Skip to content

Conversation

@ankedia
Copy link
Member

@ankedia ankedia commented Nov 10, 2021

Merge changes for OWLS-930732 from PR 2571 for following domain conditions -

  • When retries exceed the max limit, add "introspect failure retry count exceeded error" to the domain status message.
  • In case of a fatal error, include the fatal status in the domain status message.
  • For non-fatal retries, indicate the current retry stage i.e. "Introspection failed on try 1 of x".


String INTROSPECTION_ERROR = "Introspection Error: ";

String FATAL_ERROR_DOMAIN_STATUS_MESSAGE = "Introspection encountered a fatal error and will not retry automatically."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we internationalize this by pulling from the resource bundle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea644db .


private String getNonFatalRetryStatusMessage(DomainStatus domainStatus) {
return "Introspection failed on try " + domainStatus.getIntrospectJobFailureCount()
+ " of " + DomainPresence.getDomainPresenceFailureRetryMaxCount() + ".";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also provide the information about when the next retry will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjeberhard @tbarnes-us @jshum2479 - Any thoughts about this and where do we get this information from? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would DomainPresence.getDomainPresenceFailureRetrySeconds() work for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need exact timing? or can we just tell the user it will retry approximately in next make right interval (2 mins or the configured time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases such as when an introspector job times out, we retry after DomainPresence.getDomainPresenceFailureRetrySeconds but in other cases we retry during next make right interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will track this new requirement with a separate JIRA and address in a new pull request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please let me know the retry JIRA# when filed

note there's an existing umbrella retry JIRA OWLS-84438 that's meant for brain-storming purposes.

IMO, it's confusing to have two different retry settings - as this makes it hard for customers to figure out which setting they need to tune

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@russgold russgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good; I have requested some minor changes.

@ankedia
Copy link
Member Author

ankedia commented Nov 12, 2021

Overall looks very good; I have requested some minor changes.

Thanks, I have made the suggested changes in ec9d534.


nextSteps = Step.chain(
createFailureRelatedSteps(Introspection, onSeparateLines(severeStatuses)),
getNextStep(packet, domainIntrospectorJob), nextSteps);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe getNextSteop() method can return null. When it is null, will we have a null step in the middle of the chain here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that Step.chain can handle null elements in the middle of the chain. The reason I made the above change is because we want the FailureCountStep to be executed last (i.e. after the FailedStep) to retain the correct message in domain status. If FailedStep is executed last, it overwrites the domain status message created by the FailureCount step. I didn't see any issues in the unit or the integration test with this approach. Please let me know if my understanding is incorrect, I can try to come up with some other approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Javadoc of Step.chain, the parameter stepGroups is multiple groups of steps. Any ideas if the group of steps have to be non-null?

  /**
   * Chain the specified step groups into a single chain of steps.
   *
   * @param stepGroups multiple groups of steps
   * @return the first step of the resultant chain
   */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at StepChainTest.ignoreNullMiddleSteps. It actually verifies that this case works - null steps are ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Russ. We are good then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the FailureCount step should be after the FailedStep and before createEventStep so that the event can be populated with the domain status message. This will help to match the event message with the domain status message. If this requires too much changes, we can consider to make this change in a future PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in ea2d103.

Please resolve the error and then update 'domain.spec.introspectVersion' to force a retry.
WLSKO-0196=Stop introspection retry - exceeded configured domainPresenceFailureRetryMaxCount: {0}. The \
domainPresenceFailureRetryMaxCount is an operator tuning parameter and can be controlled by adding it to the \
weblogic-operator-cm configmap. To force the introspector to start retrying again, update 'domain.spec.introspectVersion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a "'" at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ea2d103 .

WLSKO-0195=Introspection encountered a fatal error and will not retry automatically. \
Please resolve the error and then update 'domain.spec.introspectVersion' to force a retry.
WLSKO-0196=Stop introspection retry - exceeded configured domainPresenceFailureRetryMaxCount: {0}. The \
domainPresenceFailureRetryMaxCount is an operator tuning parameter and can be controlled by adding it to the \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can domainPresenceFailureRetryMaxCount also be controlled via helm upgrade command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, this can't be done using helm upgrade command.

Copy link
Member

@doxiao doxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rjeberhard rjeberhard merged commit ce39d31 into main Nov 16, 2021
@ankedia ankedia deleted the merge_pr2571_owls930732 branch November 16, 2021 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants