-
Notifications
You must be signed in to change notification settings - Fork 216
Merge changes for OWLS-930732 to include retry stage and fatal condition in domain status message #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| String INTROSPECTION_ERROR = "Introspection Error: "; | ||
|
|
||
| String FATAL_ERROR_DOMAIN_STATUS_MESSAGE = "Introspection encountered a fatal error and will not retry automatically." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we internationalize this by pulling from the resource bundle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ea644db .
|
|
||
| private String getNonFatalRetryStatusMessage(DomainStatus domainStatus) { | ||
| return "Introspection failed on try " + domainStatus.getIntrospectJobFailureCount() | ||
| + " of " + DomainPresence.getDomainPresenceFailureRetryMaxCount() + "."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also provide the information about when the next retry will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjeberhard @tbarnes-us @jshum2479 - Any thoughts about this and where do we get this information from? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would DomainPresence.getDomainPresenceFailureRetrySeconds() work for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need exact timing? or can we just tell the user it will retry approximately in next make right interval (2 mins or the configured time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases such as when an introspector job times out, we retry after DomainPresence.getDomainPresenceFailureRetrySeconds but in other cases we retry during next make right interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will track this new requirement with a separate JIRA and address in a new pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please let me know the retry JIRA# when filed
note there's an existing umbrella retry JIRA OWLS-84438 that's meant for brain-storming purposes.
IMO, it's confusing to have two different retry settings - as this makes it hard for customers to figure out which setting they need to tune
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
russgold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very good; I have requested some minor changes.
operator/src/main/java/oracle/kubernetes/operator/DomainStatusUpdater.java
Outdated
Show resolved
Hide resolved
operator/src/main/java/oracle/kubernetes/operator/DomainStatusUpdater.java
Outdated
Show resolved
Hide resolved
operator/src/main/java/oracle/kubernetes/utils/OperatorUtils.java
Outdated
Show resolved
Hide resolved
Thanks, I have made the suggested changes in ec9d534. |
|
|
||
| nextSteps = Step.chain( | ||
| createFailureRelatedSteps(Introspection, onSeparateLines(severeStatuses)), | ||
| getNextStep(packet, domainIntrospectorJob), nextSteps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe getNextSteop() method can return null. When it is null, will we have a null step in the middle of the chain here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that Step.chain can handle null elements in the middle of the chain. The reason I made the above change is because we want the FailureCountStep to be executed last (i.e. after the FailedStep) to retain the correct message in domain status. If FailedStep is executed last, it overwrites the domain status message created by the FailureCount step. I didn't see any issues in the unit or the integration test with this approach. Please let me know if my understanding is incorrect, I can try to come up with some other approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Javadoc of Step.chain, the parameter stepGroups is multiple groups of steps. Any ideas if the group of steps have to be non-null?
/**
* Chain the specified step groups into a single chain of steps.
*
* @param stepGroups multiple groups of steps
* @return the first step of the resultant chain
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at StepChainTest.ignoreNullMiddleSteps. It actually verifies that this case works - null steps are ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Russ. We are good then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the FailureCount step should be after the FailedStep and before createEventStep so that the event can be populated with the domain status message. This will help to match the event message with the domain status message. If this requires too much changes, we can consider to make this change in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in ea2d103.
| Please resolve the error and then update 'domain.spec.introspectVersion' to force a retry. | ||
| WLSKO-0196=Stop introspection retry - exceeded configured domainPresenceFailureRetryMaxCount: {0}. The \ | ||
| domainPresenceFailureRetryMaxCount is an operator tuning parameter and can be controlled by adding it to the \ | ||
| weblogic-operator-cm configmap. To force the introspector to start retrying again, update 'domain.spec.introspectVersion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a "'" at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ea2d103 .
| WLSKO-0195=Introspection encountered a fatal error and will not retry automatically. \ | ||
| Please resolve the error and then update 'domain.spec.introspectVersion' to force a retry. | ||
| WLSKO-0196=Stop introspection retry - exceeded configured domainPresenceFailureRetryMaxCount: {0}. The \ | ||
| domainPresenceFailureRetryMaxCount is an operator tuning parameter and can be controlled by adding it to the \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can domainPresenceFailureRetryMaxCount also be controlled via helm upgrade command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, this can't be done using helm upgrade command.
doxiao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Merge changes for OWLS-930732 from PR 2571 for following domain conditions -