Skip to content

Conversation

@ankedia
Copy link
Member

@ankedia ankedia commented Nov 5, 2021

This PR is for merging changes from PR 2580 to main. It has below changes -

  • Fix for the introspector retry behavior following a introspection job time out.
  • Fix for NPE when the domain resource has a missing secret
  • Changes to capture the WDT log files to the LOG_HOME directory to facilitate in debugging introspector timeout.
  • Improve error message when introspector job times out with suggestion to increase the timeout value.

Copy link
Contributor

@russgold russgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a number of detailed comments; my biggest concern, though, is that this change seems to replace something (resetting the failure count) that works in main with a very different implementation in 3.3, and I don't understand why. It maybe an artifact of the forward-porting process, which is why it is generally best to start by porting only the tests and then let them guide you as to which other changes are needed.

}

@Test
@Disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test disabled?

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not valid since it tries to force an exception by generating a NullPointerException by removing WebLogic credentials secret. The WebLogic credentials secret is a mandatory field and the domain validation should fail when this field is missing. We had a bug where missing secret would generate the NPE and I fixed that bug. I have another test where we force the exception when an introspector job times out. I wanted to discuss this with you and Ryan to see if we need another test for forcing exception or if this can be removed completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the system now prevents us from ever removing the secret, we should not need a test for it. If it doesn't, we still need to test that we handle that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the secret is missing during initial domain creation, the operator will generate a domain validation error. If the secret is removed after the domain is up and running, we no longer throw a NullPointerException. The REST API to read server health will fail due to missing credentials until the next make-right operation. The domain validation during the next make-right cycle will catch and report a validation error.

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the test to assert that the NPE and DomainProcessingAborted event are not generated when WebLogic credentials secret is removed in eaf7651 .

Step read = new CallBuilder().readSecretAsync(secretName, namespace, new SecretResponseStep(getNext()));

return doNext(read, packet);
if (secretName != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the test for this change?

Stylistically, the code could be improved by

  1. flipping the test so that the first branch is taken when secretName == null. That puts the shorter and exceptional case first.
  2. Making read final.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing a NPE. I'll look into adding unit test.

return job;
}

private long getIntrospectorJobActiveDeadlineSeconds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not used and should be deleted.

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, deleted in eaf7651 .

*
* @return the secret name
*/
public String getWebLogicCredentialsSecretName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Where is the test for this change?
  2. Your IDE should have advised you to change s -> s.getName() to V1SecretReference::getName

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is also to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing the NPE. I'll look into adding unit test and make the change use V1SecretReference::getName.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best to add a unit test when fixing any problem; it means that we didn't think of that case when the code was originally written, or else that case is new.

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have changed the test whenWebLogicCredentialsSecretRemoved_NullPointerExceptionAndAbortedEventNotGenerated in DomainProcessorTest to assert that NPE is not generated when WebLogicCredentialsSecret is removed.

}

static boolean isFailed(V1Job job) {
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Test if" is never a good Javadoc comment. This method is better described as "Returns true if the specified job has a failed status or condition." Then you don't need the @return. Also, it is bad style to describe a parameter simply by repeating its name. I usually say something like, "the job to be tested."

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eaf7651.

*/
public void incrementIntrospectJobFailureCount(String uid) {
if (!Objects.equals(uid, failedIntrospectionUid)) {
if ((uid == null) || (!Objects.equals(uid, failedIntrospectionUid))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the test for this change? Is the failure count really intended to be incremented when the uid is null, even if the failedIntrospectionUid is? If so, you no longer need the call to Objects.equals and the test should be:

  if ((uid == null) || !uid.equals(failedIntrospectionUid))

and since this is now a non-obvious combination, the expression in the if should be extracted into its own method with a simple explanatory name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is for the case when the introspector job times out or fiber encounters an exception during processing. I had added a new unit test whenIntrospectionJobTimedout_failureCountIncremented which covers this change. As you have suggested, I'll extract the if into it's own method.

static class FailureCountStep extends DomainStatusUpdaterStep {

private final V1Job domainIntrospectorJob;
private final boolean resetFailureCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have changed this step to do one of two things, based on the value of a flag. That is generally poor practice. It would be better simply to add a new step to reset the failure count, if that is needed, and leave the original step alone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I had originally created two steps but later thought that I can reuse the same method and put additional logic in it. I'll change it back to add a new step for resetting failure count and keep the original step same.

return Optional.ofNullable(eventData).map(EventData::getItem).orElse(null) == DOMAIN_PROCESSING_ABORTED;
}

private void resetIntrospectorJobFailureCount() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main already has functionality for resetting the failure count. Why discard it in favor of the approach used in 3.3.3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is resetting the failure count in the domain status. I believe this approach is same in latest 3.3.3 release. Previously we had 2 failure/retry counts (1) in the DomainStatus and (2) in-memory count in DomainPresenceInfo. After the fix for OWLS-90180, we can no longer rely on the in-memory state of the operator as the introspector job might have been created before the operator started. I have removed the in-memory retry count in DomainPresenceInfo and made changes to use the failure count in domain status instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear the reasoning; I'm just surprised by the size of the change needed to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have used the same approach in PR 2580 which is merged into latest 3.3.3 release. Please let me know if you have other suggestions for this approach. Thanks.

.header("Accept", "application/json")
.header("Content-Type", "application/json")
.header("X-Requested-By", "WebLogic Operator");
Optional.ofNullable(getAuthorizationSource())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problem led to this change? Is there a unit test for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is also to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing the NPE. I'll look into adding unit test.

}

if (isKnownFailedJob(job) || hasImagePullError) {
return doContinueListOrNext(callResponse, packet, cleanUpAndReintrospect());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doContinueListOrNext is intended for list operations. Why are you using it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The onSuccess method is for listing the introspector job pod. Since we don't know the exact name of the introspector job pod, we list the pods having job-name label. The list pods operation returns pods in multiple chunks and it's possible that introspector job pod is not returned in first chunk. My understanding is that doContinueListOrNext method will either return the next chunk or perform the next action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a concern if we expected to find hundreds of active pods with that name, which would shock me. Did you run into a problem that prompted this change?

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the same approach that we use in the PodListResponseStep on line 637 of JobHelper which is called from ReadDomainIntrospectorPodStep. The PodListResponseStep records the introspector job pod-name in the packet and is called after the introspector job completes or fails. I thought it's possible that if the chunk size is very small and first chunk returns an empty list then the active pod might be returned in second chunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that it wasn't actually needed there, either.

Can you explain the problem you were solving? That is, what was the old code doing, why was it a problem, and what is the new approach? There is a basic assumption that is either presumably not valid, and yet when I look at the code, it appears to be make the same assumptions, and just getting the job pod with more steps.

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Above change means we're always going to delete an existing failed job and do the reintrospection. So it's expected that whenNewFailedJobExists_readPodLogAndReportFailure fails. I had initially looked into reusing the existing step that finds the job pod but was running into some issues since that step was invoked from different places. Let me see if the changed logic in main makes it possible to reuse that step now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current change, I have added a new step to read the job pod after reading the job and look for the relevant errors. I thought this was more clear than adding the logic into existing step to execute different set of next steps based on different errors. The introspector job pod may not always be there (for e.g. k8s might delete the pod once it times out) and it's not possible to read the logs of a job pod that has failed due to image pull error. It might be good to have a zoom call to see if it's worthwhile to spend time experimenting with that if current change is more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we meet in the morning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that works for me. Let me schedule a time in the calendar.

Copy link
Member Author

@ankedia ankedia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried making the changes to use existing step but the logic seems to be get more complicated and some unit tests are hanging. Based on my current understanding, below is the psuedo code of the new logic. Can you confirm if this understanding is correct or if I'm missing something?

The onSuccess method of existing findIntrospectorPodName step will do following.

  1. Fetch the job from the packet
  2. Read the job pod
  3. Get the job pod name
If (known failed job or image pull error) then 
   Execute deleteIntrospectorJob -> readExistingIntrospectorConfigMap -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action
else if (job is timed-out) then 
   Execute deleteDomainIntrospectorJobStep -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action 
else if (job is not null)
   Execute waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action
else if (introspection is needed)  
   Execute readExistingIntrospectorConfigMap -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action
else 
   Execute next action
fi

@ankedia
Copy link
Member Author

ankedia commented Nov 10, 2021

Integration test run results look good, the three failures look unrelated to latest changes - https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/7172/

@rjeberhard rjeberhard merged commit bbda203 into main Nov 10, 2021
@ankedia ankedia deleted the owls_91212_pr2580_merge branch November 10, 2021 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants