Owls 91212 - Merge fixes for the introspector retry behavior after the job times out and to capture WDT logs. #2613

ankedia · 2021-11-05T18:04:15Z

This PR is for merging changes from PR 2580 to main. It has below changes -

Fix for the introspector retry behavior following a introspection job time out.
Fix for NPE when the domain resource has a missing secret
Changes to capture the WDT log files to the LOG_HOME directory to facilitate in debugging introspector timeout.
Improve error message when introspector job times out with suggestion to increase the timeout value.

…hen domain resource has missing secret.

russgold

I have left a number of detailed comments; my biggest concern, though, is that this change seems to replace something (resetting the failure count) that works in main with a very different implementation in 3.3, and I don't understand why. It maybe an artifact of the forward-porting process, which is why it is generally best to start by porting only the tests and then let them guide you as to which other changes are needed.

russgold · 2021-11-08T10:52:11Z

operator/src/test/java/oracle/kubernetes/operator/DomainProcessorTest.java

  }

  @Test
+  @Disabled


Why is this test disabled?

The test is not valid since it tries to force an exception by generating a NullPointerException by removing WebLogic credentials secret. The WebLogic credentials secret is a mandatory field and the domain validation should fail when this field is missing. We had a bug where missing secret would generate the NPE and I fixed that bug. I have another test where we force the exception when an introspector job times out. I wanted to discuss this with you and Ryan to see if we need another test for forcing exception or if this can be removed completely.

If the system now prevents us from ever removing the secret, we should not need a test for it. If it doesn't, we still need to test that we handle that case.

If the secret is missing during initial domain creation, the operator will generate a domain validation error. If the secret is removed after the domain is up and running, we no longer throw a NullPointerException. The REST API to read server health will fail due to missing credentials until the next make-right operation. The domain validation during the next make-right cycle will catch and report a validation error.

I have changed the test to assert that the NPE and DomainProcessingAborted event are not generated when WebLogic credentials secret is removed in eaf7651 .

russgold · 2021-11-08T10:55:11Z

operator/src/main/java/oracle/kubernetes/operator/helpers/SecretHelper.java

-        Step read = new CallBuilder().readSecretAsync(secretName, namespace, new SecretResponseStep(getNext()));
-
-        return doNext(read, packet);
+        if (secretName != null) {


Where is the test for this change?

Stylistically, the code could be improved by

flipping the test so that the first branch is taken when secretName == null. That puts the shorter and exceptional case first.

Making read final.

This change is to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing a NPE. I'll look into adding unit test.

russgold · 2021-11-08T10:58:21Z

operator/src/test/java/oracle/kubernetes/operator/helpers/DomainIntrospectorJobTest.java

+    return job;
+  }
+
+  private long getIntrospectorJobActiveDeadlineSeconds() {


This method is not used and should be deleted.

Thanks, deleted in eaf7651 .

russgold · 2021-11-08T11:03:13Z

operator/src/main/java/oracle/kubernetes/weblogic/domain/model/Domain.java

   *
   * @return the secret name
   */
  public String getWebLogicCredentialsSecretName() {


Where is the test for this change?

Your IDE should have advised you to change s -> s.getName() to V1SecretReference::getName

This change is also to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing the NPE. I'll look into adding unit test and make the change use V1SecretReference::getName.

It is best to add a unit test when fixing any problem; it means that we didn't think of that case when the code was originally written, or else that case is new.

ok, I have changed the test whenWebLogicCredentialsSecretRemoved_NullPointerExceptionAndAbortedEventNotGenerated in DomainProcessorTest to assert that NPE is not generated when WebLogicCredentialsSecret is removed.

russgold · 2021-11-08T11:10:15Z

operator/src/main/java/oracle/kubernetes/operator/JobWatcher.java

  }

-  static boolean isFailed(V1Job job) {
+  /**


"Test if" is never a good Javadoc comment. This method is better described as "Returns true if the specified job has a failed status or condition." Then you don't need the @return. Also, it is bad style to describe a parameter simply by repeating its name. I usually say something like, "the job to be tested."

Fixed in eaf7651.

russgold · 2021-11-08T11:21:11Z

operator/src/main/java/oracle/kubernetes/weblogic/domain/model/DomainStatus.java

   */
  public void incrementIntrospectJobFailureCount(String uid) {
-    if (!Objects.equals(uid, failedIntrospectionUid)) {
+    if ((uid == null) || (!Objects.equals(uid, failedIntrospectionUid))) {


Where is the test for this change? Is the failure count really intended to be incremented when the uid is null, even if the failedIntrospectionUid is? If so, you no longer need the call to Objects.equals and the test should be:

if ((uid == null) || !uid.equals(failedIntrospectionUid))

and since this is now a non-obvious combination, the expression in the if should be extracted into its own method with a simple explanatory name.

This change is for the case when the introspector job times out or fiber encounters an exception during processing. I had added a new unit test whenIntrospectionJobTimedout_failureCountIncremented which covers this change. As you have suggested, I'll extract the if into it's own method.

russgold · 2021-11-08T13:37:21Z

operator/src/main/java/oracle/kubernetes/operator/DomainStatusUpdater.java

  static class FailureCountStep extends DomainStatusUpdaterStep {

    private final V1Job domainIntrospectorJob;
+    private final boolean resetFailureCount;


You have changed this step to do one of two things, based on the value of a flag. That is generally poor practice. It would be better simply to add a new step to reset the failure count, if that is needed, and leave the original step alone.

ok. I had originally created two steps but later thought that I can reuse the same method and put additional logic in it. I'll change it back to add a new step for resetting failure count and keep the original step same.

russgold · 2021-11-08T13:38:27Z

operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java

      return Optional.ofNullable(eventData).map(EventData::getItem).orElse(null) == DOMAIN_PROCESSING_ABORTED;
    }

-    private void resetIntrospectorJobFailureCount() {


Main already has functionality for resetting the failure count. Why discard it in favor of the approach used in 3.3.3?

This method is resetting the failure count in the domain status. I believe this approach is same in latest 3.3.3 release. Previously we had 2 failure/retry counts (1) in the DomainStatus and (2) in-memory count in DomainPresenceInfo. After the fix for OWLS-90180, we can no longer rely on the in-memory state of the operator as the introspector job might have been created before the operator started. I have removed the in-memory retry count in DomainPresenceInfo and made changes to use the failure count in domain status instead.

I hear the reasoning; I'm just surprised by the size of the change needed to do that.

Ok. I have used the same approach in PR 2580 which is merged into latest 3.3.3 release. Please let me know if you have other suggestions for this approach. Thanks.

russgold · 2021-11-08T13:41:31Z

operator/src/main/java/oracle/kubernetes/operator/steps/HttpRequestProcessing.java

          .header("Accept", "application/json")
          .header("Content-Type", "application/json")
          .header("X-Requested-By", "WebLogic Operator");
+    Optional.ofNullable(getAuthorizationSource())


What problem led to this change? Is there a unit test for it?

This change is also to avoid the NullPointerException when WebLogicCredentialsSecret is removed. I didn't know whether we need unit test when fixing the NPE. I'll look into adding unit test.

russgold · 2021-11-08T13:44:00Z

operator/src/main/java/oracle/kubernetes/operator/helpers/JobHelper.java

+          }
+
+          if (isKnownFailedJob(job) || hasImagePullError) {
+            return doContinueListOrNext(callResponse, packet, cleanUpAndReintrospect());


doContinueListOrNext is intended for list operations. Why are you using it here?

The onSuccess method is for listing the introspector job pod. Since we don't know the exact name of the introspector job pod, we list the pods having job-name label. The list pods operation returns pods in multiple chunks and it's possible that introspector job pod is not returned in first chunk. My understanding is that doContinueListOrNext method will either return the next chunk or perform the next action.

That would be a concern if we expected to find hundreds of active pods with that name, which would shock me. Did you run into a problem that prompted this change?

I was following the same approach that we use in the PodListResponseStep on line 637 of JobHelper which is called from ReadDomainIntrospectorPodStep. The PodListResponseStep records the introspector job pod-name in the packet and is called after the introspector job completes or fails. I thought it's possible that if the chunk size is very small and first chunk returns an empty list then the active pod might be returned in second chunk.

I suspect that it wasn't actually needed there, either.

Can you explain the problem you were solving? That is, what was the old code doing, why was it a problem, and what is the new approach? There is a basic assumption that is either presumably not valid, and yet when I look at the code, it appears to be make the same assumptions, and just getting the job pod with more steps.

ok. Above change means we're always going to delete an existing failed job and do the reintrospection. So it's expected that whenNewFailedJobExists_readPodLogAndReportFailure fails. I had initially looked into reusing the existing step that finds the job pod but was running into some issues since that step was invoked from different places. Let me see if the changed logic in main makes it possible to reuse that step now.

In current change, I have added a new step to read the job pod after reading the job and look for the relevant errors. I thought this was more clear than adding the logic into existing step to execute different set of next steps based on different errors. The introspector job pod may not always be there (for e.g. k8s might delete the pod once it times out) and it's not possible to read the logs of a job pod that has failed due to image pull error. It might be good to have a zoom call to see if it's worthwhile to spend time experimenting with that if current change is more clear.

OK, can we meet in the morning?

ok, that works for me. Let me schedule a time in the calendar.

I tried making the changes to use existing step but the logic seems to be get more complicated and some unit tests are hanging. Based on my current understanding, below is the psuedo code of the new logic. Can you confirm if this understanding is correct or if I'm missing something?

The onSuccess method of existing findIntrospectorPodName step will do following.

Fetch the job from the packet

Read the job pod

Get the job pod name

If (known failed job or image pull error) then Execute deleteIntrospectorJob -> readExistingIntrospectorConfigMap -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action else if (job is timed-out) then Execute deleteDomainIntrospectorJobStep -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action else if (job is not null) Execute waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action else if (introspection is needed) Execute readExistingIntrospectorConfigMap -> createNewJob -> waitForIntrospectionToComplete -> findIntrospectorPodName -> readNamedPodLog -> deleteIntrospectorJob -> createIntrospectorConfigMap -> next action else Execute next action fi

This reverts commit 2b4f1c6.

…ot use unitTestAdaptor method.

…d change to JobHelper to make the test pass.

ankedia · 2021-11-10T14:30:28Z

Integration test run results look good, the three failures look unrelated to latest changes - https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/7172/

Changes to merge fix for OWLS-91212 (PR 2580) in main and fix a NPE w…

e343010

…hen domain resource has missing secret.

ankedia requested review from doxiao, jshum2479, rjeberhard, russgold, sankarpn and tbarnes-us November 5, 2021 18:04

check domain event with long retry policy

4ba0b7a

russgold suggested changes Nov 8, 2021

View reviewed changes

ankedia and others added 7 commits November 8, 2021 18:10

Changes to address review comments.

eaf7651

cleanup

2b4f1c6

Revert "cleanup".

14be2e6

This reverts commit 2b4f1c6.

Modified the test for introspector timeout and changed JobHelper to n…

47826d0

…ot use unitTestAdaptor method.

fix job timeout

993dc36

fix job timeout

87afabe

Added unit test to check for introspector job pod timed out status an…

bd39be8

…d change to JobHelper to make the test pass.

russgold approved these changes Nov 10, 2021

View reviewed changes

jshum2479 approved these changes Nov 10, 2021

View reviewed changes

rjeberhard merged commit bbda203 into main Nov 10, 2021

ankedia deleted the owls_91212_pr2580_merge branch November 10, 2021 20:43

alai8 mentioned this pull request May 27, 2022

Backport fix for OWLS-96080 to 3.4 branch #3109

Merged

Owls 91212 - Merge fixes for the introspector retry behavior after the job times out and to capture WDT logs. #2613

Owls 91212 - Merge fixes for the introspector retry behavior after the job times out and to capture WDT logs. #2613

Uh oh!

Conversation

ankedia commented Nov 5, 2021

Uh oh!

russgold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankedia Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russgold left a comment •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading

ankedia Nov 8, 2021 •

edited

Loading