Retry behavior for synchronous calls during initialization #2118

rjeberhard · 2020-12-23T22:23:15Z

As you know, we've been seeing fairly regular integration test failures for the ItIstio* series of tests. The tests fail because the operator fails to start being unable to connect to the Kubernetes master. In each case I analyzed, the operator failed just before the Istio Envoy proxy finished initialization.

Most operator calls are done using the async pattern, which has built-in delay and retry; however, the synchronous calls done during operator initialization do not have this functionality. Therefore, I've added a wrapper method that can do this retry. I've put it down at the CallBuilder level because this gave the easiest location to add the unit-test.

Creating this as draft so that you can take a look. I think I need to use some configured value for the retry delay rather than hardcoding 5 seconds.

Istio tests are clean with this change: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/3454/console

jshum2479 · 2020-12-23T23:44:35Z

operator/src/main/java/oracle/kubernetes/operator/helpers/CallBuilder.java

+      try {
+        result = call.call();
+        complete = true;
+      } catch (RuntimeException re) {


What happened if it is a RuntimeException but not an ApiException? Do we want to limit the number of retries or make it retry forever (hopefully the condition can be resolved)?

I'll think about other exceptions... I didn't want to limit the number of retries here because the operator cannot go on until it can connect to the master and the operator will eventually be killed by the liveness probe.

ddsharpe · 2021-01-05T15:50:20Z

operator/src/main/java/oracle/kubernetes/operator/helpers/CallBuilder.java

+        result = call.call();
+        complete = true;
+      } catch (RuntimeException re) {
+        Throwable cause = re.getCause();


Does the ApiException ever get nested further down? Is there ever a case where you need to loop through the causes looking for ApiException?

ddsharpe · 2021-01-05T15:52:59Z

operator/src/main/java/oracle/kubernetes/operator/helpers/CallBuilder.java

+          LOGGER.warning(MessageKeys.EXCEPTION, cause);
+        }
+      } catch (Throwable t) {
+        LOGGER.warning(MessageKeys.EXCEPTION, t);


Please add a comment here that we expect the liveness probe to cancel this process if it "retries forever". The next coder may not understand the assumption/expectation.

Retry behavior for synchronous calls during initialization

3ae8b98

rjeberhard requested review from russgold and jshum2479 December 23, 2020 22:23

jshum2479 reviewed Dec 23, 2020

View reviewed changes

rjeberhard added 2 commits January 4, 2021 16:26

Merge remote-tracking branch 'origin/develop' into owls-86461

1286b17

Add initialization retry tuning

bb16ede

jshum2479 approved these changes Jan 4, 2021

View reviewed changes

rjeberhard marked this pull request as ready for review January 4, 2021 23:12

ddsharpe approved these changes Jan 5, 2021

View reviewed changes

Add implementation note

156014b

rjeberhard merged commit 5812d44 into develop Jan 5, 2021

rjeberhard deleted the owls-86461 branch January 5, 2021 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry behavior for synchronous calls during initialization #2118

Retry behavior for synchronous calls during initialization #2118

rjeberhard commented Dec 23, 2020 •

edited

jshum2479 Dec 23, 2020

rjeberhard Dec 24, 2020

ddsharpe Jan 5, 2021

ddsharpe Jan 5, 2021

Retry behavior for synchronous calls during initialization #2118

Retry behavior for synchronous calls during initialization #2118

Conversation

rjeberhard commented Dec 23, 2020 • edited

jshum2479 Dec 23, 2020

Choose a reason for hiding this comment

rjeberhard Dec 24, 2020

Choose a reason for hiding this comment

ddsharpe Jan 5, 2021

Choose a reason for hiding this comment

ddsharpe Jan 5, 2021

Choose a reason for hiding this comment

rjeberhard commented Dec 23, 2020 •

edited