Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector #2720

doxiao · 2022-01-19T16:07:27Z

Problem: If the DomainNamespaceSelectionStrategy is LabelSelector, the operator relies on specifying the LabelSelector on the listNamespaceAsync call to filter the namespaces. When the strategy is changed from List to LabelSelector when the operator is running, listNamespaceAsync call may return the full list of namespaces instead of the namespaces that match the label selector, because the strategy is still List when the call is made.

Fix: The changes in this PR removes the label selectors from the listNamespaceAsync call so that it always returns the full list, and modifies the isDomainNamespace method, which is used to filter the namespaces that are returned from the listNamespaceAsync call; instead of always returning true, it now actually evaluates the selector in the LabelSelector case. This approach makes the LabelSelector handling consistent with all other strategies.

This PR also fixes a NPE that I noticed during testing/debugging in DomainProcessorImpl (in the apply method of DomainPresenceInfoStep).

Integration test results (no unknown failures):
Main branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8066/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8080/

This branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8067/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8081/

New results to be posted once available.
Main branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8187/
This branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8188/

…espaces, plus fix NPE

…elector-race

operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java

operator/src/main/java/oracle/kubernetes/operator/DomainRecheck.java

operator/src/main/java/oracle/kubernetes/operator/Namespaces.java

russgold · 2022-01-21T16:00:08Z

operator/src/main/java/oracle/kubernetes/operator/Namespaces.java

-      public boolean isDomainNamespace(@Nonnull String namespaceName) {
-        return true;  // filtering is done by Kubernetes list call
+      public boolean isDomainNamespace(@Nonnull V1ObjectMeta nsMetadata) {
+        // although filtering is done by Kubernetes list call, there is a rice condition where readExistingNamespaces


I'm trying to imagine what a "rice" condition could be. :)

Can you explain what this race condition is? Why would K8s give us a non-matching namespace?

The intermittent issue only happens when the selection strategy is changed from List to LabelSelector. When the readExistingNamespaceAysnc is called, the strategy may still be List, so the K8S returns all namespaces.

russgold · 2022-01-21T16:02:17Z

operator/src/test/java/oracle/kubernetes/operator/MainTest.java

  }

+  @Test
+  void withLabelSelector_onCreateReadNamespaces_ignoreSelectorOnList_startsNamespaces() {


This test name is unclear. What is it trying to verify?

I'm very concerned about the ignoreSelectorOnListOperation call. What K8s behavior is being simulated or tracked?

The behavior/condition the change is trying to simulate is when the readExistingNamespace async call is issued, the strategy is List, so the results is as if the selector does not exist although the strategy is LabelSelector when the returned valued are processed.

I'm not following that; if the strategy has been changed to LabelSelector, why is K8s acting as though it is List? Are we sending the wrong call? Do we have an in-flight list call going when the strategy is changed? If so, maybe we should address that in the list response?

yes, the in-flight call uses List. We simulate this race condition by ignoring the selector.

In the product code, we do address the issue in the list response as the changes in DomainRecheck.

As to the test method name, what about withLabelSelector_returnAllNamespacesOnCreateReadNamespaces_startsExpectedNamespaces?

Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way.

That's fine, and lends itself to a simple solution: NamespaceListResponseStep.onSuccess() can call a new method, getMetadatas rather getNames and the Namespaces.isDomainNamespace() method can take a @nonnull V1ObjectMeta rather than a String.

As to the test, I would suggest that it should now test the change of strategy between the list call and the response. In that case, the thing to add to KubernetesTestSupport would be a doOnList call, similar to doOnCreate, etc. The test would then set the strategy to, say, List and have the doOnList change it to LabelSelection. That could be shown to fail without any other changes.

I would name such a test, whenSelectionStrategyChangesDuringRequest_startDesiredNamespaces.

As a rule, it is better to have the test name describe the desired behavior rather than the implementation of the test.

we don't need the new test case any more. One of the existing test case covers it already now that we changed the listNamespaces for LabelSelector case.

operator/src/test/java/oracle/kubernetes/operator/helpers/KubernetesTestSupport.java

…elector-race

russgold · 2022-01-21T19:39:18Z

I believe that the current approach is flawed, and requires too many changes, plus is not guaranteed to work in all cases.

The basic problem starts with the reality that our list namespace call is asynchronous, meaning that the strategy used to create the call might not match the strategy when we process the response. This has been identified in a specific case: changing from List to LabelSelector, as the latter does no post-processing, assuming that the filtering is being done by Kubernetes. It should also, then, logically be a problem in other combinations. For example, changing from LabelSelector to any other strategy means that the namespaces we start handling will exclude those without the label, even though we have explicitly said that we want to see them all. In general, any attempt to compensate for the mismatch runs into a problem of this kind. I therefore suggest that we have two simpler approaches, depending on how we want the operator to react. Both require creating the list response step with the current strategy

We could simply use the specified strategy for all strategy-dependent computations, thus ensuring that the request and processing are consistent. This would mean that the user’s change of strategies would take effect with the next list operation, not the in-flight one.
We could declare a mismatch to be an error, and restart the request. This would cause the change to take effect even with an in-flight request. The difficulty is whether we have a way to restart a multi-step list (used when there are a large number of namespaces), when the change happens partway through. If there is not, I would recommend against this approach.

doxiao · 2022-01-21T19:52:49Z

I believe that the current approach is flawed, and requires too many changes, plus is not guaranteed to work in all cases.

The PR contains fixes for other existing and unrelated issues that I noticed during testing, such as NPE. The only change for this is to filter the namespaces that the operator get from the listNamesapces call, which matches the handling of all other strategies.

The basic problem starts with the reality that our list namespace call is asynchronous, meaning that the strategy used to create the call might not match the strategy when we process the response. This has been identified in a specific case: changing from List to LabelSelector, as the latter does no post-processing, assuming that the filtering is being done by Kubernetes. It should also, then, logically be a problem in other combinations. For example, changing from LabelSelector to any other strategy means that the namespaces we start handling will exclude those without the label, even though we have explicitly said that we want to see them all. In general, any attempt to compensate for the mismatch runs into a problem of this kind. I therefore suggest that we have two simpler approaches, depending on how we want the operator to react. Both require creating the list response step with the current strategy
1. We could simply use the specified strategy for all strategy-dependent computations, thus ensuring that the request and processing are consistent. This would mean that the user’s change of strategies would take effect with the next list operation, not the in-flight one.

This is error-prune because the operator uses the current selection strategy in many other places. If the operator continues with the in-flight one here, we need to make sure all other places use the in-flight strategy as well, which is hard to do and error-prune.

2. We could declare a mismatch to be an error, and restart the request. This would cause the change to take effect even with an in-flight request. The difficulty is whether we have a way to restart a multi-step list (used when there are a large number of namespaces), when the change happens partway through. If there is not, I would recommend against this approach.

russgold · 2022-01-21T19:58:51Z

This is error-prune because the operator uses the current selection strategy in many other places. If the operator continues with the in-flight one here, we need to make sure all other places use the in-flight strategy as well, which is hard to do and error-prune.

In what other cases do you think a problem will occur? It's a problem in the list case because the request and post-processing need to use the same strategy. In most cases, it should not be an issue, as far as I can tell. And if it is indeed a problem, how do the changes here address it?

doxiao · 2022-01-21T20:21:01Z

Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way.

russgold

Looks very good. I have left a couple of stylistic recommendations.

operator/src/main/java/oracle/kubernetes/operator/Main.java

operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java

ankedia

LGTM. I have a minor comment about copyright.

integration-tests/src/test/java/oracle/weblogic/kubernetes/utils/K8sEvents.java

doxiao added 3 commits January 18, 2022 10:08

Verify label selectors against namespaces returned by readExistingNam…

e74da57

…espaces, plus fix NPE

Merge remote-tracking branch 'origin/main' into owls93995-list2labels…

3ea045e

…elector-race

Minor test change to make the test result more eeliable

ff3d802

doxiao requested review from ankedia, rjeberhard, russgold and sankarpn January 19, 2022 16:07

russgold suggested changes Jan 21, 2022

View reviewed changes

doxiao added 2 commits January 21, 2022 14:34

Address review comments

7ab063e

Merge remote-tracking branch 'origin/main' into owls93995-list2labels…

7ef5bd5

…elector-race

doxiao changed the title ~~Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector~~ WIP Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector Jan 21, 2022

doxiao added 4 commits January 21, 2022 16:06

Remove label selectors from list namespaces call

cf2041a

Filter null metsdata

a539255

Minor improvement

e3d6c38

Remove unnecessary comment

d028c5f

doxiao changed the title ~~WIP Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector~~ Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector Jan 24, 2022

doxiao requested a review from russgold January 24, 2022 13:14

russgold approved these changes Jan 24, 2022

View reviewed changes

operator/src/main/java/oracle/kubernetes/operator/Main.java Show resolved Hide resolved

operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java Outdated Show resolved Hide resolved

Address the comments/suggestions to code style

d75a4c9

ankedia approved these changes Jan 24, 2022

View reviewed changes

integration-tests/src/test/java/oracle/weblogic/kubernetes/utils/K8sEvents.java Show resolved Hide resolved

Fix a copyright

f87775c

robertpatrick merged commit 4cef5a0 into main Jan 24, 2022

robertpatrick deleted the owls93995-list2labelselector-race branch January 24, 2022 18:33

doxiao mentioned this pull request Jan 27, 2022

Backport PR#2720 for owls-93995 to release 3.3 #2743

Merged

Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector #2720

Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector #2720

Uh oh!

Conversation

doxiao commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russgold Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russgold Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

russgold commented Jan 21, 2022

Uh oh!

doxiao commented Jan 21, 2022

Uh oh!

russgold commented Jan 21, 2022

Uh oh!

doxiao commented Jan 21, 2022

Uh oh!

russgold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ankedia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

doxiao commented Jan 19, 2022 •

edited

Loading

russgold Jan 21, 2022 •

edited

Loading

russgold Jan 21, 2022 •

edited

Loading