-
Notifications
You must be signed in to change notification settings - Fork 216
Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector #2720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java
Outdated
Show resolved
Hide resolved
operator/src/main/java/oracle/kubernetes/operator/DomainRecheck.java
Outdated
Show resolved
Hide resolved
| public boolean isDomainNamespace(@Nonnull String namespaceName) { | ||
| return true; // filtering is done by Kubernetes list call | ||
| public boolean isDomainNamespace(@Nonnull V1ObjectMeta nsMetadata) { | ||
| // although filtering is done by Kubernetes list call, there is a rice condition where readExistingNamespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to imagine what a "rice" condition could be. :)
Can you explain what this race condition is? Why would K8s give us a non-matching namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intermittent issue only happens when the selection strategy is changed from List to LabelSelector. When the readExistingNamespaceAysnc is called, the strategy may still be List, so the K8S returns all namespaces.
| } | ||
|
|
||
| @Test | ||
| void withLabelSelector_onCreateReadNamespaces_ignoreSelectorOnList_startsNamespaces() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test name is unclear. What is it trying to verify?
I'm very concerned about the ignoreSelectorOnListOperation call. What K8s behavior is being simulated or tracked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior/condition the change is trying to simulate is when the readExistingNamespace async call is issued, the strategy is List, so the results is as if the selector does not exist although the strategy is LabelSelector when the returned valued are processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following that; if the strategy has been changed to LabelSelector, why is K8s acting as though it is List? Are we sending the wrong call? Do we have an in-flight list call going when the strategy is changed? If so, maybe we should address that in the list response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the in-flight call uses List. We simulate this race condition by ignoring the selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the product code, we do address the issue in the list response as the changes in DomainRecheck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to the test method name, what about withLabelSelector_returnAllNamespacesOnCreateReadNamespaces_startsExpectedNamespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way.
That's fine, and lends itself to a simple solution: NamespaceListResponseStep.onSuccess() can call a new method, getMetadatas rather getNames and the Namespaces.isDomainNamespace() method can take a @nonnull V1ObjectMeta rather than a String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to the test, I would suggest that it should now test the change of strategy between the list call and the response. In that case, the thing to add to KubernetesTestSupport would be a doOnList call, similar to doOnCreate, etc. The test would then set the strategy to, say, List and have the doOnList change it to LabelSelection. That could be shown to fail without any other changes.
I would name such a test, whenSelectionStrategyChangesDuringRequest_startDesiredNamespaces.
As a rule, it is better to have the test name describe the desired behavior rather than the implementation of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the new test case any more. One of the existing test case covers it already now that we changed the listNamespaces for LabelSelector case.
operator/src/test/java/oracle/kubernetes/operator/helpers/KubernetesTestSupport.java
Outdated
Show resolved
Hide resolved
|
I believe that the current approach is flawed, and requires too many changes, plus is not guaranteed to work in all cases. The basic problem starts with the reality that our list namespace call is asynchronous, meaning that the strategy used to create the call might not match the strategy when we process the response. This has been identified in a specific case: changing from List to LabelSelector, as the latter does no post-processing, assuming that the filtering is being done by Kubernetes. It should also, then, logically be a problem in other combinations. For example, changing from LabelSelector to any other strategy means that the namespaces we start handling will exclude those without the label, even though we have explicitly said that we want to see them all. In general, any attempt to compensate for the mismatch runs into a problem of this kind. I therefore suggest that we have two simpler approaches, depending on how we want the operator to react. Both require creating the list response step with the current strategy
|
The PR contains fixes for other existing and unrelated issues that I noticed during testing, such as NPE. The only change for this is to filter the namespaces that the operator get from the listNamesapces call, which matches the handling of all other strategies.
This is error-prune because the operator uses the current selection strategy in many other places. If the operator continues with the in-flight one here, we need to make sure all other places use the in-flight strategy as well, which is hard to do and error-prune.
|
In what other cases do you think a problem will occur? It's a problem in the list case because the request and post-processing need to use the same strategy. In most cases, it should not be an issue, as far as I can tell. And if it is indeed a problem, how do the changes here address it? |
|
Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way. |
russgold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. I have left a couple of stylistic recommendations.
operator/src/main/java/oracle/kubernetes/operator/DomainProcessorImpl.java
Outdated
Show resolved
Hide resolved
ankedia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have a minor comment about copyright.
Problem: If the DomainNamespaceSelectionStrategy is LabelSelector, the operator relies on specifying the LabelSelector on the listNamespaceAsync call to filter the namespaces. When the strategy is changed from List to LabelSelector when the operator is running, listNamespaceAsync call may return the full list of namespaces instead of the namespaces that match the label selector, because the strategy is still List when the call is made.
Fix: The changes in this PR removes the label selectors from the listNamespaceAsync call so that it always returns the full list, and modifies the isDomainNamespace method, which is used to filter the namespaces that are returned from the listNamespaceAsync call; instead of always returning true, it now actually evaluates the selector in the LabelSelector case. This approach makes the LabelSelector handling consistent with all other strategies.
This PR also fixes a NPE that I noticed during testing/debugging in DomainProcessorImpl (in the apply method of DomainPresenceInfoStep).
Integration test results (no unknown failures):
Main branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8066/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8080/
This branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8067/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8081/
New results to be posted once available.
Main branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8187/
This branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8188/