Skip to content

Conversation

@doxiao
Copy link
Member

@doxiao doxiao commented Jan 19, 2022

Problem: If the DomainNamespaceSelectionStrategy is LabelSelector, the operator relies on specifying the LabelSelector on the listNamespaceAsync call to filter the namespaces. When the strategy is changed from List to LabelSelector when the operator is running, listNamespaceAsync call may return the full list of namespaces instead of the namespaces that match the label selector, because the strategy is still List when the call is made.

Fix: The changes in this PR removes the label selectors from the listNamespaceAsync call so that it always returns the full list, and modifies the isDomainNamespace method, which is used to filter the namespaces that are returned from the listNamespaceAsync call; instead of always returning true, it now actually evaluates the selector in the LabelSelector case. This approach makes the LabelSelector handling consistent with all other strategies.

This PR also fixes a NPE that I noticed during testing/debugging in DomainProcessorImpl (in the apply method of DomainPresenceInfoStep).

Integration test results (no unknown failures):
Main branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8066/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8080/

This branch:
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8067/
https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8081/

New results to be posted once available.
Main branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8187/
This branch: https://build.weblogick8s.org:8443/job/weblogic-kubernetes-operator-kind-new/8188/

public boolean isDomainNamespace(@Nonnull String namespaceName) {
return true; // filtering is done by Kubernetes list call
public boolean isDomainNamespace(@Nonnull V1ObjectMeta nsMetadata) {
// although filtering is done by Kubernetes list call, there is a rice condition where readExistingNamespaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to imagine what a "rice" condition could be. :)

Can you explain what this race condition is? Why would K8s give us a non-matching namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intermittent issue only happens when the selection strategy is changed from List to LabelSelector. When the readExistingNamespaceAysnc is called, the strategy may still be List, so the K8S returns all namespaces.

}

@Test
void withLabelSelector_onCreateReadNamespaces_ignoreSelectorOnList_startsNamespaces() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name is unclear. What is it trying to verify?

I'm very concerned about the ignoreSelectorOnListOperation call. What K8s behavior is being simulated or tracked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior/condition the change is trying to simulate is when the readExistingNamespace async call is issued, the strategy is List, so the results is as if the selector does not exist although the strategy is LabelSelector when the returned valued are processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following that; if the strategy has been changed to LabelSelector, why is K8s acting as though it is List? Are we sending the wrong call? Do we have an in-flight list call going when the strategy is changed? If so, maybe we should address that in the list response?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the in-flight call uses List. We simulate this race condition by ignoring the selector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the product code, we do address the issue in the list response as the changes in DomainRecheck.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to the test method name, what about withLabelSelector_returnAllNamespacesOnCreateReadNamespaces_startsExpectedNamespaces?

Copy link
Contributor

@russgold russgold Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way.

That's fine, and lends itself to a simple solution: NamespaceListResponseStep.onSuccess() can call a new method, getMetadatas rather getNames and the Namespaces.isDomainNamespace() method can take a @nonnull V1ObjectMeta rather than a String.

Copy link
Contributor

@russgold russgold Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to the test, I would suggest that it should now test the change of strategy between the list call and the response. In that case, the thing to add to KubernetesTestSupport would be a doOnList call, similar to doOnCreate, etc. The test would then set the strategy to, say, List and have the doOnList change it to LabelSelection. That could be shown to fail without any other changes.

I would name such a test, whenSelectionStrategyChangesDuringRequest_startDesiredNamespaces.

As a rule, it is better to have the test name describe the desired behavior rather than the implementation of the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the new test case any more. One of the existing test case covers it already now that we changed the listNamespaces for LabelSelector case.

@russgold
Copy link
Contributor

I believe that the current approach is flawed, and requires too many changes, plus is not guaranteed to work in all cases.

The basic problem starts with the reality that our list namespace call is asynchronous, meaning that the strategy used to create the call might not match the strategy when we process the response. This has been identified in a specific case: changing from List to LabelSelector, as the latter does no post-processing, assuming that the filtering is being done by Kubernetes. It should also, then, logically be a problem in other combinations. For example, changing from LabelSelector to any other strategy means that the namespaces we start handling will exclude those without the label, even though we have explicitly said that we want to see them all. In general, any attempt to compensate for the mismatch runs into a problem of this kind. I therefore suggest that we have two simpler approaches, depending on how we want the operator to react. Both require creating the list response step with the current strategy

  1. We could simply use the specified strategy for all strategy-dependent computations, thus ensuring that the request and processing are consistent. This would mean that the user’s change of strategies would take effect with the next list operation, not the in-flight one.

  2. We could declare a mismatch to be an error, and restart the request. This would cause the change to take effect even with an in-flight request. The difficulty is whether we have a way to restart a multi-step list (used when there are a large number of namespaces), when the change happens partway through. If there is not, I would recommend against this approach.

@doxiao
Copy link
Member Author

doxiao commented Jan 21, 2022

I believe that the current approach is flawed, and requires too many changes, plus is not guaranteed to work in all cases.

The PR contains fixes for other existing and unrelated issues that I noticed during testing, such as NPE. The only change for this is to filter the namespaces that the operator get from the listNamesapces call, which matches the handling of all other strategies.

The basic problem starts with the reality that our list namespace call is asynchronous, meaning that the strategy used to create the call might not match the strategy when we process the response. This has been identified in a specific case: changing from List to LabelSelector, as the latter does no post-processing, assuming that the filtering is being done by Kubernetes. It should also, then, logically be a problem in other combinations. For example, changing from LabelSelector to any other strategy means that the namespaces we start handling will exclude those without the label, even though we have explicitly said that we want to see them all. In general, any attempt to compensate for the mismatch runs into a problem of this kind. I therefore suggest that we have two simpler approaches, depending on how we want the operator to react. Both require creating the list response step with the current strategy

1. We could simply use the specified strategy for all strategy-dependent computations, thus ensuring that the request and processing are consistent. This would mean that the user’s change of strategies would take effect with the next list operation, not the in-flight one.

This is error-prune because the operator uses the current selection strategy in many other places. If the operator continues with the in-flight one here, we need to make sure all other places use the in-flight strategy as well, which is hard to do and error-prune.

2. We could declare a mismatch to be an error, and restart the request. This would cause the change to take effect even with an in-flight request. The difficulty is whether we have a way to restart a multi-step list (used when there are a large number of namespaces), when the change happens partway through. If there is not, I would recommend against this approach.

@russgold
Copy link
Contributor

This is error-prune because the operator uses the current selection strategy in many other places. If the operator continues with the in-flight one here, we need to make sure all other places use the in-flight strategy as well, which is hard to do and error-prune.

In what other cases do you think a problem will occur? It's a problem in the list case because the request and post-processing need to use the same strategy. In most cases, it should not be an issue, as far as I can tell. And if it is indeed a problem, how do the changes here address it?

@doxiao
Copy link
Member Author

doxiao commented Jan 21, 2022

Having thought more about the problem, I think we should not have used the label selector in the list call for LabelSelector case. We should always list all namespaces (without a selector), and filter the returned list with the selector. All other strategies are handled this way.

@doxiao doxiao changed the title Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector WIP Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector Jan 21, 2022
@doxiao doxiao changed the title WIP Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector Owls93995 fix a race condition when DomainNamespaceSelectionStrategy is changed from List to LabelSelector Jan 24, 2022
@doxiao doxiao requested a review from russgold January 24, 2022 13:14
Copy link
Contributor

@russgold russgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. I have left a couple of stylistic recommendations.

Copy link
Member

@ankedia ankedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have a minor comment about copyright.

@robertpatrick robertpatrick merged commit 4cef5a0 into main Jan 24, 2022
@robertpatrick robertpatrick deleted the owls93995-list2labelselector-race branch January 24, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants