Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recheck for cloud taint before initializing nodes in cloud controller #75405

Merged
merged 1 commit into from May 14, 2019

Conversation

@tghartland
Copy link
Contributor

commented Mar 15, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

If a node creates an Update event before its Add event has been processed by the node controller, after the Add event has been processed the Update is processed and the node object received by the Update event is outdated and does not reflect that the node has already been initialized. Therefore the Update event initializes the node again.

With enough nodes that the Update event rate is higher than the rate at which the node controller can call initializeNode (taking into account the time spent making queries to the cloud), events start to pile up and a node which os added later than others will have its initial Add event at the end of the event queue, meaning it can take a long time for it to be processed. And while that node is waiting to be initialized, it is filling up the event queue with more Update events, which are also going to call initializeNode again.

By maintaining an internal map of nodes that have been initialized, outdated Update events for nodes that are known to have already been initialized can be discarded.

There is more detailed information in issue #75285

Which issue(s) this PR fixes:

Fixes #75285

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix bug where cloud-controller-manager initializes nodes multiple times
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

Hi @tghartland. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

/sig cloud-provider
/sig scalability
/area cloudprovider
/area controller-manager

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

cnc.initializedNodesMutex.Lock()
machineID, found := cnc.initializedNodes[node.Name]
cnc.initializedNodesMutex.Unlock()
alreadyInitialized := found && (machineID == node.Status.NodeInfo.MachineID)

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Mar 15, 2019

Member

I believe NodeInfo.MachineID is only set if cAdvisor is enabled, not sure if we want to be relying on cAdvisor behavior in this way here 🤔

This comment has been minimized.

Copy link
@tghartland

tghartland Mar 15, 2019

Author Contributor

Ah ok. Is there another way to uniquely identify a node like this?
Otherwise I think I can manage without it by removing a name from the map when it is eventually seen in an Update event where the node object appears initialized, and can be discarded for that reason from then on.

cnc.initializedNodesMutex.Lock()
machineID, found := cnc.initializedNodes[node.Name]
cnc.initializedNodesMutex.Unlock()
alreadyInitialized := found && (machineID == node.Status.NodeInfo.MachineID)

This comment has been minimized.

Copy link
@yastij

yastij Mar 20, 2019

Member

we might want to rely on providerID of the instance here cc @andrewsykim

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Mar 20, 2019

Member

With the current state of the PR, I think it's possible for the provider ID to be not set 🤔

This comment has been minimized.

Copy link
@yastij

yastij Mar 20, 2019

Member

on the kubelet we have:

		if node.Spec.ProviderID == "" {
			node.Spec.ProviderID, err = cloudprovider.GetInstanceProviderID(context.TODO(), kl.cloud, kl.nodeName)
			if err != nil {
				return nil, err
			}
		}

Basically it relies on InstanceID call from instances interface, so as long as the cloud-provider implements this, it should be fine. thoughts ?

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Mar 20, 2019

Member

IIRC, that code path is only for in-tree providers. This controller only runs out-of-tree.

This comment has been minimized.

Copy link
@yastij

yastij Mar 20, 2019

Member

True, there's also some other ways:

  • store on the map the providerID after initialization and use cloudprovider.GetInstanceProviderID(context.TODO(), cnc.cloud, types.NodeName(curNode.Name)) for comparison (this might be expensive as need to make call to retrieve instanceID())

  • change the data structure to a set of nodeName and make sure we add the nodeName to the set at the of the initialization, when receiving an update we can check for presence on the set

This comment has been minimized.

Copy link
@yastij

yastij Mar 21, 2019

Member

If the add fails there's no reason that we keep its nodeName on the set, so when receiving an update event we'd still initialize the node

This comment has been minimized.

Copy link
@tghartland

tghartland Mar 21, 2019

Author Contributor

The removal of names in the set is done when the Update events see the transition of cloud taint -> no cloud taint on the event node object. If the Add event initialization for the second node with the same name failed and it wasn't able to do the initialization on an Update event because the name already exists in the set, it would never be initialized so that transition would never be seen.

Maybe going back to a map and having the value in the map be the time that the name is added would be a way to check if a name should really still be tracked or not.

This comment has been minimized.

Copy link
@yastij

yastij Mar 21, 2019

Member

@tghartland - if we are not able to initialize the node we shouldn't add it to the set, so if there's an error on the add we won't add the nodeName.
If we remove the node right after initialization and add it back, the add event will be processed and start initialization, with 2 possible outcomes I think:

  • initialization is achieved, we add the nodeName from the set
  • initialization failed, we remove the nodeName from the set if it exists

when receiving an update after the add event the set will behave as it should since if the initialization is done the nodeName would be on the set.

This comment has been minimized.

Copy link
@tghartland

tghartland Mar 21, 2019

Author Contributor

initialization failed, we remove the nodeName from the set if it exists

In fact it doesn't even have to fail, if the name is in the set at the start of an Add event then it must have been left over from a previous node with the same name, and should be removed.
Commit fixing this: 7f12d7d
I think this takes care of any issues to do with a name not being removed properly.

This comment has been minimized.

Copy link
@yastij

yastij Mar 21, 2019

Member

This would work too. more generally, seeing an add should clean the set from the nodeName

@tghartland tghartland force-pushed the tghartland:node-init-fix branch from 3794403 to e578ab5 Mar 21, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@tghartland tghartland force-pushed the tghartland:node-init-fix branch from e578ab5 to 564d7cd Mar 21, 2019

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 7, 2019

We can, the question is will this cause too many API requests. The only time the request needs to be made is when the cloud taint is present, so when the cluster is static it will not be called. Even when adding nodes, it will not need to be used often, but it could be possible to have a scenario where:

It's less requests than you think because we only re-check the taint only if an update event resource has the taint. Fwiw, a get request to the API server is cheaper than multiple requests to cloud APIs so the trade-off seems worth it if the solution is simple. If we think a get request is too expensive, we can even check the node cache from the lister, there's still a chance we get a stale object but still better than what we had before.

This is why I like the cache.FIFO. There is no reason to hold on to more than one event for a particular node in the queue. Maybe combining both solutions is the best option?

As you mentioned, with the FIFO cache there's still a chance we would reinitialize a node, so not sure what the benefit of adding that complexity is over just re-checking state in cache.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Looks like we already get the latest state of the node here. If we're already making the API request let's recheck the taint state there, seems like the fix might be easier than we thought

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

@andrewsykim that works, I've just tried it.

Should this case print a line in the logs, or is it not necessary? With only the minimal change

+       cloudTaint := getCloudTaint(curNode.Spec.Taints)
+       if cloudTaint == nil {
+               // Node object from event had the cloud taint but was outdated
+               return fmt.Errorf("Node %s already had cloud taint removed", curNode.Name)
+       }

an error line is printed because of the later part

    if err != nil {
        utilruntime.HandleError(err)
        return
    }

It is not possible to return from the RetryOnConflict with nil, as that then prints the "Successfully initialized" line. We could return a custom error type and check for that to return from initializeNode without logging anything.

Also, getting the current node state should probably be moved to the top of the function passed to RetryOnConflict, yes? Otherwise that GCE code is still going to be run.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 8, 2019

Also, getting the current node state should probably be moved to the top of the function passed to RetryOnConflict, yes? Otherwise that GCE code is still going to be run.

That sounds good to me.

It is not possible to return from the RetryOnConflict with nil, as that then prints the "Successfully initialized" line. We could return a custom error type and check for that to return from initializeNode without logging anything.

Let's move the log line into the end of the retry block?

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

Let's move the log line into the end of the retry block?

Good idea. I'll sort out moving those changes into this PR's branch tomorrow, then remove the WIP status (and probably change the title as well).

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 8, 2019

Good idea. I'll sort out moving those changes into this PR's branch tomorrow, then remove the WIP status (and probably change the title as well).

Sounds good, thanks for iterating @tghartland much appreciated!!

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

Moving getting the current state above the GCE specific part does introduce one complication which the tests caught.

=== RUN   TestGCECondition
--- FAIL: TestGCECondition (0.00s)
    node_controller_test.go:374: 
                Error Trace:    node_controller_test.go:374
                Error:          Not equal: 
                                expected: 2
                                actual  : 1
                Test:           TestGCECondition
                Messages:       No new conditions were added for GCE
    node_controller_test.go:383: 
                Error Trace:    node_controller_test.go:383
                Error:          Should be true
                Test:           TestGCECondition
                Messages:       Network Route Condition for GCE not added by external cloud initializer

What happens is that the nodeutil.SetNodeCondition modifies the node state, so when the request for the current node state is done first it is no longer the current state after setting the node condition. Then when the node is updated later, it removes the conditions that were added in the GCE part.

Either we should leave the GCE code at the top of RetryOnConflict, or the current state has to be refreshed after SetNodeCondition, like so:

    err := clientretry.RetryOnConflict(UpdateNodeSpecBackoff, func() error {
            curNode, err := cnc.kubeClient.CoreV1().Nodes().Get(node.Name, metav1.GetOptions{})
            if err != nil {
                    return err 
            }   
 
            cloudTaint := getCloudTaint(curNode.Spec.Taints)
            if cloudTaint == nil {
                    // Node object received from event had the cloud taint but was outdated,
                    // the node has actually already been initialized.
                    return nil 
            }   
 
            // TODO(wlan0): Move this logic to the route controller using the node taint instead of condition
            // Since there are node taints, do we still need this?
            // This condition marks the node as unusable until routes are initialized in the cloud provider
            if cnc.cloud.ProviderName() == "gce" {
                    if err := nodeutil.SetNodeCondition(cnc.kubeClient, types.NodeName(node.Name), v1.NodeCondition{
                            Type:               v1.NodeNetworkUnavailable,
                            Status:             v1.ConditionTrue,
                            Reason:             "NoRouteCreated",
                            Message:            "Node created without a route",
                            LastTransitionTime: metav1.Now(),
                    }); err != nil {
                            return err 
                    }   
+
+                   var err error
+                   curNode, err = cnc.kubeClient.CoreV1().Nodes().Get(node.Name, metav1.GetOptions{})
+                   if err != nil {
+                           return err 
+                   }   
            }

Or curNode could be directly modified to add the condition without having to make another API request:

       if cnc.cloud.ProviderName() == "gce" {
-              if err := nodeutil.SetNodeCondition(cnc.kubeClient, types.NodeName(node.Name), v1.NodeCondition{
+              nodeCondition := v1.NodeCondition{
                       Type:               v1.NodeNetworkUnavailable,
                       Status:             v1.ConditionTrue,
                       Reason:             "NoRouteCreated",
                       Message:            "Node created without a route",
                       LastTransitionTime: metav1.Now(),
-              }); err != nil {
+              }   
+
+              if err := nodeutil.SetNodeCondition(cnc.kubeClient, types.NodeName(node.Name), nodeCondition); err != nil {
                       return err 
               }   
+
+              curNode.Status.Conditions = append(curNode.Status.Conditions, nodeCondition)
       }

But I'm not sure that it's a good idea to trust that the manually changed node will always be identical to what SetNodeCondition does. It does make the tests pass though.

@yastij

This comment has been minimized.

Copy link
Member

commented May 9, 2019

@tghartland @andrewsykim - Sorry for the delay

But I'm not sure that it's a good idea to trust that the manually changed node will always be identical to what SetNodeCondition does. It does make the tests pass though.

I think this can be avoided using patch as we'll be sending only a diff change, this would avoid having conflicts on the node object. There's a broader issue with context here #56350

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

@yastij do you mean using patch instead of update here?

curNode.Spec.Taints = excludeCloudTaint(curNode.Spec.Taints)
_, err = cnc.kubeClient.CoreV1().Nodes().Update(curNode)

Because SetNodeCondition has this line
condition.LastHeartbeatTime = metav1.NewTime(time.Now())

which means that the condition you pass to the function isn't exactly what gets set on the node. The only way to get the exact condition that is set on the node is to request the node state from the API again. Without doing that the heartbeat time would be changed when sending the update.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 9, 2019

@tghartland on second thought, let's recheck the node after the setting the GCE node condition. I don't see any harm in re-adding the NoRoute condition there, that should be idempotent anyways.

Abort node initialization if cloud taint was already removed
If node events are received at a faster rate than they
can be processed then initialization for some nodes will
be delayed. Once they are eventually processed their cloud taint
is removed, but there may already be several update events
for those nodes with the cloud taint still on them already
in the event queue. To avoid re-initializing those nodes,
the cloud taint is checked for again after requesting
the current state of the node. If the cloud taint is no
longer on the node then nil is returned from the
RetryOnConflict, as an error does not need to be logged.

The logging for a successful initialization is also
moved inside the RetryOnConflict so that the early nil
return does not cause the aborted initialization to be
logged as a success.

@tghartland tghartland force-pushed the tghartland:node-init-fix branch from bb127c2 to 081ec69 May 10, 2019

@k8s-ci-robot k8s-ci-robot added size/S and removed size/M labels May 10, 2019

@tghartland tghartland changed the title WIP Track nodes already initialized by cloud controller Recheck for cloud taint before initializing nodes in cloud controller May 10, 2019

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 13, 2019

/approve
/lgtm

Thanks for iterating @tghartland!

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, tghartland

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Can you update the release note please, maybe something like "fix bug where cloud-controller-manager initializes nodes multiple times"

@tghartland

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

@andrewsykim I've changed the note to what you suggested.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented May 14, 2019

/hold cancel

@fejta-bot

This comment has been minimized.

Copy link

commented May 14, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit ec88ec2 into kubernetes:master May 14, 2019

17 of 20 checks passed

pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-integration Job triggered.
Details
tide Not mergeable. Job pull-kubernetes-e2e-gce-100-performance has not succeeded.
Details
cla/linuxfoundation tghartland authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.