Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeLifecycleController - MarkPodsNotReady retry fix #84445

Merged
merged 2 commits into from Nov 4, 2019

Conversation

@krzysied
Copy link
Contributor

krzysied commented Oct 28, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes MarkPodsNotReady retry logic.
Previously there used to be no retry, now if MarkPodsNotReady fails, it will be retried in next monitorNodeHealth pass.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@krzysied

This comment has been minimized.

Copy link
Contributor Author

krzysied commented Oct 28, 2019

/assign @wojtek-t

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Oct 30, 2019
@krzysied krzysied force-pushed the krzysied:node_controller_retry_fix branch 4 times, most recently from 252d855 to f59bed2 Oct 30, 2019
updateReactor: func() func(action testcore.Action) (bool, runtime.Object, error) {
i := 0
return func(action testcore.Action) (bool, runtime.Object, error) {
if action.GetVerb() != "update" || action.GetResource().Resource != "pods" || action.GetSubresource() != "status" {

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

I don't like it - let's be explicit:
if this is "list pods" then return what you return

if it's "update pod status" logic below.

Otherwise return error to make it very explicit what should happen.

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Not sure I understand that...
I inverted if statement and added error for not supported action. Is this what you suggested?

@krzysied krzysied force-pushed the krzysied:node_controller_retry_fix branch from f59bed2 to 8330d8b Oct 31, 2019
if (currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue) ||
(needsRetry && observedReadyCondition.Status != v1.ConditionTrue) {
// Report node event only once when status changed.
if !needsRetry {

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

Those conditions became fairly complicated...

How about factoring this out to a separate function:

func (...) markPodsNotReady() {
  pods, err := listPodsFromNode(nc.kubeClient, node.Name)
  ...

And here doing:

if currentReadyCondition.. && observed... {
  // report node event
  nc.markPodsNotReady()
} else if needsRetry && observed... {
  nc.markPodsNodReady()
}

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

I don't want to log error or handle nodesToRetry in markPodsNodReady but rather in parent method.
I went with swtich/case. Is fallthrough acceptable or we try to not use in the code?

@krzysied krzysied force-pushed the krzysied:node_controller_retry_fix branch from 8330d8b to b0bce1f Oct 31, 2019
@krzysied

This comment has been minimized.

Copy link
Contributor Author

krzysied commented Oct 31, 2019

/test pull-kubernetes-e2e-gce

@@ -236,6 +236,8 @@ type Controller struct {
// workers that are responsible for tainting nodes.
zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

nodesToRetry *sync.Map

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

why pointer?
Why not just sync.Map?

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

IMHO it doesn't matter. Pointer removed.

if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
_, needsRetry := nc.nodesToRetry.Load(node.Name)
// We call markPodsNotReadyFunc when node condition changed from true to something different
// or previous markPodsNotReadyFunc failed and ready status is still not true.

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

Remove this comment - it describes what the code is doing without any additional context on why/when/etc.

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Removed.

func (nc *Controller) markPodsNotReady(nodeName string) error {
pods, err := listPodsFromNode(nc.kubeClient, nodeName)
if err != nil {
return fmt.Errorf("Unable to list pods from node %v: %v", nodeName, err)

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

unable (errors start with small letter)

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Fixed.

}
}
fakeNow := metav1.Date(2015, 1, 1, 12, 0, 0, 0, time.UTC)
timeNow := metav1.Date(2015, 1, 1, 12, 0, 0, 0, time.UTC)

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

what's the point in having both timeNow and fakeNow ?

i++
switch i {
case 1:
return true, nil, fmt.Errorf("radom error")

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

s/radom/fake/

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Done.

}

for i, item := range table {
nodeController, _ := newNodeLifecycleControllerFromClient(

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

please add a "desc| field to table, and opaque this into:

t.Run(item.desc, func(t *testing.T) {
})

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Done.

}
for _, itertion := range item.nodeIterations {
nodeController.now = func() metav1.Time { return metav1.Time{Time: fakeNow.Add(itertion.timeToPass)} }
item.fakeNodeHandler.Existing[0].Status = itertion.newNodeStatus

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

This is really hacky. Why instead of passing node status in iteration, we simply not pass a list of Nodes and assign it here to fakeNodeHandler?

BTW - then you should change makeNodeStatus to simply makeNode function.

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Done.

t.Errorf("unexpected error: %v", err)
}
if err := nodeController.monitorNodeHealth(); err != nil {
t.Errorf("Case[%d] unexpected error: %v", i, err)

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

Why in some cases there is Case[%d] and in others there isn't?

BTW - once you opaque it into t.Run() you will no longer need it.

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

Done.

if err := nodeController.syncNodeStore(item.fakeNodeHandler); err != nil {
t.Errorf("unexpected error: %v", err)
}
if err := nodeController.monitorNodeHealth(); err != nil {

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Oct 31, 2019

Member

I think I'm missing one thing - in all cases you're starting with condition set to true. So what's the point of this call even?
[Same question for the syncNodeStore above]

This comment has been minimized.

Copy link
@krzysied

krzysied Oct 31, 2019

Author Contributor

We need this additional setting for initializing controller inner structure - we want to simulated node not being update after the status was recorded previously as Ready.
BTW, I removed this code. The initialization data was moved to the nodeIterations as first entry.

@krzysied krzysied force-pushed the krzysied:node_controller_retry_fix branch from b0bce1f to 3a82f50 Oct 31, 2019
@wojtek-t wojtek-t removed the needs-priority label Nov 3, 2019
@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Nov 3, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 3, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: krzysied, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Nov 3, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Nov 4, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 5e33f3d into kubernetes:master Nov 4, 2019
15 checks passed
15 checks passed
cla/linuxfoundation krzysied authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 4, 2019
@krzysied krzysied deleted the krzysied:node_controller_retry_fix branch Nov 4, 2019
k8s-ci-robot added a commit that referenced this pull request Mar 10, 2020
…82884-#83248-#83320-#83780-#84445-#81167-upstream-release-1.16

Automated cherry pick of fix for #77733 (NodeLifecycleController is overloading kube-apiserver) into release-1.16
k8s-ci-robot added a commit that referenced this pull request Mar 11, 2020
…81839-#82489-#82884-#83248-#83320-#83780-#84445-#81167-upstream-release-1.15

Automated cherry pick of fix for #77733 (NodeLifecycleController is overloading kube-apiserver) into release-1.15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.