Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retry issues when the nodes are under deleting on Azure #80419

Merged
merged 2 commits into from Jul 24, 2019

Conversation

@feiskyer
Copy link
Member

commented Jul 22, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

/kind bug
/priority critical-urgent
/sig azure
/milestone v1.16

What this PR does / why we need it:

Fix retry issues when the nodes are under deleting on Azure. e.g. when the nodes are under deleting, update the network interface with LB backend pool should be canceled and shouldn't be retried.

It also reports NodeNotInitialized error when providerId is an empty string, which could reduce Azure API calls when cloud-controller-manager is used.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Should be cherry-picked to old releases.

Does this PR introduce a user-facing change?:

Fix retry issues when the nodes are under deleting on Azure

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Report NodeNotInitialized error when providerId is empty string
This could be happening when cloud-controller-manager is used.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented Jul 22, 2019

@k8s-ci-robot k8s-ci-robot requested a review from andyzhangx Jul 22, 2019

Show resolved Hide resolved staging/src/k8s.io/legacy-cloud-providers/azure/azure_instances.go Outdated
Show resolved Hide resolved staging/src/k8s.io/legacy-cloud-providers/azure/azure_backoff.go Outdated
@@ -32,6 +33,10 @@ import (
"k8s.io/klog"
)

const (
vmssVMNotActiveErrorMessage = "not an active Virtual Machine Scale Set VM instanceId"

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jul 22, 2019

Member

what do you mean by "not active"? not exists?

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jul 22, 2019

Member

it looks like error msg return by azure API, may add comments about "not active" meaning.

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jul 22, 2019

Member

and this is only for VMSS? what about VMAS?

This comment has been minimized.

Copy link
@feiskyer

feiskyer Jul 22, 2019

Author Member

Yep, the error message is coming from VMSS API, so it is VMSS only.

This comment has been minimized.

Copy link
@feiskyer

feiskyer Jul 22, 2019

Author Member

Would add a comment for this.

klog.Errorf("GetPrivateIPsByNodeName(%s): backoff failure, will retry,err=%v", nodeName, retryErr)
return false, nil
}
klog.V(2).Infof("GetPrivateIPsByNodeName(%s): backoff success", nodeName)
klog.V(3).Infof("GetPrivateIPsByNodeName(%s): backoff success", nodeName)

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jul 22, 2019

Member

any reason why changing log level?

This comment has been minimized.

Copy link
@feiskyer

feiskyer Jul 22, 2019

Author Member

V(2) would produce too much noise for large clusters. Changing to V(3) for reducing it.

@feiskyer feiskyer force-pushed the feiskyer:vmss-fix branch from 241eacd to 2a62bc7 Jul 22, 2019

@feiskyer feiskyer added this to In progress in SIG Azure via automation Jul 22, 2019

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented Jul 23, 2019

/retest

@feiskyer

This comment has been minimized.

Copy link
Member Author

commented Jul 23, 2019

@andyzhangx
Copy link
Member

left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 24, 2019

@k8s-ci-robot k8s-ci-robot merged commit c08a88a into kubernetes:master Jul 24, 2019

23 checks passed

cla/linuxfoundation feiskyer authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

SIG Azure automation moved this from In progress to Done Jul 24, 2019

@feiskyer feiskyer deleted the feiskyer:vmss-fix branch Jul 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.