Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix get azure accounts timeout issue when there is no out-bound IP #74191

Merged
merged 2 commits into from Feb 22, 2019

Conversation

@andyzhangx
Copy link
Member

andyzhangx commented Feb 18, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
make getAllStorageAccounts func happen only when there is create unmanaged disk request, for managed disk cluster, it's not necessary to make getAllStorageAccounts func run.

Which issue(s) this PR fixes:

Fixes #74190

Original issue: kubernetes/cloud-provider-azure#86

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix get azure accounts timeout issue when there is no out-bound IP

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

cc @djsly

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 18, 2019

Question, does the call to getAllStorageAccounts needs to be done at instantiation time ?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 19, 2019

Question, does the call to getAllStorageAccounts needs to be done at instantiation time ?

yes, and it's for unmanaged disk which is corner scenario now, while for compatibility, I still keep it in the code. So is setting timeout as 1 min ok in your scenario? @djsly

@andyzhangx andyzhangx force-pushed the andyzhangx:get-account-timeout branch from 105d50d to 5d4dacd Feb 19, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 19, 2019

I have fixed @feiskyer 's comments, and will wait for @djsly 's reply, to make sure whether this PR is acceptable in the no outbound IP initiallly scenario.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 19, 2019

/hold

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 19, 2019

@andyzhangx if you are saying that there is no choice but to query the list of StorageAccount at initiation time, then I guess it's good.

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 19, 2019

I was looking at the code, and I see that the accounts property of the BlobDiskController isn't used at all in the azure.go. The bloc controller gets created blobController, err := newBlobDiskController(common) but not used until a // create/attach/detach/delete blob based (unmanaged disks) tasks is performed.

type BlobDiskController struct {
	common   *controllerCommon
	accounts map[string]*storageAccountState
}

Couldnt the call to accounts, err := c.getAllStorageAccounts() be done on the first call of create/attach/detach/delete blob, like those public methods could be enhanced with a "isInitialized" and this would allow them to call accounts, err := c.getAllStorageAccounts() if it hasn't been initialized. Preventing the API call all together during the initialization of the Azure Cloud Provider object ?

Food for though...

@andyzhangx andyzhangx force-pushed the andyzhangx:get-account-timeout branch from 5d4dacd to 65350c7 Feb 19, 2019

@andyzhangx andyzhangx changed the title set timeout for get azure account operation fix get azure accounts timeout issue when there is no out-bound IP Feb 19, 2019

@k8s-ci-robot k8s-ci-robot added size/M and removed size/XS labels Feb 19, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 19, 2019

@djsly thanks for your code analysis, you are right:
getAllStorageAccounts func should only happen when there is create unmanaged disk request, for managed disk cluster, it's not necessary to make getAllStorageAccounts func run.

I have changed the code according to your request. PTAL, thanks.

@andyzhangx
Copy link
Member Author

andyzhangx left a comment

Thanks for @djsly 's suggestion, this is an old issue which should be fixed in earlier version. I will cherry pick to earlier version when this PR get merged.
/hold cancel

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 20, 2019

@andyzhangx , perfect! thanks for the cherry picks.

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 20, 2019

result, err := az.VirtualMachineSizesClient.List(context.TODO(), az.Location)
                if err != nil || result.Value == nil {
                        klog.Errorf("failed to list vm sizes in GetVolumeLimits, plugin.host: %s, location: %s", plugin.host.GetHostName(), az.Location)
                        return volumeLimits, nil
                }

Problem here is that the source of the call is in k8s, not azure.go

./pkg/kubelet/nodestatus/setters.go:                    attachLimits, err := volumePlugin.GetVolumeLimits()

Can't we use instanceMetadata for this ?

I'm not sure yet, why the full list of VmSizes in a SubID is required.

// List lists virtual-machine-sizes available in a location for a subscription.
//
// location is the location upon which virtual-machine-sizes is queried.
@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 20, 2019

so I now understand better how this works, Since the instance metadata doesn't provide the MaxDataDiskCount you get the list of all VM instance type possible in the SubID/Location and match the InstanceType of the current VM with the list to extract the MaxDataDiskCount ...

I would try your timeout logic you initially wrote but I can't find the commit anymore, must have been squashed. Do we know if the instance metadata team could augment the metadata with this information ?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 21, 2019

@djsly use this code for set timeout as 1 min:

ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)

can we merge this PR first? @djsly

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 21, 2019

Sure this is good to go. I guess we will need to open another one for the issue with the

result, err := az.VirtualMachineSizesClient.List(context.TODO(), az.Location)

I will leave that to your discretion to see if you prefer to manage two PRs with multiple cherry picks.

Or if it would be best tackle all those issues within the same PR, since in the end all this work is related to Kubelet not starting in a timely fashion when there isn't any outbound IP

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2019

OK I tested the context.WithTimeout() on the VirtualMachineSizesClient API call and this seem to be the last long blocking call . You can see bellow that the node isn't registered for 15sec, until the first API call timesout.

I0221 23:14:50.774027   22942 kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
belet[22942]: I0221 23:14:50.774066   22942 kubelet_node_status.go:326] Adding node label from cloud provider: beta.kubernetes.io/instance-type=Sta
belet[22942]: I0221 23:14:50.774077   22942 azure_zones.go:69] Availability zone is not enabled for the node, falling back to fault domain
belet[22942]: I0221 23:14:50.774085   22942 kubelet_node_status.go:337] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zo
belet[22942]: I0221 23:14:50.774093   22942 kubelet_node_status.go:341] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/re
belet[22942]: W0221 23:14:50.774139   22942 setters.go:144] replacing cloudprovider-reported hostname of kn-default-3 with overridden hostname of k
...
...
belet[22942]: E0221 23:14:50.827380   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:14:50.927536   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:14:51.027770   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:14:51.127979   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:14:51.228213   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:14:51.328371   22942 kubelet.go:2266] node "kn-default-3" not found
...
...
...
belet[22942]: E0221 23:15:04.780600   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:15:04.880789   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:15:04.980982   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: E0221 23:15:05.066718   22942 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-3, location: eastu
belet[22942]: I0221 23:15:05.066768   22942 setters.go:736] Error getting volume limit for plugin kubernetes.io/aws-ebs
belet[22942]: I0221 23:15:05.066779   22942 setters.go:736] Error getting volume limit for plugin kubernetes.io/gce-pd
belet[22942]: I0221 23:15:05.066800   22942 kubelet_node_status.go:446] Recording NodeHasSufficientMemory event message for node kn-default-3
belet[22942]: I0221 23:15:05.066824   22942 kubelet_node_status.go:446] Recording NodeHasNoDiskPressure event message for node kn-default-3
belet[22942]: I0221 23:15:05.066838   22942 kubelet_node_status.go:446] Recording NodeHasSufficientPID event message for node kn-default-3
belet[22942]: I0221 23:15:05.066866   22942 kubelet_node_status.go:72] Attempting to register node kn-default-3
belet[22942]: I0221 23:15:05.067321   22942 server.go:459] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kn-default-3", UID:"kn-default
belet[22942]: I0221 23:15:05.067343   22942 server.go:459] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kn-default-3", UID:"kn-default
belet[22942]: I0221 23:15:05.067354   22942 server.go:459] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kn-default-3", UID:"kn-default
belet[22942]: E0221 23:15:05.081204   22942 kubelet.go:2266] node "kn-default-3" not found
belet[22942]: I0221 23:15:05.082542   22942 kubelet_node_status.go:75] Successfully registered node kn-default-3
@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2019

also, before my patch, we can see that the calls were failing every 10mins (I'm guessing kubelet will keep retrying) and after I patched, on existing Nodes, the error messages is every ~15sec

Feb 21 19:28:23  kubelet[42192]: E0221 19:28:23.656890   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 19:38:53  kubelet[42192]: E0221 19:38:53.679577   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 19:49:23  kubelet[42192]: E0221 19:49:23.703875   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 19:59:53  kubelet[42192]: E0221 19:59:53.725452   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 20:10:23  kubelet[42192]: E0221 20:10:23.754637   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 20:20:53  kubelet[42192]: E0221 20:20:53.788917   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 20:31:23  kubelet[42192]: E0221 20:31:23.818574   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 20:41:53  kubelet[42192]: E0221 20:41:53.840404   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 20:52:23  kubelet[42192]: E0221 20:52:23.891879   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:02:53  kubelet[42192]: E0221 21:02:53.917138   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:13:23  kubelet[42192]: E0221 21:13:23.937846   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:23:53  kubelet[42192]: E0221 21:23:53.957359   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:34:23  kubelet[42192]: E0221 21:34:23.981998   42192 azure_dd.go:165] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2

Feb 21 21:38:07  kubelet[88761]: E0221 21:38:07.171911   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:38:22  kubelet[88761]: E0221 21:38:22.191122   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:39:22  kubelet[88761]: E0221 21:39:22.191373   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:40:37  kubelet[88761]: E0221 21:40:37.212780   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:41:52  kubelet[88761]: E0221 21:41:52.234522   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:43:07  kubelet[88761]: E0221 21:43:07.254709   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:44:22  kubelet[88761]: E0221 21:44:22.276533   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:45:37  kubelet[88761]: E0221 21:45:37.300389   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:46:52  kubelet[88761]: E0221 21:46:52.319842   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:48:07  kubelet[88761]: E0221 21:48:07.348183   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:49:22  kubelet[88761]: E0221 21:49:22.367277   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:50:37  kubelet[88761]: E0221 21:50:37.392020   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:51:52  kubelet[88761]: E0221 21:51:52.410877   88761 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:52:29  kubelet[97239]: E0221 21:52:29.759218   97239 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:52:44  kubelet[97239]: E0221 21:52:44.785222   97239 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:52:59  kubelet[97239]: E0221 21:52:59.802677   97239 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
Feb 21 21:53:59  kubelet[97239]: E0221 21:53:59.802780   97239 azure_dd.go:168] failed to list vm sizes in GetVolumeLimits, plugin.host: kn-default-1, location: eastus2
@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Feb 22, 2019

@djsly Thanks for validating this.

@andyzhangx Could you fix the issue within same PR? They are actually same issue, but just different APIs. And we need to cherry pick the changes to stable releases.

@feiskyer feiskyer added this to In progress in Cloud Provider Azure Feb 22, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 22, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx

This comment has been minimized.

Copy link
Member Author

andyzhangx commented Feb 22, 2019

@feiskyer @djsly pls review commit: 28a9aa8

accounts, err := c.getAllStorageAccounts()
if err != nil {
klog.Errorf("azureDisk - getAllStorageAccounts error: %v", err)
c.accounts = make(map[string]*storageAccountState)

This comment has been minimized.

@feiskyer

feiskyer Feb 22, 2019

Member

Keep c.accouts as nil here? or else, it won't try again on errors.

This comment has been minimized.

@andyzhangx

andyzhangx Feb 22, 2019

Author Member

@feiskyer yes, won't try again on error, it would make new map and create new storage account(don't use existing accounts)

@@ -164,7 +165,9 @@ func (plugin *azureDataDiskPlugin) GetVolumeLimits() (map[string]int64, error) {
}

if vmSizeList == nil {
result, err := az.VirtualMachineSizesClient.List(context.TODO(), az.Location)
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()

This comment has been minimized.

@feiskyer

feiskyer Feb 22, 2019

Member

Should we also add same timeout to getAllStorageAccounts?

This comment has been minimized.

@andyzhangx

andyzhangx Feb 22, 2019

Author Member

@feiskyer fixed.

@@ -164,7 +165,9 @@ func (plugin *azureDataDiskPlugin) GetVolumeLimits() (map[string]int64, error) {
}

if vmSizeList == nil {
result, err := az.VirtualMachineSizesClient.List(context.TODO(), az.Location)
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)

This comment has been minimized.

@feiskyer

feiskyer Feb 22, 2019

Member

nit: define a consts for timeout, e.g. listAPITimeout = 60*time.Second

@feiskyer

This comment has been minimized.

Copy link
Member

feiskyer commented Feb 22, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Feb 22, 2019

add timeout in GetVolumeLimits operation
add timeout for getAllStorageAccounts

@andyzhangx andyzhangx force-pushed the andyzhangx:get-account-timeout branch from 28a9aa8 to 8cd09bb Feb 22, 2019

@feiskyer
Copy link
Member

feiskyer left a comment

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 22, 2019

@k8s-ci-robot k8s-ci-robot merged commit 8d6f20e into kubernetes:master Feb 22, 2019

16 checks passed

cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details

Cloud Provider Azure automation moved this from In progress to Done Feb 22, 2019

@djsly

This comment has been minimized.

Copy link
Contributor

djsly commented Feb 22, 2019

Thanks guys!

k8s-ci-robot added a commit that referenced this pull request Feb 27, 2019

Merge pull request #74486 from andyzhangx/automated-cherry-pick-of-#7…
…4191-upstream-release-1.13

Automated cherry pick of #74191: remove get azure accounts in the init process set timeout

k8s-ci-robot added a commit that referenced this pull request Mar 5, 2019

Merge pull request #74490 from andyzhangx/automated-cherry-pick-of-#7…
…4191-upstream-release-1.11

Automated cherry pick of #74191: remove get azure accounts in the init process set timeout

k8s-ci-robot added a commit that referenced this pull request Mar 7, 2019

Merge pull request #74489 from andyzhangx/automated-cherry-pick-of-#7…
…4191-upstream-release-1.12

Automated cherry pick of #74191: remove get azure accounts in the init process set timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.