Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeInfoManager should use informer to cache CSINodeInfo objects #82556

Open
wants to merge 1 commit into
base: master
from

Conversation

@tedyu
Copy link
Contributor

commented Sep 10, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #70773

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

@k8s-ci-robot k8s-ci-robot requested review from feiskyer and jsafrane Sep 10, 2019

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from e35edfc to c1f5232 Sep 10, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

/test pull-kubernetes-e2e-gce-storage-slow

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

/test pull-kubernetes-e2e-gce

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

/test pull-kubernetes-e2e-gce-100-performance

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from c1f5232 to 36963c2 Sep 11, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

From https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1171880436243107840/ :

W0911 20:26:21.842] Copying file:///tmp/bazel-gcs.SRf6n5/kubernetes.tar.gz [Content-Type=application/x-tar]...
W0911 20:26:21.901] \ [70/160 files][ 77.4 MiB/  2.9 GiB]   2% Done
...
W0911 20:39:36.411] - [159/160 files][  2.9 GiB/  2.9 GiB]  99% Done     0.0 B/s                    ^M\^MINFO 0911 20:39:36.410741 retry_util.py] Retrying request, attempt #22...
W0911 20:40:18.540] \ [159/160 files][  2.9 GiB/  2.9 GiB]  99% Done     0.0 B/s                    ^M|^MServiceException: 503 Backend Error
W0911 20:40:18.585] CommandException: 1 files/objects could not be copied/removed.
@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

/test pull-kubernetes-e2e-gce

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 36963c2 to 7038950 Sep 11, 2019

pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/volume/csi/nodeinfomanager/nodeinfomanager.go Outdated Show resolved Hide resolved

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 7038950 to 9ffacbd Sep 11, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2019

From https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1171932156784545793/build-log.txt :

I0912 00:02:35.427] Sep 12 00:00:21.024: INFO: security-context-60f836f1-ac22-4918-b30d-38a1ce52dcf2 started at 2019-09-11 23:58:17 +0000 UTC (0+1 container statuses recorded)
...
I0912 00:02:35.458] Sep 12 00:02:35.375: INFO: namespace persistent-local-volumes-test-9981 deletion completed in 13.545889768s
...
I0912 00:02:35.460]       ^[[91mSep 12 00:00:17.897: Unexpected error:
I0912 00:02:35.460]           <*errors.errorString | 0xc000d2f600>: {
I0912 00:02:35.460]               s: "pod \"security-context-60f836f1-ac22-4918-b30d-38a1ce52dcf2\" is not Running: timed out waiting for the condition",
I0912 00:02:35.460]           }
I0912 00:02:35.460]           pod "security-context-60f836f1-ac22-4918-b30d-38a1ce52dcf2" is not Running: timed out waiting for the condition

Maybe the pod not running was related to deletion of the namespace ?

+               klog.Infof("informer returned %v err: %v", nodeInfo, err)

Didn't find the above log in e2e test output

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 9ffacbd to 033a35d Sep 12, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2019

From e2e-gce test output:

I0912 04:24:27.767] Sep 12 04:22:21.227: INFO: Pod pod-secrets-022b2d75-4972-455e-b066-fbe44c68f5b5 no longer exists
I0912 04:24:27.767] Sep 12 04:22:21.227: FAIL: Unexpected error:
I0912 04:24:27.767]     <*errors.errorString | 0xc000d90450>: {
I0912 04:24:27.767]         s: "expected pod \"pod-secrets-022b2d75-4972-455e-b066-fbe44c68f5b5\" success: Gave up after waiting 5m0s for pod \"pod-secrets-022b2d75-4972-455e-b066-fbe44c68f5b5\" to be \"success or failure\"",
I0912 04:24:27.767]     }
I0912 04:24:27.768]     expected pod "pod-secrets-022b2d75-4972-455e-b066-fbe44c68f5b5" success: Gave up after waiting 5m0s for pod "pod-secrets-022b2d75-4972-455e-b066-fbe44c68f5b5" to be "success or failure"

Wondering why the success prefix wasn't recognized.

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 033a35d to a8f7001 Sep 12, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2019

Checked e2e-gce test output, searching for info logs added in the PR.
I didn't find any of them.

@msau42

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

You might need to explicitly set the log level to 4

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from bd23723 to 1b9fa60 Sep 13, 2019

pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/kubelet/volume_host.go Show resolved Hide resolved
pkg/kubelet/volume_host.go Show resolved Hide resolved
pkg/volume/csi/csi_plugin.go Show resolved Hide resolved
@@ -228,7 +230,7 @@ func (p *csiPlugin) Init(host volume.VolumeHost) error {
}

// Initializing the label management channels
nim = nodeinfomanager.NewNodeInfoManager(host.GetNodeName(), host, migratedPlugins)
nim = nodeinfomanager.NewNodeInfoManager(host.GetNodeName(), host, migratedPlugins, nodeLister)

This comment has been minimized.

Copy link
@msau42

msau42 Sep 13, 2019

Member

I think we should not initialize nodeinfomanager at all if the informer is nil. Also nim appears to be unused?

This comment has been minimized.

Copy link
@tedyu

tedyu Sep 13, 2019

Author Contributor

So:
If CSINodeInfo feature gate is on, but we cannot get nodeLister, return an error.

The nim is global variable.

This comment has been minimized.

Copy link
@msau42

msau42 Sep 17, 2019

Member

Sorry my first comment was not correct. We still need to initialize node info manager even if the CSINodeInfo feature gate is off, because nim still needs to manage Node objects.

Instead of passing in nodeLister as a new variable, can we just get it from the KubeletVolumeHost inside?

pkg/volume/csi/nodeinfomanager/nodeinfomanager.go Outdated Show resolved Hide resolved

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch 2 times, most recently from 7904d4d to ccfae25 Sep 13, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2019

From a successful e2e-gce test output:

I0913 23:42:42.840] ^[[0;33mWaiting for 4 ready nodes. 0 ready nodes, 0 registered. Retrying.^[[0m
I0913 23:42:58.088] ^[[0;33mWaiting for 4 ready nodes. 3 ready nodes, 3 registered. Retrying.^[[0m
I0913 23:43:13.403] Found 4 node(s).

For the above e2e-gce test output (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1172693962192850944/):

I0914 02:51:09.905] ^[[0;33mWaiting for 4 ready nodes. 0 ready nodes, 0 registered. Retrying.^[[0m
W0914 02:51:25.193] No resources found in default namespace.
I0914 02:51:25.294] ^[[0;33mDetected 0 ready nodes, found 0 nodes out of expected 4. Your cluster may not be fully functional.^[[0m

I wonder how to retrieve log from one of the nodes.
Edit: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1173316427403235328/artifacts/e2e-d742743543-674b9-master/

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from ccfae25 to 0b0f551 Sep 14, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2019

I0914 14:37:56.393]       ^[[91mSep 14 14:33:47.497: Error getting Kubelet e2e-ce5e83be40-674b9-minion-group-gb8x metrics: Timed out when waiting for proxy to gather metrics from e2e-ce5e83be40-674b9-minion-group-gb8x

It seems metrics collection timeout was related to test failure.

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch 5 times, most recently from 50e7da7 to 5f791e7 Sep 14, 2019

@k8s-ci-robot k8s-ci-robot added size/L sig/apps and removed size/M labels Sep 15, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2019

From https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1173316427403235328/artifacts/e2e-d742743543-674b9-master/kube-controller-manager.log

E0915 19:39:44.662973       1 csi_plugin.go:227] cannot get node lister from &{<nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> 0xc0005397c0 0xc000b364c0 0xc0005b7b00 {{1 0} map[kubernetes.io/aws-ebs:0xc000d37dc0 kubernetes.io/azure-disk:0xc000d37dd0 kubernetes.io/azure-file:0xc000d37de0 kubernetes.io/cinder:0xc000dbbf20 kubernetes.io/flocker:0xc000d37e10 kubernetes.io/gce-pd:0xc000d37df0 kubernetes.io/glusterfs:0xc000dbbf00 kubernetes.io/host-path:0xc00048c780 kubernetes.io/local-volume:0xc000ac9710 kubernetes.io/nfs:0xc00048c9b0 kubernetes.io/portworx-volume:0xc000dbbf40 kubernetes.io/quobyte:0xc000d37db0 kubernetes.io/rbd:0xc000d37da0 kubernetes.io/scaleio:0xc000dbbf60 kubernetes.io/storageos:0xc000d37e20 kubernetes.io/vsphere-volume:0xc000d37e00] 0x77403b0 map[] 0xc0001b3a00} true e2e-d742743543-674b9 15000000000 {0xc000af0100} 0xc000af0120 0xc000085aa0 0xc000085b60 0xc000ac9d70 <nil> 5 10000000000 <nil> {{{0 0} {<nil>} map[] 0}}}

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 5f791e7 to 3f30bea Sep 16, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2019

In csiPlugin#Init, I added check for volume.AttachDetachVolumeHost in order to extract CSINodeLister.

From https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1173421743792984064/artifacts/e2e-0d6b2075a6-674b9-master/kube-controller-manager.log :

E0916 02:36:09.209372       1 csi_plugin.go:231] cannot get node lister from &{<nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> <nil> 0xc000c2c280 0xc000df6f80 0xc00058f440 {{1 0} map[kubernetes.io/aws-ebs:0xc000fab5c0 kubernetes.io/azure-disk:0xc000fab5d0 kubernetes.io/azure-file:0xc000fab5e0 kubernetes.io/cinder:0xc0007e6920 kubernetes.io/flocker:0xc000fab610 kubernetes.io/gce-pd:0xc000fab5f0 kubernetes.io/glusterfs:0xc0007e6900 kubernetes.io/host-path:0xc000c17bd0 kubernetes.io/local-volume:0xc00079ab40 kubernetes.io/nfs:0xc000c17e50 kubernetes.io/portworx-volume:0xc0007e6940 kubernetes.io/quobyte:0xc000fab5b0 kubernetes.io/rbd:0xc000fab5a0 kubernetes.io/scaleio:0xc0007e6960 kubernetes.io/storageos:0xc000fab620 kubernetes.io/vsphere-volume:0xc000fab600] 0x77403b0 map[] 0xc000201520} true e2e-0d6b2075a6-674b9 15000000000 {0xc0007e6b00} 0xc0007e6b20 0xc00096fbc0 0xc00096fc20 0xc00079b1a0 <nil> 5 10000000000 <nil> {{{0 0} {<nil>} map[] 0}}}

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 3f30bea to 9e848c2 Sep 16, 2019

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2019

We currently have :
AttachDetachVolumeHost
KubeletVolumeHost
fakeVolumeHost

Looks like obtaining CSINodeLister would be smoother if we create an interface which is embedded into the above 3 Hosts

pkg/controller/daemon/daemon_controller.go Outdated Show resolved Hide resolved
} else {
klog.Warning("kubeClient is nil. Skip initialization of CSIDriverLister")
}
}
if utilfeature.DefaultFeatureGate.Enabled(features.CSINodeInfo) {
if kubelet.kubeClient != nil {
informerFactory = informers.NewSharedInformerFactory(kubelet.kubeClient, resyncPeriod)

This comment has been minimized.

Copy link
@msau42

msau42 Sep 17, 2019

Member

Can we have both csiDriver and csiNode share the same informer factory?

pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/kubelet/volume_host.go Outdated Show resolved Hide resolved
pkg/volume/csi/csi_plugin.go Outdated Show resolved Hide resolved
@@ -228,7 +230,7 @@ func (p *csiPlugin) Init(host volume.VolumeHost) error {
}

// Initializing the label management channels
nim = nodeinfomanager.NewNodeInfoManager(host.GetNodeName(), host, migratedPlugins)
nim = nodeinfomanager.NewNodeInfoManager(host.GetNodeName(), host, migratedPlugins, nodeLister)

This comment has been minimized.

Copy link
@msau42

msau42 Sep 17, 2019

Member

Sorry my first comment was not correct. We still need to initialize node info manager even if the CSINodeInfo feature gate is off, because nim still needs to manage Node objects.

Instead of passing in nodeLister as a new variable, can we just get it from the KubeletVolumeHost inside?

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2019

The presence of 'cannot get node lister from' log meant that the host parameter is neither AttachDetachVolumeHost nor KubeletVolumeHost.

@msau42

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

The presence of 'cannot get node lister from' log meant that the host parameter is neither AttachDetachVolumeHost nor KubeletVolumeHost.

csi Init() is called by both master controllers and kubelet. We only want to initialize NodeInfoManager when we're running in kubelet.

@yutedz yutedz force-pushed the yutedz:vol-host-informer branch from 9e848c2 to 050bdcd Sep 17, 2019

@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Sep 17, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tedyu
To complete the pull request process, please assign thockin
You can assign the PR to them by writing /assign @thockin in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tedyu

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2019

From https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/82556/pull-kubernetes-e2e-gce/1173768535843803137/artifacts/e2e-8ccbbf85b1-674b9-minion-group-4bz2/kubelet.go

E0917 02:42:47.904187    1378 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:node:e2e-8ccbbf85b1-674b9-minion-group-4bz2" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only get, create, update, patch, or delete a CSINode

Maybe this was related to test failure.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

@tedyu: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 050bdcd link /test pull-kubernetes-bazel-test
pull-kubernetes-verify 050bdcd link /test pull-kubernetes-verify
pull-kubernetes-e2e-gce 050bdcd link /test pull-kubernetes-e2e-gce
pull-kubernetes-e2e-gce-storage-slow 050bdcd link /test pull-kubernetes-e2e-gce-storage-slow
pull-kubernetes-e2e-gce-csi-serial 050bdcd link /test pull-kubernetes-e2e-gce-csi-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.