New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serial suite is timing out #70810

Closed
msau42 opened this Issue Nov 8, 2018 · 26 comments

Comments

@msau42
Member

msau42 commented Nov 8, 2018

Which jobs are failing:
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial

Which test(s) are failing:
should unmount if pod is force deleted while kubelet is down [Disruptive][Slow]

Since when has it been failing:
11/07/18 15:02

Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial

Reason for failure:
The test case stops and starts kubelet, and it seems like kubelet is unable to start again

Anything else we need to know:

/kind failing-test

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

Thanks for opening this @msau42

Looking at commits from yesterday, looks like PR#70713 could be the cause?

/assign @deads2k

@deads2k can you please look into this as priority. This is failing critical release blocking tests.

/cc @jberkus @maria @mortent

/sig cli
/priority critical-urgent
/milestone v1.13

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

I1107 23:30:16.131] �[1mSTEP�[0m: Starting the kubelet and waiting for pod to delete.
I1107 23:30:16.135] Nov  7 23:30:16.135: INFO: Checking if sudo command is present
I1107 23:30:16.204] Nov  7 23:30:16.203: INFO: Checking if systemctl command is present
I1107 23:30:16.270] Nov  7 23:30:16.270: INFO: Attempting `sudo systemctl start kubelet`
I1107 23:30:16.354] Nov  7 23:30:16.354: INFO: ssh prow@35.226.141.5:22: command:   sudo systemctl start kubelet
I1107 23:30:16.354] Nov  7 23:30:16.354: INFO: ssh prow@35.226.141.5:22: stdout:    ""
I1107 23:30:16.354] Nov  7 23:30:16.354: INFO: ssh prow@35.226.141.5:22: stderr:    ""
I1107 23:30:16.355] Nov  7 23:30:16.354: INFO: ssh prow@35.226.141.5:22: exit code: 0

I1107 23:31:14.480] Nov  7 23:31:14.479: INFO: Condition Ready of node bootstrap-e2e-minion-group-tjh1 is false, but Node is tainted by NodeController with [{node.kubernetes.io/not-ready  NoSchedule 2018-11-07 23:30:16 +0000 UTC} {node.kubernetes.io/not-ready  NoExecute 2018-11-07 23:30:21 +0000 UTC}]. Failure
I1107 23:31:16.480] Nov  7 23:31:16.480: INFO: Node bootstrap-e2e-minion-group-tjh1 didn't reach desired Ready condition status (true) within 1m0s
I1107 23:31:16.480] Nov  7 23:31:16.480: INFO: Node bootstrap-e2e-minion-group-tjh1 failed to enter Ready state
@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

It looks like kubelet keeps restarting. The error right before restarting is:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4323/artifacts/bootstrap-e2e-minion-group-tjh1/kubelet.log

F1108 07:23:17.005364   26737 kubelet.go:1375] failed to start Plugin Watcher. err: failed to traverse plugin socket path, err: error accessing path: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-28f57e30-e2e2-11e8-b368-42010a800002/globalmount error: readdirent: input/output error
panic: send on closed channel
goroutine 889 [running]:
k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher.(*Watcher).traversePluginDir.func1.1(0xc000d28720, 0xc001087640, 0x36)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go:191 +0x68
created by k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher.(*Watcher).traversePluginDir.func1
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go:190 +0x1ad
kubelet.service: Service hold-off time over, scheduling restart.
Stopped Kubernetes kubelet.
Started Kubernetes kubelet.
kubelet.service: Main process exited, code=exited, status=255/n/a
kubelet.service: Unit entered failed state.
kubelet.service: Failed with result 'exit-code'.
kubelet.service: Service hold-off time over, scheduling restart.
Stopped Kubernetes kubelet.
Started Kubernetes kubelet.
@msau42

This comment has been minimized.

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

This bug could have been exposed by the newly added tests in #68025

@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Nov 8, 2018

Hmm, so I see two errors:

F1108 07:23:17.005364 26737 kubelet.go:1375] failed to start Plugin Watcher. err: failed to traverse plugin socket path, err: error accessing path: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-28f57e30-e2e2-11e8-b368-42010a800002/globalmount error: readdirent: input/output error

Which is a Fatal error and would prevent Kubelet from starting anyways. In which case this error is handled correctly by the pluginwatcher, but I'm not sure preventing Kubelet from starting is the desired behavior.

Should we log this error and skip it? There also seem to be a cleanup issue here on that file is this expected?

panic: send on closed channel

This one is clearly a bug and not really handled correctly. It's due to the fact that we encounter an error at start, cleanup but don't wait for the gorountines to finish before cleaning up.

I can send in a fix pretty easily for the second issue, basically add a waitgroup in the traversePluginDir.

For the first error, what do we think is the expected behavior?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Nov 8, 2018

This one is clearly a bug and not really handled correctly. It's due to the fact that we encounter an error at start, cleanup but don't wait for the gorountines to finish before cleaning up.

This was mentioned in #70460 as well.
/cc @davidz627

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

Is this related to the issue where plugin watcher is recursively scanning for sockets?

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

Also, it should be noted that the path it's erroring on is not a socket type.

@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Nov 8, 2018

Is this related to the issue where plugin watcher is recursively scanning for sockets?
Also, it should be noted that the path it's erroring on is not a socket type.

That's a bit different, as it is intended to scan recursively. Also note that the error here is produced by filepath.Walk as can be seen by the error log "error accessing path".

What should be our behavior in such a case? Ignore that error?

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

I'm not sure if we want to be recursively scanning. The volume directories are mount points, which means there could be lots of files/directories existing in those volumes.

@figo

This comment has been minimized.

Member

figo commented Nov 8, 2018

I'm not sure if we want to be recursively scanning. The volume directories are mount points, which means there could be lots of files/directories existing in those volumes.

@RenaudWasTaken maybe we can put a limit on depth of plugin directory for traversePluginDir on the same PR: #70494

@saad-ali

This comment has been minimized.

Member

saad-ali commented Nov 8, 2018

I'm not sure if we want to be recursively scanning. The volume directories are mount points, which means there could be lots of files/directories existing in those volumes.

Ideally the directory that the plugin sockets are dropped in should be different from the directory where mounts happens.

Currently the plugin socket directory is: /var/lib/kubelet/plugins/)
The mounts directory is /var/lib/kubelet/pods/[poduid]/volumes/... (so far so good).
Problem is the "global mount" directory is also /var/lib/kubelet/plugins/... :(

So options are:

  1. Drop recursively scanning like @msau42 suggested, and say plugins will can only be directly under /var/lib/kubelet/plugins/
    • This will break existing plugins, for example, the GCP plugin is at /var/lib/kubelet/plugins/com.google.csi.gcepd/csi.sock (see link).
  2. Keep recursively scanning but change the plugin socket directory to /var/lib/kubelet/plugins/sockets/?
    • This will also break existing plugins, like the GCP one pointed out above.
  3. Leave the plugin dir the same, and leave recursion in, but put a limit on depth of plugin directory as suggested by @figo.
    • This is ugly but may be the cleanest solution.
@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Nov 8, 2018

I'm not sure if we want to be recursively scanning. The volume directories are mount points, which means there could be lots of files/directories existing in those volumes.

This conversation is completely different from the bug surfaced by this issue. Let's move that to the corresponding issue: #69015

I have a fix ready to go out for the panic, we should figure out what behavior we expect Kubelet to have when encountering this error from filepath.Walk:

F1108 07:23:17.005364 26737 kubelet.go:1375] failed to start Plugin Watcher. err: failed to traverse plugin socket path, err: error accessing path: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-28f57e30-e2e2-11e8-b368-42010a800002/globalmount error: readdirent: input/output error

Should we ignore it?

@msau42

This comment has been minimized.

Member

msau42 commented Nov 8, 2018

Ignoring the error (with a warning maybe) instead of failing all of kubelet seems reasonable to me. If the plugin did not setup their registration socket correctly, then the end result should be that the plugin fails to register, but all other plugins should continue to work correctly.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

@msau42 @saad-ali @RenaudWasTaken the latest runs of this job no longer timesout - https://testgrid.k8s.io/sig-release-master-blocking#gce-acos-master-serial

Instead I am seeing a new CSI related failure [sig-storage] CSI Volumes [Driver: com.google.csi.gcepd][Serial] [Testpattern: Dynamic PV (default fs)] volumes should be mountable

Is this expected. Do we want to track that failure in this bug now or do we have a separate issue?

@msau42

This comment has been minimized.

Member

msau42 commented Nov 9, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

@msau the same testrun #4325 shows up as a timeout on that particular test. Lets wait for the next run.

but somehow the timeout seems to have gone. Are we still planning to merge #69015 and #70821 associated with this issue?

@RenaudWasTaken

This comment has been minimized.

Member

RenaudWasTaken commented Nov 9, 2018

#70810 should be merged as it fixes a bug.

@saad-ali can you help with approval?

Thanks!

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

#70810 should be merged as it fixes a bug

@RenaudWasTaken are you missing the right PR# in the comment above?

@vladimirvivien

This comment has been minimized.

Member

vladimirvivien commented Nov 9, 2018

@AishSundar this is the correct PR - #70821

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 10, 2018

/reopen

Keeping it open until we can see the job pass a couple of rounds

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 10, 2018

@AishSundar: Reopened this issue.

In response to this:

/reopen

Keeping it open until we can see the job pass a couple of rounds

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 10, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 11, 2018

I see just one CSI test timing out in serial job https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-serial

@RenaudWasTaken Is that something the fixes here attempted to resolve? if not I can open a new issue if it persists and close this one out. Thanks.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 11, 2018

The test mentioned in my comment above was a flake.. closing this issue now. will open a new issue for the flake

@AishSundar AishSundar closed this Nov 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment