Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS test failures #58578

Closed
justinsb opened this issue Jan 21, 2018 · 10 comments
Closed

NFS test failures #58578

justinsb opened this issue Jan 21, 2018 · 10 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@justinsb
Copy link
Member

We're observing some NFS test failures on kops-aws tests. kops on AWS does not use the containerized mounter.

For example:

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56132/pull-kubernetes-e2e-kops-aws/69965/

I0121 02:54:06.028] Jan 21 02:53:54.179: INFO: At 2018-01-21 02:48:44 +0000 UTC - event for pvc-tester-w4q98: {kubelet ip-172-20-41-247.us-west-2.compute.internal} FailedMount: MountVolume.SetUp failed for volume "nfs-qjgnk" : mount failed: exit status 32
I0121 02:54:06.028] Mounting command: systemd-run
I0121 02:54:06.029] Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/92f5e91d-fe55-11e7-a178-0690eca8f2ac/volumes/kubernetes.io~nfs/nfs-qjgnk --scope -- mount -t nfs 100.96.3.43:/exports /var/lib/kubelet/pods/92f5e91d-fe55-11e7-a178-0690eca8f2ac/volumes/kubernetes.io~nfs/nfs-qjgnk
I0121 02:54:06.029] Output: Running as unit run-30771.scope.
I0121 02:54:06.029] mount.nfs: rpc.statd is not running but is required for remote locking.
I0121 02:54:06.029] mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
I0121 02:54:06.030] mount.nfs: an incorrect mount option was specified
I0121 02:54:06.030] 

Is it a requirement to run rpc.statd, or should kubelet start it? I can't see any recent code changes. It feels like this is also flaking (rather than reliably failing), which is also confusing.

We did recently have to update the image & kernel for Meltdown / Spectre for kops-on-AWS.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2018
@justinsb justinsb added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 21, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2018
justinsb added a commit to justinsb/test-infra that referenced this issue Jan 21, 2018
The alpha channel has the upcoming cloud images, so we can catch issues
before images are committed to the stable channel.

This will also let us gather data for issues which may be image
dependent e.g.
kubernetes/kubernetes#58578
@justinsb
Copy link
Member Author

And it does look to be non-deterministic, as https://k8s-testgrid.appspot.com/google-aws#kops-aws shows: 22752 & '53 failed all the NFS tests, '54 failed 3/7, '55 failed 5/7, '56 and '57 failed 3/7 (but not the exact same 3), and '58 passed all 3.

(I'm assuming not a whole lot changed over that time interval)

@liggitt liggitt added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Jan 22, 2018
@liggitt
Copy link
Member

liggitt commented Jan 22, 2018

this appears to have failed 24 out of the last 35 kops PR runs.

the resulting PVC test errors started appearing early afternoon on 1/20: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=pvc-tester&job=pull-kubernetes-e2e-kops-aws

do we know what changed in the image changes around that time (https://github.com/kubernetes/kops/commits/master)

@justinsb
Copy link
Member Author

Change should primarily have been the kernel -> 4.4.111, we also do a package update so we could have picked up a newer version of the NFS package or other supporting package.

We did try going to stretch for 1.10, but we still have / had the nf_conntrack issue so I reverted that to unblock the queue.

Also starting to add more tests, with other images ( e.g. kubernetes/test-infra#6364 ) both so that we can get a heads-up before promoting images to stable, but also I guess we should try other OSes (e.g. I could add a test on RHEL).

justinsb added a commit to justinsb/test-infra that referenced this issue Jan 22, 2018
The alpha channel has the upcoming cloud images, so we can catch issues
before images are committed to the stable channel.

This will also let us gather data for issues which may be image
dependent e.g.
kubernetes/kubernetes#58578
@justinsb
Copy link
Member Author

It does seem highly likely to be something caused by the image, given the timing and the fact that https://k8s-testgrid.appspot.com/google-kops-gce#kops-gce is consistently green (that uses COS). I'll put through a PR for a test that uses another distro also.

@justinsb
Copy link
Member Author

Proposed kubernetes/test-infra#6367 to add a test for ubuntu-16.04, which should be similar. That way we'll get an indication if it's a meltdown/spectre regression or something specific to the image.

Not sure how we should proceed here - we want to get the queue unblocked, but we don't really want to ignore the issue if it turns out to be real. I guess we could ignore the NFS tests on the queue-blocking jobs, and ensure that we have non-blocking jobs that will continue to fail until it's resolved.

@zouyee
Copy link
Member

zouyee commented Jan 22, 2018

At present, the pull-kubernetes-e2e-kops-aws job is widespread failed

@justinsb
Copy link
Member Author

Correction: the kernel in the stable channel is 4.4.110. The kernel in the alpha channel is 4.4.111. PR to add testing of the alpha channel is kubernetes/test-infra#6364

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 22, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

5 participants