NFS test failures #58578

justinsb · 2018-01-21T03:41:59Z

We're observing some NFS test failures on kops-aws tests. kops on AWS does not use the containerized mounter.

For example:

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56132/pull-kubernetes-e2e-kops-aws/69965/

I0121 02:54:06.028] Jan 21 02:53:54.179: INFO: At 2018-01-21 02:48:44 +0000 UTC - event for pvc-tester-w4q98: {kubelet ip-172-20-41-247.us-west-2.compute.internal} FailedMount: MountVolume.SetUp failed for volume "nfs-qjgnk" : mount failed: exit status 32
I0121 02:54:06.028] Mounting command: systemd-run
I0121 02:54:06.029] Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/92f5e91d-fe55-11e7-a178-0690eca8f2ac/volumes/kubernetes.io~nfs/nfs-qjgnk --scope -- mount -t nfs 100.96.3.43:/exports /var/lib/kubelet/pods/92f5e91d-fe55-11e7-a178-0690eca8f2ac/volumes/kubernetes.io~nfs/nfs-qjgnk
I0121 02:54:06.029] Output: Running as unit run-30771.scope.
I0121 02:54:06.029] mount.nfs: rpc.statd is not running but is required for remote locking.
I0121 02:54:06.029] mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
I0121 02:54:06.030] mount.nfs: an incorrect mount option was specified
I0121 02:54:06.030]

Is it a requirement to run rpc.statd, or should kubelet start it? I can't see any recent code changes. It feels like this is also flaking (rather than reliably failing), which is also confusing.

We did recently have to update the image & kernel for Meltdown / Spectre for kops-on-AWS.

The text was updated successfully, but these errors were encountered:

The alpha channel has the upcoming cloud images, so we can catch issues before images are committed to the stable channel. This will also let us gather data for issues which may be image dependent e.g. kubernetes/kubernetes#58578

justinsb · 2018-01-21T16:14:13Z

And it does look to be non-deterministic, as https://k8s-testgrid.appspot.com/google-aws#kops-aws shows: 22752 & '53 failed all the NFS tests, '54 failed 3/7, '55 failed 5/7, '56 and '57 failed 3/7 (but not the exact same 3), and '58 passed all 3.

(I'm assuming not a whole lot changed over that time interval)

liggitt · 2018-01-22T02:38:03Z

this appears to have failed 24 out of the last 35 kops PR runs.

the resulting PVC test errors started appearing early afternoon on 1/20: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=pvc-tester&job=pull-kubernetes-e2e-kops-aws

do we know what changed in the image changes around that time (https://github.com/kubernetes/kops/commits/master)

justinsb · 2018-01-22T04:35:59Z

Change should primarily have been the kernel -> 4.4.111, we also do a package update so we could have picked up a newer version of the NFS package or other supporting package.

We did try going to stretch for 1.10, but we still have / had the nf_conntrack issue so I reverted that to unblock the queue.

Also starting to add more tests, with other images ( e.g. kubernetes/test-infra#6364 ) both so that we can get a heads-up before promoting images to stable, but also I guess we should try other OSes (e.g. I could add a test on RHEL).

The alpha channel has the upcoming cloud images, so we can catch issues before images are committed to the stable channel. This will also let us gather data for issues which may be image dependent e.g. kubernetes/kubernetes#58578

justinsb · 2018-01-22T04:47:16Z

It does seem highly likely to be something caused by the image, given the timing and the fact that https://k8s-testgrid.appspot.com/google-kops-gce#kops-gce is consistently green (that uses COS). I'll put through a PR for a test that uses another distro also.

justinsb · 2018-01-22T05:48:15Z

Proposed kubernetes/test-infra#6367 to add a test for ubuntu-16.04, which should be similar. That way we'll get an indication if it's a meltdown/spectre regression or something specific to the image.

Not sure how we should proceed here - we want to get the queue unblocked, but we don't really want to ignore the issue if it turns out to be real. I guess we could ignore the NFS tests on the queue-blocking jobs, and ensure that we have non-blocking jobs that will continue to fail until it's resolved.

zouyee · 2018-01-22T13:38:37Z

At present, the pull-kubernetes-e2e-kops-aws job is widespread failed

justinsb · 2018-01-22T15:19:16Z

Correction: the kernel in the stable channel is 4.4.110. The kernel in the alpha channel is 4.4.111. PR to add testing of the alpha channel is kubernetes/test-infra#6364

fejta-bot · 2018-04-22T16:02:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-05-22T16:49:42Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-06-21T17:36:31Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2018

justinsb added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jan 21, 2018

k8s-ci-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 21, 2018

justinsb mentioned this issue Jan 21, 2018

Add tests for kops alpha channel kubernetes/test-infra#6364

Merged

This was referenced Jan 21, 2018

Don't specify require-kubeconfig from 1.10 kubernetes/kops#4308

Merged

Send correct resource version for delete events from watch cache #58547

Merged

liggitt added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Jan 22, 2018

liggitt mentioned this issue Jan 22, 2018

fix apiserver crash caused by nil pointer #58438

Merged

This was referenced Jan 22, 2018

Add test for ubuntu 16.04 image kubernetes/test-infra#6367

Merged

Failure cluster [c76502...] failed 89 builds, 6 jobs, and 6 tests over 1 days #58594

Closed

zouyee mentioned this issue Jan 22, 2018

update etcd unified version to 3.1.10 #54242

Merged

mtaufen mentioned this issue Jan 22, 2018

flag precedence redo #56995

Merged

madddi mentioned this issue Jan 22, 2018

Add keyring parameter in Ceph RBD provisioner #58287

Merged

ericchiang mentioned this issue Jan 22, 2018

oidc authentication: switch to v2 of coreos/go-oidc #58544

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 22, 2018

k8s-ci-robot closed this as completed Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS test failures #58578

NFS test failures #58578

justinsb commented Jan 21, 2018

justinsb commented Jan 21, 2018

liggitt commented Jan 22, 2018 •

edited

justinsb commented Jan 22, 2018

justinsb commented Jan 22, 2018

justinsb commented Jan 22, 2018

zouyee commented Jan 22, 2018 •

edited

justinsb commented Jan 22, 2018

fejta-bot commented Apr 22, 2018

fejta-bot commented May 22, 2018

fejta-bot commented Jun 21, 2018

NFS test failures #58578

NFS test failures #58578

Comments

justinsb commented Jan 21, 2018

justinsb commented Jan 21, 2018

liggitt commented Jan 22, 2018 • edited

justinsb commented Jan 22, 2018

justinsb commented Jan 22, 2018

justinsb commented Jan 22, 2018

zouyee commented Jan 22, 2018 • edited

justinsb commented Jan 22, 2018

fejta-bot commented Apr 22, 2018

fejta-bot commented May 22, 2018

fejta-bot commented Jun 21, 2018

liggitt commented Jan 22, 2018 •

edited

zouyee commented Jan 22, 2018 •

edited