New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFS test failures #58578
Comments
The alpha channel has the upcoming cloud images, so we can catch issues before images are committed to the stable channel. This will also let us gather data for issues which may be image dependent e.g. kubernetes/kubernetes#58578
And it does look to be non-deterministic, as https://k8s-testgrid.appspot.com/google-aws#kops-aws shows: 22752 & '53 failed all the NFS tests, '54 failed 3/7, '55 failed 5/7, '56 and '57 failed 3/7 (but not the exact same 3), and '58 passed all 3. (I'm assuming not a whole lot changed over that time interval) |
this appears to have failed 24 out of the last 35 kops PR runs. the resulting PVC test errors started appearing early afternoon on 1/20: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=pvc-tester&job=pull-kubernetes-e2e-kops-aws do we know what changed in the image changes around that time (https://github.com/kubernetes/kops/commits/master) |
Change should primarily have been the kernel -> 4.4.111, we also do a package update so we could have picked up a newer version of the NFS package or other supporting package. We did try going to stretch for 1.10, but we still have / had the nf_conntrack issue so I reverted that to unblock the queue. Also starting to add more tests, with other images ( e.g. kubernetes/test-infra#6364 ) both so that we can get a heads-up before promoting images to stable, but also I guess we should try other OSes (e.g. I could add a test on RHEL). |
The alpha channel has the upcoming cloud images, so we can catch issues before images are committed to the stable channel. This will also let us gather data for issues which may be image dependent e.g. kubernetes/kubernetes#58578
It does seem highly likely to be something caused by the image, given the timing and the fact that https://k8s-testgrid.appspot.com/google-kops-gce#kops-gce is consistently green (that uses COS). I'll put through a PR for a test that uses another distro also. |
Proposed kubernetes/test-infra#6367 to add a test for ubuntu-16.04, which should be similar. That way we'll get an indication if it's a meltdown/spectre regression or something specific to the image. Not sure how we should proceed here - we want to get the queue unblocked, but we don't really want to ignore the issue if it turns out to be real. I guess we could ignore the NFS tests on the queue-blocking jobs, and ensure that we have non-blocking jobs that will continue to fail until it's resolved. |
At present, the |
Correction: the kernel in the stable channel is 4.4.110. The kernel in the alpha channel is 4.4.111. PR to add testing of the alpha channel is kubernetes/test-infra#6364 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
We're observing some NFS test failures on kops-aws tests. kops on AWS does not use the containerized mounter.
For example:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56132/pull-kubernetes-e2e-kops-aws/69965/
Is it a requirement to run rpc.statd, or should kubelet start it? I can't see any recent code changes. It feels like this is also flaking (rather than reliably failing), which is also confusing.
We did recently have to update the image & kernel for Meltdown / Spectre for kops-on-AWS.
The text was updated successfully, but these errors were encountered: