Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

fejta · 2016-08-29T20:06:31Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/30321/node-pull-build-e2e-test/21737/

I0829 10:30:09.777786   22504 remote.go:210] Copying test artifacts from tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727
I0829 10:30:09.792249   22504 run_remote.go:559] Deleting instance "tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727"
================================================================
Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite
Running Suite: E2eNode Suite

Super unclear what happened here...

The text was updated successfully, but these errors were encountered:

timstclair · 2016-08-29T20:29:15Z

cc @Random-Liu

dchen1107 · 2016-08-29T21:25:59Z

There is no test failed in this run. Is this blocking any pr merge?

Random-Liu · 2016-08-29T21:34:22Z

SConnection to 146.148.103.180 closed by remote host.
, command [scp -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -r 146.148.103.180:/tmp/gcloud-e2e-622678784/results/ /var/lib/jenkins/workspace/node-pull-build-e2e-test/_artifacts/tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727] failed with error: exit status 1 and output:
ssh_exchange_identification: read: Connection reset by peer
]

This should be the reason. The ssh connect is reset in the middle of the test for some reason.

fejta · 2016-08-29T21:39:09Z

I think this is the main cause of #31439

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/8802/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/8768/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/8763/

Random-Liu · 2016-08-29T21:47:20Z

One observation is that all these test failures are on coreos tmp-node-e2e-498cbc0e-coreos-alpha-1122-0-0-v20160727.

@yifan-gu @euank Any ideas what is the possible reason?

dchen1107 · 2016-08-29T21:49:29Z

Looks like all failure are due to lost the connect to coreos host during the tests:

Connection to 104.197.17.25 closed by remote host.
, command [scp -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -r 104.197.17.25:/tmp/gcloud-e2e-1986596147/results/ /var/lib/jenkins/workspace/kubelet-gce-e2e-ci/_artifacts/tmp-node-e2e-beffb2ff-coreos-alpha-1122-0-0-v20160727] failed with error: exit status 1 and output:
ssh_exchange_identification: read: Connection reset by peer
]

cc/ @euank Can you help with this? Otherwise, we have to remove coreos image from node test metrics to stop the bleeding edge.

euank · 2016-08-29T21:53:55Z

Possible theory: when I updated to the current way of launching coreos nodes, I just mask the update units, not explicitly stop them. That might not always be sufficient. I'll verify now whether that's going wrong (my bad if so).

euank · 2016-08-29T22:15:51Z

Theory confirmed (note the masked + active):

core@tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727 ~ $ sudo systemctl status update-engine
● update-engine.service
   Loaded: masked (/dev/null; bad)
   Active: active (running) since Mon 2016-08-29 22:12:24 UTC; 1min 33s ago
 Main PID: 1034 (update_engine)
   CGroup: /system.slice/update-engine.service
           └─1034 /usr/sbin/update_engine -foreground -logtostderr

Aug 29 22:12:23 localhost systemd[1]: Starting Update Engine...
Aug 29 22:12:24 localhost update_engine[1034]: [0829/221224:INFO:main.cc(155)] CoreOS Update Engine starting
Aug 29 22:12:24 localhost systemd[1]: Started Update Engine.
Aug 29 22:12:24 localhost update_engine[1034]: [0829/221224:INFO:update_check_scheduler.cc(82)] Next update check in 11m12s
Aug 29 22:13:09 tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727.c.coreos-g update_engine[1034]: [0829/221309:INFO:update_attempter.cc(485)] Updating boot flags...
core@tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727 ~ $ sudo systemctl status locksmithd
● locksmithd.service
   Loaded: masked (/dev/null; bad)
   Active: active (running) since Mon 2016-08-29 22:12:23 UTC; 1min 37s ago
 Main PID: 1016 (locksmithd)
   CGroup: /system.slice/locksmithd.service
           └─1016 /usr/lib/locksmith/locksmithd

Aug 29 22:12:23 localhost systemd[1]: Started Cluster reboot manager.
Aug 29 22:12:24 tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727.c.coreos-g locksmithd[1016]: locksmithd starting currentOperation="UPDATE_STATUS_IDLE" strategy="best-effort"

PR coming in a second... Mea culpa

This disables update-engine and locksmithd with ignition instead of cloud-init so that they're really totally 100% disabled. Pretty much every way of disabling them with cloud-init is mildly racy. Fixes kubernetes#31633

euank · 2016-08-29T23:20:40Z

PR #31653 if any of you fine flake-hunters want to give a review.

@vishh

Automatic merge from submit-queue test/node-e2e: Update CoreOS update disabling Previously in this saga... #25004 This disables update-engine and locksmithd with ignition instead of cloud-init so that they're really totally 100% disabled. Our ignition guy promises. Pretty much every way of disabling them with cloud-init is mildly racy. Fixes #31633 I think @vishh can say "I told you so" after the comment on #30023 (diff) .. he was right, but it turns out "stop" there doesn't really work either because of the mess that is cloud-init. Fortunately, converting our cloud-init to json and calling it "ignition" works quite well 😄 Testing done: I ssh'd in and verified that yes, they're disabled. I didn't wait on the e2e tests to pass, so we'll let this PR check that.

fejta added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 29, 2016

fejta assigned dchen1107 Aug 29, 2016

fejta changed the title ~~Failure~~ Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite Aug 29, 2016

fejta mentioned this issue Aug 29, 2016

Convert bool to error, helper func for cd to skew #30321

Merged

timstclair mentioned this issue Aug 29, 2016

Include security options in the container created event #31557

Merged

k8s-github-robot mentioned this issue Aug 29, 2016

kubelet-gce-e2e-ci: broken test run #31439

Closed

fejta added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 29, 2016

dchen1107 assigned euank Aug 29, 2016

euank mentioned this issue Aug 29, 2016

test/node-e2e: Update CoreOS update disabling #31653

Merged

This was referenced Aug 30, 2016

Cleanup node failure message #31634

Merged

increase latency and resource limit accroding to test results #31664

Merged

k8s-github-robot closed this as completed in #31653 Aug 30, 2016

fejta mentioned this issue Aug 30, 2016

1 errors encountered.Running gubernator.sh #31656

Closed

timstclair mentioned this issue Aug 31, 2016

Append "AppArmor enabled" to the Node ready condition message #31659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

fejta commented Aug 29, 2016 •

edited by timstclair

timstclair commented Aug 29, 2016

dchen1107 commented Aug 29, 2016

Random-Liu commented Aug 29, 2016 •

edited

fejta commented Aug 29, 2016 •

edited

Random-Liu commented Aug 29, 2016 •

edited

dchen1107 commented Aug 29, 2016

euank commented Aug 29, 2016 •

edited

euank commented Aug 29, 2016

euank commented Aug 29, 2016

Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

Comments

fejta commented Aug 29, 2016 • edited by timstclair

timstclair commented Aug 29, 2016

dchen1107 commented Aug 29, 2016

Random-Liu commented Aug 29, 2016 • edited

fejta commented Aug 29, 2016 • edited

Random-Liu commented Aug 29, 2016 • edited

dchen1107 commented Aug 29, 2016

euank commented Aug 29, 2016 • edited

euank commented Aug 29, 2016

euank commented Aug 29, 2016

fejta commented Aug 29, 2016 •

edited by timstclair

Random-Liu commented Aug 29, 2016 •

edited

fejta commented Aug 29, 2016 •

edited

Random-Liu commented Aug 29, 2016 •

edited

euank commented Aug 29, 2016 •

edited