Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite #31633

Closed
fejta opened this issue Aug 29, 2016 · 9 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@fejta
Copy link
Contributor

fejta commented Aug 29, 2016

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/30321/node-pull-build-e2e-test/21737/

I0829 10:30:09.777786   22504 remote.go:210] Copying test artifacts from tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727
I0829 10:30:09.792249   22504 run_remote.go:559] Deleting instance "tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727"
================================================================
Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite
Running Suite: E2eNode Suite

Super unclear what happened here...

@fejta fejta added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Aug 29, 2016
@fejta fejta changed the title Failure Failure Finished Host tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727 Test Suite Aug 29, 2016
@timstclair
Copy link

cc @Random-Liu

@dchen1107
Copy link
Member

There is no test failed in this run. Is this blocking any pr merge?

@Random-Liu
Copy link
Member

Random-Liu commented Aug 29, 2016

SConnection to 146.148.103.180 closed by remote host.
, command [scp -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -r 146.148.103.180:/tmp/gcloud-e2e-622678784/results/ /var/lib/jenkins/workspace/node-pull-build-e2e-test/_artifacts/tmp-node-e2e-b6d45b2a-coreos-alpha-1122-0-0-v20160727] failed with error: exit status 1 and output:
ssh_exchange_identification: read: Connection reset by peer
]

This should be the reason. The ssh connect is reset in the middle of the test for some reason.

@fejta fejta added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 29, 2016
@Random-Liu
Copy link
Member

Random-Liu commented Aug 29, 2016

One observation is that all these test failures are on coreos tmp-node-e2e-498cbc0e-coreos-alpha-1122-0-0-v20160727.

@yifan-gu @euank Any ideas what is the possible reason?

@dchen1107
Copy link
Member

Looks like all failure are due to lost the connect to coreos host during the tests:

Connection to 104.197.17.25 closed by remote host.
, command [scp -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -r 104.197.17.25:/tmp/gcloud-e2e-1986596147/results/ /var/lib/jenkins/workspace/kubelet-gce-e2e-ci/_artifacts/tmp-node-e2e-beffb2ff-coreos-alpha-1122-0-0-v20160727] failed with error: exit status 1 and output:
ssh_exchange_identification: read: Connection reset by peer
]

cc/ @euank Can you help with this? Otherwise, we have to remove coreos image from node test metrics to stop the bleeding edge.

@euank
Copy link
Contributor

euank commented Aug 29, 2016

Possible theory: when I updated to the current way of launching coreos nodes, I just mask the update units, not explicitly stop them. That might not always be sufficient. I'll verify now whether that's going wrong (my bad if so).

@euank
Copy link
Contributor

euank commented Aug 29, 2016

Theory confirmed (note the masked + active):

core@tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727 ~ $ sudo systemctl status update-engine
● update-engine.service
   Loaded: masked (/dev/null; bad)
   Active: active (running) since Mon 2016-08-29 22:12:24 UTC; 1min 33s ago
 Main PID: 1034 (update_engine)
   CGroup: /system.slice/update-engine.service
           └─1034 /usr/sbin/update_engine -foreground -logtostderr

Aug 29 22:12:23 localhost systemd[1]: Starting Update Engine...
Aug 29 22:12:24 localhost update_engine[1034]: [0829/221224:INFO:main.cc(155)] CoreOS Update Engine starting
Aug 29 22:12:24 localhost systemd[1]: Started Update Engine.
Aug 29 22:12:24 localhost update_engine[1034]: [0829/221224:INFO:update_check_scheduler.cc(82)] Next update check in 11m12s
Aug 29 22:13:09 tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727.c.coreos-g update_engine[1034]: [0829/221309:INFO:update_attempter.cc(485)] Updating boot flags...
core@tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727 ~ $ sudo systemctl status locksmithd
● locksmithd.service
   Loaded: masked (/dev/null; bad)
   Active: active (running) since Mon 2016-08-29 22:12:23 UTC; 1min 37s ago
 Main PID: 1016 (locksmithd)
   CGroup: /system.slice/locksmithd.service
           └─1016 /usr/lib/locksmith/locksmithd

Aug 29 22:12:23 localhost systemd[1]: Started Cluster reboot manager.
Aug 29 22:12:24 tmp-node-e2e-96819429-coreos-alpha-1122-0-0-v20160727.c.coreos-g locksmithd[1016]: locksmithd starting currentOperation="UPDATE_STATUS_IDLE" strategy="best-effort"

PR coming in a second... Mea culpa

euank added a commit to euank/kubernetes that referenced this issue Aug 29, 2016
This disables update-engine and locksmithd with ignition instead of
cloud-init so that they're really totally 100% disabled.

Pretty much every way of disabling them with cloud-init is mildly racy.

Fixes kubernetes#31633
@euank
Copy link
Contributor

euank commented Aug 29, 2016

PR #31653 if any of you fine flake-hunters want to give a review.

k8s-github-robot pushed a commit that referenced this issue Aug 30, 2016
Automatic merge from submit-queue

test/node-e2e: Update CoreOS update disabling

Previously in this saga... #25004

This disables update-engine and locksmithd with ignition instead of
cloud-init so that they're really totally 100% disabled. Our ignition guy promises.

Pretty much every way of disabling them with cloud-init is mildly racy.

Fixes #31633 

I think @vishh can say "I told you so" after the comment on #30023 (diff) .. he was right, but it turns out "stop" there doesn't really work either because of the mess that is cloud-init. Fortunately, converting our cloud-init to json and calling it "ignition" works quite well 😄 

Testing done: I ssh'd in and verified that yes, they're disabled. I didn't wait on the e2e tests to pass, so we'll let this PR check that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants