New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in e2e test script #315
Conversation
[test] |
lgtm. |
@soltysh I think this is a big cause of hanging e2e test runs in Jenkins. When it occurs, which is almost every time for me in Jenkins, you have to wait for the full 30 min timeout... meanwhile, 8mb of logs accumulate... |
Origin Test Results: SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/352/) |
cc @ncdc |
Still hanging there, I must still not be accounting for something in the build detection. |
To save anybody else looking into this: the builds and deployments are working correctly: the e2e script simply fails to detect that the build has completed and so never moves on to the next checks (which would succeed if only it saw the completed build). Not sure why the race fix here didn't work, but I'm out of time to continue investigating. |
hmm... i'm pretty sure i had a run where the build timed out and the docker logs of the build showed it was still pushing. |
[INFO] Success running command: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/560dbc67-6158-11e4-98a0-22000b510bf6 | grep complete' after 2020 seconds (from my PR). 33 minutes. things are moving in the wrong direction. i'm going to try pre-pulling centos7 and pushing it into the docker-registry separate from the build so we can at least measure the times semi-independently, and not have the builds timeout. |
This should be able to merge now, unfortunately you'll have to rebase first. sorry. |
@bparees From the Jenkins log (and this matches my other runs from other PRs including this one):
This indicates that the build pod was started, a new built image was detected 4 minutes later, a deployment then occurred immediately, and then the deployment controller noticed the deployment thereafter and ignores. The e2e script just stalled trying to detect the first event (build completion). What evidence is there that the build itself was hanging? |
@bparees I also see this repeated, which could be a separate problem:
|
8f4dd36
to
e9a2441
Compare
[test] again |
1 similar comment
[test] again |
@ironcladlou my evidence for the build itself hanging was that one of my runs where the build timed out (after 30 minutes) and had my new build log dumping logic, the build log showed the start of the push and not the completion, indicating the push was still running when the timer popped. |
@ironcladlou see this log:
As you can see we timed out waiting for complete. But the build log shows the build was still pushing to the repository when things were aborted:
|
2dba3a0
to
843a4d1
Compare
The build detection relied on looking up the build ID immediately after a webhook simulation, and then repeatedly querying that build for status. If the ID lookup beats the build record creation, the script will time out looking for a nil build. Look at the build list for status instead to eliminate the race.
843a4d1
to
2d0f8dd
Compare
[test] again... |
This is LGTM, any reason this can't be merged Ben? |
Do not merge yet, still using this as a canary to sort out the intermittent e2e failures. |
[test] |
Evaluated for origin up to 2d0f8dd |
Origin Action Required: Pull request cannot be automatically merged, please rebase your branch from latest HEAD and push again |
Force a rebuild after ./hack/sync-to-origin.sh
…atch-1 Merged by openshift-bot
The build detection relied on looking up the build ID immediately
after a webhook simulation, and then repeatedly querying that build
for status. If the ID lookup beats the build record creation, the
script will time out looking for a nil build.
Look at the build list for status instead to eliminate the race.