Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in e2e test script #315

Closed
wants to merge 1 commit into from

Conversation

ironcladlou
Copy link
Contributor

The build detection relied on looking up the build ID immediately
after a webhook simulation, and then repeatedly querying that build
for status. If the ID lookup beats the build record creation, the
script will time out looking for a nil build.

Look at the build list for status instead to eliminate the race.

@ironcladlou
Copy link
Contributor Author

[test]

@bparees
Copy link
Contributor

bparees commented Oct 31, 2014

lgtm.

@ironcladlou
Copy link
Contributor Author

@soltysh I think this is a big cause of hanging e2e test runs in Jenkins. When it occurs, which is almost every time for me in Jenkins, you have to wait for the full 30 min timeout... meanwhile, 8mb of logs accumulate...

@openshift-bot
Copy link
Contributor

@ironcladlou
Copy link
Contributor Author

cc @ncdc

@ironcladlou
Copy link
Contributor Author

Still hanging there, I must still not be accounting for something in the build detection.

@ironcladlou
Copy link
Contributor Author

To save anybody else looking into this: the builds and deployments are working correctly: the e2e script simply fails to detect that the build has completed and so never moves on to the next checks (which would succeed if only it saw the completed build). Not sure why the race fix here didn't work, but I'm out of time to continue investigating.

@ncdc @bparees @pmorie @abhgupta @smarterclayton

@bparees
Copy link
Contributor

bparees commented Oct 31, 2014

hmm... i'm pretty sure i had a run where the build timed out and the docker logs of the build showed it was still pushing.

@bparees
Copy link
Contributor

bparees commented Nov 1, 2014

[INFO] Success running command: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/560dbc67-6158-11e4-98a0-22000b510bf6 | grep complete' after 2020 seconds

(from my PR). 33 minutes. things are moving in the wrong direction.

i'm going to try pre-pulling centos7 and pushing it into the docker-registry separate from the build so we can at least measure the times semi-independently, and not have the builds timeout.

@bparees
Copy link
Contributor

bparees commented Nov 1, 2014

This should be able to merge now, unfortunately you'll have to rebase first. sorry.

@ironcladlou
Copy link
Contributor Author

@bparees From the Jenkins log (and this matches my other runs from other PRs including this one):

I1031 21:52:36.519941 03717 controller.go:117] Attempting to create pod: &api.Pod{TypeMeta:api.TypeMeta{Kind:"", ID:"build-docker-2f7c4dd5-6148-11e4-931b-22
000b4d0fb8", <snip>
I1031 21:56:26.541171 03717 image_change_controller.go:42] Detecting changed images for deploymentConfig frontend
I1031 21:56:26.612775 03717 deployment_config_controller.go:110] Creating new deployment from config frontend
I1031 22:01:26.649526 03717 deployment_config_controller.go:63] Shouldn't deploy because a deployment 'frontend-1' already exists for latest config frontend

This indicates that the build pod was started, a new built image was detected 4 minutes later, a deployment then occurred immediately, and then the deployment controller noticed the deployment thereafter and ignores. The e2e script just stalled trying to detect the first event (build completion). What evidence is there that the build itself was hanging?

@ironcladlou
Copy link
Contributor Author

@bparees I also see this repeated, which could be a separate problem:

I1031 22:21:49.536373 03717 docker.go:228] Pulling image 172.121.17.1:5001/openshift/origin-ruby-sample without credentials
E1031 22:21:51.995775 03717 kubelet.go:566] Failed to pull image 172.121.17.1:5001/openshift/origin-ruby-sample: Tag latest not found in repository 172.121.17.1:5001/openshift/origin-ruby-sample skipping pod cf83656a-6148-11e4-931b-22000b4d0fb8.default.etcd container ruby-helloworld.

@ironcladlou
Copy link
Contributor Author

[test] again

1 similar comment
@ironcladlou
Copy link
Contributor Author

[test] again

@bparees
Copy link
Contributor

bparees commented Nov 1, 2014

@ironcladlou my evidence for the build itself hanging was that one of my runs where the build timed out (after 30 minutes) and had my new build log dumping logic, the build log showed the start of the push and not the completion, indicating the push was still running when the timer popped.

@bparees
Copy link
Contributor

bparees commented Nov 1, 2014

@ironcladlou see this log:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/333/console

[INFO] Waiting for build to complete
[INFO] Waiting for command to finish: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/60820b81-6149-11e4-90e8-22000b5b1026 | grep complete'...
[ ERR] Gave up waiting for: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/60820b81-6149-11e4-90e8-22000b5b1026 | grep complete'
[FAIL] !!!!! Test Failed !!!!

As you can see we timed out waiting for complete.

But the build log shows the build was still pushing to the repository when things were aborted:

[INFO] Build logs: /tmp/openshift.local.logs.3n1T/build.log
Step 0 : FROM openshift/ruby-20-centos
 ---> a3d20336b598
Step 1 : RUN gem install sinatra sinatra-activerecord mysql2 --no-ri --no-rdoc
 ---> Running in 25908fa85996
Successfully installed rack-1.5.2
Successfully installed tilt-1.4.1
Successfully installed rack-protection-1.5.3
Successfully installed sinatra-1.4.5
Successfully installed i18n-0.6.11
Successfully installed thread_safe-0.3.4
Successfully installed tzinfo-1.2.2
Successfully installed minitest-5.4.2
Successfully installed activesupport-4.1.7
Successfully installed builder-3.2.2
Successfully installed activemodel-4.1.7
Successfully installed arel-5.0.1.20140414130214
Successfully installed activerecord-4.1.7
Successfully installed sinatra-activerecord-2.0.3
Building native extensions.  This could take a while...
Successfully installed mysql2-0.3.16
15 gems installed
 ---> efd4bfaf5242
Removing intermediate container 25908fa85996
Step 2 : ADD sinatra_app /tmp/
 ---> 7bcbe45ee84b
Removing intermediate container 530ce55a75df
Step 3 : USER root
 ---> Running in d216a0943edf
 ---> 15895eb9c85e
Removing intermediate container d216a0943edf
Step 4 : RUN chown -R ruby:ruby /tmp/*
 ---> Running in 05b5d772f52b
 ---> 355f9001c37b
Removing intermediate container 05b5d772f52b
Step 5 : USER ruby
 ---> Running in d78334f9900b
 ---> 1e12903dd440
Removing intermediate container d78334f9900b
Step 6 : WORKDIR /tmp/
 ---> Running in 6dcbc4b99a90
 ---> a5ff95ab7307
Removing intermediate container 6dcbc4b99a90
Step 7 : EXPOSE 8080
 ---> Running in 9157068e75b1
 ---> c6477b921d8d
Removing intermediate container 9157068e75b1
Step 8 : CMD ruby start.rb
 ---> Running in 02086a049ab1
 ---> 0869ed30ff57
Removing intermediate container 02086a049ab1
Successfully built 0869ed30ff57
The push refers to a repository [172.121.17.1:5001/openshift/origin-ruby-sample] (len: 1)
Sending image list
Pushing repository 172.121.17.1:5001/openshift/origin-ruby-sample (1 tags)
[INFO] Tearing down test

The build detection relied on looking up the build ID immediately
after a webhook simulation, and then repeatedly querying that build
for status. If the ID lookup beats the build record creation, the
script will time out looking for a nil build.

Look at the build list for status instead to eliminate the race.
@ironcladlou
Copy link
Contributor Author

[test] again...

@smarterclayton
Copy link
Contributor

This is LGTM, any reason this can't be merged Ben?

@ironcladlou
Copy link
Contributor Author

Do not merge yet, still using this as a canary to sort out the intermittent e2e failures.

@ironcladlou
Copy link
Contributor Author

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin up to 2d0f8dd

@ironcladlou ironcladlou mentioned this pull request Nov 3, 2014
@openshift-bot
Copy link
Contributor

Origin Action Required: Pull request cannot be automatically merged, please rebase your branch from latest HEAD and push again

@ironcladlou ironcladlou closed this Nov 4, 2014
@ironcladlou ironcladlou deleted the e2e-build-race-fix branch November 7, 2014 20:01
danwinship pushed a commit to danwinship/origin that referenced this pull request Jun 24, 2016
Force a rebuild after ./hack/sync-to-origin.sh
miminar pushed a commit to miminar/origin that referenced this pull request Oct 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants