Fix race condition in e2e test script #315

ironcladlou · 2014-10-31T21:39:55Z

The build detection relied on looking up the build ID immediately
after a webhook simulation, and then repeatedly querying that build
for status. If the ID lookup beats the build record creation, the
script will time out looking for a nil build.

Look at the build list for status instead to eliminate the race.

ironcladlou · 2014-10-31T21:40:03Z

[test]

bparees · 2014-10-31T21:40:57Z

lgtm.

ironcladlou · 2014-10-31T21:41:40Z

@soltysh I think this is a big cause of hanging e2e test runs in Jenkins. When it occurs, which is almost every time for me in Jenkins, you have to wait for the full 30 min timeout... meanwhile, 8mb of logs accumulate...

openshift-bot · 2014-10-31T21:44:40Z

Origin Test Results: SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/352/)

ironcladlou · 2014-10-31T21:47:49Z

cc @ncdc

ironcladlou · 2014-10-31T22:01:02Z

Still hanging there, I must still not be accounting for something in the build detection.

ironcladlou · 2014-10-31T23:51:23Z

To save anybody else looking into this: the builds and deployments are working correctly: the e2e script simply fails to detect that the build has completed and so never moves on to the next checks (which would succeed if only it saw the completed build). Not sure why the race fix here didn't work, but I'm out of time to continue investigating.

@ncdc @bparees @pmorie @abhgupta @smarterclayton

bparees · 2014-10-31T23:55:36Z

hmm... i'm pretty sure i had a run where the build timed out and the docker logs of the build showed it was still pushing.

bparees · 2014-11-01T01:38:32Z

[INFO] Success running command: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/560dbc67-6158-11e4-98a0-22000b510bf6 | grep complete' after 2020 seconds

(from my PR). 33 minutes. things are moving in the wrong direction.

i'm going to try pre-pulling centos7 and pushing it into the docker-registry separate from the build so we can at least measure the times semi-independently, and not have the builds timeout.

bparees · 2014-11-01T03:30:30Z

This should be able to merge now, unfortunately you'll have to rebase first. sorry.

ironcladlou · 2014-11-01T11:54:46Z

@bparees From the Jenkins log (and this matches my other runs from other PRs including this one):

I1031 21:52:36.519941 03717 controller.go:117] Attempting to create pod: &api.Pod{TypeMeta:api.TypeMeta{Kind:"", ID:"build-docker-2f7c4dd5-6148-11e4-931b-22
000b4d0fb8", <snip>
I1031 21:56:26.541171 03717 image_change_controller.go:42] Detecting changed images for deploymentConfig frontend
I1031 21:56:26.612775 03717 deployment_config_controller.go:110] Creating new deployment from config frontend
I1031 22:01:26.649526 03717 deployment_config_controller.go:63] Shouldn't deploy because a deployment 'frontend-1' already exists for latest config frontend

This indicates that the build pod was started, a new built image was detected 4 minutes later, a deployment then occurred immediately, and then the deployment controller noticed the deployment thereafter and ignores. The e2e script just stalled trying to detect the first event (build completion). What evidence is there that the build itself was hanging?

ironcladlou · 2014-11-01T11:59:57Z

@bparees I also see this repeated, which could be a separate problem:

I1031 22:21:49.536373 03717 docker.go:228] Pulling image 172.121.17.1:5001/openshift/origin-ruby-sample without credentials
E1031 22:21:51.995775 03717 kubelet.go:566] Failed to pull image 172.121.17.1:5001/openshift/origin-ruby-sample: Tag latest not found in repository 172.121.17.1:5001/openshift/origin-ruby-sample skipping pod cf83656a-6148-11e4-931b-22000b4d0fb8.default.etcd container ruby-helloworld.

ironcladlou · 2014-11-01T12:01:55Z

[test] again

ironcladlou · 2014-11-01T12:04:10Z

[test] again

bparees · 2014-11-01T14:44:40Z

@ironcladlou my evidence for the build itself hanging was that one of my runs where the build timed out (after 30 minutes) and had my new build log dumping logic, the build log showed the start of the push and not the completion, indicating the push was still running when the timer popped.

bparees · 2014-11-01T14:50:12Z

@ironcladlou see this log:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/333/console

[INFO] Waiting for build to complete
[INFO] Waiting for command to finish: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/60820b81-6149-11e4-90e8-22000b5b1026 | grep complete'...
[ ERR] Gave up waiting for: '/data/src/github.com/openshift/origin/hack/../_output/go/bin/openshift kube get builds/60820b81-6149-11e4-90e8-22000b5b1026 | grep complete'
[FAIL] !!!!! Test Failed !!!!

As you can see we timed out waiting for complete.

But the build log shows the build was still pushing to the repository when things were aborted:

[INFO] Build logs: /tmp/openshift.local.logs.3n1T/build.log
Step 0 : FROM openshift/ruby-20-centos
 ---> a3d20336b598
Step 1 : RUN gem install sinatra sinatra-activerecord mysql2 --no-ri --no-rdoc
 ---> Running in 25908fa85996
Successfully installed rack-1.5.2
Successfully installed tilt-1.4.1
Successfully installed rack-protection-1.5.3
Successfully installed sinatra-1.4.5
Successfully installed i18n-0.6.11
Successfully installed thread_safe-0.3.4
Successfully installed tzinfo-1.2.2
Successfully installed minitest-5.4.2
Successfully installed activesupport-4.1.7
Successfully installed builder-3.2.2
Successfully installed activemodel-4.1.7
Successfully installed arel-5.0.1.20140414130214
Successfully installed activerecord-4.1.7
Successfully installed sinatra-activerecord-2.0.3
Building native extensions.  This could take a while...
Successfully installed mysql2-0.3.16
15 gems installed
 ---> efd4bfaf5242
Removing intermediate container 25908fa85996
Step 2 : ADD sinatra_app /tmp/
 ---> 7bcbe45ee84b
Removing intermediate container 530ce55a75df
Step 3 : USER root
 ---> Running in d216a0943edf
 ---> 15895eb9c85e
Removing intermediate container d216a0943edf
Step 4 : RUN chown -R ruby:ruby /tmp/*
 ---> Running in 05b5d772f52b
 ---> 355f9001c37b
Removing intermediate container 05b5d772f52b
Step 5 : USER ruby
 ---> Running in d78334f9900b
 ---> 1e12903dd440
Removing intermediate container d78334f9900b
Step 6 : WORKDIR /tmp/
 ---> Running in 6dcbc4b99a90
 ---> a5ff95ab7307
Removing intermediate container 6dcbc4b99a90
Step 7 : EXPOSE 8080
 ---> Running in 9157068e75b1
 ---> c6477b921d8d
Removing intermediate container 9157068e75b1
Step 8 : CMD ruby start.rb
 ---> Running in 02086a049ab1
 ---> 0869ed30ff57
Removing intermediate container 02086a049ab1
Successfully built 0869ed30ff57
The push refers to a repository [172.121.17.1:5001/openshift/origin-ruby-sample] (len: 1)
Sending image list
Pushing repository 172.121.17.1:5001/openshift/origin-ruby-sample (1 tags)
[INFO] Tearing down test

The build detection relied on looking up the build ID immediately after a webhook simulation, and then repeatedly querying that build for status. If the ID lookup beats the build record creation, the script will time out looking for a nil build. Look at the build list for status instead to eliminate the race.

ironcladlou · 2014-11-03T14:18:44Z

[test] again...

smarterclayton · 2014-11-03T14:58:35Z

This is LGTM, any reason this can't be merged Ben?

ironcladlou · 2014-11-03T14:59:18Z

Do not merge yet, still using this as a canary to sort out the intermittent e2e failures.

ironcladlou · 2014-11-03T22:53:51Z

[test]

openshift-bot · 2014-11-03T22:58:46Z

Evaluated for origin up to 2d0f8dd

openshift-bot · 2014-11-04T18:58:44Z

Origin Action Required: Pull request cannot be automatically merged, please rebase your branch from latest HEAD and push again

Force a rebuild after ./hack/sync-to-origin.sh

…atch-1 Merged by openshift-bot

ironcladlou mentioned this pull request Oct 31, 2014

Adding metadata for fired trigger to the deployment #302

Merged

ironcladlou force-pushed the e2e-build-race-fix branch from 8f4dd36 to e9a2441 Compare November 1, 2014 12:01

ironcladlou force-pushed the e2e-build-race-fix branch from 2dba3a0 to 843a4d1 Compare November 3, 2014 14:17

ironcladlou force-pushed the e2e-build-race-fix branch from 843a4d1 to 2d0f8dd Compare November 3, 2014 14:18

ironcladlou mentioned this pull request Nov 3, 2014

Fixed buildLogs #323

Merged

ironcladlou closed this Nov 4, 2014

ironcladlou deleted the e2e-build-race-fix branch November 7, 2014 20:01

danwinship pushed a commit to danwinship/origin that referenced this pull request Jun 24, 2016

Merge pull request openshift#315 from danwinship/sync-rebuild

0dc6df5

Force a rebuild after ./hack/sync-to-origin.sh

miminar pushed a commit to miminar/origin that referenced this pull request Oct 24, 2016

Merge pull request openshift#315 from smarterclayton/smarterclayton-p…

83b71c5

…atch-1 Merged by openshift-bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in e2e test script #315

Fix race condition in e2e test script #315

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

bparees commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

openshift-bot commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

bparees commented Oct 31, 2014

bparees commented Nov 1, 2014

bparees commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

bparees commented Nov 1, 2014

bparees commented Nov 1, 2014

ironcladlou commented Nov 3, 2014

smarterclayton commented Nov 3, 2014

ironcladlou commented Nov 3, 2014

ironcladlou commented Nov 3, 2014

openshift-bot commented Nov 3, 2014

openshift-bot commented Nov 4, 2014

Fix race condition in e2e test script #315

Fix race condition in e2e test script #315

Conversation

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

bparees commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

openshift-bot commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

ironcladlou commented Oct 31, 2014

bparees commented Oct 31, 2014

bparees commented Nov 1, 2014

bparees commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

ironcladlou commented Nov 1, 2014

bparees commented Nov 1, 2014

bparees commented Nov 1, 2014

ironcladlou commented Nov 3, 2014

smarterclayton commented Nov 3, 2014

ironcladlou commented Nov 3, 2014

ironcladlou commented Nov 3, 2014

openshift-bot commented Nov 3, 2014

openshift-bot commented Nov 4, 2014