Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extended test for pipeline build #11130

Merged
merged 1 commit into from Oct 21, 2016

Conversation

wanghaoran1988
Copy link
Member

@wanghaoran1988
Copy link
Member Author

@bparees Please help have a look when you can, thanks.

@gabemontero
Copy link
Contributor

Generally speaking, LGTM. The various checks, etc. look in line with what we do other tests. Good use of our exutil helpers.

@bparees - I toyed with the idea of suggesting that some sort of examination of the Jenkins job log be done (like we do in our e), but as the deployment endpoint being available is the culmination of activities in the sample pipeline job, so its readiness is sufficient indication that things went well, and examining the job log would be redundant.

please post the merge comment @bparees at your convenience - thanks.

@bparees
Copy link
Contributor

bparees commented Sep 28, 2016

[testextended][extended:core(openshift pipeline build)]

@bparees bparees self-assigned this Sep 28, 2016
@bparees
Copy link
Contributor

bparees commented Sep 28, 2016

@wanghaoran1988 the new test appears to be failing.

@gabemontero
Copy link
Contributor

gabemontero commented Sep 28, 2016

So I looked at the failed extended test run.
The sample-pipeline build did not complete successfully.
The debug that is dumped shows that the attempt to start the build occurred, but the build timed out after 10 seconds.
At first blush, it would seem to me the test worked "ok" and it uncovered minimally an environmental issue (perhaps 10 seconds is not long enough in our PR testing env), or a problem with the jenkins file strategy.
@bparees any thoughts / corrections?

@gabemontero
Copy link
Contributor

Some snippets from the console log:

Waiting for build/sample-pipeline-1 to complete
Done waiting for build/sample-pipeline-1: util.BuildResult{BuildPath:"build/sample-pipeline-1", StartBuildStdErr:"", StartBuildStdOut:"build/sample-pipeline-1", StartBuildErr:error(nil), BuildConfigName:"", Build:(*api.Build)(0xc820e1e000), BuildAttempt:true, BuildSuccess:false, BuildFailure:false, BuildTimeout:true, oc:(*util.CLI)(0xc8202723c0)}

and

Sep 28 12:58:41.582: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc logs --namespace=extended-test-jenkins-pipeline-r2be4-x0986 --config=/tmp/openshift/extended-test-jenkins-pipeline-r2be4-x0986-user.kubeconfig -f build/sample-pipeline-1] []   Error from server: Timeout: timed out waiting for build sample-pipeline-1 to start after 10s
 Error from server: Timeout: timed out waiting for build sample-pipeline-1 to start after 10s
 [] <nil> 0xc820f2f7e0 exit status 1 <nil> true [0xc8201500f8 0xc820150178 0xc820150178] [0xc8201500f8 0xc820150178] [0xc820150120 0xc820150160] [0xaf7f70 0xaf80d0] 0xc821025200}:
Error from server: Timeout: timed out waiting for build sample-pipeline-1 to start after 10s
Error during log retrieval: Error retieving logs for util.BuildResult{BuildPath:"build/sample-pipeline-1", StartBuildStdErr:"", StartBuildStdOut:"build/sample-pipeline-1", StartBuildErr:error(nil), BuildConfigName:"", Build:(*api.Build)(0xc820e1e000), BuildAttempt:true, BuildSuccess:false, BuildFailure:false, BuildTimeout:true, oc:(*util.CLI)(0xc8202723c0)}: exit status 1

@bparees
Copy link
Contributor

bparees commented Sep 28, 2016

we wait an hour for a build to complete.

the 10s timeout you see is from the code that tries to dump the build logs after we've decided the build has failed/timedout. and that's always going to happen because pipeline builds don't have logs to dump, so it will always timeout waiting for those logs. But the real issue is why the build failed/timedout in the first place.

@gabemontero
Copy link
Contributor

Ah - gotcha (10 s piece).

As to the real issue, I suppose it is complicated by the fact the jenkins build strategy does not generate build logs in the classic sense.

Any suggestions on what debug mechanism should be added to the extended test exutil bag of tools to capture what is needed (unless the key data is there in this console already and I'm just missing it).

@bparees
Copy link
Contributor

bparees commented Sep 28, 2016

@gabemontero yeah we should locate the jenkins pod and dump its logs, at least. that won't get us the job logs, but at least it will tell us if something went wrong inside jenkins itself.

@wanghaoran1988
Copy link
Member Author

wanghaoran1988 commented Sep 30, 2016

`** Build Description:
Name: sample-pipeline-1
Namespace: extended-test-jenkins-pipeline-r2be4-x0986
Created: About an hour ago
Labels: app=jenkins-pipeline-example
buildconfig=sample-pipeline
name=sample-pipeline
openshift.io/build-config.name=sample-pipeline
openshift.io/build.start-policy=Serial
template=application-template-sample-pipeline
Annotations: openshift.io/build-config.name=sample-pipeline
openshift.io/build.number=1

Status: New
Duration: waiting for 1h0m2s
Build Config: sample-pipeline
Build Pod: sample-pipeline-1-build
`
By the log we can see, the "sample-pipeline-1-build" build never starts, and keeps "New" within the one hour, no idea why the build never starts, @gabemontero , could you please help have investigate this ?

@bparees
Copy link
Contributor

bparees commented Sep 30, 2016

@wanghaoran1988 start by looking at the jenkins pod logs.

if you walk through the steps from the test manually on a cluster, does it work?

if the build never starts that means the sync plugin in the jenkins pod is not working correctly (or the jenkins pod itself is not running properly at all)

@wanghaoran1988
Copy link
Member Author

@bparees , It works when I manually run the steps, and I run it again with devenv-rhel7_5101 on the AWS, it passed.

@bparees
Copy link
Contributor

bparees commented Sep 30, 2016

i'm going to rerun it in this PR, but if it fails again we're going to have to dig deeper into the test.

br.AssertSuccess()

g.By("expecting the frontend service get endpoints")
err = oc.KubeFramework().WaitForAnEndpoint("frontend")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghaoran1988 so the latest test run past, which is good, but given the hiccup we saw, I would suggest adding some debug here in case this test catches the issue again after we merge.

Before calling o.Expect(err).NotTo(o.HaveOccurred()), first test if err is not nil and call one of the debug facilities to "locate the jenkins pod and dump its logs" as @bparees noted earlier. The code would look like this:

if err ! nil {
   exutil.DumpDeploymentLogs("jenkins", oc)
}

There are means to dump the contents of the jenkins job logs but are much more complicated, and based on the details of the hiccup, I'm not even sure it got started. If need be I can put it in at a later date.

Put the minimal debug I've outlined (where of course incorporate any comments @bparees might have) and then we can merge this PR and go from there.

thanks!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, still leave the o.Expect(err).NotTo(o.HaveOccurred()) line in after the if block I outlined above.

@wanghaoran1988
Copy link
Member Author

@gabemontero Dump the jenkins log after deploy jenkins failed and build failed.

@wanghaoran1988
Copy link
Member Author

wanghaoran1988 commented Oct 2, 2016

The build keeps status "New" again, and this is the error log from jenkins pod:

INFO: Updated job sample-pipeline from BuildConfig NamespaceName{extended-test-jenkins-pipeline-fuw6c-i0j6q:sample-pipeline} with revision: 766
java.io.IOException: closed
at okhttp3.internal.ws.WebSocketWriter.writeControlFrameSynchronized(WebSocketWriter.java:119)
at okhttp3.internal.ws.WebSocketWriter.writeClose(WebSocketWriter.java:111)
at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.java:168)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onClose(WatchConnectionManager.java:229)
at okhttp3.internal.ws.RealWebSocket.peerClose(RealWebSocket.java:197)
at okhttp3.internal.ws.RealWebSocket.access$200(RealWebSocket.java:38)
at okhttp3.internal.ws.RealWebSocket$1$2.execute(RealWebSocket.java:84)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
java.io.IOException: closed
at okhttp3.internal.ws.WebSocketWriter.writeControlFrameSynchronized(WebSocketWriter.java:119)
at okhttp3.internal.ws.WebSocketWriter.writeClose(WebSocketWriter.java:111)
at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.java:168)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onClose(WatchConnectionManager.java:229)
at okhttp3.internal.ws.RealWebSocket.peerClose(RealWebSocket.java:197)
at okhttp3.internal.ws.RealWebSocket.access$200(RealWebSocket.java:38)
at okhttp3.internal.ws.RealWebSocket$1$2.execute(RealWebSocket.java:84)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
java.io.IOException: closed
at okhttp3.internal.ws.WebSocketWriter.writeControlFrameSynchronized(WebSocketWriter.java:119)
at okhttp3.internal.ws.WebSocketWriter.writeClose(WebSocketWriter.java:111)
at okhttp3.internal.ws.RealWebSocket.close(RealWebSocket.java:168)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onClose(WatchConnectionManager.java:229)
at okhttp3.internal.ws.RealWebSocket.peerClose(RealWebSocket.java:197)
at okhttp3.internal.ws.RealWebSocket.access$200(RealWebSocket.java:38)
at okhttp3.internal.ws.RealWebSocket$1$2.execute(RealWebSocket.java:84)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@gabemontero
Copy link
Contributor

@wanghaoran1988 thanks for the debug. The stack trace you posted is in the openshift-sync plugin.

@jimmidyson - could you provide some insight here. The context is that we are running the sample-pipleline job from the origin extended tests, and we intermittently see the test fail where the openshift side build hangs in New state. When debug was added to dump the jenkins pod, the associated jenkins job's output had a series of Java stack traces.

See @wanghaoran1988 's prior comment.

Is this indicative of an environmental type issue wrt the watches the sync plugin employs? Or is this an error which should be handled, the operation retried, etc?

thanks.

@jimmidyson
Copy link
Contributor

Not sure if Jenkins has had the sync plugin upgraded to 0.0.13 yet? Could someone check?

@gabemontero
Copy link
Contributor

Not yet, but the move to 0.0.13 is in progress. The RPM has been built. I'm waiting on a couple of other items and then will be submitting a pull that should start the image updates.

Does this look like one of the known items fixed with 0.0.13 ?

@jimmidyson
Copy link
Contributor

There are some fixes to reconnection that could be this, but I've not seen this exact issue until now.

@gabemontero
Copy link
Contributor

OK - we'll still wait until the image gets 0.0.13 and see. Though not
consistent it has happened with enough frequency that we should be able to
see if it makes a difference. And of course we've raised your awareness.

On Monday, October 3, 2016, Jimmi Dyson notifications@github.com wrote:

There are some fixes to reconnection that could be this, but I've not seen
this exact issue until now.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11130 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADbadLXW8HGNj1wK-CVfoRCGyk92grxPks5qwWyOgaJpZM4KIWTO
.

@gabemontero
Copy link
Contributor

Oh and with the debug in now we could in theory merge this testcase. Let's
see what @bparees says next time he checks email.

On Monday, October 3, 2016, Gabe Montero gmontero@redhat.com wrote:

OK - we'll still wait until the image gets 0.0.13 and see. Though not
consistent it has happened with enough frequency that we should be able to
see if it makes a difference. And of course we've raised your awareness.

On Monday, October 3, 2016, Jimmi Dyson <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

There are some fixes to reconnection that could be this, but I've not
seen this exact issue until now.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11130 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADbadLXW8HGNj1wK-CVfoRCGyk92grxPks5qwWyOgaJpZM4KIWTO
.

@bparees
Copy link
Contributor

bparees commented Oct 4, 2016

don't merge broken tests. is what i say :)

we have enough flakes.

@gabemontero
Copy link
Contributor

Rgoer that :-)

On Monday, October 3, 2016, Ben Parees notifications@github.com wrote:

don't merge broken tests. is what i say :)

we have enough flakes.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11130 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADbadN37GKk7yxHq-AJGkCrWcbK-JXyLks5qwZ7CgaJpZM4KIWTO
.

@bparees bparees changed the title Add extened test for pipeline build Add extended test for pipeline build Oct 10, 2016
@gabemontero
Copy link
Contributor

Went back and looked at the console output.

The slave pod was listed as running.

However, since the name is maven-2ec09ee38ad, it was not caught by our dump of the pod logs on the error.

@wanghaoran1988 - could you add a exutil.DumpDeploymentLogs("maven", oc) after the exutil.DumpDeploymentLogs("jenkins", oc) call.

We'll try a few more runs and see if we catch either these new flakes.

@bparees - fyi, based on how this goes, I'm leaning toward NOT moving https://github.com/openshift/origin/blob/master/test/extended/jenkins/kubernetes_plugin.go under image_ecosystem when I move https://github.com/openshift/origin/blob/master/test/extended/jenkins/plugin.go

@bparees
Copy link
Contributor

bparees commented Oct 14, 2016

can we update this pipeline test to not use a slave image? then we don't need to worry about the kubernetes plugin flake issues.

I think we still want a test that does use the slave launcher, but it can be a separate test. @wanghaoran1988 @gabemontero what do you think?

@gabemontero
Copy link
Contributor

On Fri, Oct 14, 2016 at 2:43 PM, Ben Parees notifications@github.com
wrote:

can we update this pipeline test to not use a slave image? then we
don't need to worry about the kubernetes plugin flake issues.

I think we still want a test that does use the slave launcher, but it
can be a separate test. @wanghaoran1988
https://github.com/wanghaoran1988 @gabemontero
https://github.com/gabemontero what do you think?

I'm good with that.

And Michal did create a slave launcher test (the kubernetes_plugin.go test
I referenced earlier in this PR), though it creates a slave image vs. using
one of the predefined ones (re: the issue you assigned me regarding the
master-slave example) .
Like the plugin test, I'm betting it is not getting invoked consistently by
any of the existing ci.openshift jobs.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11130 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADbadMaL_YlY_Ym1o60Wi1NLdrS0W-Jmks5qz81LgaJpZM4KIWTO
.

@wanghaoran1988 wanghaoran1988 force-pushed the test_pipeline branch 2 times, most recently from 62c2e1c to 3fab1b0 Compare October 18, 2016 01:01
@wanghaoran1988
Copy link
Member Author

@bparees @gabemontero test updated, add a new template with jenkinsfile not use the maven node

g.By("starting a pipeline build")
br, _ := exutil.StartBuildAndWait(oc, "sample-pipeline")
if !br.BuildSuccess {
exutil.DumpDeploymentLogs("jenkins", oc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with you switch to ruby, I would add a

exutil.DumpDeploymentLogs(oc, "ruby")

In case we get slave errors there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind - duh, you moved off of slave images per earlier comment from @bparees

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero Yes, I removed the slave images

@gabemontero
Copy link
Contributor

running extended tests a few times ... see if flakes emerge. Also added 1 new review comment.

[testextended][extended:core(openshift pipeline build)]

@gabemontero
Copy link
Contributor

OK one successful run.

Second run:

[testextended][extended:core(openshift pipeline build)]

@gabemontero
Copy link
Contributor

Second run successful.

Third:

[testextended][extended:core(openshift pipeline build)]

@gabemontero
Copy link
Contributor

Third successful run.

Fourth:

[testextended][extended:core(openshift pipeline build)]

@gabemontero
Copy link
Contributor

Fourth run successful.

Fifth:
[testextended][extended:core(openshift pipeline build)]

@gabemontero
Copy link
Contributor

@bparees that is 5 successful runs in a row; with the sync plugin update and moving off of slave images, I think this PR is good to go

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one spelling nit and please squash the commits and then i'll merge.

err := exutil.WaitForBuilderAccount(oc.KubeREST().ServiceAccounts(oc.Namespace()))
o.Expect(err).NotTo(o.HaveOccurred())
})
g.Context("Manual deploy the jenkins and triger a jenkins pipeline build", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triger->trigger

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the typo, updated

@bparees
Copy link
Contributor

bparees commented Oct 19, 2016

@wanghaoran1988 i still need you to squash your commits.

@wanghaoran1988
Copy link
Member Author

@bparees squashed

@bparees
Copy link
Contributor

bparees commented Oct 20, 2016

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to 02bc722

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 02bc722

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 02bc722

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/630/) (Base Commit: 44fd91b) (Extended Tests: core(openshift pipeline build))

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10287/) (Base Commit: 44fd91b)

@openshift-bot
Copy link
Contributor

openshift-bot commented Oct 21, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10380/) (Base Commit: c94f61a) (Image: devenv-rhel7_5214)

@openshift-bot openshift-bot merged commit 87f1f55 into openshift:master Oct 21, 2016
@wanghaoran1988 wanghaoran1988 deleted the test_pipeline branch November 17, 2016 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants