Jenkins plugin e2e hangs and runs forever #7277

smarterclayton · 2016-02-13T15:11:54Z

Looks like it didn't detect failure and didn't exit?

Failed

Extended Core.[jenkins] openshift pipeline plugin jenkins-plugin test context jenkins-plugin test case execution (from (junit_01.xml))

Failing for the past 2 builds (Since Failed#623 )
Took 1 hr 1 min.
add description
Stacktrace

/data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/jenkins/plugin.go:156
Expected error:
    <*errors.errorString | 0xc208b36160>: {
        s: "the deploy did not finish within 3 minutes",
    }
    the deploy did not finish within 3 minutes
not to have occurred

The text was updated successfully, but these errors were encountered:

gabemontero · 2016-02-16T19:38:32Z

@bparees some preliminary research and thoughts on changing this test ... PTAL:

Based on the error message that was returned (which btw, is incorrect in the time duration ... it is up to 15 minutes vs. 3 minutes (the original amount I tried) ... also, a printing of which deployment would have been nice for me to have added), we did not observe success in this the WaitForDeployment method, which as I recall has received some debate in the past.
the switch here to 15 minutes stemmed from comments from you when the jenkins extended test was introduced.
otherwise, the logic is pretty much the same since @stevekuznetsov introduced/changed back in August

If I interpret the original error msg correctly that the test ran for 1h and 1 min and then was kicked out, the hang(s) could have been in 1 spot or potentially spread out across a few:

We wait for deployment for jenkins itself (watch / channel based via WaitForDeployment) up to 15 minutes
We wait for deployment for frontend from the sample jenkins job (watch /channel based via WaitForDeployment) up to 15 minutes
We wait for deployment for frontend-prod from the sample jenkins job (watch / channel based via WaitForDeployment) up to 15 minutes
We wait for the jenkins job to complete - simple polling of Jenkins via HTTP GET for up to 3 minutes, 1 second at a time)

There still is a bit of a time discrepancy of course (the above 4 points add up to 48 minutes)....perhaps the wait for input on the channel is taking longer than we expect ??

Initial thoughts on elements of a change

bump down that 15 min per deployment wait
leave each wait per deployment the same, but have a overall test clock that bails out after X number of minutes
devise an alternative to WaitForDeployment (leave the existing one as is for the other consumers) that simply polls via client.List vs. employing the watch

Any thoughts?

bparees · 2016-02-16T20:01:28Z

we'd have more info if @smarterclayton had included the full logs of his run since it would show which steps were completed (and possibly timestamps for each step?)

my first guess would be a bug in WaitForADeployment that causes it to get hung somewhere inside the loop and thus the 15 minute interval doesn't get checked. For example if the watch hangs (no events show up, but the watch doesn't get closed), i think you could end up sitting here indefinitely:
https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L222

Given that no new stuff is likely being added to etcd, your watch will probably not expire frequently. (based on my current poor understanding of etcd and watches).

So we probably need a separate go routine that babysits WaitForADeployment, or use a channel switch:
http://blog.golang.org/go-concurrency-patterns-timing-out-and

gabemontero · 2016-02-16T20:08:32Z

@bparees thx for the input (including the channel switch stuff) and the +1 on the theory the hang centers around the channel processing.

As an fyi, I may have reproduced the hang locally ... at least running the extended test locally is hanging. I'll see if it is a general env issue but running some other tests. And in looking at the jenkins logs, it says the test job failed. So perhaps a fix for the job failure itself, as well as better error reaction, is in the offing.

gabemontero · 2016-02-16T21:06:11Z

OK ... I did reproduce the test hang locally. Recent plugin usability changes necessitated an update to the test jenkins job config associated with the jenkins extended test. I have a fix in hand.

That said, I'll stash the job config fix temporarily, and work on better error detection and test exit.

smarterclayton added area/tests kind/test-flake Categorizes issue or PR as related to test flakes. labels Feb 13, 2016

danmcp assigned bparees Feb 15, 2016

danmcp added the priority/P2 label Feb 15, 2016

bparees assigned gabemontero and unassigned bparees Feb 15, 2016

gabemontero mentioned this issue Feb 17, 2016

fix jenkins testjob xml; fix jenkins ext test deployment error handling #7382

Merged

openshift-bot closed this as completed in #7382 Feb 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jenkins plugin e2e hangs and runs forever #7277

Jenkins plugin e2e hangs and runs forever #7277

smarterclayton commented Feb 13, 2016

gabemontero commented Feb 16, 2016

bparees commented Feb 16, 2016

gabemontero commented Feb 16, 2016

gabemontero commented Feb 16, 2016

Jenkins plugin e2e hangs and runs forever #7277

Jenkins plugin e2e hangs and runs forever #7277

Comments

smarterclayton commented Feb 13, 2016

gabemontero commented Feb 16, 2016

bparees commented Feb 16, 2016

gabemontero commented Feb 16, 2016

gabemontero commented Feb 16, 2016