Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins plugin e2e hangs and runs forever #7277

Closed
smarterclayton opened this issue Feb 13, 2016 · 4 comments
Closed

Jenkins plugin e2e hangs and runs forever #7277

smarterclayton opened this issue Feb 13, 2016 · 4 comments
Assignees
Labels
area/tests kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2

Comments

@smarterclayton
Copy link
Contributor

Looks like it didn't detect failure and didn't exit?

Failed

Extended Core.[jenkins] openshift pipeline plugin jenkins-plugin test context jenkins-plugin test case execution (from (junit_01.xml))

Failing for the past 2 builds (Since Failed#623 )
Took 1 hr 1 min.
add description
Stacktrace

/data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/jenkins/plugin.go:156
Expected error:
    <*errors.errorString | 0xc208b36160>: {
        s: "the deploy did not finish within 3 minutes",
    }
    the deploy did not finish within 3 minutes
not to have occurred
@smarterclayton smarterclayton added area/tests kind/test-flake Categorizes issue or PR as related to test flakes. labels Feb 13, 2016
@bparees bparees assigned gabemontero and unassigned bparees Feb 15, 2016
@gabemontero
Copy link
Contributor

@bparees some preliminary research and thoughts on changing this test ... PTAL:

  1. Based on the error message that was returned (which btw, is incorrect in the time duration ... it is up to 15 minutes vs. 3 minutes (the original amount I tried) ... also, a printing of which deployment would have been nice for me to have added), we did not observe success in this the WaitForDeployment method, which as I recall has received some debate in the past.

  2. the switch here to 15 minutes stemmed from comments from you when the jenkins extended test was introduced.

  3. otherwise, the logic is pretty much the same since @stevekuznetsov introduced/changed back in August

If I interpret the original error msg correctly that the test ran for 1h and 1 min and then was kicked out, the hang(s) could have been in 1 spot or potentially spread out across a few:

  1. We wait for deployment for jenkins itself (watch / channel based via WaitForDeployment) up to 15 minutes
  2. We wait for deployment for frontend from the sample jenkins job (watch /channel based via WaitForDeployment) up to 15 minutes
  3. We wait for deployment for frontend-prod from the sample jenkins job (watch / channel based via WaitForDeployment) up to 15 minutes
  4. We wait for the jenkins job to complete - simple polling of Jenkins via HTTP GET for up to 3 minutes, 1 second at a time)

There still is a bit of a time discrepancy of course (the above 4 points add up to 48 minutes)....perhaps the wait for input on the channel is taking longer than we expect ??

Initial thoughts on elements of a change

  1. bump down that 15 min per deployment wait
  2. leave each wait per deployment the same, but have a overall test clock that bails out after X number of minutes
  3. devise an alternative to WaitForDeployment (leave the existing one as is for the other consumers) that simply polls via client.List vs. employing the watch

Any thoughts?

@bparees
Copy link
Contributor

bparees commented Feb 16, 2016

we'd have more info if @smarterclayton had included the full logs of his run since it would show which steps were completed (and possibly timestamps for each step?)

my first guess would be a bug in WaitForADeployment that causes it to get hung somewhere inside the loop and thus the 15 minute interval doesn't get checked. For example if the watch hangs (no events show up, but the watch doesn't get closed), i think you could end up sitting here indefinitely:
https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L222

Given that no new stuff is likely being added to etcd, your watch will probably not expire frequently. (based on my current poor understanding of etcd and watches).

So we probably need a separate go routine that babysits WaitForADeployment, or use a channel switch:
http://blog.golang.org/go-concurrency-patterns-timing-out-and

@gabemontero
Copy link
Contributor

@bparees thx for the input (including the channel switch stuff) and the +1 on the theory the hang centers around the channel processing.

As an fyi, I may have reproduced the hang locally ... at least running the extended test locally is hanging. I'll see if it is a general env issue but running some other tests. And in looking at the jenkins logs, it says the test job failed. So perhaps a fix for the job failure itself, as well as better error reaction, is in the offing.

@gabemontero
Copy link
Contributor

OK ... I did reproduce the test hang locally. Recent plugin usability changes necessitated an update to the test jenkins job config associated with the jenkins extended test. I have a fix in hand.

That said, I'll stash the job config fix temporarily, and work on better error detection and test exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2
Projects
None yet
Development

No branches or pull requests

4 participants