Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Re-runs should be part of the main job flow #217

Open
Krinkle opened this Issue · 9 comments

4 participants

@Krinkle
Collaborator

Right now they're kind of implemented as an independent thing. So the job looks complete when it really isn't.

Suggesting to change the display that if a run is not "complete" (!passing and runs < max) to not display "fail", but wait for the re-run.

Notes:

  • Keep showing in-progress icons and API return status for those runs. We can still output the current status (green/red) as well as an in between progress indicator, but keep the semantic status on in-progress.
@gnarf
Owner

You might be able to implement a "yellow" status for "failing, but still has retries" in the grid, would make it a little more obvious whats happening, although if it still had the "waiting" indicator & the red result, that would prolly do it too.

@jayarjo

What is the reason behind necessity of reruns? Is it really possible for a test to pass one time, and fail on another, while testing not changing code base?

@Krinkle
Collaborator

@jayarjo Of course the test suite itself does not change. Re-runs are not related to the test suite but are to fix environmental issues. When working on a localhost this may not be an issue in most cases, but when working over the internet the imperfections of our international network really show. Check out Result correction for more information.

If you don't need them, they are easily omitted by simply setting runMax=1.

@jayarjo

Yes sure, just wanted to know a theory behind the solution. Thanks :)

@jzaefferer
Owner

@Krinkle any news on this one?

@jzaefferer
Owner

I've looked into this, mostly to get our Jenkins jobs to wait for reruns instead of declaring failure immediately. I learned something, but not enough to solve this.

Jenkins uses node-testswarm to call the job action to figure out if a job is done or not. For that it iterates through all uaRuns and looks at their status (which I think maps to runresults.status in the DB). As long as they are "new" or "progress", it waits. The problem here is that runresults.status has no concept of reruns (only busy, finished, 2 kinds of timeout).

The concept of rerun exists in the run_useragent table, where status can be idle, busy and done (passed and/or reached max). For the Jenkins API calls, this is what we care about, not (only) the runresult.status. At first glance we can just change this line to map the value of $runUaRow->status to something node-testswarm can understand. Since that value doesn't tell us if the run has passed or failed, its more complicated then that. I can't figure out how to merge the two status values, if that's even the right approach.

@Krinkle
Collaborator

It looks you've got it in memory better than I do at the moment. However here's the approach I would take.

The swarmstate API has a query for computing the number of pending reruns. I'd re-use that logic in the job status API and add an additional property for pending reruns. Then a consumer (e.g. node-testswarm) can decide to wait for reruns by asserting both status=passed and pendingReruns=0.

I remember now why I did not do this originally. The reason is that reruns are restricted to running on a different client. A big factor in why a job failed and might pass if re-run is things that relate to whatever goggled up state or network issues or something else a client might have. That's why we at the very least require a rerun not be distributed to the same client again. The new client could be a new VM worker, or the same one after they refreshed the run page (either is enough to get a new client id).

Because of this requirement (enforced in the getrun API), and because of how testswarm-browserstack works, waiting for reruns to drop to 0 could be an indefinite or at least very long wait. Reruns (both the automated ones and the manual ones), as currently implemented, only work well when used as a way to clean up afterwards, not to block an initial CI response.

While integrating automated reruns into the initial CI run makes sense, that would require testswarm-browserstack to account for it as well. Otherwise you'd be waiting indefinitely (node-testswarm waiting for reruns to complete, and testswarm-browserstack doing nothing since its clients have finished all tests and are not given new tests).

In practice reruns do eventually finish because of margins and delays. E.g. when a job is finished (except for reruns) we don't immediately terminate the worker (for one, it takes us a minute to realise the worker is finished in the first place). So the TestSwarm Run page stays open and continues to poll for more tests to run. Let's start observing at a point in time where job 10 has completed (with pending reruns) and there are no online clients. Job 11 is submitted and VM workers are spawned. The VMs poll the getrun API for tests to run (newest first). When runs for Job 11 are finished, the next poll will yield a rerun for Job 10. Then at some point testswarm-browserstack sees there are no more pending runs (not including reruns) for this worker so it terminates it. There may or may not be any pending reruns left at that point.

@jzaefferer
Owner

As discussed in IRC: We'd have to first update testswarm-browserstack to account for reruns, then TestSwarm API to expose reruns per job, then node-testswarm to wait for them.

Sounds okay to me, we'd finally get some value out of doing reruns.

@jzaefferer
Owner

@Krinkle any estimate if/when you could implement this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.