Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of concurrent tests in integration.go #6655

Merged
merged 1 commit into from Apr 10, 2015

Conversation

Projects
None yet
4 participants
@yujuhong
Copy link
Member

yujuhong commented Apr 9, 2015

This is to address the integration test timeout problem (#6651).

@googlebot googlebot added the cla: yes label Apr 9, 2015

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 9, 2015

Waiting for travis/shippable results to see if this would drastically lengthen the overall test latency.

@vmarmol vmarmol self-assigned this Apr 9, 2015

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 9, 2015

@yujuhong it didn't seem to really change the runtime of Travis.

// Only run at most maxConcurrency tests in parallel.
numFinishedTests := 0
for numFinishedTests < len(testFuncs) {
testToRun := len(testFuncs) - numFinishedTests

This comment has been minimized.

Copy link
@vmarmol

vmarmol Apr 9, 2015

Contributor

nit: numTestsToRun?

numFinishedTests := 0
for numFinishedTests < len(testFuncs) {
testToRun := len(testFuncs) - numFinishedTests
glog.Infof("Running %d tests in parallel.", testToRun)

This comment has been minimized.

Copy link
@vmarmol

vmarmol Apr 9, 2015

Contributor

We should move this after the if, else we may print a wrong value

This comment has been minimized.

Copy link
@yujuhong

yujuhong Apr 9, 2015

Author Member

Yep, I just saw that. Thanks!

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 9, 2015

Not always. In one run it had I0409 22:34:01.362097 32622 integration.go:1002] Running 6 tests in parallel., instead of running all 8 tests together.

runtime.NumCPU() may not be a good indicator if the machine is overloaded. I just restarted shippable again to see if the number varies in different runs. If shippable times out when having >8 cpus, we'll need to set a hard limit (which will affect local runs) or pass a flag to restrict this for shippable or travis only.

@yujuhong yujuhong force-pushed the yujuhong:parallel_tests branch from e8c0ef5 to 91f304f Apr 9, 2015

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 9, 2015

It's probably not a good indicator if the machine is overloaded. It'll just tell us the number of cores on the machine. The number of runnable cores may be less than this (if we get a cpu mask applied to us for example).

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 9, 2015

It may make sense to set this to a static value for now, I don't know how feasible it'll be for us to detect how loaded the CI machine is and change our behavior based on that.

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 9, 2015

In general, we use runtime.NumCPU() in many places to set GOMAXPROCS, so if these overloaded machines have more CPUs, we may use more threads :\

Yes, it's not feasible to detect the load on the machines, and it varies. I was suggesting that maybe we could pass a fixed number for travis/shippable configuration, but allows developers to use as many cores as their machine have locally. I'll test with a fixed number locally and measure the performance difference. If it's okay, I'll just set it to a static number.

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 9, 2015

WDYT of exposing this as a flag in the integration test? We can default to SOME_NUMBER or allow developers to raise this (with set == -1 for NumCpus()) if they want to run it faster locally.

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 9, 2015

@vmarmol, that's why I suggested, but maybe I didn't express it well :)

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 9, 2015

Ah I see it now, I'm on-board :D

Limit the number of concurrent tests in integration.go
Integration test often time out because the machine is loaded. Instead of
increasing timeout, this change hopes to address the issue by limiting the
number of tests running simultaneously.

Add a new flag in integration.go to specify the maximum number of concurrent
tests. Set the default in travis and shippable configurations to be 4.

@yujuhong yujuhong force-pushed the yujuhong:parallel_tests branch from 91f304f to faf47b6 Apr 9, 2015

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 10, 2015

Updated and passed all the tests. Only time will tell if this helps...

@vmarmol

This comment has been minimized.

Copy link
Contributor

vmarmol commented Apr 10, 2015

lol, sounds ominous :)

LGTM

vmarmol added a commit that referenced this pull request Apr 10, 2015

Merge pull request #6655 from yujuhong/parallel_tests
Limit the number of concurrent tests in integration.go

@vmarmol vmarmol merged commit 0f4f6cc into kubernetes:master Apr 10, 2015

4 checks passed

Shippable Shippable builds completed
Details
cla/google All necessary CLAs are signed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.01%) to 54.03%
Details
numTestsToRun = maxConcurrency
}
glog.Infof("Running %d tests in parallel.", numTestsToRun)
var wg sync.WaitGroup

This comment has been minimized.

Copy link
@lavalamp

lavalamp Apr 10, 2015

Member

The canonical way to do this in go is to start N worker threads all reading from a channel, and then dump everything you want done into the channel.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Apr 10, 2015

Did this actually reduce the running time? I'd expect doing things more serially to make it slower, not faster.

Also I think running everything at once is valuable for detecting races and making sure asynchronous code keeps working.

@yujuhong

This comment has been minimized.

Copy link
Member Author

yujuhong commented Apr 10, 2015

@lavalamp, thanks for the pointer. @yifan-gu has submitted a PR to improve this (using buffered channel as semaphore) yesterday in #/6669.

This goal of the PR is not to reduce the running time (latency), and it would not. It is meant to reduce the load on system since all the tests are sharing the same apiserver, scheduler, etc (assuming shippable/travis wouldn't schedule more tasks onto one machine). If we keep adding tests, then we'll have to scale the timeout values along with it, which is a pain. I think we should have a separate performance suite similar to our e2e to track performance regression. Integration tests should focus on testing the correctness because (1) we don't control the testing infrastructure (2) it is too expensive as it runs for every PR and a lot of time and efforts are wasted in triaging the failure (3) even if we can catch some bugs, it is hard to reproduce the case locally (4) it makes people ignore the tests results.

That said, it is possible that there is a performance regression (or even deadlock) causing the timeout problem. The PR was to stop wasting all developers' time. I will file an issue for the potential performance problem. Feel free to chime in or revert this PR if you think having all the tests running is more valuable :)

@yujuhong yujuhong deleted the yujuhong:parallel_tests branch May 8, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.