parallelize_tests #1951

Merged
merged 18 commits into from May 17, 2013

Projects

None yet

3 participants

@mdboom
Matplotlib Developers member

This is a possible solution to #1508 to get the test suite running in parallel.

The main fix was to not use a single gs process to do the conversion but to instantiate it every time. This makes parallelizing much easier, and it was never much of an optimization anyway. (Alternately, we could use one gs process if not parallelized, but I couldn't figure out how to detect in nose which mode we were in).

There was also a bug where the mathtext tests were being generated with different function names than what they were called in the module's namespace, which broke pickling of those functions through multiprocessing.

On my 4 core i7 machine, the wall clock time was 10:32, now it is 3:05 running all 4 cores. Pretty nice!

This also adds more tests to the default suite run by Travis.

@mdboom
Matplotlib Developers member

Does anyone have any thoughts about the test failures? I'm stumped, and I can't reproduce them on a Ubuntu Precise VM I have (which apparently matches what Travis has).

@WeatherGod
Matplotlib Developers member

Why is the testing/util.py file deleted?

@mdboom
Matplotlib Developers member

testing/util.py only contains one class, which was a version of expect used to communicate with a long-lived gs process -- it feeds it PDF filenames and it writes out PNG files. Since this approach doesn't work when multiprocessing (since each process would need its own ghostscript process), I just deleted it. It could probably be made to work by being more clever about it, but it's a very small optimization over just creating a new gs process for each image.

@pelson
Matplotlib Developers member

Does anyone have any thoughts about the test failures?

I can't see travis-ci test results anymore (this has been true for some time now) - my browser crashes and burns when it sees the long log (firefox 17.0.5). It renders the travis-ci system into a pretty useless binary pass/fail system for me. Can we look at making the logs shorter again? (I have a "print if fails" executable in cartopy to hide the build output unless it went wrong, for example: https://travis-ci.org/SciTools/cartopy/builds/6686231). I can't see beyond the build output as it stands.

@pelson
Matplotlib Developers member

In principle this looks good. I'm guessing you're adding the necessary multiprocess arguments to your python tests.py call since there is no update to that file?

I also can't see what has changed that would make travis-ci use multiprocessing - I would have expected an update to .travis.yml

@mdboom
Matplotlib Developers member

@pelson: I haven't had any trouble getting the logs from Travis lately. They recently did make the log fetching more chunked which has helped a lot. But I could capitulate and hide setup parts of the log if there's no other way.

In any event, for this specific PR, you can get the raw log here:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/6665520/log.txt

As for how to run this with multiprocess -- tests.py merely passes its arguments along to nosetests verbatim, so it's still up to the user to pass --process=-1 to run the tests in parallel. I think that's the right thing because we don't want to override any of the nose defaults and surprise people.

I had originally added --process=-1 to the .travis.yml, but then took it out when the tests failed to see if that would resolve it. It didn't (the results are identical). So the failures are not the result of multiprocessing, but due to some other change here, which makes it all the more puzzling.

@mdboom mdboom was assigned Apr 30, 2013
@mdboom
Matplotlib Developers member

Ok -- I've got this mostly working. The initial failures were due to still importing matplotlib.testing.util after it had been removed. :rage1:

It still seems, however, that running the tests in parallel on Travis doesn't work:

The command "cd ../tmp_test_dir" exited with 0.
$ python ../matplotlib/tests.py -sv --processes=-1 --process-timeout=300
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

I'm perfectly fine with merging this as is, because it's still very useful for running tests locally. I'm not entirely sure that Travis would give us access to more than one core per VM anyway -- I find the Travis docs completely unwieldy, so I can't confirm or deny that, but I wouldn't be surprised.

@pelson
Matplotlib Developers member

FWIW you can get at the log output in the form: https://api.travis-ci.org/jobs/{build_id}/log.txt?deansi=true

I was able to reproduce the freezing that travis-ci saw - it was related to providing a negative value to the processes argument so instead I went for the less elegant --processes=$(nproc) approach.

After fixing that, I was able to get the tests to run in ~91 seconds with just 3 failiures: https://api.travis-ci.org/jobs/6781057/log.txt?deansi=true . I think it might be worth tracking these down (if its not too thorny). (my extra commits were in my copy of your branch https://github.com/pelson/matplotlib/tree/multiprocess_testing)

@mdboom
Matplotlib Developers member

@pelson: Thanks for getting to the bottom of the hanging. nosetests uses multiprocessing.cpu_count under the hood, which in turn uses num = os.sysconf('SC_NPROCESSORS_ONLN'). I guess that system call never returns?

In any case, I think your solution is just fine. I think it also makes sense to log the value of $(nproc) for informational purposes.

@mdboom
Matplotlib Developers member

Ok -- it seems we're down to one failure: test_bbox_inches_tight_suptile_legend. The image is a different size than expected.

@mdboom
Matplotlib Developers member

With the latest commit (which should not have affected test_bbox_inches_tight_subtitle_legend), that test is now passing, so there's obviously some kind of race condition with that test. (That doesn't surprise me, it's always been a bit flaky). Not sure how to tackle that, though.

I wonder if the inkscape failure may be due to some race condition there -- since it's now possible that many inkscape processes will be launched at the same time, I wouldn't be surprised if there is greater memory pressure, etc.

@mdboom
Matplotlib Developers member

I think I finally have something that works on Travis (excepting the usual Travis network errors). Any volunteers to do a little more local testing before merging?

@WeatherGod
Matplotlib Developers member
@WeatherGod
Matplotlib Developers member
@mdboom
Matplotlib Developers member

@WeatherGod: Is the import error on building, or importing once installed? Are you certain you're in the same virtualenv/version of Python where you have pyparsing installed?

@mdboom
Matplotlib Developers member

@WeatherGod: also -- how are you running the tests? If from nosetests at the commandline, I have personally run into issues before where the nosetests command is using a different Python from what I intended.

@WeatherGod
Matplotlib Developers member
@mdboom
Matplotlib Developers member

Odd indeed. Here's a wild guess: do you have a pyparsing.py file sitting around from an old matplotlib installation in /nas/home/broot/centos6/lib/python2.7/site-packages/matplotlib-1.3.x-py2.7-linux-x86_64.egg/matplotlib/? matplotlib is still using the old import semantics (it doesn't do from __future__ import absolute_import) so if there were a pyparsing.py in that directory it would take precendence over the globally installed one. If that's the case, try cleaning the git directory with git clean -fxd and reinstalling.

@WeatherGod
Matplotlib Developers member
@WeatherGod WeatherGod and 1 other commented on an outdated diff May 8, 2013
- python setup.py install
script:
- mkdir ../tmp_test_dir
- cd ../tmp_test_dir
- - python ../matplotlib/tests.py -sv
+ - echo Testing using 4 processes
@WeatherGod
WeatherGod May 8, 2013

The echo statement here doesn't match what is passed to the command line

@mdboom
mdboom May 8, 2013

Indeed. Thanks for catching.

@mdboom
Matplotlib Developers member

@WeatherGod: since setuptools is doing the downloading, my understanding is that it's ok, since the Debian build environment will cause the downloading not to happen. They basically have a generic way to handle all setuptools-based builds.

@WeatherGod
Matplotlib Developers member

So, even with two processes and 4GB of RAM, I am running into issues where the processes run such high memory usages (~3GB each) that my system starts to swapping like crazy. And it seems like it isn't making any further progress even though my processors keep getting pegged (whenever an io wait is finished, that is). I can do a simple matplotlib.test() run with no problems, so I don't know why doing it with two processes is much, much worse.

@mdboom
Matplotlib Developers member

@WeatherGod: Thanks. That's a useful data point. I'm not sure what further to investigate. For me, on Fedora 18 (which shouldn't be fundamentally different from your CentOS box), with 4GB and 4 cores, I'm not seeing runaway memory usage. This is also on Python 2.7. What do you get for:

› gs --version
9.06
› inkscape --version
Inkscape 0.48.4 r9939 (Dec 18 2012)
@WeatherGod
Matplotlib Developers member
@WeatherGod
Matplotlib Developers member

So, on my Ubuntu 12.04 machine, setting it to 2 processes, it just hangs immediately after completing the first test. But, if I run it in single process mode, it works just fine.

ben@tigger:~$ gs --version
9.05
ben@tigger:~$ inkscape --version
Inkscape 0.48.3.1 r9886 (Jan 29 2013)
@mdboom
Matplotlib Developers member

Ok. Interesting -- I guess this is farther from ready than I thought. The Travis tests are running in a Ubuntu 12.04 VM, if I recall correctly, so I'm surprised it works there and not for you. We did see hanging initially when trying to use multiprocessing.cpu_count() to get the number of cores, but if you're explicitly specifying 2 cores you aren't just hitting that problem. Thanks for trying -- I'll have to go back to the drawing board (perhaps try this on a Ubuntu VM myself) and see if I can come up with any good question to ask...

@pelson: Have you tried running the tests locally, or only on Travis?

@mdboom
Matplotlib Developers member

I'm still unable to reproduce @WeatherGod's issue (I tried on a clean Ubuntu 12.04 VM). Very strange.

How do we all feel about this PR? Personally, I'd love to have it in (since it works for me -- sorry to be selfish), and it shouldn't be any worse for anyone not running the tests in parallel.

Travis seems to like running the tests in parallel as well, so I'm leaning toward turning it on in .travis.yml, but that of course is also optional.

@WeatherGod
Matplotlib Developers member
@mdboom
Matplotlib Developers member

After installing matplotlib, from a temporary directory, I run:

$PATH_TO_MATPLOTLIB_SOURCE/tests.py --processes=-1 --process-timeout=300
@pelson
Matplotlib Developers member

This is absolutely fine to be merged. I too have memory problems (Intel® Xeon® Processor E5520 with 8 threads and 5.7Gb addressable RAM on RHEL6 64bit, Python 2.7) which mean I cannot run the tests in parallel (even with --processes=1) but this PR represents an improvement (and the travis tests are a lot quicker). So 👍 for v1.3.x.

@pelson
Matplotlib Developers member

For the record, running with the -sv flags my machine froze (and was terminated with ctrl+c) at:

$> python tests.py  --processes=1 --process-timeout=300 -sv
...

test_line_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_for_non_affine_transData (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_line_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok
test_pathc_extents_non_affine (matplotlib.tests.test_transforms.TestTransformPlotInterface) ... ok

^C
Traceback (most recent call last):
  File "tests.py", line 20, in <module>
    run()
  File "tests.py", line 17, in run
    defaultTest=default_test_modules)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 118, in __init__
    **extra_args)
  File "lib/python2.7/unittest/main.py", line 95, in __init__
    self.runTests()
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/core.py", line 197, in runTests
    result = self.testRunner.run(self.test)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 349, in run
    timeout=nexttimeout)
  File "<string>", line 2, in get
  File "lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    kind, result = conn.recv()
KeyboardInterrupt
Process Process-2:
Traceback (most recent call last):
  File "lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 625, in runner
    keyboardCaught, shouldStop, loaderClass, resultClass, config)
  File "lib/python2.7/site-packages/nose-1.1.2-py2.7.egg/nose/plugins/multiprocess.py", line 692, in __runner
    keyboardCaught.set()
  File "iris/sci/lib/python2.7/multiprocessing/managers.py", line 1010, in set
    return self._callmethod('set')
  File "lib/python2.7/multiprocessing/managers.py", line 758, in _callmethod
    conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
@WeatherGod
Matplotlib Developers member
@pelson
Matplotlib Developers member

Come to think of it, I too also ran my commands with the -sv option.

For the record, I don't think it is because you're running it with sv - I just did that to see if I could see where it stalls (I'm not sure if you can though...)

@mdboom
Matplotlib Developers member

Ok -- it definitely sucks that it's failing for an unknown reason for (at least) @pelson, and @WeatherGod, but I think I'll merge this so we at least get the benefits for Travis as we head into the release period, and then I'll open a new issue for getting to the bottom of the failures.

@mdboom mdboom merged commit d39f9c0 into matplotlib:master May 17, 2013

1 check passed

Details default The Travis CI build passed
@mdboom mdboom deleted the mdboom:parallelize_tests branch Aug 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment