New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory errors / timeouts with affine registration on Windows #694

Closed
matthew-brett opened this Issue Aug 1, 2015 · 3 comments

Comments

Projects
None yet
3 participants
@matthew-brett
Member

matthew-brett commented Aug 1, 2015

After merging the affine registration branch, and Omars fixes for NaNs, we've got a couple of remaining errors on the Windows buildbots:

======================================================================
ERROR: dipy.align.tests.test_imaffine.test_affreg_all_transforms
----------------------------------------------------------------------
Traceback (most recent call last):
[...]
nose.proxy.MemoryError: 

(http://nipy.bic.berkeley.edu/builders/dipy-bdist32-33/builds/141/steps/shell_8/logs/stdio)

and

dipy.align.tests.test_imaffine.test_affreg_all_transforms ... 
command timed out: 1200 seconds without output, attempting to kill

(http://nipy.bic.berkeley.edu/builders/dipy-py2.6-32/builds/563/steps/shell_6/logs/stdio)

I guess the two errors may be related, with the second caused by the machine starting to thrash on swap rather than generating a MemoryError ?

@arokem

This comment has been minimized.

Member

arokem commented Sep 25, 2015

Did this one just go away?

@omarocegueda

This comment has been minimized.

Contributor

omarocegueda commented Sep 25, 2015

Hi!

Did this one just go away?

yes, I tried to reproduce the time-out bug on two windows machines but I couldn't. The memory usage was always reasonable. I have observed similar time-outs and memory errors in my institution's cluster when another user with admin privileges runs a relatively heavy process. It happens like this:
1.The admin runs a heavy process on a node X without using the scheduler
2. I send a job to the execution queue
3. My job is assigned to node X because the scheduler "thinks" it has enough available memory
4. The job fails because the admin's process didn't leave enough memory for my job (producing the Memory error above) or because the process was too heavy, making my job run very slowly (producing the time-out).

I have observed a process that normally runs in 50 minutes time-out after 4 hours!

May be something similar happened with the buildbot? do you think it is possible that the buildbot machine was running another heavy process while the test was being executed (maybe an automatic update or something like that)?

@arokem

This comment has been minimized.

Member

arokem commented Sep 25, 2015

That all makes sense to me - your hypothesis (system is running some other process, both together run into swap, machine stalls), sounds reasonable. Either way, I'm going to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment