Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

Merged
merged 71 commits into from
Jul 31, 2013

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Sep 2, 2012

This is a new multiprocessing.Pool subclass to better deal with the shared memory situation as discussed at the end of the comment thread of #43.

Feedback welcome, I plan to do more testing, benchmarking, documentation and joblib.parallel integration + validate on the sklearn's RandomForestClassifier use case in the coming week.

TODO before merge:

  • doctest fixture to skip memmap doctest when numpy is not installed
  • support for memmap instance in the .base attribute of an array
  • add a joblib.has_shareable_memory utility function to detect datastructures with shared memory.

Tasks left as future work for another pull request:

  • add support for multiprocessing Lock if possible
  • demonstrate concurrent read / write access with Lock + doc

@travisbot
Copy link

This pull request fails (merged c1caab3 into 8aa6e48).

@travisbot
Copy link

This pull request fails (merged fff0404 into 8aa6e48).

@GaelVaroquaux
Copy link
Member

I am looking at this PR on my mobile phone, so I may fail to see the big picture, but it seems to me that the solution you found will not work in general (with raw multiprocessing, or another parallel computing engine). Am I right?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 3, 2012

It will work as long as you use the joblib.pool.MemmapingPool implementation instead of multiprocessing.Pool. Once we get that approach validated as working for joblib / sklearn I might submit that improvement upstream to the python std lib as it might be useful in other situations as well. For instance we could provide a reducers at the mmap type level instead of the numpy.memmap level.

copy between the parent and child processes.

This module should not be imported if multiprocessing is not
available. as it implements subclasses of multiprocessing Pool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: comma instead of dot.

@travisbot
Copy link

This pull request fails (merged b5810a5 into 8aa6e48).

@travisbot
Copy link

This pull request fails (merged 6853472 into 8aa6e48).

@travisbot
Copy link

This pull request fails (merged 25eddbe into 8aa6e48).

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 4, 2012

@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release? It seems that is makes the code very complicated to maintain and would make it harder to support both python 3 and python 2 in a single codebase mode.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release?

I'd prefer. Python 2.6 was only released in 2008. joblib's goal is to be
as seamless as possible.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 4, 2012

Alright I will try to rework the tests to skip those if multiprocessing is not avaible.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 9, 2012

@glouppe I think the current state of the branch is good enough to start experimenting on real world machine learning problems (extra trees / RF and cross validation / grid search with n_jobs >> 2). np.array instances passed as input to Parallel operations should automatically be dumped to FS backed np.memmap instances if they are larger than 1MB (can be changed or even disabled using the max_nbytes arg).

When using the Parallel high level API, all temporary files are collected automatically without the need for explicit user intervention. Finer grained controlled can be obtained by using the joblib.pool.MemmapingPool API directly (same public API as multiprocessing.pool.Pool).

@GaelVaroquaux I would appreciate you to have a look at the implementation. If you agree with the design and @glouppe's tests works as expected I will start to write the missing documentation.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 9, 2012

Oh and BTW I restored the Python 2.5 compat as TravisBot is reporting with the green dot next to the latest commits.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 9, 2012

I have launched a c1.xlarge (8 cores on 7GB) box on EC2 to run the ExtraTrees covertree benchmark with n_jobs=-1 by just replacing sklearn's joblib folder with a symlink to the joblib folder from this branch and I could observe 1.7GB reduction in memory usage and a training time that decreased from 96s to 91s (memory allocation is actually expensive enough to be measurable on this benchmark and memmap allows to spare some of those allocations in subprocesses). I find this very cool for simple drop-in replacement.

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

@ogrisel ogrisel mentioned this pull request Sep 9, 2012
@glouppe
Copy link

glouppe commented Sep 9, 2012

That looks great! I will have a deeper look at it tomorrow :)

@glouppe
Copy link

glouppe commented Sep 10, 2012

I have run a quick test on my machine and it seems to work like a charm :) Building a forest of 10 extra-trees on mnist3vs8 gives the following results:

master:

  • 1 job: 35s, 120mb
  • 2 jobs: 18s: 290mb

pickling-pool:

  • 1 job: 27s: 140mb
  • 2 jobs: 16s, 140mb

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :)

Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

@glouppe
Copy link

glouppe commented Sep 10, 2012

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :) Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

Thanks for those early tests. I am really looking forward to the tests on your "real life" work environment :) Most likely you will no longer need those 512GB of RAM :)

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

Yeah if you load your data from the disk into a numpy array, this array will stay in the memory of the master process while a memmap copy will also be allocated (once) and shared with the subprocesses for the duration of the computation.

In order to get rid of the initial extra-copy, make sure you load the data into variable X with the right dtype and alignment (C or Fortran), you can do:

from joblib import dump, load
filename = '/tmp/cached_source_data.pkl'
dump(X, filename)
X = load(filename, mmap_mode='c')

The original X array will be garbage collected (assuming no other references to it) and the memmap instance will directly be shared with the subprocesses from now on.

The same remark applies to the X_argsorted internal datastructure. When n_jobs !=1, we could make ExtraTrees* and RandomForest* directly allocate X_argsorted as a np.memmap array in w+ mode pointing to a temporary file (to be stored as an attribute of the main estimator class to be collected in the __del__ method) instead of an in-memory numpy array. That would remove another memory copy.

@bdholt1
Copy link

bdholt1 commented Sep 10, 2012

Thanks @ogrisel for this PR! I'm giving it a go on my 13GB dataset X.shape=(1678300, 2000) where I have 32GB of RAM... So to train a forest I'm going to either have to train one at a time or use mem mapping.

I followed your instructions (working on windows 7, EPD64bit + cygwin) and I'm getting a strange error:
File "E:\brian\poseme\rgbd\train_rgbd_trees.py", line 133, in train_trees X = joblib.load(x_filename, mmap_mode='c') File "E:\brian\scikit-learn\sklearn\externals\joblib\numpy_pickle.py", line 418, in load obj = unpickler.load() File "C:\Python27\lib\pickle.py", line 858, in load dispatch[key](self) KeyError: '\x93'
I get the same problem when mmap_mode=None and no problems when I use np.load(x_filename, mmap_mode=None). Also, I've saved X as a binary fortran array .npy. Any thoughts how to proceeed?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

Thanks @bdholt1 for testing (especially under windows as I could not find the motivation so far :), you could please add a print statement / pdb breakpoint on line 858 of pickle.py to disply the content of the self variable when key is \x93?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

BTW, to be able to work with your dataset of 13GB you need to memmap it directly upstream in float32 / Fortran array and make sure that X_argsorted is memmaped too from the start (that requires a change in sklearn's source code, not just your script). I would suggest you to start with a smaller subset of your data (e.g. ~5GB) to make sure that n_jobs >> 1 behave as expected first and then move on to optimization of sklearn / your script afterwards.

Also, could you please run joblib test suite under windows? You just need to run nosetests in the top folder.

@bdholt1
Copy link

bdholt1 commented Sep 10, 2012

On line 858 of pickle.py I added:
from pprint import pprint print self pprint (vars(self)) dispatch[key](self)
<sklearn.externals.joblib.numpy_pickle.NumpyUnpickler instance at 0x0000000006F7C248> {'_dirname': 'H:/RGB-D_2D/features', '_filename': 'rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy', 'append': <built-in method append of list object at 0x0000000006F7D248>, 'file_handle': <open file 'H:/RGB-D_2D/features/rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy', mode 'rb' at 0x0000000006EDBE40>, 'mark': <object object at 0x0000000001DC80E0>, 'memo': {}, 'mmap_mode': 'c', 'np': <module 'numpy' from 'C:\Python27\lib\site-packages\numpy\__init__.pyc'>, 'read': <built-in method read of file object at 0x0000000006EDBE40>, 'readline': <built-in method readline of file object at 0x0000000006EDBE40>, 'stack': []}

I have no need for X_argsorted because I'm using the lazy argsort branch :)

After running nosetests I don't see a summary report (except the OK at the end) so I checked every statement looking for failures and this looked like the only possible problem
joblib.test.test_numpy_pickle.test_numpy_persistence ... Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(2.57e-322)> ignored

@bdholt1
Copy link

bdholt1 commented Sep 10, 2012

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

If you used the default numpy.save, I don't think there is an easy way to memmap to it directly. So you will need to either write your own memmap serialization code or use joblib.dump / joblib.load as said previously.

About the KeyError, apparently underwindows multiprocesing is passing more stuff to the child processes as pickles (probably due to the difference in the fork operation). I need to understand why it's trying to pickle the joblib.numpy_pickle.NumpyUnpickler class itself as it sounds useless to me. I will probably need to debug that on a windows VM. Can you please post a minimalistic python script (with as few data as possible, possibly generated using np.random.randn) that reproduces this error as a new http://gist.github.com/ ?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

BTW:

>>> print "\x93"
?

Sounds appropriate :)

@bdholt1
Copy link

bdholt1 commented Sep 10, 2012

Ok, I think I made a mistake by not using joblib.dump. Now that I have done it, I no longer experience the problem with joblib.load mentioned above.

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 10, 2012

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

This is right, this is the beauty of memmaping: data is loaded (pages by pages) by the kernel (and cached using the HDD kernel cache) only when need by processes addressing this virtual memory segment.

@glouppe
Copy link

glouppe commented Sep 10, 2012

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertionoself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is on open ticket since 2009 concerning that bug but no one seem to have solved it.

@glouppe
Copy link

glouppe commented Sep 10, 2012

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertion `oself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is an open ticket since 2009 concerning that bug but no one seems to have solved it.

http://bugs.python.org/issue7358

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 30, 2013

Rebased on top of master. @GaelVaroquaux I think this is ready to be merged (once travis is green again after the rebase).

@GaelVaroquaux
Copy link
Member

Still breaking on my box :$. I'll try to check that and fix it.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 30, 2013

Which version of numpy are you using? I disabled the mitigation strategy for broken numpy versions. I could not reproduce the issue on the released 1.7.0 nor on the 1.7.1 version.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 30, 2013

I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well.

@GaelVaroquaux
Copy link
Member

Which version of numpy are you using?

1.8.0.dev-1ea1592

I disabled the mitigation strategy for broken numpy versions.

That's probably why.

@GaelVaroquaux
Copy link
Member

I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well.

Don't bother. It works on numpy master. I pushed your branch to joblib
master (merged).

Could you please make an announcement on joblib's mailing list (and maybe
twitter) to push people to use this new feature and report problems.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 31, 2013

Are you sure? I don't see it on master.

@GaelVaroquaux GaelVaroquaux merged commit fc7b0de into joblib:master Jul 31, 2013
@GaelVaroquaux
Copy link
Member

Are you sure, I don't see it on master?

Why, oh why do I keep messing my merges up...

It should be good now. I am just surprised that I got a merge.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 31, 2013

:)

What do you think of tagging the current master and synchronizing the joblib version of scikit-learn master on that tag to get a wider testing base?

@GaelVaroquaux
Copy link
Member

What do you think of tagging the current master and synchronizing the joblib
version of scikit-learn master on that tag to get a wider testing base?

I usually have a policy of not having an unreleased joblib in sklearn.
But... rules are meant to be bent.

How about finishing the release of sklearn (I am somewhat
bandwidth-limited currently) and merging these changes in sklearn after?

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 31, 2013

We can also do a major release just for that :)

But do as you wish. I am in no hurry.

@jni
Copy link

jni commented Jul 31, 2013

I'm in a hurry! This is what I've been waiting for for over a year! ;) Thanks guys!

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 31, 2013

@jni since the version you know, the main difference it that now the auto-memmap feature will use the ramdisk partition /dev/shm when available. That should remove the need to touch the hard drive under linux. Under OSX and Windows the temp folder is still used by default.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jul 31, 2013

Looking for any bench, memory profile trace or bug report.

@mluessi
Copy link
Contributor

mluessi commented Jul 31, 2013

Great :) now I can revive mne-tools/mne-python#99 that uses this feature. This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel.

@GaelVaroquaux
Copy link
Member

This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel.

Note that saving memory was already possible by passing around file names
and memmapping them when reading. This PR just makes everything easier.

@mluessi
Copy link
Contributor

mluessi commented Jul 31, 2013

@GaelVaroquaux yes, I know.. but this feature makes it much easier :)

@GaelVaroquaux
Copy link
Member

Indeed. All the credit goes to @ogrisel. Tell us how it goes in your
usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants