[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

ogrisel · 2012-09-02T23:50:30Z

This is a new multiprocessing.Pool subclass to better deal with the shared memory situation as discussed at the end of the comment thread of #43.

Feedback welcome, I plan to do more testing, benchmarking, documentation and joblib.parallel integration + validate on the sklearn's RandomForestClassifier use case in the coming week.

TODO before merge:

doctest fixture to skip memmap doctest when numpy is not installed
support for memmap instance in the .base attribute of an array
add a joblib.has_shareable_memory utility function to detect datastructures with shared memory.

Tasks left as future work for another pull request:

add support for multiprocessing Lock if possible
demonstrate concurrent read / write access with Lock + doc

travisbot · 2012-09-02T23:53:06Z

This pull request fails (merged c1caab3 into 8aa6e48).

travisbot · 2012-09-02T23:58:44Z

This pull request fails (merged fff0404 into 8aa6e48).

GaelVaroquaux · 2012-09-03T06:48:17Z

I am looking at this PR on my mobile phone, so I may fail to see the big picture, but it seems to me that the solution you found will not work in general (with raw multiprocessing, or another parallel computing engine). Am I right?

ogrisel · 2012-09-03T07:29:20Z

It will work as long as you use the joblib.pool.MemmapingPool implementation instead of multiprocessing.Pool. Once we get that approach validated as working for joblib / sklearn I might submit that improvement upstream to the python std lib as it might be useful in other situations as well. For instance we could provide a reducers at the mmap type level instead of the numpy.memmap level.

glouppe · 2012-09-03T08:30:09Z

joblib/pool.py

+copy between the parent and child processes.
+
+This module should not be imported if multiprocessing is not
+available.  as it implements subclasses of multiprocessing Pool


Typo: comma instead of dot.

travisbot · 2012-09-03T22:22:05Z

This pull request fails (merged b5810a5 into 8aa6e48).

travisbot · 2012-09-03T22:27:44Z

This pull request fails (merged 6853472 into 8aa6e48).

travisbot · 2012-09-03T22:30:30Z

This pull request fails (merged 25eddbe into 8aa6e48).

ogrisel · 2012-09-04T08:40:46Z

@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release? It seems that is makes the code very complicated to maintain and would make it harder to support both python 3 and python 2 in a single codebase mode.

GaelVaroquaux · 2012-09-04T08:53:53Z

@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release?

I'd prefer. Python 2.6 was only released in 2008. joblib's goal is to be
as seamless as possible.

ogrisel · 2012-09-04T08:55:34Z

Alright I will try to rework the tests to skip those if multiprocessing is not avaible.

ogrisel · 2012-09-09T00:02:31Z

@glouppe I think the current state of the branch is good enough to start experimenting on real world machine learning problems (extra trees / RF and cross validation / grid search with n_jobs >> 2). np.array instances passed as input to Parallel operations should automatically be dumped to FS backed np.memmap instances if they are larger than 1MB (can be changed or even disabled using the max_nbytes arg).

When using the Parallel high level API, all temporary files are collected automatically without the need for explicit user intervention. Finer grained controlled can be obtained by using the joblib.pool.MemmapingPool API directly (same public API as multiprocessing.pool.Pool).

@GaelVaroquaux I would appreciate you to have a look at the implementation. If you agree with the design and @glouppe's tests works as expected I will start to write the missing documentation.

ogrisel · 2012-09-09T00:03:54Z

Oh and BTW I restored the Python 2.5 compat as TravisBot is reporting with the green dot next to the latest commits.

ogrisel · 2012-09-09T02:18:29Z

I have launched a c1.xlarge (8 cores on 7GB) box on EC2 to run the ExtraTrees covertree benchmark with n_jobs=-1 by just replacing sklearn's joblib folder with a symlink to the joblib folder from this branch and I could observe 1.7GB reduction in memory usage and a training time that decreased from 96s to 91s (memory allocation is actually expensive enough to be measurable on this benchmark and memmap allows to spare some of those allocations in subprocesses). I find this very cool for simple drop-in replacement.

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

glouppe · 2012-09-09T20:32:30Z

That looks great! I will have a deeper look at it tomorrow :)

glouppe · 2012-09-10T07:24:40Z

I have run a quick test on my machine and it seems to work like a charm :) Building a forest of 10 extra-trees on mnist3vs8 gives the following results:

master:

1 job: 35s, 120mb
2 jobs: 18s: 290mb

pickling-pool:

1 job: 27s: 140mb
2 jobs: 16s, 140mb

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :)

Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

glouppe · 2012-09-10T07:41:58Z

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

ogrisel · 2012-09-10T08:57:27Z

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :) Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

Thanks for those early tests. I am really looking forward to the tests on your "real life" work environment :) Most likely you will no longer need those 512GB of RAM :)

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

Yeah if you load your data from the disk into a numpy array, this array will stay in the memory of the master process while a memmap copy will also be allocated (once) and shared with the subprocesses for the duration of the computation.

In order to get rid of the initial extra-copy, make sure you load the data into variable X with the right dtype and alignment (C or Fortran), you can do:

from joblib import dump, load
filename = '/tmp/cached_source_data.pkl'
dump(X, filename)
X = load(filename, mmap_mode='c')

The original X array will be garbage collected (assuming no other references to it) and the memmap instance will directly be shared with the subprocesses from now on.

The same remark applies to the X_argsorted internal datastructure. When n_jobs !=1, we could make ExtraTrees* and RandomForest* directly allocate X_argsorted as a np.memmap array in w+ mode pointing to a temporary file (to be stored as an attribute of the main estimator class to be collected in the __del__ method) instead of an in-memory numpy array. That would remove another memory copy.

bdholt1 · 2012-09-10T09:23:08Z

Thanks @ogrisel for this PR! I'm giving it a go on my 13GB dataset X.shape=(1678300, 2000) where I have 32GB of RAM... So to train a forest I'm going to either have to train one at a time or use mem mapping.

I followed your instructions (working on windows 7, EPD64bit + cygwin) and I'm getting a strange error:
File "E:\brian\poseme\rgbd\train_rgbd_trees.py", line 133, in train_trees X = joblib.load(x_filename, mmap_mode='c') File "E:\brian\scikit-learn\sklearn\externals\joblib\numpy_pickle.py", line 418, in load obj = unpickler.load() File "C:\Python27\lib\pickle.py", line 858, in load dispatch[key](self) KeyError: '\x93'
I get the same problem when mmap_mode=None and no problems when I use np.load(x_filename, mmap_mode=None). Also, I've saved X as a binary fortran array .npy. Any thoughts how to proceeed?

ogrisel · 2012-09-10T09:28:52Z

Thanks @bdholt1 for testing (especially under windows as I could not find the motivation so far :), you could please add a print statement / pdb breakpoint on line 858 of pickle.py to disply the content of the self variable when key is \x93?

ogrisel · 2012-09-10T09:32:16Z

BTW, to be able to work with your dataset of 13GB you need to memmap it directly upstream in float32 / Fortran array and make sure that X_argsorted is memmaped too from the start (that requires a change in sklearn's source code, not just your script). I would suggest you to start with a smaller subset of your data (e.g. ~5GB) to make sure that n_jobs >> 1 behave as expected first and then move on to optimization of sklearn / your script afterwards.

Also, could you please run joblib test suite under windows? You just need to run nosetests in the top folder.

bdholt1 · 2012-09-10T09:43:38Z

On line 858 of pickle.py I added:
from pprint import pprint print self pprint (vars(self)) dispatch[key](self)
<sklearn.externals.joblib.numpy_pickle.NumpyUnpickler instance at 0x0000000006F7C248> {'_dirname': 'H:/RGB-D_2D/features', '_filename': 'rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy', 'append': <built-in method append of list object at 0x0000000006F7D248>, 'file_handle': <open file 'H:/RGB-D_2D/features/rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy', mode 'rb' at 0x0000000006EDBE40>, 'mark': <object object at 0x0000000001DC80E0>, 'memo': {}, 'mmap_mode': 'c', 'np': <module 'numpy' from 'C:\Python27\lib\site-packages\numpy\__init__.pyc'>, 'read': <built-in method read of file object at 0x0000000006EDBE40>, 'readline': <built-in method readline of file object at 0x0000000006EDBE40>, 'stack': []}

I have no need for X_argsorted because I'm using the lazy argsort branch :)

After running nosetests I don't see a summary report (except the OK at the end) so I checked every statement looking for failures and this looked like the only possible problem
joblib.test.test_numpy_pickle.test_numpy_persistence ... Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(2.57e-322)> ignored

bdholt1 · 2012-09-10T09:46:07Z

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

ogrisel · 2012-09-10T09:52:22Z

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

If you used the default numpy.save, I don't think there is an easy way to memmap to it directly. So you will need to either write your own memmap serialization code or use joblib.dump / joblib.load as said previously.

About the KeyError, apparently underwindows multiprocesing is passing more stuff to the child processes as pickles (probably due to the difference in the fork operation). I need to understand why it's trying to pickle the joblib.numpy_pickle.NumpyUnpickler class itself as it sounds useless to me. I will probably need to debug that on a windows VM. Can you please post a minimalistic python script (with as few data as possible, possibly generated using np.random.randn) that reproduces this error as a new http://gist.github.com/ ?

ogrisel · 2012-09-10T09:53:23Z

BTW:

>>> print "\x93"
?

Sounds appropriate :)

bdholt1 · 2012-09-10T10:12:24Z

Ok, I think I made a mistake by not using joblib.dump. Now that I have done it, I no longer experience the problem with joblib.load mentioned above.

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

ogrisel · 2012-09-10T10:15:39Z

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

This is right, this is the beauty of memmaping: data is loaded (pages by pages) by the kernel (and cached using the HDD kernel cache) only when need by processes addressing this virtual memory segment.

glouppe · 2012-09-10T11:54:23Z

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertionoself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is on open ticket since 2009 concerning that bug but no one seem to have solved it.

glouppe · 2012-09-10T11:54:24Z

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertion `oself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is an open ticket since 2009 concerning that bug but no one seems to have solved it.

http://bugs.python.org/issue7358

… support anymore

…ression

ogrisel · 2013-07-30T21:28:02Z

Rebased on top of master. @GaelVaroquaux I think this is ready to be merged (once travis is green again after the rebase).

GaelVaroquaux · 2013-07-30T21:58:53Z

Still breaking on my box :$. I'll try to check that and fix it.

ogrisel · 2013-07-30T22:55:29Z

Which version of numpy are you using? I disabled the mitigation strategy for broken numpy versions. I could not reproduce the issue on the released 1.7.0 nor on the 1.7.1 version.

ogrisel · 2013-07-30T22:56:21Z

I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well.

GaelVaroquaux · 2013-07-31T05:50:18Z

Which version of numpy are you using?

1.8.0.dev-1ea1592

I disabled the mitigation strategy for broken numpy versions.

That's probably why.

GaelVaroquaux · 2013-07-31T05:54:55Z

I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well.

Don't bother. It works on numpy master. I pushed your branch to joblib
master (merged).

Could you please make an announcement on joblib's mailing list (and maybe
twitter) to push people to use this new feature and report problems.

ogrisel · 2013-07-31T07:14:38Z

Are you sure? I don't see it on master.

GaelVaroquaux · 2013-07-31T07:22:23Z

Are you sure, I don't see it on master?

Why, oh why do I keep messing my merges up...

It should be good now. I am just surprised that I got a merge.

ogrisel · 2013-07-31T07:49:57Z

:)

What do you think of tagging the current master and synchronizing the joblib version of scikit-learn master on that tag to get a wider testing base?

GaelVaroquaux · 2013-07-31T08:00:47Z

What do you think of tagging the current master and synchronizing the joblib
version of scikit-learn master on that tag to get a wider testing base?

I usually have a policy of not having an unreleased joblib in sklearn.
But... rules are meant to be bent.

How about finishing the release of sklearn (I am somewhat
bandwidth-limited currently) and merging these changes in sklearn after?

ogrisel · 2013-07-31T08:14:36Z

We can also do a major release just for that :)

But do as you wish. I am in no hurry.

jni · 2013-07-31T08:33:29Z

I'm in a hurry! This is what I've been waiting for for over a year! ;) Thanks guys!

ogrisel · 2013-07-31T08:55:20Z

@jni since the version you know, the main difference it that now the auto-memmap feature will use the ramdisk partition /dev/shm when available. That should remove the need to touch the hard drive under linux. Under OSX and Windows the temp folder is still used by default.

ogrisel · 2013-07-31T08:56:09Z

Looking for any bench, memory profile trace or bug report.

mluessi · 2013-07-31T12:37:03Z

Great :) now I can revive mne-tools/mne-python#99 that uses this feature. This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel.

GaelVaroquaux · 2013-07-31T12:39:19Z

This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel.

Note that saving memory was already possible by passing around file names
and memmapping them when reading. This PR just makes everything easier.

mluessi · 2013-07-31T20:45:06Z

@GaelVaroquaux yes, I know.. but this feature makes it much easier :)

GaelVaroquaux · 2013-07-31T20:47:04Z

Indeed. All the credit goes to @ogrisel. Tell us how it goes in your
usage.

glouppe reviewed Sep 3, 2012
View reviewed changes

ogrisel mentioned this pull request Sep 9, 2012

WIP: Shared arrays #43

Closed

ogrisel added 7 commits July 30, 2013 23:26

Various improvements to the parallel_numpy.rst doc

692670d

Add a comment in the inline example

aaff79d

Remove boilerplate that was only useful for python 2.5 which we don't…

e301216

… support anymore

Always do gc.collect() before forking

4af71f3

Cleanup duplicated code

331ffb1

Disable optimized numpy.memmap support on numpy 1.7.0 which has a reg…

f46f4c1

…ression

Apparently numpy 1.7.0 works as well

fc7b0de

GaelVaroquaux merged commit fc7b0de into joblib:master Jul 31, 2013

mluessi mentioned this pull request Jul 31, 2013

Use joblib memmaping pool (finally :) ) mne-tools/mne-python#707

Merged

ogrisel mentioned this pull request Sep 8, 2013

Make sklearn.utils.validation tools preserve memmap when possible scikit-learn/scikit-learn#1142

Closed

JinpengLI mentioned this pull request Sep 9, 2013

Too many open files using memory mapping #81

Closed

ogrisel mentioned this pull request Oct 3, 2013

Memmap wrapping for joblib.Parallel #40

Closed

[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

Conversation

ogrisel commented Sep 2, 2012

travisbot commented Sep 2, 2012

travisbot commented Sep 2, 2012

GaelVaroquaux commented Sep 3, 2012

ogrisel commented Sep 3, 2012

glouppe Sep 3, 2012

Choose a reason for hiding this comment

travisbot commented Sep 3, 2012

travisbot commented Sep 3, 2012

travisbot commented Sep 3, 2012

ogrisel commented Sep 4, 2012

GaelVaroquaux commented Sep 4, 2012

ogrisel commented Sep 4, 2012

ogrisel commented Sep 9, 2012

ogrisel commented Sep 9, 2012

ogrisel commented Sep 9, 2012

glouppe commented Sep 9, 2012

glouppe commented Sep 10, 2012

glouppe commented Sep 10, 2012

ogrisel commented Sep 10, 2012

bdholt1 commented Sep 10, 2012

ogrisel commented Sep 10, 2012

ogrisel commented Sep 10, 2012

bdholt1 commented Sep 10, 2012

bdholt1 commented Sep 10, 2012

ogrisel commented Sep 10, 2012

ogrisel commented Sep 10, 2012

bdholt1 commented Sep 10, 2012

ogrisel commented Sep 10, 2012

glouppe commented Sep 10, 2012

glouppe commented Sep 10, 2012

ogrisel commented Jul 30, 2013

GaelVaroquaux commented Jul 30, 2013

ogrisel commented Jul 30, 2013

ogrisel commented Jul 30, 2013

GaelVaroquaux commented Jul 31, 2013

GaelVaroquaux commented Jul 31, 2013

ogrisel commented Jul 31, 2013

GaelVaroquaux commented Jul 31, 2013

ogrisel commented Jul 31, 2013

GaelVaroquaux commented Jul 31, 2013

ogrisel commented Jul 31, 2013

jni commented Jul 31, 2013

ogrisel commented Jul 31, 2013

ogrisel commented Jul 31, 2013

mluessi commented Jul 31, 2013

GaelVaroquaux commented Jul 31, 2013

mluessi commented Jul 31, 2013

GaelVaroquaux commented Jul 31, 2013