-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44
Conversation
I am looking at this PR on my mobile phone, so I may fail to see the big picture, but it seems to me that the solution you found will not work in general (with raw multiprocessing, or another parallel computing engine). Am I right? |
It will work as long as you use the |
copy between the parent and child processes. | ||
|
||
This module should not be imported if multiprocessing is not | ||
available. as it implements subclasses of multiprocessing Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: comma instead of dot.
@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release? It seems that is makes the code very complicated to maintain and would make it harder to support both python 3 and python 2 in a single codebase mode. |
I'd prefer. Python 2.6 was only released in 2008. joblib's goal is to be |
Alright I will try to rework the tests to skip those if multiprocessing is not avaible. |
@glouppe I think the current state of the branch is good enough to start experimenting on real world machine learning problems (extra trees / RF and cross validation / grid search with n_jobs >> 2). When using the @GaelVaroquaux I would appreciate you to have a look at the implementation. If you agree with the design and @glouppe's tests works as expected I will start to write the missing documentation. |
Oh and BTW I restored the Python 2.5 compat as TravisBot is reporting with the green dot next to the latest commits. |
I have launched a c1.xlarge (8 cores on 7GB) box on EC2 to run the ExtraTrees covertree benchmark with n_jobs=-1 by just replacing sklearn's joblib folder with a symlink to the joblib folder from this branch and I could observe 1.7GB reduction in memory usage and a training time that decreased from 96s to 91s (memory allocation is actually expensive enough to be measurable on this benchmark and memmap allows to spare some of those allocations in subprocesses). I find this very cool for simple drop-in replacement. Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so. |
That looks great! I will have a deeper look at it tomorrow :) |
I have run a quick test on my machine and it seems to work like a charm :) Building a forest of 10 extra-trees on mnist3vs8 gives the following results: master:
pickling-pool:
Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :) Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted. |
Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of |
Thanks for those early tests. I am really looking forward to the tests on your "real life" work environment :) Most likely you will no longer need those 512GB of RAM :)
Yeah if you load your data from the disk into a numpy array, this array will stay in the memory of the master process while a memmap copy will also be allocated (once) and shared with the subprocesses for the duration of the computation. In order to get rid of the initial extra-copy, make sure you load the data into variable from joblib import dump, load
filename = '/tmp/cached_source_data.pkl'
dump(X, filename)
X = load(filename, mmap_mode='c') The original X array will be garbage collected (assuming no other references to it) and the memmap instance will directly be shared with the subprocesses from now on. The same remark applies to the |
Thanks @ogrisel for this PR! I'm giving it a go on my 13GB dataset I followed your instructions (working on windows 7, EPD64bit + cygwin) and I'm getting a strange error: |
Thanks @bdholt1 for testing (especially under windows as I could not find the motivation so far :), you could please add a print statement / pdb breakpoint on line 858 of pickle.py to disply the content of the |
BTW, to be able to work with your dataset of 13GB you need to memmap it directly upstream in float32 / Fortran array and make sure that Also, could you please run joblib test suite under windows? You just need to run |
On line 858 of pickle.py I added: I have no need for X_argsorted because I'm using the lazy argsort branch :) After running nosetests I don't see a summary report (except the OK at the end) so I checked every statement looking for failures and this looked like the only possible problem |
When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary |
If you used the default About the |
BTW:
Sounds appropriate :) |
Ok, I think I made a mistake by not using Now it should be working but the behaviour is a bit weird. Watching my memory usage at |
This is right, this is the beauty of memmaping: data is loaded (pages by pages) by the kernel (and cached using the HDD kernel cache) only when need by processes addressing this virtual memory segment. |
I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).
We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is on open ticket since 2009 concerning that bug but no one seem to have solved it. |
I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).
We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is an open ticket since 2009 concerning that bug but no one seems to have solved it. |
Rebased on top of master. @GaelVaroquaux I think this is ready to be merged (once travis is green again after the rebase). |
Still breaking on my box :$. I'll try to check that and fix it. |
Which version of numpy are you using? I disabled the mitigation strategy for broken numpy versions. I could not reproduce the issue on the released 1.7.0 nor on the 1.7.1 version. |
I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well. |
1.8.0.dev-1ea1592
That's probably why. |
Don't bother. It works on numpy master. I pushed your branch to joblib Could you please make an announcement on joblib's mailing list (and maybe |
Are you sure? I don't see it on master. |
Why, oh why do I keep messing my merges up... It should be good now. I am just surprised that I got a merge. |
:) What do you think of tagging the current master and synchronizing the joblib version of scikit-learn master on that tag to get a wider testing base? |
I usually have a policy of not having an unreleased joblib in sklearn. How about finishing the release of sklearn (I am somewhat |
We can also do a major release just for that :) But do as you wish. I am in no hurry. |
I'm in a hurry! This is what I've been waiting for for over a year! ;) Thanks guys! |
@jni since the version you know, the main difference it that now the auto-memmap feature will use the ramdisk partition |
Looking for any bench, memory profile trace or bug report. |
Great :) now I can revive mne-tools/mne-python#99 that uses this feature. This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel. |
Note that saving memory was already possible by passing around file names |
@GaelVaroquaux yes, I know.. but this feature makes it much easier :) |
Indeed. All the credit goes to @ogrisel. Tell us how it goes in your |
This is a new
multiprocessing.Pool
subclass to better deal with the shared memory situation as discussed at the end of the comment thread of #43.Feedback welcome, I plan to do more testing, benchmarking, documentation and
joblib.parallel
integration + validate on the sklearn'sRandomForestClassifier
use case in the coming week.TODO before merge:
.base
attribute of an arrayjoblib.has_shareable_memory
utility function to detect datastructures with shared memory.Tasks left as future work for another pull request: