-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Shared arrays #43
Conversation
Nitpicks (I am starting with the nitpicks, because they don't require an understanding of the code):
Now to the big picture:
In [1]: from joblib import sharedarray In [2]: a = np.zeros((10, 10)) In [3]: b = sharedarray.assharedarray(a) In [4]: b Out[4]: SharedArray([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) In [5]: b[:2] Out[5]: SharedArray([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) In [6]: b[:2] + 3 Out[6]: SharedArray([[ 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.], [ 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.]]) However, the way that you have coded it, their is memmapping going on for Out[5] but not Ou[6](because you are checking for). Thus the SharedArray is Out[6] is not actually usable in a multiprocessing context. I do believe that it is a desired feature. If you code it the other way around, as people do their computations, they will end up with heaps of shared arrays. Each one of these have an associated file descriptor, and at the end of the day, you run out of file descriptors (the infamous 'too many files open' error). This happens to me when I work with memmapped arrays. So, we want this feature that daughter arrays do not rely on memmapping, but right now it is confusing to the user, as the user has the impression that he has arrays that can be shared across processes. I am not sure how address this problem, but I think that using the priority mechanism and the inheritance model of numpy, we can improve things. In the least, we can probably avoid that grand-daughter arrays are ShareArrays. In this light, I suggest putting aray_priority to -9999: this means that when operating with other array subclasses, this subclass will always loose in the subclass-coercion mechanism (see http://docs.scipy.org/doc/numpy/reference/arrays.classes.html). Also, I wonder if array_prepare and array_wrap could not be made in a clever way, so that in the cases for which _mmap is None, standard ndarrays are created instead of SharedArrays. |
I suppose you meant shared_array.py
Alright, I just wanted to be consistent with
Yes this is planned. The lock feature is missing too. Now to the big picture:
Indeed but they would share a lot of common code with the file-based variant.
Yes precisely to solve the multiprocessing memory copy of memmaps (and add the lock feature to them). This was my initial use case (the anonymous mode is an almost free bonus of this refactoring). I will try to do experiments with array priorities and do some open file descriptors profiling tonight. Thanks for this first review. |
Good news: for anonymous shared arrays, there is no attached filedescriptor:
Before the running this:
After running this:
|
I decided to split the file based and anonymous memory case in distinct classes as you requested. I have also tried to summarize what remains to be done in the description of this PR based on your feedback + other considerations. |
That was a suggestion, and not a request. I can be convinced otherwise.
Cool. I am not sure what the shared=True option in cache/load would be |
The code is now much simpler so I think it's better this way :)
The goal would be to avoid doing useless memory allocation in a non shareable array of cached serialized results prior to feeding a call to parallel. |
Just a quick progress note on this. The coverage of the current implementation is good but there are still 2 outstanding issues:
|
OK, I think that I am going to do a new minor release of joblib not |
No pbm. Don't wait for me, this is still a WIP. |
@ogrisel Have you made progress on this? I am not so familiar with joblib codebase, but if I can be of any help, please tell me! (I can dive into it) Shared arrays are definitely something I'd like to see properly implemented. |
@glouppe yes I decided to stop using anonymous mmap as it would make it much to complex to implement proper multiprocess garbage collection and use |
@ogrisel How can I help you with any of these things? Feel free to delegate some work. I'd be glad to help. |
I have been given it a thought this WE and I cannot come up with a good solution anymore. The current code has two issues:
I am thinking that the best way to go would be to not use a custom |
@ogrisel: I cannot allocate time on this before the sprint, but I'd love |
I have started to work on |
I am closing this PR as the approach in #44 looks much better. |
Early pull request to introduce a new datastructure that blends the good features of
np.memmap
andmultiprocessing.Array
for working withjoblib.Parallel
without exhausting the memory when dealing with large data arrays.This can be considered an alternative or complementary solution to PR #40 for issue #38.
TODO:
SharedArray(10) + 3
? A regular numpy array? If so implement it and test it.as_shared_datastructure
to reallocate scipy.sparse and other nested datastructures with arrays to make it easier to work in a multiprocessing contextjoblib.load
andjoblib.Memory.cache
(maybe with ashared=True
option)?share_memory
option tojoblib.Parallel
to callas_shared_datastructure
on the args?