Skip to content

[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

Merged
merged 71 commits into from Jul 31, 2013

7 participants

@ogrisel
ogrisel commented Sep 2, 2012

This is a new multiprocessing.Pool subclass to better deal with the shared memory situation as discussed at the end of the comment thread of #43.

Feedback welcome, I plan to do more testing, benchmarking, documentation and joblib.parallel integration + validate on the sklearn's RandomForestClassifier use case in the coming week.

TODO before merge:

  • doctest fixture to skip memmap doctest when numpy is not installed
  • support for memmap instance in the .base attribute of an array
  • add a joblib.has_shareable_memory utility function to detect datastructures with shared memory.

Tasks left as future work for another pull request:

  • add support for multiprocessing Lock if possible
  • demonstrate concurrent read / write access with Lock + doc
@travisbot

This pull request fails (merged c1caab3 into 8aa6e48).

@travisbot

This pull request fails (merged fff0404 into 8aa6e48).

@GaelVaroquaux
joblib member

I am looking at this PR on my mobile phone, so I may fail to see the big picture, but it seems to me that the solution you found will not work in general (with raw multiprocessing, or another parallel computing engine). Am I right?

@ogrisel
ogrisel commented Sep 3, 2012

It will work as long as you use the joblib.pool.MemmapingPool implementation instead of multiprocessing.Pool. Once we get that approach validated as working for joblib / sklearn I might submit that improvement upstream to the python std lib as it might be useful in other situations as well. For instance we could provide a reducers at the mmap type level instead of the numpy.memmap level.

@glouppe glouppe commented on an outdated diff Sep 3, 2012
joblib/pool.py
@@ -0,0 +1,139 @@
+"""Custom implementation of multiprocessing.Pool with custom pickler
+
+This module provides efficient ways of working with data stored in
+shared memory with numpy.memmap arrays without inducing any memory
+copy between the parent and child processes.
+
+This module should not be imported if multiprocessing is not
+available. as it implements subclasses of multiprocessing Pool
@glouppe
glouppe added a note Sep 3, 2012

Typo: comma instead of dot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@travisbot

This pull request fails (merged b5810a5 into 8aa6e48).

@travisbot

This pull request fails (merged 6853472 into 8aa6e48).

@travisbot

This pull request fails (merged 25eddbe into 8aa6e48).

@ogrisel
ogrisel commented Sep 4, 2012

@GaelVaroquaux do you still want to support python 2.5 and the lack of multiprocessing for the next release? It seems that is makes the code very complicated to maintain and would make it harder to support both python 3 and python 2 in a single codebase mode.

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Sep 4, 2012

Alright I will try to rework the tests to skip those if multiprocessing is not avaible.

@ogrisel
ogrisel commented Sep 9, 2012

@glouppe I think the current state of the branch is good enough to start experimenting on real world machine learning problems (extra trees / RF and cross validation / grid search with n_jobs >> 2). np.array instances passed as input to Parallel operations should automatically be dumped to FS backed np.memmap instances if they are larger than 1MB (can be changed or even disabled using the max_nbytes arg).

When using the Parallel high level API, all temporary files are collected automatically without the need for explicit user intervention. Finer grained controlled can be obtained by using the joblib.pool.MemmapingPool API directly (same public API as multiprocessing.pool.Pool).

@GaelVaroquaux I would appreciate you to have a look at the implementation. If you agree with the design and @glouppe's tests works as expected I will start to write the missing documentation.

@ogrisel
ogrisel commented Sep 9, 2012

Oh and BTW I restored the Python 2.5 compat as TravisBot is reporting with the green dot next to the latest commits.

@ogrisel
ogrisel commented Sep 9, 2012

I have launched a c1.xlarge (8 cores on 7GB) box on EC2 to run the ExtraTrees covertree benchmark with n_jobs=-1 by just replacing sklearn's joblib folder with a symlink to the joblib folder from this branch and I could observe 1.7GB reduction in memory usage and a training time that decreased from 96s to 91s (memory allocation is actually expensive enough to be measurable on this benchmark and memmap allows to spare some of those allocations in subprocesses). I find this very cool for simple drop-in replacement.

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

@ogrisel ogrisel referenced this pull request Sep 9, 2012
Closed

WIP: Shared arrays #43

@glouppe
glouppe commented Sep 9, 2012

That looks great! I will have a deeper look at it tomorrow :)

@glouppe
glouppe commented Sep 10, 2012

I have run a quick test on my machine and it seems to work like a charm :) Building a forest of 10 extra-trees on mnist3vs8 gives the following results:

master:

  • 1 job: 35s, 120mb
  • 2 jobs: 18s: 290mb

pickling-pool:

  • 1 job: 27s: 140mb
  • 2 jobs: 16s, 140mb

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :)

Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

@glouppe
glouppe commented Sep 10, 2012

Those numbers could even be further reduced if the original dataset would be memmaped directly rather than loaded as numpy arrays but that would require a change in the benchmark script to do so.

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

@ogrisel
ogrisel commented Sep 10, 2012

Those figures may not be very accurate (the benchmark was run only once), but they at least confirm that it works as expected! That's a very good job Olivier :) Once my colleagues arrive, I'll try a bigger task on our 48-core 512Gb machine! I keep you posted.

Thanks for those early tests. I am really looking forward to the tests on your "real life" work environment :) Most likely you will no longer need those 512GB of RAM :)

Do you mean that an extra-copy is made? Could we fix that at checking and conversion time of X?

Yeah if you load your data from the disk into a numpy array, this array will stay in the memory of the master process while a memmap copy will also be allocated (once) and shared with the subprocesses for the duration of the computation.

In order to get rid of the initial extra-copy, make sure you load the data into variable X with the right dtype and alignment (C or Fortran), you can do:

from joblib import dump, load
filename = '/tmp/cached_source_data.pkl'
dump(X, filename)
X = load(filename, mmap_mode='c')

The original X array will be garbage collected (assuming no other references to it) and the memmap instance will directly be shared with the subprocesses from now on.

The same remark applies to the X_argsorted internal datastructure. When n_jobs !=1, we could make ExtraTrees* and RandomForest* directly allocate X_argsorted as a np.memmap array in w+ mode pointing to a temporary file (to be stored as an attribute of the main estimator class to be collected in the __del__ method) instead of an in-memory numpy array. That would remove another memory copy.

@bdholt1
bdholt1 commented Sep 10, 2012

Thanks @ogrisel for this PR! I'm giving it a go on my 13GB dataset X.shape=(1678300, 2000) where I have 32GB of RAM... So to train a forest I'm going to either have to train one at a time or use mem mapping.

I followed your instructions (working on windows 7, EPD64bit + cygwin) and I'm getting a strange error:

File "E:\brian\poseme\rgbd\train_rgbd_trees.py", line 133, in train_trees
X = joblib.load(x_filename, mmap_mode='c')
File "E:\brian\scikit-learn\sklearn\externals\joblib\numpy_pickle.py", line 418, in load
obj = unpickler.load()
File "C:\Python27\lib\pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\x93'

I get the same problem when mmap_mode=None and no problems when I use np.load(x_filename, mmap_mode=None). Also, I've saved X as a binary fortran array .npy. Any thoughts how to proceeed?

@ogrisel
ogrisel commented Sep 10, 2012

Thanks @bdholt1 for testing (especially under windows as I could not find the motivation so far :), you could please add a print statement / pdb breakpoint on line 858 of pickle.py to disply the content of the self variable when key is \x93?

@ogrisel
ogrisel commented Sep 10, 2012

BTW, to be able to work with your dataset of 13GB you need to memmap it directly upstream in float32 / Fortran array and make sure that X_argsorted is memmaped too from the start (that requires a change in sklearn's source code, not just your script). I would suggest you to start with a smaller subset of your data (e.g. ~5GB) to make sure that n_jobs >> 1 behave as expected first and then move on to optimization of sklearn / your script afterwards.

Also, could you please run joblib test suite under windows? You just need to run nosetests in the top folder.

@bdholt1
bdholt1 commented Sep 10, 2012

On line 858 of pickle.py I added:

from pprint import pprint
print self
pprint (vars(self))
dispatch[key](self)


<sklearn.externals.joblib.numpy_pickle.NumpyUnpickler instance at 0x0000000006F7C248>
{'_dirname': 'H:/RGB-D_2D/features',
'_filename': 'rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy',
'append': <built-in method append of list object at 0x0000000006F7D248>,
'file_handle': <open file 'H:/RGB-D_2D/features/rgbd_leaveout0_bc_data_100pix_100x100_32bit_fortran.npy', mode 'rb' at 0x0000000006EDBE40>,
'mark': <object object at 0x0000000001DC80E0>,
'memo': {},
'mmap_mode': 'c',
'np': <module 'numpy' from 'C:\Python27\lib\site-packages\numpy\__init__.pyc'>,
'read': <built-in method read of file object at 0x0000000006EDBE40>,
'readline': <built-in method readline of file object at 0x0000000006EDBE40>,
'stack': []}

I have no need for X_argsorted because I'm using the lazy argsort branch :)

After running nosetests I don't see a summary report (except the OK at the end) so I checked every statement looking for failures and this looked like the only possible problem
joblib.test.test_numpy_pickle.test_numpy_persistence ... Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(2.57e-322)> ignored

@bdholt1
bdholt1 commented Sep 10, 2012

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

@ogrisel
ogrisel commented Sep 10, 2012

When you say "memmap it directly upstream directly in float32 / Fortran array", I think I've done that in that the data stored on disk is in float32 fortran as a binary .npy file. Will that work?

If you used the default numpy.save, I don't think there is an easy way to memmap to it directly. So you will need to either write your own memmap serialization code or use joblib.dump / joblib.load as said previously.

About the KeyError, apparently underwindows multiprocesing is passing more stuff to the child processes as pickles (probably due to the difference in the fork operation). I need to understand why it's trying to pickle the joblib.numpy_pickle.NumpyUnpickler class itself as it sounds useless to me. I will probably need to debug that on a windows VM. Can you please post a minimalistic python script (with as few data as possible, possibly generated using np.random.randn) that reproduces this error as a new http://gist.github.com/ ?

@ogrisel
ogrisel commented Sep 10, 2012

BTW:

>>> print "\x93"
?

Sounds appropriate :)

@bdholt1
bdholt1 commented Sep 10, 2012

Ok, I think I made a mistake by not using joblib.dump. Now that I have done it, I no longer experience the problem with joblib.load mentioned above.

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

@ogrisel
ogrisel commented Sep 10, 2012

Now it should be working but the behaviour is a bit weird. Watching my memory usage at
X = joblib.load(x_filename, mmap_mode='c') it doesn't appear to be loading anything into memory. Is this right? Should it simply point to data on disk and load when needed?

This is right, this is the beauty of memmaping: data is loaded (pages by pages) by the kernel (and cached using the HDD kernel cache) only when need by processes addressing this virtual memory segment.

@glouppe
glouppe commented Sep 10, 2012

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertionoself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is on open ticket since 2009 concerning that bug but no one seem to have solved it.

@glouppe
glouppe commented Sep 10, 2012

I don't think this is directly related, but we got the following error when we consider datasets that are too large (from 234000*1728 and beyond).

python2.6: ./Modules/cStringIO.c:419: O_cwrite: Assertion `oself->pos + l < 2147483647' failed.`

We already had that bug before, but it disappeared without any good reason... From what I googled, this seems to come from Python itself which do not handle too large objects... :/ There is an open ticket since 2009 concerning that bug but no one seems to have solved it.

http://bugs.python.org/issue7358

@ogrisel
ogrisel commented Sep 10, 2012

@glouppe Do you have the full traceback? What object is this? A numpy array? If so it should have been converted to a memmap using the joblib.numpy_pickle module that should be able to handle very large numpy array.

@glouppe
glouppe commented Sep 10, 2012

I don't have any traceback. It does not trigger any exception. It crashes and exits. I am just building an ExtraTrees classifier with joblib replaced with yours. I am currently investigating at which step it crashes...

@bdholt1
bdholt1 commented Sep 10, 2012

@ogrisel from my tests now with an ExtraTreesRegressor, X is loaded and remains on disk with X = joblib.load(x_filename, mmap_mode='c'), but the code to set up forest._parallel_build_trees (forest.py line 254) :

all_trees = Parallel(n_jobs=n_jobs, verbose=self.verbose)(
delayed(_parallel_build_trees)(
n_trees[i],
self,
X,
y,
self.random_state.randint(MAX_INT),
verbose=self.verbose)
for i in xrange(n_jobs))

is causing memory to be allocated proportional to the number of jobs. Bootstrap is set to false to ensure that trees train on the same underlying data.

Is that expected?

@glouppe
glouppe commented Sep 10, 2012

@ogrisel Oops, false alarm. Python was not using the right version of sklearn/joblib. It now passes with a dataset of that size. We will be testing soon with even larger datasets.

@bdholt1
bdholt1 commented Sep 10, 2012

Maybe I'm missing something, possibly still using the multiprocessing.Pool instead of joblib.pool.PicklingPool. @glouppe can you create a gist of the program you ran earlier?

@ogrisel
ogrisel commented Sep 10, 2012

@bdholt1 you just have to use the joblib.Parallel of this branch by replacing the sklearn.externals.joblib folder with a symlink to the clone of this joblib repo + branch. joblib.Parallel has been updated to use joblib.pool.MemmapingPool instead of multiprocessing.pool.Pool by default.

@glouppe
glouppe commented Sep 10, 2012

Using a dataset larger than 2GB (476000*1728), I got the following error:


Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 238, in save
    obj, filename = self._write_array(obj, filename)
  File "/usr/lib64/python2.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 199, in _write_array
    self.np.save(filename, array)
  File "/usr/lib64/python2.6/site-packages/numpy/lib/npyio.py", line 411, in save
    format.write_array(fid, arr)
  File "/usr/lib64/python2.6/site-packages/numpy/lib/format.py", line 409, in write_array
    array.tofile(fp)
ValueError: 274176000 requested and 169876470 written

/usr/lib64/python2.6/pickle.py:286: DeprecationWarning: 'i' format requires -2147483648 <= number <= 2147483647
  f(self, obj) # Call unbound method with explicit self
@glouppe
glouppe commented Sep 10, 2012

@bdholt1 It is the same as the one I used for your lazy argsort PR. The only change is that I used mnist3vs8 (MNIST restrcited to '3' and '8' digits only).

@glouppe
glouppe commented Sep 10, 2012

@ogrisel I think this is related to the underlying file system. How can I set the /tmp directory where files are put?

@ogrisel
ogrisel commented Sep 10, 2012

Use the temp_folder kwarg of Parallel (was added recently, you might need to git pull). I can confirm I can reproduce the same error on OSX by using the /tmp folder which is apparently on the swap or inmemory partition, hence with limited size.

@bdholt1
bdholt1 commented Sep 10, 2012

It works! Once I had set temp_folder to a place that had space, I can see that my 4 python instances are now sharing 6.5GB between them (I halved the dataset size) and then each has his own private memory.

Great job @ogrisel, I'm sure this is going to make a huge difference for many users, certainly myself. Thanks very much!

@glouppe
glouppe commented Sep 10, 2012

It now works for me as well with a custom temp_folder. We'll definitely have to add that parameter in forests models, or as a global parameter within scikit-learn (don't know exactly how though).

@ogrisel
ogrisel commented Sep 10, 2012

I think it's good to have both a programmatic parameter. Otherwise this is using the tempfile module of the python standard library. This means that the users always has the ability to set the default temp folder using an environment variable: TMPDIR, TEMP or TMP.

@ogrisel
ogrisel commented Sep 10, 2012

It works! Once I had set temp_folder to a place that had space, I can see that my 4 python instances are now sharing 6.5GB between them (I halved the dataset size) and then each has his own private memory.

@bdholt1 great news! was this using win7 64bits? If so, that would spare me a windows testing session :)

@bdholt1
bdholt1 commented Sep 10, 2012

@ogrisel yes, I'm on windows 7, EPD64bit (7.3.1) with Python2.7 + cygwin. I use the EPD provided toolchain for g++ etc to ensure compatibility.

@bdholt1
bdholt1 commented Sep 10, 2012

@ogrisel, is it possible for the main thread to release the reference to the array object once its been dumped to the temp folder? It seems like my master thread is holding onto the object as well as the workers reading the mem-mapped file from them temp folder, meaning that its unnecessarily using twice as much mem!

@ogrisel
ogrisel commented Sep 10, 2012

In your caller script you should del the unneeded variable holding the reference before calling the joblib.Parallel operations. You can maybe also force a gc:

import gc
gc.collect()

Any way, it probably better to run a first script to preprocess the original dataset and store it on the drive in a memmapable format with the right dtype and C/Fortrain alignment and then a second script that does the parallel learning directly using the memmap (without ever allocating the data in memory using numpy.array).

@bdholt1
bdholt1 commented Sep 10, 2012

I'm doing exactly what you've suggested in the second part: I've saved my fortran 32bit data using joblib.dump and I'm now loading it with X = joblib.load(x_filename, mmap_mode='c'). As I said before, after this call memory usage doesn't increase so it seems to be doing the right thing. But it always tries to dump this very same data into temp_folder and then read it from there. Is there any way to prevent that? Or is that the code path that is always taken?

@ogrisel
ogrisel commented Sep 10, 2012

As I said before, after this call memory usage doesn't increase so it seems to be doing the right thing. But it always tries to dump this very same data into temp_folder and then read it from there. Is there any way to prevent that? Or is that the code path that is always taken?

memmap instances should not be dumped into new files and should not trigger any memory copy. I will have a look to see if I can reproduce this.

@bdholt1 bdholt1 commented on the diff Sep 10, 2012
joblib/pool.py
+ self._recv = recv = self._reader.recv
+ racquire, rrelease = self._rlock.acquire, self._rlock.release
+
+ def get():
+ racquire()
+ try:
+ return recv()
+ finally:
+ rrelease()
+
+ self.get = get
+
+ if self._reducers:
+ def send(obj):
+ buffer = BytesIO()
+ CustomizablePickler(buffer, self._reducers).dump(obj)
@bdholt1
bdholt1 added a note Sep 10, 2012

It seems that this is where the queue tries to transmit the data through the pipe, and always dumps the data to a file even if it is already memmaped....

@ogrisel
ogrisel added a note Sep 10, 2012

The reducer registered for the type np.memmap should prevent that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
ogrisel commented Sep 10, 2012

@bdholt1 I have pushed a bunch of small improvements + tests. Can you run the tests on you box again? I cannot reproduce your problem on my MacOSX box (even before the latests fixes).

If you still have the unnecessary FS-dump issue, could you please provide a minimalistic reproduction script on http://gist.github.com ?

@bdholt1
bdholt1 commented Sep 11, 2012

@ogrisel This has been a very interesting exercise for me. I updated to your latest push and then added a few print statements in ArrayMemmapReducer.__call__ to see what is happening.

def __call__(self, a):
        print "__call__ ", a.shape
        if isinstance(a, np.memmap):
            print "a is memmap"
            # np.memmap is a subclass of np.ndarray that does not need to be
            # dumped on the filesystem
            return reduce_memmap(a)
        if a.nbytes > self.max_nbytes:
            print "dumping a to file"
            # check that the folder exists (lazily create the pool temp folder
            # if required)
            if not os.path.exists(self.temp_folder):
                os.makedirs(self.temp_folder)

            # Find a unique, concurrent safe filename for writing the
            # content of this array only once.
            basename = "%d-%d-%d-%s.pkl" % (
                os.getpid(), id(threading.current_thread()), id(a), hash(a))
            filename = os.path.join(self.temp_folder, basename)

            # In case the same array with the same content is passed several
            # times to the pool subprocess children, serialize it only once
            if not os.path.exists(filename):
                dump(a, filename)

            # Let's use the memmap reducer
            return reduce_memmap(load(filename, mmap_mode=self.mmap_mode))
        else:
            print "dumping a through default pickler"
            # do not convert a into memmap, let pickler do its usual copy with
            # the default system pickler
            return (loads, (dumps(a, protocol=HIGHEST_PROTOCOL),))

The test script has 3 parts. Firstly it creates a regression dataset (larger than 1MB) and dumps the features to a memmaped file and the targets to a npy file.

Then there are 2 forest routines. The first is a scaled down version of whats in forest.py and simply creates the trees in parallel. Here is the instrumented output ( I kill the process after the threads have spawned):

__call__ (1000000L,)
dumping a to file
__call__ (1000000L,)
dumping a to file
__call__ (1000000L,)
dumping a to file

as you can see, its dumping the labels to a memmaped file but not touching the features. This is exactly the right behaviour I expect.

The second function creates an ExtraTreesRegressor. Here is the instrumented output:

__call__ (624L,)
dumping a through default pickler
__call__ (1000000L, 100L)
dumping a to file
__call__ (1000000L, 1L)
dumping a to file
__call__ (624L,)
dumping a through default pickler
__call__ (1000000L, 100L)
dumping a to file
__call__ (1000000L, 1L)
dumping a to file
__call__ (624L,)
dumping a through default pickler
__call__ (1000000L, 100L)
dumping a to file
__call__ (1000000L, 1L)
dumping a to file

Can you see that its trying to memmap both the features and the targets? This is what I was seeing yesterday. What's not clear to me is why forest1 works correctly and why ExtraTrees doesn't. The main difference I see is passing self into forest._parallel_build_trees, but I don't know what effect that might have, except to account for the length 624 array being passed into the threads.

In any event, it seems that your code does what it is supposed to, but @glouppe and I probably need to refactor forest.py so as to prevent this behaviour.

@ogrisel
ogrisel commented Sep 11, 2012

The following:

__call__ (1000000L, 100L)
dumping a to file

means that the data has already been copied to a numpy.ndarray prior to the call to Parallel. You should print the type of this object in upstream code to find the culprit.

@ogrisel
ogrisel commented Sep 11, 2012

@bdholt1 Also please again push your experimental script on a http://gist.github.com so that I can have a look. Blind guesses is not the most efficient way to debug stuff.

@bdholt1
bdholt1 commented Sep 11, 2012

@ogrisel I created a gist and linked to it in my post (the words test script are a hyperlink), but here it is again: https://gist.github.com/3696748

@bdholt1
bdholt1 commented Sep 11, 2012

Turns out that my "hyperlink" was to a git repo. I've fixed that. Sorry!

@bdholt1
bdholt1 commented Sep 11, 2012

@ogrisel I've found the culprit: X, y = check_arrays(X, y, sparse_format="dense")

before: type of X <class 'numpy.core.memmap.memmap'>
after: type of X <type 'numpy.ndarray'>

@bdholt1
bdholt1 commented Sep 11, 2012

More specifically array = np.asarray(array, dtype=dtype) will convert a memmap to an ndarray. Do you think its worth changing validation.py so as not to execute that line if the array is memmaped?

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Sep 11, 2012

@bdholt1 I missed the hyperlink, sorry :) Thanks for the investigation. I will try to have a look at it tonight.

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Sep 12, 2012

@bdholt1 the issue with the extra dumpings you observe is caused by this bug sklearn in scikit-learn/scikit-learn#1142

I plan to fix it tomorrow during the sprint with @GaelVaroquaux.

@bdholt1
@mluessi
mluessi commented Sep 13, 2012

I did some experiments using this branch and there seems to be a problem when a reduction is applied to the the argument of a function and the result is returned. See this gist: https://gist.github.com/3717046

On the master branch it works as expected but on this branch, Parallel() returns a list of memmap-arrays and the values are all the same, i.e.,

(mismatch 100.0%)
x: array([ 0.35186981, 0.35186981, 0.35186981, 0.35186981])
y: array([ 0. , 0.49930411, 0.99860822, 1.49791233])

It works if I cast the values to float (line 12). So far I haven't been able to figure out what is causing this.

@ogrisel
ogrisel commented Sep 14, 2012

Thanks for the report @mluessi. This is caused by a bug in np.memmap when dealing with ufuncs and operaters: instead of returning np.ndarray they return np.memmap instance that are actually not mapped to the underlying file.

In [1]: import numpy as np

In [2]: a = np.memmap('/tmp/a', shape=3, dtype=np.float32, mode='w+')

In [3]: a.fill(1.)

In [4]: np.mean(a)
Out[4]: memmap(1.0)

In [5]: b = 3 * np.mean(a)

In [6]: b
Out[6]: memmap(3.0)

In [10]: a.filename
Out[10]: '/tmp/a'

In [11]: b.filename
Out[11]: '/tmp/a'

In [12]: a
Out[12]: memmap([ 1.,  1.,  1.], dtype=float32)

In [13]: b
Out[13]: memmap(3.0)

We indeed need to find a way to workaround that numpy bug before merging this branch.

@ogrisel
ogrisel commented Sep 14, 2012

I found a cheap way to detect those buggy cases:

In [14]: from _multiprocessing import address_of_buffer

In [32]: address_of_buffer(a._mmap)
Out[32]: (4296896512, 12)

In [33]: address_of_buffer(a.data)
a.data

In [33]: address_of_buffer(a.data)
Out[33]: (4296896512, 12)

In [34]: address_of_buffer(b.data)
Out[34]: (4343115344, 8)

In [35]: address_of_buffer(b._mmap)
Out[35]: (4296896512, 12)

We will thus be able to pickle the "bad" np.memmap instances as regular np.ndarray as they should have been in the first place.

Thanks again @mluessi for the report.

@ogrisel
ogrisel commented Sep 14, 2012

Actually we don't even have to import _multiprocessing.address_of_buffer:

In [38]: id(a._mmap)
Out[38]: 4344963680

In [39]: id(a.data)
Out[39]: 4346978480

In [40]: id(b.data)
Out[40]: 4346978416

In [41]: id(b._mmap)
Out[41]: 4344963680

Edit: actually if you look at the ids, they are different for a hence cannot be used to detect buffer drift events.

@ogrisel
ogrisel commented Sep 14, 2012

@mluessi I pushed a fix for the bug you reported.

@mluessi
mluessi commented Sep 14, 2012

Yep, it works now. Thanks for fixing this :).

@mluessi
mluessi commented Sep 14, 2012

I did some more tests and I'm very impressed by this feature :).

I wonder if it would make sense to the feature where you tell Parallel() explicitly for which arguments a memmap should be used. Let's say you have a function that takes two arguments arr_1 and arr_2, both large enough so joblib will use a memap. In the function, arr_2 is modified, so it is read back into memory. Does this create an overhead compared to the situation where joblib only uses a memmap for arr_1 and passes arr_2 "directly"*? If it does create an overhead, it would make sense to use max_nbytes=None, and specify explicitly that a memmap should be used for arr_1.

(* I'm not sure how exactly joblib passes arr_2 "diectly", I guess this is why I'm asking)

Also, I think it may be "dangerous" to enable memmapping by default. Some users may not have a lot of space in the temporary directory (since they have not been using it so far), so joblib may fill up the filesystem.

Finally, I also noticed that joblib doesn't cleanup the temporary files (/tmp/joblib_memmaping_pool..) is this a bug or is this intended (not yet implemented)?

PS: Not tho hijack this PR, but does anyone know if it makes sense (speed wise) to use a tmpfs (file system mounted in memory) for the temporary folder? EDIT: to answer my own question, on my laptop the speedup is about 20% (4s vs 5s) for a 500MB array and 4 processes.

@ogrisel
ogrisel commented Sep 14, 2012

I wonder if it would make sense to the feature where you tell Parallel() explicitly for which arguments a memmap should be used. Let's say you have a function that takes two arguments arr_1 and arr_2, both large enough so joblib will use a memap. In the function, arr_2 is modified, so it is read back into memory. Does this create an overhead compared to the situation where joblib only uses a memmap for arr_1 and passes arr_2 "directly"*?

Hard to say in advance. I think it depends on the speed of the disk.

If it does create an overhead, it would make sense to use max_nbytes=None, and specify explicitly that a memmap should be used for arr_1.

Yes, you can pass max_nbytes=None to disable the autodump feature and pass numpy memmap yourself only for the data you want.

>>> from joblib import load, dump
>>> filename = "/tmp/my_readonly_data"
>>> dump(data_array, filename)
>>> data_mmap = load(filename, mmap_mode='c')  # you can pass 'r' or 'r+' alternatively
>>> results = Parallel(2, max_nbytes=None)(delayed(some_func)(data_mmap, p) for p in params)

(* I'm not sure how exactly joblib passes arr_2 "diectly", I guess this is why I'm asking)

When arrays are not memmaped, they are pickled and streamed into a pipe over to the worker process that has to deal with it. So if an array is send 10 times to the workers it will be pickled and reallocated 10 times.

Also, I think it may be "dangerous" to enable memmapping by default. Some users may not have a lot of space in the temporary directory (since they have not been using it so far), so joblib may fill up the filesystem.

Usually you have more space on the harddrive than in RAM that's why it sounded to me like a reasonable default.

Finally, I also noticed that joblib doesn't cleanup the temporary files (/tmp/joblib_memmaping_pool..) is this a bug or is this intended (not yet implemented)?

This sounds like a bug but I cannot reproduce it:

>>> from joblib import Parallel, delayed
>>> import numpy as np
>>> Parallel(2, max_nbytes=1e6)(delayed(type)(np.zeros(i)) for i in [int(1e4), int(1e6)])
[numpy.ndarray, numpy.core.memmap.memmap]
>>> !ls /tmp/ | grep joblib

If you use the MemmapingPool directly don't forget to call the terminate method.

PS: Not tho hijack this PR, but does anyone know if it makes sense (speed wise) to use a tmpfs (file system mounted in memory) for the temporary folder?

No idea, it should be benchmarked (probably system dependent). '/tmp' on OSX is indeed such an in-memory folder AFAIK and is smaller than other folders, that's why it might be interesting to pass your own temp_folder to Parallel to point to a real disk folder with a lot of free space.

@mluessi
mluessi commented Sep 14, 2012

@ogrisel thanks for your answers. I think the problem is that MemmapingPool.terminate() never gets called, this ogrisel/joblib#1 fixes it for me.

Regarding the tempfs, on my laptop (Linux) it makes about a 20% difference is speed (see Edit above)

@mluessi mluessi referenced this pull request in ogrisel/joblib Sep 14, 2012
Closed

FIX: call Pool.terminate() #1

@ogrisel
ogrisel commented Sep 18, 2012

@GaelVaroquaux I started to write some narrative documentation. The numpy doctest feature is still lacking.

@mluessi mluessi referenced this pull request in mne-tools/mne-python Sep 19, 2012
Closed

WIP: use joblib memmapping pool #99

@ogrisel
ogrisel commented Sep 20, 2012

@glouppe @bdholt1 could you please re-run your tests with the new version current state of this branch? Memory usage and runtimes should be further decreased when the original data is already a numpy.memmap derived datastructure: this case is now detected and optimized away.

@ogrisel
ogrisel commented Sep 20, 2012

Actually there is a bug. I am on it.

@ogrisel
ogrisel commented Sep 20, 2012

Should be fixed now.

@ogrisel ogrisel referenced this pull request in scikit-learn/scikit-learn Sep 21, 2012
Open

Implement Parallelized SGD as in NIPS 2010 paper #1174

@ogrisel
ogrisel commented Sep 22, 2012

Ok this PR is getting in good shape I think. Reviews of the narrative documentation appreciated.

@GaelVaroquaux I introduced a helpful has_shared_memory utility function. Right now it lives in joblib.pool but I think it's not related to pool. Should we move it and or promote it to the top level? Note: it requires numpy.

@glouppe
glouppe commented Sep 22, 2012

@ogrisel I am leaving for ECML this week-end and won't have time to re-test it until my return :/ (i.e. not before October 1st)

@ogrisel
ogrisel commented Sep 22, 2012

@glouppe no pbm, have a good ECML :)

@GaelVaroquaux
joblib member

Tests fail on my box (probably due to a different numpy version):

Doctest: parallel_numpy.rst ... Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
    put(task)
  File "/home/varoquau/dev/joblib/joblib/pool.py", line 303, in send
    CustomizablePickler(buffer, self._reducers).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/varoquau/dev/joblib/joblib/pool.py", line 243, in dispatcher
    reduced = reduce_func(obj)
  File "/home/varoquau/dev/joblib/joblib/pool.py", line 175, in __call__
    return _reduce_memmap_backed(a, m)
  File "/home/varoquau/dev/joblib/joblib/pool.py", line 107, in _reduce_memmap_backed
    offset += m.offset
AttributeError: 'numpy.ndarray' object has no attribute 'offset'

I am on this problem and will see what I can do.

That said, it raises a bigger problem, that is that in this situation, the test suite freezes and never finishes. For automated builds, this will be a problem. It is a problem related to error handling in multiprocessing, and I do not know how to solve it.

@GaelVaroquaux
joblib member

OK, shame on me: the error was due to an unclean install of joblib: the parent process was picking up the local joblib install, but the children were picking up the system install. So the problem disappears.

I am getting another test failure, though:

Check that it is possible to reduce a memmap backed array ... FAIL

======================================================================
FAIL: Check that it is possible to reduce a memmap backed array
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/varoquau/dev/joblib/joblib/test/test_pool.py", line 120, in test_memmap_based_array_reducing
    assert_array_equal(b_reconstructed, b)
  File "/home/varoquau/dev/numpy/numpy/testing/utils.py", line 753, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/home/varoquau/dev/numpy/numpy/testing/utils.py", line 677, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

(mismatch 100.0%)
 x: memmap([ 0.,  5.])
 y: memmap([ 3.,  8.])
>>  raise AssertionError('\nArrays are not equal\n\n(mismatch 100.0%)\n x: memmap([ 0.,  5.])\n y: memmap([ 3.,  8.])')

Investigating...

@ogrisel
ogrisel commented Sep 22, 2012

Indeed the multiprocessing failures are a pain to handle. Maybe we could register a suicide thread with a timer (e.g. max 1min) in a setup_module fixture?

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Sep 22, 2012

Ok I'll probably do that next week (unless you want to do it first).

@GaelVaroquaux
joblib member

I'll first try to tackle the failure, that doesn't look spurious.

@ogrisel
ogrisel commented Sep 22, 2012

Thanks.

@GaelVaroquaux
joblib member

I must cook lunch so I'll be offline for an hour, but here is a difference that I observed between numpy master (where its failing) and a numpy version where it works:
working

(Pdb) print b.base
[[  0.   1.   2.   3.   4.]
 [  5.   6.   7.   8.   9.]
 [ 10.  11.  12.  13.  14.]]
(Pdb) print b_reconstructed.base

failing

(Pdb) print b.base

(Pdb) print b_reconstructed.base

I think that there is an optimization in recent numpy where it avoids propagating a chain of 'base' and only keeps the top-most original base. The reason is that it facilitates garbage collection.

I think that that's where the code breaks, but I don't have enough understanding (yet) to know exactly where. Maybe, as you know this code better, you can have a better guess than me.

@ogrisel
ogrisel commented Sep 22, 2012

Alright, I will install numpy master, thanks for the report / investigation. However I won't probably have time to dive into this before next week. But there ain't no hurry anyway. Better take time and make the code clean and robust :)

@GaelVaroquaux
joblib member

I dug a bit more, and the problem is that as the 'base' no longer points to the original memmapped array but to the mmap file itself, the 'byte_bounds' trick used in '_reduce_memmap_backed' cannot be used to find the offset to the file. Indeed, I couldn't figure out a way to know when a view of a memmap was not the original view.

I asked a related question on the numpy mailing list:
http://mail.scipy.org/pipermail/numpy-discussion/2012-September/063978.html

@GaelVaroquaux
joblib member

As you have suggested on the numpy mailing list, a way around this problem is

from _multiprocessing import address_of_buffer
offset = address_of_buffer(b.data)[0] - address_of_buffer(b._mmap)[0]
@ogrisel
ogrisel commented Sep 22, 2012

But that does not solve the case were you also loose the .filename attribute when you have a = np.asarray(some_memmap_instance). In that case you just have access to a.base which is mmap instance (no metadata on how this memory segment was open in the first place). We could serialize the mmap buffer address (+ dtype + shape + strides) and then remake a new buffer from the subprocess with ctypes and use np.frombuffer but that sounds more fragile to me.

BTW this is the way multiprocessing handles shared memory (the ctypes buffer from_address thingy).

@GaelVaroquaux
joblib member
@GaelVaroquaux
joblib member

There might not be anything that we can do about that... Doing 'asarray' to a memmap pretty much kills it.

@ogrisel
ogrisel commented Sep 22, 2012

I have a solution:

from ctypes import c_byte
from _multiprocessing import address_of_buffer
from numpy.lib.stride_tricks import as_strided
import numpy as np

filename = '/tmp/some.mmap'
o = np.memmap(filename, dtype=np.float64, shape=1000, mode='w+')
o[:] = np.arange(o.shape[0]) * -1

m = np.memmap(filename, dtype=np.float64, shape=(3, 4), order='C', mode='r+')
m[:] = np.arange(12).reshape(m.shape)
print 'm:', m

a = np.asarray(m[:2, :])
print 'a:', a

addr, _ = address_of_buffer(a.base)

a2 = np.frombuffer((c_byte * a.nbytes).from_address(addr)).astype(a.dtype)
a3 = as_strided(a2, shape=a.shape, strides=a.strides)
print 'a3:', a3

Only the order='F' case in m is left to be dealt with.

But I would have rather used explicit memmaping instead of directly addressing memory.

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Sep 22, 2012

There is probably a bug in my code as it broken when the 2nd dim is slices too:

from ctypes import c_byte
from _multiprocessing import address_of_buffer
from numpy.lib.stride_tricks import as_strided
import numpy as np

filename = '/tmp/some.mmap'
o = np.memmap(filename, dtype=np.float64, shape=1000, mode='w+')
o[:] = np.arange(o.shape[0]) * -1

m = np.memmap(filename, dtype=np.float64, shape=(3, 4), order='C', mode='r+')
m[:] = np.arange(12).reshape(m.shape)
print 'm:', m

a = np.asarray(m[:2, 1:3])
print 'a:', a

addr, _ = address_of_buffer(a.base)

a2 = np.frombuffer((c_byte * a.nbytes).from_address(addr)).astype(a.dtype)
a3 = as_strided(a2, shape=a.shape, strides=a.strides)
print 'a3:', a3

But that should be fixable IMHO as the fortran alignement case. I just don't have time to investigate further now.

Is that going to work accross processes?

Yes that works accross processes but only from subprocesses addressing mmap allocated memory by their parent.

I don't like the potential segfaults either.

It's a pitty that the mmap objects aresuch black boxes.

Yes. Maybe we could implement a non broken numpy.memmap alternative in joblib. We would have control over it. Also instead of using the python mmap object directly as the array buffer, we could devise our own class that would wrap it, implement the python buffer interface but also store the filename and offset metadata directly as buffer attributes.

Hence those metadata would be preserved and available in np.asarray(m).base that would be an instance of our FileBackedMmapBuffer class:

a.base.filename
a.base.offset
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 25, 2013
examples/parallel_memmap.py
+ folder = tempfile.mkdtemp()
+ samples_name = os.path.join(folder, 'samples')
+ sums_name = os.path.join(folder, 'sums')
+ try:
+ # Generate some data and an allocate an output buffer
+ samples = rng.normal(size=(10, int(1e6)))
+ sums = np.memmap(sums_name, dtype=samples.dtype,
+ shape=samples.shape[0], mode='w+')
+
+ # Dump the input data to disk to free the memory
+ dump(samples, samples_name)
+ samples = load(samples_name, mmap_mode='r')
+
+ # Make sure that the original arrays are no longer in memory before
+ # forking
+ gc.collect()
@GaelVaroquaux
joblib member

Is this actually useful? If it is, we should put it in the Parallel code, just before the fork.

@ogrisel
ogrisel added a note Jul 25, 2013

This is the only pattern you can implement to avoid zero-memory copy.

However you can only implement it outside of the Parallel call: here the trick works because samples is a local variable and nobody else has a reference on it so it can be garbage collected before doing the fork. If we were to put that into the Parallel code, the caller would still have a reference to the array hence gc.collect would not do anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 25, 2013
examples/parallel_memmap.py
+def sum_row(input, output, i):
+ """Compute the sum of a row in input and store it in output"""
+ sum_ = input[i, :].sum()
+ print("[Worker %d] Sum for row %d is %f" % (os.getpid(), i, sum_))
+ output[i] = sum_
+
+if __name__ == "__main__":
+ rng = np.random.RandomState(42)
+ folder = tempfile.mkdtemp()
+ samples_name = os.path.join(folder, 'samples')
+ sums_name = os.path.join(folder, 'sums')
+ try:
+ # Generate some data and an allocate an output buffer
+ samples = rng.normal(size=(10, int(1e6)))
+ sums = np.memmap(sums_name, dtype=samples.dtype,
+ shape=samples.shape[0], mode='w+')
@GaelVaroquaux
joblib member

You should add a small comment telling here why you are creating this array (a container for the return)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Jul 25, 2013
joblib/parallel.py
@@ -486,7 +513,12 @@ def __call__(self, iterable):
# Set an environment variable to avoid infinite loops
os.environ['__JOBLIB_SPAWNED_PARALLEL__'] = '1'
- self._pool = multiprocessing.Pool(n_jobs)
+ self._pool = MemmapingPool(
+ n_jobs, max_nbytes=self._max_nbytes,
+ mmap_mode=self._mmap_mode,
+ temp_folder=self._temp_folder,
+ verbose=max(0, self.verbose - 50),
+ )
@GaelVaroquaux
joblib member

If we are not under the right numpy versions, we should probably use a standard Pool here, rather than a MemmapingPool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 25, 2013
joblib/pool.py
+
+from pickle import HIGHEST_PROTOCOL
+try:
+ from io import BytesIO
+except ImportError:
+ # Python 2.5 compat
+ from StringIO import StringIO as BytesIO
+try:
+ from multiprocessing.pool import Pool
+ from multiprocessing import Pipe
+ from multiprocessing.synchronize import Lock
+ from multiprocessing.forking import assert_spawning
+except ImportError:
+ class Pool(object):
+ """Dummy class for python 2.5 backward compat"""
+ pass
@GaelVaroquaux
joblib member

Joblib doesn't support Python 2.5 anymore. I think that you can simplify the code here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
ogrisel commented Jul 26, 2013

I made some more manual testing with Python 2 / 3 with numpy 1.7.0 as well and everything seems to work as expected. I think this is ready for merging.

@ogrisel ogrisel referenced this pull request in scikit-learn/scikit-learn Jul 29, 2013
Closed

Random Forest memory efficiency #2179

ogrisel and others added some commits Sep 2, 2012
@ogrisel ogrisel WIP: started work on a pool with custom pickler 81f2233
@ogrisel ogrisel WIP: more untested work on custom pickler f9edc40
@ogrisel ogrisel prototype custom pickling pool is working a749da8
@ogrisel ogrisel Fix wrong docstring + python 2.5 support d6c4f14
@ogrisel ogrisel Drop useless Unpickler, use HIGHEST_PROTOCOL + fix docstrings 183a011
@ogrisel ogrisel Fix Pool parameter passing to parent class 49a8ebe
@ogrisel ogrisel pep8 4d5157c
@ogrisel ogrisel More work on automated switch to memmap for large arrays b08a46e
@ogrisel ogrisel Cheaper and concurrent safe filenames for dumped arrays 73e4faf
@ogrisel ogrisel New MemmapingPool for automated shared memory of numpy arrays ab94c10
@ogrisel ogrisel MemmapingPool can automatically switch to memmap for large arrays ee4c8f1
@ogrisel ogrisel Better testing for lazy temp folder creation 65f4ef1
@ogrisel ogrisel More tests comments improvements 1d4941c
@ogrisel ogrisel Make it possible to disable auto-memmaping 4997726
@ogrisel ogrisel Enable new memmap support in the Parallel class 082567e
@ogrisel ogrisel Make it possible to plug custom reducers from the Parallel API 7ae929c
@ogrisel ogrisel Make it possible to configure the temporary folder as well 90ea07a
@ogrisel ogrisel One more assertion in the tests fedccc0
@ogrisel ogrisel Make ArrayMemmapReducer more robust in case a np.memmap is passed as …
…argument
7b1631b
@ogrisel ogrisel Make numpy an optional dependency, again. d091593
@ogrisel ogrisel Normalize temp_folder path for MemmapingPool 1c6e2eb
@ogrisel ogrisel Improve docstrings, get rid of dead code, verbose mode 6350510
@ogrisel ogrisel Improved docstring for the Parallel class 96f5152
@GaelVaroquaux GaelVaroquaux MISC: check semaphore support early e0d92be
@GaelVaroquaux GaelVaroquaux BUG: propagate arguments in MemmappingPool ceb0616
@GaelVaroquaux GaelVaroquaux BUG: propagate arguments in PicklingPool 4f736b9
@GaelVaroquaux GaelVaroquaux MISC: name nitckpicking 0d6199f
@GaelVaroquaux GaelVaroquaux Notes and misc from discussion with @ogrisel c55e509
@GaelVaroquaux GaelVaroquaux ENH: non C nor F-con arrays: fail gracefully 48c1bde
@GaelVaroquaux GaelVaroquaux BUG: fix my previous 'improvement'
Import would no longer work as multiprocessing ended up being an integer
55088b2
@GaelVaroquaux GaelVaroquaux MISC: reducers: sequence -> dict 358c562
@ogrisel ogrisel FIX: handle memmap instance with in memory buffers 40df092
@ogrisel ogrisel FIX: call self._pool.terminate() in Parallel f10a6be
@ogrisel ogrisel DOC: some narrative documentation + example for the new memmaping stuff 7901817
@ogrisel ogrisel TST: fixture to skip memmap doctest when numpy is not installed b2738bc
@ogrisel ogrisel TST: fixture for missing multiprocessing too fd87ecf
@ogrisel ogrisel Remove old TODOs a9986d3
@ogrisel ogrisel WIP: transparent support for mmap backed np.ndarray instances c2da715
@ogrisel ogrisel ENH: make test future proof to new memmap reducer 55ad6f2
@ogrisel ogrisel FIX: broken as_strided implementation 0fa99bd
@ogrisel ogrisel typo 3b0f04e
@ogrisel ogrisel Add TODO and fix python 2.5 tests 4cee2fe
@ogrisel ogrisel cosmits e4df9d0
@ogrisel ogrisel Add a new test for memmaping pool on array views on memmaps 72dea4c
@ogrisel ogrisel Add support for non-contiguous sliced memmap + a lot more tests 5084992
@ogrisel ogrisel ENH: bug and always register the smart ArrayMemmapReducer 802712f
@ogrisel ogrisel Typo in doc 63de112
@ogrisel ogrisel Fix typo in example 120de97
@ogrisel ogrisel FIX: do not try to preserve the array class type as it's not always p…
…ossible and useless anyway
0189709
@ogrisel ogrisel Even more tricky original shape for the reconstruction tests c7ca438
@ogrisel ogrisel ENH: add an autokill testing utility to prevent stalling when failing…
… multiprocessing tests
4754821
@ogrisel ogrisel FIX: typo that caused autokill false positive on travis CI 8dec5a0
@ogrisel ogrisel Restore python 3 support 47b6ba2
@ogrisel ogrisel Readonly access raises ValueError instead of RuntimeError in numpy 1.7+ 3c9e6c1
@ogrisel ogrisel Make it possible to disable the autokill feature in interactive debug…
…ging mode
025aab1
@ogrisel ogrisel Make example work with Python 3 4e82e28
@ogrisel ogrisel Make it possible to disable autokill when debugging 06d0bb3
@ogrisel ogrisel DOC: small improvements fb1adcb
@ogrisel ogrisel FIX: make sure we do not leak references to file handles (pipes and s…
…emaphores)
bee4f01
@ogrisel ogrisel renamed has_shared_memory to has_shareable_memory 3377476
@ogrisel ogrisel Skip parallel_numpy.rst doctests when multiprocessing is disabled 8966715
@ogrisel ogrisel Use /dev/shm as the default tmp folder when available 109a8a9
@ogrisel ogrisel Protect test_pool against OS than don't support semaphores c239047
@ogrisel ogrisel Skip numpy memmap pool tests if multiprocessing does not support sema…
…phores
595b5c9
@ogrisel ogrisel Various improvements to the parallel_numpy.rst doc 692670d
@ogrisel ogrisel Add a comment in the inline example aaff79d
@ogrisel ogrisel Remove boilerplate that was only useful for python 2.5 which we don't…
… support anymore
e301216
@ogrisel ogrisel Always do gc.collect() before forking 4af71f3
@ogrisel ogrisel Cleanup duplicated code 331ffb1
@ogrisel ogrisel Disable optimized numpy.memmap support on numpy 1.7.0 which has a reg…
…ression
f46f4c1
@ogrisel ogrisel Apparently numpy 1.7.0 works as well fc7b0de
@ogrisel
ogrisel commented Jul 30, 2013

Rebased on top of master. @GaelVaroquaux I think this is ready to be merged (once travis is green again after the rebase).

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Jul 30, 2013

Which version of numpy are you using? I disabled the mitigation strategy for broken numpy versions. I could not reproduce the issue on the released 1.7.0 nor on the 1.7.1 version.

@ogrisel
ogrisel commented Jul 30, 2013

I you have the exact sha1 of the broken numpy I can try to reproduce on my box as well.

@GaelVaroquaux
joblib member
@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Jul 31, 2013

Are you sure? I don't see it on master.

@GaelVaroquaux GaelVaroquaux merged commit fc7b0de into joblib:master Jul 31, 2013

1 check passed

Details default The Travis CI build passed
@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Jul 31, 2013

:)

What do you think of tagging the current master and synchronizing the joblib version of scikit-learn master on that tag to get a wider testing base?

@GaelVaroquaux
joblib member
@ogrisel
ogrisel commented Jul 31, 2013

We can also do a major release just for that :)

But do as you wish. I am in no hurry.

@jni
jni commented Jul 31, 2013

I'm in a hurry! This is what I've been waiting for for over a year! ;) Thanks guys!

@ogrisel
ogrisel commented Jul 31, 2013

@jni since the version you know, the main difference it that now the auto-memmap feature will use the ramdisk partition /dev/shm when available. That should remove the need to touch the hard drive under linux. Under OSX and Windows the temp folder is still used by default.

@ogrisel
ogrisel commented Jul 31, 2013

Looking for any bench, memory profile trace or bug report.

@mluessi
mluessi commented Jul 31, 2013

Great :) now I can revive mne-tools/mne-python#99 that uses this feature. This should really help with non-parametric statistics which currently uses crazy amounts of memory when running in parallel.

@GaelVaroquaux
joblib member
@mluessi mluessi referenced this pull request in mne-tools/mne-python Jul 31, 2013
Merged

Use joblib memmaping pool (finally :) ) #707

@mluessi
mluessi commented Jul 31, 2013

@GaelVaroquaux yes, I know.. but this feature makes it much easier :)

@GaelVaroquaux
joblib member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.