mmap objects used in base should expose a filename attribute #470

Closed
GaelVaroquaux opened this Issue Sep 30, 2012 · 24 comments

Projects

None yet

5 participants

@GaelVaroquaux
Contributor

In numpy <= 1.6 it is possible to inspect arrays to find out if they come from memmapped files and which file:

>>> import numpy as np
>>> a = np.ones((10, 10))
>>> np.save('a.npy', a)
>>> b = np.load('a.npy', mmap_mode='r')
>>> c = np.asarray(b)
>>> c.base is b
True
>>> c.base.filename
'/home/varoquau/a.npy'

In 1.7b2, the 'base' of 'c' is now a reference to the Python mmap object, and thus it is now impossible to get back to original filename. This may be an important usecase in parallel computing (e.g. with multiprocessing, or IPython) to control if data is shared between processes or not.

>>> import numpy as np
>>> a = np.ones((10, 10))
>>> np.save('a.npy', a)
>>> b = np.load('a.npy', mmap_mode='r')
>>> c = np.asarray(b)
>>> c.base

I propose to use as a 'base' in the numpy memmap objects something that keeps tracks of the filename (and maybe the offset, although it can be retrieved with a bit of magic). The simplest option would probably to simply use a subclass of mmap.mmap.

I, @GaelVaroquaux, volunteer to implement such a patch, if the strategy above is deemed as acceptable.

@teoliphant
Member

I think it makes sense to have a standard base object interface for all NumPy arrays --- so that if base is not none it points to something with a standard interface.

In other words, I like this but for all NumPy arrays and not just memmap arrays.

Travis Oliphant
(on a mobile)
512-826-7480

On Sep 30, 2012, at 10:50 AM, Gael Varoquaux notifications@github.com wrote:

In numpy <= 1.6 it is possible to inspect arrays to find out if they come from memmapped files and which file:

import numpy as np
a = np.ones((10, 10))
np.save('a.npy', a)
b = np.load('a.npy', mmap_mode='r')
c = np.asarray(b)
c.base is b
True
c.base.filename
'/home/varoquau/a.npy'
In 1.7b2, the 'base' of 'c' is now a reference to the Python mmap object, and thus it is now impossible to get back to original filename. This may be an important usecase in parallel computing (e.g. with multiprocessing, or IPython) to control if data is shared between processes or not.

import numpy as np
a = np.ones((10, 10))
np.save('a.npy', a)
b = np.load('a.npy', mmap_mode='r')
c = np.asarray(b)
c.base
I propose to use as a 'base' in the numpy memmap objects something that keeps tracks of the filename (and maybe the offset, although it can be retrieved with a bit of magic). The simplest option would probably to simply use a subclass of mmap.mmap.

I, @GaelVaroquaux, volunteer to implement such a patch, if the strategy above is deemed as acceptable.


Reply to this email directly or view it on GitHub.

@GaelVaroquaux
Contributor

I think it makes sense to have a standard base object interface for all NumPy
arrays --- so that if base is not none it points to something with a standard
interface.

OK, what kind of field do you think would be useful?

@njsmith
Member
njsmith commented Sep 30, 2012

@teoliphant: Err.... why? Right now the interface is "this is a refcountable Python object", which matches the semantics of "base" (i.e., specifying an object that we will drop a reference to on destruction). Except for UPDATEIFCOPY, which overloads this to mean something else entirely, in which case the interface is "this is a type-and-shape-compatible ndarray". And Gael is suggesting we formalize another overload, for the memmap subclass of ndarray. Plus there are tons of codes out there that set .base to random opaque object handles and stuff. What possible common interface would all of these objects have?

@teoliphant
Member

If we are going to make .base always point to the underlying byte source instead of the most recent object, I think we need some common things we can expect from that object.

Another approach, though, is to just halt the traversal of base back to the first object of that subtype/subclass

I think that would fix the underlying problem Gael is struggling with without defining another buffer or array interface.

Travis

Travis Oliphant
(on a mobile)
512-826-7480

On Sep 30, 2012, at 12:34 PM, njsmith notifications@github.com wrote:

@teoliphant: Err.... why? Right now the interface is "this is a refcountable Python object", which matches the semantics of "base" (i.e., specifying an object that we will drop a reference to on destruction). Except for UPDATEIFCOPY, which overloads this to mean something else entirely, in which case the interface is "this is a type-and-shape-compatible ndarray". And Gael is suggesting we formalize another overload, for the memmap subclass of ndarray. Plus there are tons of codes out there that set .base to random opaque object handles and stuff. What possible common interface would all of these objects have?


Reply to this email directly or view it on GitHub.

@GaelVaroquaux
Contributor

On Sun, Sep 30, 2012 at 11:01:23AM -0700, Travis E. Oliphant wrote:

I think that would fix the underlying problem Gael is struggling with without
defining another buffer or array interface.

Probably, and it does look like a sound and light rule. Although in the
specific example I gave 'c' is of type np.ndarray, whereas its parent is
of type np.memmap. The ndarray object should be able to accept a memmap
as a parent, I believe.

@teoliphant
Member

On Sep 30, 2012, at 1:08 PM, Gael Varoquaux wrote:

On Sun, Sep 30, 2012 at 11:01:23AM -0700, Travis E. Oliphant wrote:

I think that would fix the underlying problem Gael is struggling with without
defining another buffer or array interface.

Probably, and it does look like a sound and light rule. Although in the
specific example I gave 'c' is of type np.ndarray, whereas its parent is
of type np.memmap. The ndarray object should be able to accept a memmap
as a parent, I believe.

So isinstance(c.base, np.ndarray) would work if c.base were a memmap so it would be acceptable according to the proposed rule.

Specifically the proposal is to set the base object of an instance to the top-most object that is still is of the same type as the instance in question.

Would this solve your underlying problem, Gael? It seem like a simple change that solves the bug but also preserves some of the same semantics people might be expecting of "base" in situations like yours.

-Travis

@GaelVaroquaux
Contributor

On Sun, Sep 30, 2012 at 11:13:48AM -0700, Travis E. Oliphant wrote:

So isinstance(c.base, np.ndarray) would work if c.base were a memmap so it
would be acceptable according to the proposed rule.

Yes.

Would this solve your underlying problem, Gael? It seem like a simple change
that solves the bug but also preserves some of the same semantics people might
be expecting of "base" in situations like yours.

I do believe that it would solve my problem. I also think that would
match a bit more the expectation of people. I don't really see a downside
of it. Maybe this change would need to be run by more people, but as far
as I am concerned, it seems a good thing, and would make me happy.

Thanks!

@certik
Contributor
certik commented Nov 14, 2012

This was the behavior in b2:

>>> import numpy as np
>>> a = np.ones((10, 10))
>>> np.save('a.npy', a)
>>> b = np.load('a.npy', mmap_mode='r')
>>> c = b.view()
>>> c.base.filename
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'mmap.mmap' object has no attribute 'filename'

but now after the PR #2708 was merged and backported to the 1.7.x branch by the commit a9c7b38, the behavior is:

>>> import numpy as np
>>> a = np.ones((10, 10))
>>> np.save('a.npy', a)
>>> b = np.load('a.npy', mmap_mode='r')
>>> c = b.view()
>>> c.base.filename
'/tmp/a.npy'

So this problem should be fixed and I am closing this issue. @GaelVaroquaux, please let me know if all is fixed now.

@certik certik closed this Nov 14, 2012
@GaelVaroquaux
Contributor

Sorry, numpy team, I have had to focus on many other things, and
completely lost track of this issue.

I'll try to come back to it. In the mean time, @ogrisel, could you also
have a look: you spent a while on these problems too.

Thanks guys!

@ogrisel
Contributor
ogrisel commented Nov 14, 2012

Thanks for the heads up, I will check that it fixes our problem as soon as I can find some spare cycles.

@ogrisel ogrisel referenced this issue in joblib/joblib Nov 14, 2012
Merged

[MRG] Custom pickling pool for no-copy memmap handling with multiprocessing #44

3 of 3 tasks complete
@ogrisel
Contributor
ogrisel commented Nov 15, 2012

Unfortunately @certik pull request now merged in the master does not solve the issue reported by @GaelVaroquaux:

>>> m = np.memmap('/tmp/a', dtype=np.float32, shape=3, mode='w+')
>>> m.base
<mmap.mmap object at 0x10eaf3a30>

>>> m.view().base
memmap([ 0.,  0.,  0.], dtype=float32)

>>> np.asarray(m).base
<mmap.mmap object at 0x10eaf3a30>

When using np.array the filename / offset info is lost. I think the buffer solution advertised by @GaelVaroquaux in the header of this issue is the most robust way to not loose the original memory mapping metadata. I implemented it at the end of this file:

https://github.com/ogrisel/joblib/blob/72b6bc520ce1fd5fba9639594a9dd0dafba89e68/joblib/numpy_mmap.py#L99

This makes it possible to not use the numpy.memmap anymore and just use pure vanilla ndarrays on mmap buffers.

The use of the FileBackedMmapBuffer class / mmap_array function combo also fixes other issues of the current numpy.memmap arrays such as the following behavior which I find very confusing for the users:

>>> m = np.memmap('/tmp/a', dtype=np.float32, shape=3, mode='w+')

>>> type(m)
<class 'numpy.core.memmap.memmap'>
>>> m.base
<mmap.mmap object at 0x10eaf3cb0>
>>> m.filename, m.offset
('/tmp/a', 0)

>>> m2 = 2 * m 
>>> type(m2)
<class 'numpy.core.memmap.memmap'>
>>> m2.base
array([ 0.,  0.,  0.], dtype=float32)
>>> m2.filename, m2.offset
(None, None)

m2 is using a non-memmaped in-memory buffer while being an instance of the numpy.memmap class.

@ogrisel
Contributor
ogrisel commented Nov 15, 2012

To make it clearer, it is expected that m2 should be a memory copy of m but what is unexpected is for it to be an instance of the np.memmap class. Object oriented class derivation does not work well in this case I think.

@certik
Contributor
certik commented Nov 15, 2012

@ogrisel, thanks for the thorough review. Would you recommend numpy to include code from numpy_mmap.py that you referenced? Or is there anything else that should still be changed in numpy before the release?

I want to make sure that all backwards incompatibilities are resolved before the release.

@ogrisel
Contributor
ogrisel commented Nov 15, 2012

@certik I am not familiar enough with numpy to tell if numpy can be easily fixed so that np.asarray(np.memmap) preserves the offset / filename info in the .base attribute.

The content of the numpy_mmap.py file could be considered a longer term fix (replacement) for the other behavioral issues of the numpy.memmap class. Maybe not for the 1.7 release if it's already underway.

@njsmith
Member
njsmith commented Nov 15, 2012

@certik @ogrisel: Ugh, right, now I remember the discussions. What this PR was supposed to do to work around things in 1.7 was set the .base to be the last array before you hit another type of object. So there's an off-by-one error in #2708; it's iterating one step too far.

@njsmith njsmith reopened this Nov 15, 2012
@GaelVaroquaux
Contributor

@certik : thanks a lot for handling this issue and keeping track of it.

@njsmith : I think that indeed offseting by one #2708 would fix our usecase. I'll have to check more seriously first (I am scheduling some time for this today). Also, it would be good to have a review from the core devs of numpy of the consequences of changing this offset by one in the bigger picture.

@ogrisel : as a side remark, numpy 1.3 does not expose the 'filename' attribute on a np.memmap instance. Thus will have to cater for the lack of filename attribute.

@ogrisel
Contributor
ogrisel commented Nov 15, 2012

@ogrisel : as a side remark, numpy 1.3 does not expose the 'filename' attribute on a np.memmap instance. Thus will have to cater for the lack of filename attribute.

@GaelVaroquaux for numpy 1.3 we will just fallback to in-memory copy of the data. But let's move this part of the discussion to the joblib PR if needed.

@GaelVaroquaux
Contributor

@ogrisel said:

To make it clearer, it is expected that m2 should be a memory copy of m but what is unexpected is for it to be an instance of the np.memmap class. Object oriented class derivation does not work well in this case I think.

Agreed, but I think that such a change in behavior should not happen before 2.0 as it is too invasive and might break code relying on it.

@ogrisel
Contributor
ogrisel commented Nov 15, 2012

I would not advocate to remove numpy.memmap in 1.7, just adding the alternative way to instanciate mmap-backed ndarrays in numpy and maybe a deprecation warning for numpy.memmap recommending to use the new mmap_array instead.

@ogrisel
Contributor
ogrisel commented Nov 15, 2012

But this can be done later too if @certik or @njsmith or someone else knows a simpler way to fix the np.asarray(m).base issue for 1.7.

@GaelVaroquaux
Contributor

But this can be done later too if @certik or @njsmith or someone else knows a
simpler way to fix the np.asarray(m).base issue for 1.7.

Working on it.

@GaelVaroquaux
Contributor

I have issued a pull request ( #2746 ) that should address this issue. It is a very slight modification of @teoliphant 's original code. The new behavior should be validated by the numpy core devs.

Also, @ogrisel , could you please check that it addresses our usecase. I think that it does.

@ogrisel
Contributor
ogrisel commented Nov 16, 2012

As posted there #2746 is indeed solving the issue for my use case.

@certik certik pushed a commit that closed this issue Dec 4, 2012
@GaelVaroquaux GaelVaroquaux FIX: base always refers to the original subclass
Closes #470

The use case is that if b is a memmap, 'np.asarray(b).base' should be a
memmap, and thus we put is to b in such a situation.
e4fe789
@certik certik closed this in e4fe789 Dec 4, 2012
@certik certik added a commit to certik/numpy that referenced this issue Dec 4, 2012
@GaelVaroquaux @certik GaelVaroquaux + certik FIX: base always refers to the original subclass
Closes #470

The use case is that if b is a memmap, 'np.asarray(b).base' should be a
memmap, and thus we put is to b in such a situation.
3190a57
@ogrisel
Contributor
ogrisel commented Dec 5, 2012

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment