New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept file-like objects in np.load memmap mode #3306

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@juliantaylor
Contributor

juliantaylor commented May 3, 2013

accept file-like objects in open_memmap

closes gh-3143

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 3, 2013

I didn't find any information why it was not allowed originally, commit 05fa6e7 does not give details, maybe np.memmap did not work with file-likes then

@charris

This comment has been minimized.

Member

charris commented May 3, 2013

@stefanv My guess is that it was a fix for Python 3.3. It might not be needed if an integer return is checked for.

@rkern

This comment has been minimized.

Member

rkern commented May 3, 2013

You are testing with a bona-fide file object (or the Python 3 equivalent), not a generic file-like object like a GzipFile which cannot be memmapped. That's the intention behind the restriction.

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 3, 2013

gzip file is handled by a different function in np.load, it never goes into open_memmap.
open_memmap is not documented, so not part of the api (?), do we really need to check for gzip there?

@rkern

This comment has been minimized.

Member

rkern commented May 3, 2013

We're not "checking for gzip". We're documenting the restrictions on valid input to the function. You can't memory map onto arbitrary file-like objects.

open_memmap is documented. It is part of the numpy.lib.format API. It's not just for implementing numpy.load.

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 3, 2013

I'm not sure what you are trying to tell me.
That open_memmap should not work on file descriptors at all?
Or just that the documentation should be improved to say it won't work as expected on compressed npy files and NpzFile should be used instead and that it will generally only work on real file like objects, no sockets, fifos etc.?

@rkern

This comment has been minimized.

Member

rkern commented May 3, 2013

You can update the documentation and the code to allow memmap-able file objects but there still needs to be documentation and hopefully an early check for general non-memmap-able file-like objects. Probably you can check that filename.read exists and filename.fileno() exists and returns an integer. An early check which raises a helpful explanatory exception is preferable to an obscure AttributeError from numpy.memmap().

The reason that it was originally restricted to string filenames is that when you are given a valid file object, you can usually get the string filename from it easily, so there wasn't much point to doing more thorough testing for the limited amount of flexibility you get. As you have demonstrated elsewhere, it appears that in Python 3, we can now get valid memmap-able file objects (well, io.BufferedReader objects now) that don't have an easily accessible string filename. So now is a good time to relieve some of that restriction.

Note that "file-like object" is a much broader term in Python parlance than you seem to be thinking of. It refers to any Python object with a minimal complement of methods like .read(), .seek(), and .tell(). There are lots of them in the Python world that have nothing to do with files or sockets or any other service provided by the OS.

@charris

This comment has been minimized.

Member

charris commented May 4, 2013

@juliantaylor I'd like to get this in, @rkern suggestions look good to me.

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 4, 2013

I'm not so sure anymore, allowing arbitrary file descriptors does allow for a lot of mistakes, and as rkern said you can get the filename from a valid open descriptor easily.

The only real reason I see to allow descriptors is if you are at the system limit of allowed descriptors.
I think it may be better to close the pull and the associated issue unless I'm overlooking a real use case.

@charris

This comment has been minimized.

Member

charris commented May 4, 2013

I think a more informative error message would be useful however this goes.

@njsmith

This comment has been minimized.

Member

njsmith commented May 5, 2013

Being able to memmap file descriptors is handy in all sorts of cases. I'm
not sure how important they are for numpy's purposes, but certainly being
able to memmap an anonymous temporary file seems useful, which was the
original issue. (The whole point of such a file is that it doesn't have a
name -- which means it will be automatically cleaned up as soon as the
program using it exits -- so getting a filename from a descriptor is
certainly not possible in this case, never mind easy.)

On Sat, May 4, 2013 at 6:52 PM, Charles Harris notifications@github.comwrote:

I think a more informative error message would be useful however this goes.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3306#issuecomment-17442768
.

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 5, 2013

memmapping an unnamed file is certainly useful, but this pull is about np.load which does not make sense for these files.

@njsmith

This comment has been minimized.

Member

njsmith commented May 5, 2013

Oh, right then. (Sorry, I've been at a workshop all week.)
On 5 May 2013 08:02, "Julian Taylor" notifications@github.com wrote:

memmapping an unnamed file is certainly useful, but this pull is about
np.load which does not make sense for these files.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3306#issuecomment-17450741
.

@juliantaylor

This comment has been minimized.

Contributor

juliantaylor commented May 6, 2013

I'm going to close this for now, if someone comes up with a good use case we can reopen.

@FreddieWitherden

This comment has been minimized.

FreddieWitherden commented Jul 6, 2013

In our project we have a strong use case for allowing np.load to be able to memory map arbitrary file objects. Our code makes extensive use of uncompressed .npz (zip) files for our file format which are often several gigabytes in size. As they are uncompressed the .npy files inside this file are prime candidates for memory mapping. Without too much work one can get the offset of each entry inside of the file. Exploiting this, however, requires that np.load accept file objects as opposed to file names with the mmap parameter.

@charris

This comment has been minimized.

Member

charris commented Jul 8, 2013

@juliantaylor Want to reopen this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment