accept file-like objects in np.load memmap mode #3306

Closed
wants to merge 2 commits into
from

Projects

None yet

5 participants

@juliantaylor
Contributor

accept file-like objects in open_memmap

closes gh-3143

@juliantaylor
Contributor

I didn't find any information why it was not allowed originally, commit 05fa6e7 does not give details, maybe np.memmap did not work with file-likes then

@charris
Member
charris commented May 3, 2013

@stefanv My guess is that it was a fix for Python 3.3. It might not be needed if an integer return is checked for.

@rkern
Member
rkern commented May 3, 2013

You are testing with a bona-fide file object (or the Python 3 equivalent), not a generic file-like object like a GzipFile which cannot be memmapped. That's the intention behind the restriction.

@juliantaylor
Contributor

gzip file is handled by a different function in np.load, it never goes into open_memmap.
open_memmap is not documented, so not part of the api (?), do we really need to check for gzip there?

@rkern
Member
rkern commented May 3, 2013

We're not "checking for gzip". We're documenting the restrictions on valid input to the function. You can't memory map onto arbitrary file-like objects.

open_memmap is documented. It is part of the numpy.lib.format API. It's not just for implementing numpy.load.

@juliantaylor
Contributor

I'm not sure what you are trying to tell me.
That open_memmap should not work on file descriptors at all?
Or just that the documentation should be improved to say it won't work as expected on compressed npy files and NpzFile should be used instead and that it will generally only work on real file like objects, no sockets, fifos etc.?

@rkern
Member
rkern commented May 3, 2013

You can update the documentation and the code to allow memmap-able file objects but there still needs to be documentation and hopefully an early check for general non-memmap-able file-like objects. Probably you can check that filename.read exists and filename.fileno() exists and returns an integer. An early check which raises a helpful explanatory exception is preferable to an obscure AttributeError from numpy.memmap().

The reason that it was originally restricted to string filenames is that when you are given a valid file object, you can usually get the string filename from it easily, so there wasn't much point to doing more thorough testing for the limited amount of flexibility you get. As you have demonstrated elsewhere, it appears that in Python 3, we can now get valid memmap-able file objects (well, io.BufferedReader objects now) that don't have an easily accessible string filename. So now is a good time to relieve some of that restriction.

Note that "file-like object" is a much broader term in Python parlance than you seem to be thinking of. It refers to any Python object with a minimal complement of methods like .read(), .seek(), and .tell(). There are lots of them in the Python world that have nothing to do with files or sockets or any other service provided by the OS.

@charris
Member
charris commented May 4, 2013

@juliantaylor I'd like to get this in, @rkern suggestions look good to me.

@juliantaylor
Contributor

I'm not so sure anymore, allowing arbitrary file descriptors does allow for a lot of mistakes, and as rkern said you can get the filename from a valid open descriptor easily.

The only real reason I see to allow descriptors is if you are at the system limit of allowed descriptors.
I think it may be better to close the pull and the associated issue unless I'm overlooking a real use case.

@charris
Member
charris commented May 4, 2013

I think a more informative error message would be useful however this goes.

@njsmith
Member
njsmith commented May 5, 2013

Being able to memmap file descriptors is handy in all sorts of cases. I'm
not sure how important they are for numpy's purposes, but certainly being
able to memmap an anonymous temporary file seems useful, which was the
original issue. (The whole point of such a file is that it doesn't have a
name -- which means it will be automatically cleaned up as soon as the
program using it exits -- so getting a filename from a descriptor is
certainly not possible in this case, never mind easy.)

On Sat, May 4, 2013 at 6:52 PM, Charles Harris notifications@github.comwrote:

I think a more informative error message would be useful however this goes.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3306#issuecomment-17442768
.

@juliantaylor
Contributor

memmapping an unnamed file is certainly useful, but this pull is about np.load which does not make sense for these files.

@njsmith
Member
njsmith commented May 5, 2013

Oh, right then. (Sorry, I've been at a workshop all week.)
On 5 May 2013 08:02, "Julian Taylor" notifications@github.com wrote:

memmapping an unnamed file is certainly useful, but this pull is about
np.load which does not make sense for these files.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3306#issuecomment-17450741
.

@juliantaylor
Contributor

I'm going to close this for now, if someone comes up with a good use case we can reopen.

@FreddieWitherden

In our project we have a strong use case for allowing np.load to be able to memory map arbitrary file objects. Our code makes extensive use of uncompressed .npz (zip) files for our file format which are often several gigabytes in size. As they are uncompressed the .npy files inside this file are prime candidates for memory mapping. Without too much work one can get the offset of each entry inside of the file. Exploiting this, however, requires that np.load accept file objects as opposed to file names with the mmap parameter.

@charris
Member
charris commented Jul 8, 2013

@juliantaylor Want to reopen this?

@dichen001 dichen001 referenced this pull request in dichen001/Paper-Reading Jun 28, 2016
Open

Summary of the 20 issues in Herbsleb's 2014 FSE paper. #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment