Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support for file-like objects in np.fromfile #4505

Closed
wants to merge 1 commit into from

Conversation

@ddasilva
Copy link
Contributor

commented Mar 17, 2014

Fixes #3207. Reads file-like object into tempfile, then proceeds normally.

An object is determined to be "file-like" if it has read and close attributes. If any exceptions occur while calling these two methods, they are propagated upwards. Tests include use-case with StringIO and zipfile.

def test_from_zipfile(self):
self.x.tofile(self.filename)

zip_filename = tempfile.mktemp(dir=self.tempdir)

This comment has been minimized.

Copy link
@charris

charris Mar 17, 2014

Member

mktemp has been deprecated since Python 2.3 (security), use mkstemp instead. See the Python documentation.

@charris

This comment has been minimized.

Copy link
Member

commented Mar 17, 2014

Failure on Python3, StringIO has moved. You need something like

    if sys.version_info[0] >= 3:
        # In Python3 stderr, stdout are text files.
        from io import StringIO
    else:
        from StringIO import StringIO

But, do you want StringIO or BytesIO?

/* Try handling file-like objects that are not file */
if (PyArray_IsFileLike(file)) {
/* create tempfile and fill it with repetitions of obj.read() */
fp = tmpfile();

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Mar 17, 2014

Contributor

tmpfile is insecure, use mkstemp which is a posix extension, maybe using python api is better

fp = tmpfile();

while (1) {
buff = PyObject_CallMethod(file, "read", NULL);

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Mar 17, 2014

Contributor

I reading in chunks would be save memory, even better would be reading into a pipe from a thread/select but portability issues ...

This comment has been minimized.

Copy link
@njsmith

njsmith Mar 17, 2014

Member

I have to ask -- how hard would it be to teach fromfile to read from
file-like objects directly?

All we need is "read", right? It's not like the boilerplate to call a
"read" method is that more complicated than fread().

On Mon, Mar 17, 2014 at 7:52 PM, Julian Taylor notifications@github.comwrote:

In numpy/core/src/multiarray/multiarraymodule.c:

@@ -2001,10 +2019,48 @@
}
fp = npy_PyFile_Dup2(file, "rb", &orig_pos);
if (fp == NULL) {

  •    PyErr_SetString(PyExc_IOError,
    
  •            "first argument must be an open file");
    
  •    Py_DECREF(file);
    
  •    return NULL;
    
  •    /\* Try handling file-like objects that are not file */
    
  •    if (PyArray_IsFileLike(file)) {
    
  •        /\* create tempfile and fill it with repetitions of obj.read() */
    
  •        fp = tmpfile();
    
  •        while (1) {
    
  •            buff = PyObject_CallMethod(file, "read", NULL);
    

I reading in chunks would be save memory, even better would be reading
into a pipe from a thread but portability issues ...


Reply to this email directly or view it on GitHubhttps://github.com//pull/4505/files#r10675358
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Mar 17, 2014

Contributor

for binary files we have no issue, for some string stuff we need fgetc and unget, but I think in total having custom buffering for that would have been a lot faster than all the hacks we have now.

This comment has been minimized.

Copy link
@njsmith

njsmith Mar 17, 2014

Member

Writing a little wrapper around file.read that provides fgetc, unget, read,
and 1 character of buffering would take like... 15 lines of code.

On Mon, Mar 17, 2014 at 7:56 PM, Julian Taylor notifications@github.comwrote:

In numpy/core/src/multiarray/multiarraymodule.c:

@@ -2001,10 +2019,48 @@
}
fp = npy_PyFile_Dup2(file, "rb", &orig_pos);
if (fp == NULL) {

  •    PyErr_SetString(PyExc_IOError,
    
  •            "first argument must be an open file");
    
  •    Py_DECREF(file);
    
  •    return NULL;
    
  •    /\* Try handling file-like objects that are not file */
    
  •    if (PyArray_IsFileLike(file)) {
    
  •        /\* create tempfile and fill it with repetitions of obj.read() */
    
  •        fp = tmpfile();
    
  •        while (1) {
    
  •            buff = PyObject_CallMethod(file, "read", NULL);
    

for binary files we have no issue, for some string stuff we need fgetc and
unget, but I think in total having custom buffering for that would have
been a lot faster than all the hacks we have now.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4505/files#r10675543
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2014

I don't like all this hacking around the issues when dealing with python file objects, I would prefer we simply change the whole filehandling over to the python api, for the missing ungetc we would need custom buffering

@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2014

Thank you @charris @juliantaylor and @njsmith for your feedback. I moved forward changing the whole file-handling to the Python API. Code is updated, but is major WIP.

However there is a major issue: files opened through the zipfile module (as in #3207) don't support tell() and seek() methods. We are relying on those to determine the size of the data before allocating. What would you like to do here? In principle we could read the data into a linked list and then copy it over.

@njsmith

This comment has been minimized.

Copy link
Member

commented Mar 18, 2014

The best bet is probably just to read into either a linked list or use
repeated doubling:

length = 1 << 20; /* or whatever, just a starting point */
data = malloc(length);
used = 0;
while (1) {
data_read = file.read(...);
if (length < used + len(data_read)) {
while (length < used + len(data_read))
length *= 2;
data = realloc(data, length);
}
length[used:used + len(data_read)] = data_read;
}
realloc(data, used);

On Tue, Mar 18, 2014 at 4:30 AM, Daniel da Silva
notifications@github.comwrote:

Thank you @charris https://github.com/charris @juliantaylorhttps://github.com/juliantaylorand
@njsmith https://github.com/njsmith for your feedback. I moved forward
changing the whole file-handling to the Python API. Code is updated, but is
major WIP.

However there is a major issue: files opened through the zipfile module
(as in #3207 #3207) don't support
tell() and seek() methods. We are relying on those to determine the size
of the data before allocating. What would you like to do here? In principle
we could read the data into a linked list and then copy it over.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4505#issuecomment-37898771
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented Mar 25, 2014

Thank you @njsmith. I talked with @juliantaylor on IRC and we discussed that buffering is also necessary in the next_element implementation. The code must over-pull from obj.read() to ensure the scanf-like parser has access to all necessary bytes for the element.

I want to make sure one thing is clear. The function PyArray_FromFile(FILE *fp, ...) is marked as being part of the NumPy API. Is the intent here to deprecate this function? If not, all the code for reading from file-like objects will be along-side the current FILE*-based implementation.

@njsmith

This comment has been minimized.

Copy link
Member

commented Mar 25, 2014

Potentially PyArray_FromFile could just be a call to PyFile_FromFile/FromFd
plus the new code, though.
On 25 Mar 2014 04:38, "Daniel da Silva" notifications@github.com wrote:

Thank you @njsmith https://github.com/njsmith. I talked with
@juliantaylor https://github.com/juliantaylor on IRC and we discussed
that buffering is also necessary in the next_element implementation. The
code must over-pull from obj.read() to ensure the scanf-like parser has
access to all necessary bytes for the element.

I want to make sure one thing is clear. The function PyArray_FromFile(FILE
fp, ...) is marked as being part of the NumPy API. Is the intent here to
deprecate this function? If not, all the code for reading from file-like
objects will be along-side the current FILE
-based implementation.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4505#issuecomment-38530297
.

@ddasilva ddasilva closed this Mar 28, 2014

@ddasilva ddasilva deleted the ddasilva:meltingwax/3207 branch Mar 28, 2014

@ddasilva ddasilva restored the ddasilva:meltingwax/3207 branch Mar 28, 2014

@ddasilva ddasilva reopened this Mar 28, 2014

@charris

This comment has been minimized.

Copy link
Member

commented Mar 28, 2014

Looks like you need to rebase.

return (PyObject *)ret;

copy = malloc(sizeof(PyArrayObject_fields));
memcpy(copy, ret, sizeof(PyArrayObject_fields));

This comment has been minimized.

Copy link
@ddasilva

ddasilva Apr 14, 2014

Author Contributor

I am unsure about creating a copy and returning it, but testing fails about half the time in 3.x if this part is removed. Through debugging I was able to detect that in the failing scenarios the ret pointer would be valid until this function returns. When dereferencing it immediately afterwards in the calling function, a segfault would occur. If anyone has any insight, please let me know.

@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2014

I would like to revisit this PR. I reimplemented the core file handling to the Python API. All the old tests pass, and I added a few new ones.

The overall code-sharing design between fromfile and fromstring was retained. The Python API is buffered through a ~4096-byte wide block implemented as a struct named NPyFileBuff. At the core, the reading now uses fromstr instead of scanfunc.

In this change the last remaining call to scanfunc was removed. Should scanfunc be removed itself? I also noticed that even though PyArray_FromFile(FILE*, …) is labeled as being part of the NumPy API in a comment, it is not exported in any header files. Can it be dropped?

@njsmith

This comment has been minimized.

Copy link
Member

commented Apr 14, 2014

If an internal function is unused, it should be deleted.

The way the numpy API works, that ALLCAPS_COMMENT actually makes it part
of the API -- there's a script that collects those up and autogenerates a
header file.

That said, if we no longer support reading from a FILE* then I guess we can
make it just return an error or something... ideally we would keep it
working for a bit with a deprecation warning, assuming that maintaining it
is onerous.

On Mon, Apr 14, 2014 at 10:16 PM, Daniel da Silva
notifications@github.comwrote:

I would like to revisit this PR. I reimplemented the core file handling to
the Python API. All the old tests pass, and I added a few new ones.

The overall code-sharing design between fromfile and fromstring was
retained. The Python API is buffered through a ~4096-byte wide block
implemented as a struct named NPyFileBuff. At the core, the reading now
uses fromstr instead of scanfunc.

In this change the last remaining call to scanfunc was removed. Should
scanfunc be removed itself? I also noticed that even though PyArray_FromFile(FILE*,
…) is labeled as being part of the NumPy API in a comment, it is not
exported in any header files. Can it be dropped?


Reply to this email directly or view it on GitHubhttps://github.com//pull/4505#issuecomment-40412544
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

ENH: Rewrite np.load(f) to use Python API
Rewrite FILE* based code to rely only on file_obj.read().
@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2014

Understood. I have removed the code related to scanfunc. In regards to PyArray_FromFile, I think is okay to continue supporting it; it's only a dozen lines.

@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2014

Hi, this is @meltingwax. I changed my GitHub username to @ddasilva.

Do any core developers have any feedback on the Python API based implementation in this PR? Whether it's low or high level, I can iterate on this PR and improve it until it meets the requirements to be merged.

Thank you for your time,
Daniel

@rgommers

This comment has been minimized.

Copy link
Member

commented May 28, 2014

@ddasilva this sometimes takes some repeated kicking:)

I don't know this code, so no opinion on internal changes. Test coverage looks good.

@juliantaylor

This comment has been minimized.

Copy link
Contributor

commented May 28, 2014

currently we are trying to get 1.9 ready and review time is limited, when its done hopefully someone will have a look, but it might take a while

the memcpy required to work around failure indicates a issue somewhere, probably a refcounting issue

@mattip

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

I think we should close this. tofile and fromfile are fast-and-dirty APIs for saving to disk. Better alternatives for serializing ndarrays to ByteIO or zipfiles exist.

@mattip mattip added the 57 - Close? label Feb 14, 2019

@ddasilva

This comment has been minimized.

Copy link
Contributor Author

commented Feb 15, 2019

I'm closing this. It's been 5 years and I don't have the time to revisit this right now.

@ddasilva ddasilva closed this Feb 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.