Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow reading from buffered stdout using np.fromfile #12324

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Nov 4, 2018

Previously, np.fromfile would check if the object was an object of RawIO. If it was, then it wasn't seekable.

Unfortunately, (at least in python 3) this doesn't work for stdout as it is buffered but not seekable (at least on linux).

This fixes that by calling file.seekable()

Tests are also added to ensure the behaviour lives on.

Fixes #12309

Patch from 2018, before rebase in 2022:
12324_before_rebase.patch.txt

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 4, 2018

Shippable failed with this test:

=================================== FAILURES ===================================
______________________ TestPathUsage.test_tofile_fromfile ______________________
[gw0] linux2 -- Python 2.7.12 /root/venv/2.7/bin/python

self = <numpy.core.tests.test_records.TestPathUsage object at 0xffff8d904f90>
    def test_tofile_fromfile(self):
        with temppath(suffix='.bin') as path:
            path = Path(path)
            a = np.empty(10, dtype='f8,i4,a5')
            a[5] = (0.5,10,'abcde')
            a.newbyteorder('<')
            with path.open("wb") as fd:
                a.tofile(fd)
            x = np.core.records.fromfile(path,
                                         formats='f8,i4,a5',
                                         shape=10,
                                         byteorder='<')

but I can't seem to find it when i search...

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 4, 2018

Is there a preference between

ret = PyObject_CallMethod(file, "seekable", "");

and

ret = PyObject_CallMethod(file, "seekable", NULL);

@eric-wieser
Copy link
Member

eric-wieser commented Nov 5, 2018

I'd probably opt for passing NULL - I'd expect it to be faster, else CPython wouldn't bother supporting it.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

Ok.

Any clue where the python 2.7 test is? Seems important...

@eric-wieser
Copy link
Member

numpy.core.tests.test_records would be at numpy/core/tests/test_records.py

@charris
Copy link
Member

charris commented Nov 5, 2018

Shippable failed with this test:

I've seen that failure on other PRs, but not all, or even all recent, so I don't know what is the cause. It seems inconsistent, only on shippable, and only on Python 2.7. Might try a rebase to see if that fixes it.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

Sorry my repo was out of sync.

How is it even getting in that test? Path should be None.....

I rebased, lets see what happens.

@tylerjereddy
Copy link
Contributor

The shippable CI uses pathlib for 2.x backporting of the functionality. I guess we could try switching to pathlib2, which is perhaps more well maintained. Or just accept not using pathlib on 2.x & remove that from shippable 2.7 job.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

Thanks @tylerjereddy.

Is this considered a bugfix or feature? If it is a feature, then maybe we can just ignore 2.7?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

@charris didn't mean to ignore your comment. Thanks for chiming in. my webpage was out of sync.

@tylerjereddy
Copy link
Contributor

I opened another PR to try to fix the shippable 2.x issue. In that PR & here I see circleci now starting to fail for some reason!

@eric-wieser
Copy link
Member

eric-wieser commented Nov 5, 2018

How is it even getting in that test? Path should be None.....

I can't see the old shippable output, as the link is discarded every time you rebase - so can't help you. Can you link to the failed output next time?

Edit: https://app.shippable.com/github/numpy/numpy/runs/1301/1/tests, I assume

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

@eric-wieser it seems that shippable is using a backport of pathlib, I didn't know that. Hence the discussion in #12330

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

@eric-wieser adding te comment here since that subthread is annoying.

On the line:

PyErr_SetString(PyExc_IOError,

It detects an error, but keeps going. Shouldn't it return in the if statement?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

I would also like it if a out parameter was provided to fromfile does that discussion go to the the mailing list, or issue tracker?

@eric-wieser
Copy link
Member

It detects an error, but keeps going. Shouldn't it return in the if statement?

Congrats, you found a bug

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 5, 2018

Lol, shippable passed, then I removed some commit noise (whitespace that I had added) and it failed.

@charris charris changed the title Allow reading from buffered stdout using np.fromfile ENH: Allow reading from buffered stdout using np.fromfile Nov 6, 2018
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 7, 2018

Thanks @tylerjereddy . This is passing now!!!!

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nits left.

I don't understand the big picture here - why did we care that the file was buffered before, and why do we care that it's seekable now?

'import sys; sys.stdout.write("numpy is not np")'],
stdout=sp.PIPE, bufsize=bufsize)
arr = np.fromfile(p.stdout, dtype=np.uint8,
count=len('numpy is not np'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to read to completion in the test, rather than specify a length? Alternatively, do you want to add some extra characters, and see if p.stdout.read() returns only the extras?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it will work with unbuffered stdout, but lets see.

numpy/core/tests/test_longdouble.py Outdated Show resolved Hide resolved
arr = np.fromfile(p.stdout, dtype=np.uint8,
count=len('numpy is not np'))
assert_array_equal(arr.tostring(),
np.asarray('numpy is not np', dtype='|S'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a weird test. The return type of tostring (aka tobytes) is bytes, so I'd compare it against a bytes literal, not against a 0d array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 7, 2018

I think we care that it is seekable because on failure it would reset the location of the file. Not really sure though.

The comments indicated that they were using the buffered test as a proxy to seekability.

@hmaarrfk hmaarrfk force-pushed the from_file_stdout branch 4 times, most recently from 73047f3 to 538f414 Compare November 7, 2018 15:19
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 7, 2018

@eric-wieser It seems that the code that tries to guess how long a file is also uses seek, but doesn't check the seekable parameter. It seems difficult to tell the size of the stdout buffer otherwise.

Basically multiarraymodule.c should check the output of originpos and nin somewhere in:

    fp = npy_PyFile_Dup2(file, "rb", &orig_pos);
    if (fp == NULL) {
        Py_DECREF(file);
        return NULL;
    }
    if (type == NULL) {
        type = PyArray_DescrFromType(NPY_DEFAULT_TYPE);
    }
    ret = PyArray_FromFile(fp, type, (npy_intp) nin, sep);

and return a resonable error message telling the user that they must specify count for unseekable file buffers. Maybe I'll get to it this weekend.

In my mind, a patch would include a test that raises an exception when the count parameter isn't provided when trying to read from the stdout buffer.

@hmaarrfk hmaarrfk force-pushed the from_file_stdout branch 3 times, most recently from a2e78a1 to 482b315 Compare November 7, 2018 20:07
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Nov 7, 2018

Thanks eric, seems like there is a serious bug that was found due to the tests you requested. I'll have to keep looking into it.

@hmaarrfk
Copy link
Contributor Author

@eric-wieser i'm stumped. Basically, it works in pure python as shown by the tests

https://travis-ci.org/numpy/numpy/jobs/453301883#L4540

only 2 "numpy-ic" fail, the two "pythonic" ones pass well. I split the tests so you can do it on your computer by checking out the commit before the mods to the implementation are done.

My ultimate goal here would be to make a well tested fromfile so that I can expand the capabilities with an out parameter.

I couldn't find many tests for fromfile 😕

Base automatically changed from master to main March 4, 2021 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Pending author's response
Development

Successfully merging this pull request may close these issues.

from npy_PyFile_Dup2 check the seekable attribute
5 participants