Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix max_rows and chunked string/datetime reading in loadtxt #26762

Merged
merged 7 commits into from
Jun 27, 2024

Conversation

l09rin
Copy link
Contributor

@l09rin l09rin commented Jun 19, 2024

Within function _read(), called by loadtxt() method, large files are read in chunks of maximum _loadtxt_chunksize=50000 lines, by calling iteratively the function _load_from_filelike().
When max_rows exceeded _loadtxt_chunksize, the latter was still called with the option max_rows=max_rows, instead of max_rows=chunk_size, that cannot be greater than _loadtxt_chunksize. This caused the function to load chunks of max_rows lines while still thinking having loaded chunks of maximum _loadtxt_chunksize lines.
The option max_rows=max_rows at line 1058 has been changed with max_rows=chunk_size, and added a test function to check loadtxt() for different sizes in the file numpy/lib/tests/test_loadtxt.py.

Closes #26754.

…read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize.
@charris charris changed the title fixed bug in function _read() in file numpy/lib/_npyio_impl.py BUG: fix function _read() in file numpy/lib/_npyio_impl.py Jun 19, 2024
@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Jun 19, 2024
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, unfortunately, the tests are correctly failing on datetime now.

Test failures are real, unfortunatley. Something is wrong with:

test_parametric_unit_discovery

Which I suspect is a problme with the chunking (the test seems to fail to use nrows, too, correctly.
I am not sure what is wrong, maybe read the wrong numbre of rows (which can make some test fail on getting the right unit too).

@@ -1043,3 +1043,13 @@ def test_field_growing_cases():
for i in range(1, 1024):
res = np.loadtxt(["," * i], delimiter=",", dtype=bytes)
assert len(res) == i+1

@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize
@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize

The line is long and, maybe move the comment (even in the function body).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when the function is called on a file-like path it works fine, while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...

I think there's actually a bug in the test itself - the nrows parameter from the test_parametric_unit_discovery doesn't appear to actually be used. If you add max_rows=nrows to the two calls to loadtxt, you see failures when nrows is <= 50000 (i.e. the _loadtxt_chunksize). Perhaps this is what you were saying above (where the "last line" is dictated by the value of nrows). However, I'd argue that's actually correct behavior - shouldn't the dtype only be determined by the values in the chunk requested to be read by the users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that provoked 2 of the 7 test failures indeed... I think that the behaviour of type determination is correct, because (after substituting 50000 with nrows in the definition of data) the first part of the test function never fails, it fails only when loadtxt is given a file object, then I think that there could be a bug with chunk reading within the function _load_from_filelike() from module _core._multiarray_umath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's useful, but in the case of a file of 50001 lines, within the cycle starting at line 212 of file numpy/_core/src/multiarray/textreading/rows.c (within function read_rows()), when it tries to read the second chunk (consisting of just one line) the function npy_tokenize(s, &ts, pconfig); tells that ts.num_fields==0, but should be 1... Could npy_tokenize() have issues in handling file objects?

…ions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects.
…t line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end.
…est functions for reading files in chunks...
@l09rin
Copy link
Contributor Author

l09rin commented Jun 21, 2024

I found the issue with the loading of files within the file numpy/lib/_npyio_impl.py at line 1045, where before chunking the file was opened as an iterable, but the variable filelike was not updated accordingly.
Now it seems to work properly, with three new tests now failing, within the class TestCReaderUnitTests.
But, at first sight, there are methods trying to load non-filelike objects as if they were file-like, if I understood well, then they should correctly fail, is my guess correct?

… arrays within function test_maxrows_exceeding_chunksize()
…rowing_cases() to avoid memory allocation issues when the line grows too much.
@l09rin
Copy link
Contributor Author

l09rin commented Jun 22, 2024

I noticed that on a couple of machines some checks failed due to memory allocation issues. Could it be useful/feasible not to have a fixed _loadtxt_chunksize = 50000 but define it according to some guess on the lenght of lines within the file?

@seberg seberg requested a review from rossbar June 25, 2024 07:40
@EngineerEricXie
Copy link
Contributor

Sorry for disturbing, but I have a few questions regarding the _load_from_filelike function.

I noticed that within the while loop, where the condition is while max_rows != 0, the max_rows variable is set to chunk_size (if max_rows < 0) in each iteration. Shouldn't this result in the function only returning the first chunk of data each time it is called? The current implementation doesn't seem to produce the expected results when reading multiple chunks (which is correct).

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo nit. A bit larger change now, since the chunking was effectively never active, so I am wondering if we should do a more minimal fix for backporting (i.e. keep chunking effectively disabled entirely as it was, just fixing max_rows).

But maybe it's OK, it does seem pretty safe.

numpy/lib/tests/test_loadtxt.py Outdated Show resolved Hide resolved
@seberg
Copy link
Member

seberg commented Jun 27, 2024

@EngineerEricXie, it seems all correct to me, it reads until a chunk isn't as large as expected. Do you have an example of that fails?

@EngineerEricXie
Copy link
Contributor

@seberg I've realized that I misunderstood how the data is read. Sorry for any confusion, and thanks for your response.

@seberg seberg changed the title BUG: fix function _read() in file numpy/lib/_npyio_impl.py BUG: fix max_rows and chunked string/datetime reading in loadtxt Jun 27, 2024
@seberg seberg merged commit 26a2e6c into numpy:main Jun 27, 2024
68 checks passed
@seberg
Copy link
Member

seberg commented Jun 27, 2024

Thanks for the follow-up fixes and new tests especially @l09rin!

@l09rin
Copy link
Contributor Author

l09rin commented Jun 27, 2024

You're welcome!

charris pushed a commit to charris/numpy that referenced this pull request Jul 3, 2024
…umpy#26762)

* fixed bug at line 1058 in file numpy/lib&npyio_impl.py; in function _read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize.

* changed numpy/lib/tests/test_loadtxt.py; added further tests in functions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects.

* changed value of filelike variable in file numpy/lib/_npyio_impl.py at line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end.

* I forgot to add the new version of test_loadtxt.py with the updated test functions for reading files in chunks...

* within file numpy/lib/tests/test_loadtxt.py I reduced the size of the arrays within function test_maxrows_exceeding_chunksize()

* add max_rows=10 in the call of loadtxt() within function test_field_growing_cases() to avoid memory allocation issues when the line grows too much.

* Update numpy/lib/tests/test_loadtxt.py

---------

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

BUG: numpy.loadtxt read more lines than specified with max_rows
6 participants