BUG: fix max_rows and chunked string/datetime reading in `loadtxt` #26762

l09rin · 2024-06-19T23:32:05Z

Within function _read(), called by loadtxt() method, large files are read in chunks of maximum _loadtxt_chunksize=50000 lines, by calling iteratively the function _load_from_filelike().
When max_rows exceeded _loadtxt_chunksize, the latter was still called with the option max_rows=max_rows, instead of max_rows=chunk_size, that cannot be greater than _loadtxt_chunksize. This caused the function to load chunks of max_rows lines while still thinking having loaded chunks of maximum _loadtxt_chunksize lines.
The option max_rows=max_rows at line 1058 has been changed with max_rows=chunk_size, and added a test function to check loadtxt() for different sizes in the file numpy/lib/tests/test_loadtxt.py.

Closes #26754.

…read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize.

seberg

Thanks, unfortunately, the tests are correctly failing on datetime now.

Test failures are real, unfortunatley. Something is wrong with:

test_parametric_unit_discovery

Which I suspect is a problme with the chunking (the test seems to fail to use nrows, too, correctly.
I am not sure what is wrong, maybe read the wrong numbre of rows (which can make some test fail on getting the right unit too).

seberg · 2024-06-20T06:17:40Z

numpy/lib/tests/test_loadtxt.py

@@ -1043,3 +1043,13 @@ def test_field_growing_cases():
    for i in range(1, 1024):
        res = np.loadtxt(["," * i], delimiter=",", dtype=bytes)
        assert len(res) == i+1
+
+@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize


Suggested change

@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize

@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize

The line is long and, maybe move the comment (even in the function body).

It seems that when the function is called on a file-like path it works fine, while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...

while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...

I think there's actually a bug in the test itself - the nrows parameter from the test_parametric_unit_discovery doesn't appear to actually be used. If you add max_rows=nrows to the two calls to loadtxt, you see failures when nrows is <= 50000 (i.e. the _loadtxt_chunksize). Perhaps this is what you were saying above (where the "last line" is dictated by the value of nrows). However, I'd argue that's actually correct behavior - shouldn't the dtype only be determined by the values in the chunk requested to be read by the users?

Yes, that provoked 2 of the 7 test failures indeed... I think that the behaviour of type determination is correct, because (after substituting 50000 with nrows in the definition of data) the first part of the test function never fails, it fails only when loadtxt is given a file object, then I think that there could be a bug with chunk reading within the function _load_from_filelike() from module _core._multiarray_umath

I don't know if it's useful, but in the case of a file of 50001 lines, within the cycle starting at line 212 of file numpy/_core/src/multiarray/textreading/rows.c (within function read_rows()), when it tries to read the second chunk (consisting of just one line) the function npy_tokenize(s, &ts, pconfig); tells that ts.num_fields==0, but should be 1... Could npy_tokenize() have issues in handling file objects?

…ions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects.

…t line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end.

…est functions for reading files in chunks...

l09rin · 2024-06-21T12:06:28Z

I found the issue with the loading of files within the file numpy/lib/_npyio_impl.py at line 1045, where before chunking the file was opened as an iterable, but the variable filelike was not updated accordingly.
Now it seems to work properly, with three new tests now failing, within the class TestCReaderUnitTests.
But, at first sight, there are methods trying to load non-filelike objects as if they were file-like, if I understood well, then they should correctly fail, is my guess correct?

… arrays within function test_maxrows_exceeding_chunksize()

…rowing_cases() to avoid memory allocation issues when the line grows too much.

l09rin · 2024-06-22T11:39:14Z

I noticed that on a couple of machines some checks failed due to memory allocation issues. Could it be useful/feasible not to have a fixed _loadtxt_chunksize = 50000 but define it according to some guess on the lenght of lines within the file?

EngineerEricXie · 2024-06-27T08:09:25Z

Sorry for disturbing, but I have a few questions regarding the _load_from_filelike function.

I noticed that within the while loop, where the condition is while max_rows != 0, the max_rows variable is set to chunk_size (if max_rows < 0) in each iteration. Shouldn't this result in the function only returning the first chunk of data each time it is called? The current implementation doesn't seem to produce the expected results when reading multiple chunks (which is correct).

seberg

LGTM, modulo nit. A bit larger change now, since the chunking was effectively never active, so I am wondering if we should do a more minimal fix for backporting (i.e. keep chunking effectively disabled entirely as it was, just fixing max_rows).

But maybe it's OK, it does seem pretty safe.

numpy/lib/tests/test_loadtxt.py

seberg · 2024-06-27T08:52:29Z

@EngineerEricXie, it seems all correct to me, it reads until a chunk isn't as large as expected. Do you have an example of that fails?

EngineerEricXie · 2024-06-27T09:20:27Z

@seberg I've realized that I misunderstood how the data is read. Sorry for any confusion, and thanks for your response.

seberg · 2024-06-27T10:00:47Z

Thanks for the follow-up fixes and new tests especially @l09rin!

l09rin · 2024-06-27T20:03:00Z

You're welcome!

…umpy#26762) * fixed bug at line 1058 in file numpy/lib&npyio_impl.py; in function _read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize. * changed numpy/lib/tests/test_loadtxt.py; added further tests in functions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects. * changed value of filelike variable in file numpy/lib/_npyio_impl.py at line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end. * I forgot to add the new version of test_loadtxt.py with the updated test functions for reading files in chunks... * within file numpy/lib/tests/test_loadtxt.py I reduced the size of the arrays within function test_maxrows_exceeding_chunksize() * add max_rows=10 in the call of loadtxt() within function test_field_growing_cases() to avoid memory allocation issues when the line grows too much. * Update numpy/lib/tests/test_loadtxt.py --------- Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

melissawm added 00 - Bug component: numpy.lib labels Jun 19, 2024

charris changed the title ~~fixed bug in function _read() in file numpy/lib/_npyio_impl.py~~ BUG: fix function _read() in file numpy/lib/_npyio_impl.py Jun 19, 2024

charris added the 09 - Backport-Candidate PRs tagged should be backported label Jun 19, 2024

seberg reviewed Jun 20, 2024

View reviewed changes

l09rin added 3 commits June 21, 2024 13:10

changed value of filelike variable in file numpy/lib/_npyio_impl.py a…

590f365

…t line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end.

I forgot to add the new version of test_loadtxt.py with the updated t…

d9b6fc6

…est functions for reading files in chunks...

l09rin added 2 commits June 21, 2024 14:14

within file numpy/lib/tests/test_loadtxt.py I reduced the size of the…

36b1af3

… arrays within function test_maxrows_exceeding_chunksize()

add max_rows=10 in the call of loadtxt() within function test_field_g…

bb70004

…rowing_cases() to avoid memory allocation issues when the line grows too much.

seberg requested a review from rossbar June 25, 2024 07:40

seberg approved these changes Jun 27, 2024

View reviewed changes

numpy/lib/tests/test_loadtxt.py Outdated Show resolved Hide resolved

Update numpy/lib/tests/test_loadtxt.py

eab9d5f

seberg changed the title ~~BUG: fix function _read() in file numpy/lib/_npyio_impl.py~~ BUG: fix max_rows and chunked string/datetime reading in loadtxt Jun 27, 2024

seberg merged commit 26a2e6c into numpy:main Jun 27, 2024
68 checks passed

charris mentioned this pull request Jul 3, 2024

BUG: fix max_rows and chunked string/datetime reading in loadtxt #26836

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix max_rows and chunked string/datetime reading in `loadtxt` #26762

BUG: fix max_rows and chunked string/datetime reading in `loadtxt` #26762

l09rin commented Jun 19, 2024 •

edited by charris

Loading

seberg left a comment

seberg Jun 20, 2024

l09rin Jun 20, 2024

rossbar Jun 20, 2024

l09rin Jun 20, 2024

l09rin Jun 20, 2024

l09rin commented Jun 21, 2024

l09rin commented Jun 22, 2024

EngineerEricXie commented Jun 27, 2024

seberg left a comment

seberg commented Jun 27, 2024

EngineerEricXie commented Jun 27, 2024

seberg commented Jun 27, 2024

l09rin commented Jun 27, 2024

	@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize
	@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize

BUG: fix max_rows and chunked string/datetime reading in loadtxt #26762

BUG: fix max_rows and chunked string/datetime reading in loadtxt #26762

Conversation

l09rin commented Jun 19, 2024 • edited by charris Loading

seberg left a comment

Choose a reason for hiding this comment

seberg Jun 20, 2024

Choose a reason for hiding this comment

l09rin Jun 20, 2024

Choose a reason for hiding this comment

rossbar Jun 20, 2024

Choose a reason for hiding this comment

l09rin Jun 20, 2024

Choose a reason for hiding this comment

l09rin Jun 20, 2024

Choose a reason for hiding this comment

l09rin commented Jun 21, 2024

l09rin commented Jun 22, 2024

EngineerEricXie commented Jun 27, 2024

seberg left a comment

Choose a reason for hiding this comment

seberg commented Jun 27, 2024

EngineerEricXie commented Jun 27, 2024

seberg commented Jun 27, 2024

l09rin commented Jun 27, 2024

BUG: fix max_rows and chunked string/datetime reading in `loadtxt` #26762

BUG: fix max_rows and chunked string/datetime reading in `loadtxt` #26762

l09rin commented Jun 19, 2024 •

edited by charris

Loading