-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix max_rows and chunked string/datetime reading in loadtxt
#26762
Conversation
…read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, unfortunately, the tests are correctly failing on datetime now.
Test failures are real, unfortunatley. Something is wrong with:
test_parametric_unit_discovery
Which I suspect is a problme with the chunking (the test seems to fail to use nrows
, too, correctly.
I am not sure what is wrong, maybe read the wrong numbre of rows (which can make some test fail on getting the right unit too).
numpy/lib/tests/test_loadtxt.py
Outdated
@@ -1043,3 +1043,13 @@ def test_field_growing_cases(): | |||
for i in range(1, 1024): | |||
res = np.loadtxt(["," * i], delimiter=",", dtype=bytes) | |||
assert len(res) == i+1 | |||
|
|||
@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize | |
@pytest.mark.parametrize("nmax", (10000, 50000, 80000, 100000, 120000)) # less, equal, greater, twice and more than twice than _loadtxt_chunksize |
The line is long and, maybe move the comment (even in the function body).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when the function is called on a file-like path it works fine, while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while on a file object it misses end lines, causing also the right units in test_parametric_unit_discovery, because it should infer them from the last line...
I think there's actually a bug in the test itself - the nrows
parameter from the test_parametric_unit_discovery
doesn't appear to actually be used. If you add max_rows=nrows
to the two calls to loadtxt, you see failures when nrows
is <= 50000 (i.e. the _loadtxt_chunksize). Perhaps this is what you were saying above (where the "last line" is dictated by the value of nrows
). However, I'd argue that's actually correct behavior - shouldn't the dtype only be determined by the values in the chunk requested to be read by the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that provoked 2 of the 7 test failures indeed... I think that the behaviour of type determination is correct, because (after substituting 50000 with nrows in the definition of data) the first part of the test function never fails, it fails only when loadtxt is given a file object, then I think that there could be a bug with chunk reading within the function _load_from_filelike() from module _core._multiarray_umath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's useful, but in the case of a file of 50001 lines, within the cycle starting at line 212 of file numpy/_core/src/multiarray/textreading/rows.c (within function read_rows()), when it tries to read the second chunk (consisting of just one line) the function npy_tokenize(s, &ts, pconfig); tells that ts.num_fields==0, but should be 1... Could npy_tokenize() have issues in handling file objects?
…ions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects.
…t line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end.
…est functions for reading files in chunks...
I found the issue with the loading of files within the file numpy/lib/_npyio_impl.py at line 1045, where before chunking the file was opened as an iterable, but the variable filelike was not updated accordingly. |
… arrays within function test_maxrows_exceeding_chunksize()
…rowing_cases() to avoid memory allocation issues when the line grows too much.
I noticed that on a couple of machines some checks failed due to memory allocation issues. Could it be useful/feasible not to have a fixed _loadtxt_chunksize = 50000 but define it according to some guess on the lenght of lines within the file? |
Sorry for disturbing, but I have a few questions regarding the I noticed that within the while loop, where the condition is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo nit. A bit larger change now, since the chunking was effectively never active, so I am wondering if we should do a more minimal fix for backporting (i.e. keep chunking effectively disabled entirely as it was, just fixing max_rows
).
But maybe it's OK, it does seem pretty safe.
@EngineerEricXie, it seems all correct to me, it reads until a chunk isn't as large as expected. Do you have an example of that fails? |
@seberg I've realized that I misunderstood how the data is read. Sorry for any confusion, and thanks for your response. |
loadtxt
Thanks for the follow-up fixes and new tests especially @l09rin! |
You're welcome! |
…umpy#26762) * fixed bug at line 1058 in file numpy/lib&npyio_impl.py; in function _read(), called by loadtxt() method, when files are read in chunks to reduce memory overhead, max_rows lines were always loaded every time, also in the case max_rows>_loadtxt_chunksize, in which case it loaded chunks with the wrong size. A test has been added in numpy/lib/tests/test_loadtxt.py, to check for the array size loaded for different max_rows, less and greater than _loadtxt_chunksize. * changed numpy/lib/tests/test_loadtxt.py; added further tests in functions at lines test_maxrows_exceeding_chunksize() and test_parametric_unit_discovery() to check if loadtxt() method loads correctly files as a whole and in chunks. It seems that the function _load_from_filelike() works well with file-like streams, but not with file objects. * changed value of filelike variable in file numpy/lib/_npyio_impl.py at line 1045; file was converted to iterable, but not accounted for, then _load_from_fillelike() was not able to read the stream properly until the end. * I forgot to add the new version of test_loadtxt.py with the updated test functions for reading files in chunks... * within file numpy/lib/tests/test_loadtxt.py I reduced the size of the arrays within function test_maxrows_exceeding_chunksize() * add max_rows=10 in the call of loadtxt() within function test_field_growing_cases() to avoid memory allocation issues when the line grows too much. * Update numpy/lib/tests/test_loadtxt.py --------- Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Within function _read(), called by loadtxt() method, large files are read in chunks of maximum _loadtxt_chunksize=50000 lines, by calling iteratively the function _load_from_filelike().
When max_rows exceeded _loadtxt_chunksize, the latter was still called with the option max_rows=max_rows, instead of max_rows=chunk_size, that cannot be greater than _loadtxt_chunksize. This caused the function to load chunks of max_rows lines while still thinking having loaded chunks of maximum _loadtxt_chunksize lines.
The option max_rows=max_rows at line 1058 has been changed with max_rows=chunk_size, and added a test function to check loadtxt() for different sizes in the file numpy/lib/tests/test_loadtxt.py.
Closes #26754.