Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: be able to read Stata files without reading them fully into memory #48922

Closed
wants to merge 1 commit into from

Conversation

akx
Copy link
Contributor

@akx akx commented Oct 3, 2022

Fixes #48700
Refs #9245
Refs #37639
Regressed in 6d1541e

@akx akx force-pushed the stata-no-memory branch 2 times, most recently from 7bc15e8 to 1532991 Compare October 3, 2022 15:38
@akx akx marked this pull request as ready for review October 3, 2022 16:36
@twoertwein
Copy link
Member

There seem to be some failing tests on windows:

FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[118] - Perm...
FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[119] - Perm...
FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[None] - Per...
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_labels - Per...
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_label_name_conversion
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_label_convert_categoricals_error

@akx
Copy link
Contributor Author

akx commented Oct 4, 2022

There seem to be some failing tests on windows:

Right, e.g.

       try:
>           self._accessor.unlink(self)
E           PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\32b05830-5662-48d4-9c1a-e0a1002e813c'

looks like not all handles are being correctly, which is apparently a test suite bug, fixed in 99d3540 :)

@mroeschke mroeschke added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Oct 4, 2022
@akx akx force-pushed the stata-no-memory branch 2 times, most recently from 99d3540 to 121ada5 Compare October 5, 2022 05:27
pandas/io/stata.py Outdated Show resolved Hide resolved
@akx akx requested review from twoertwein and removed request for mroeschke October 7, 2022 07:53
Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, otherwise looks good to me!

@jbrockmendel
Copy link
Member

LGTM cc @bashtage

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test changes indicate that this might be introducing a behavior change with respect to leaving handles open in existing code. Can you verif y this isn't the case and restore the other tests that should work unmodified?

@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion():
with tm.assert_produces_warning(InvalidColumnName):
data.to_stata(path, value_labels=value_labels)

reader = StataReader(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like all of this style of change. StataReader is self-closing. Hsa this changed? If so, that is introducing a new bug to user code that needs to be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashtage StataReader is not self-closing. It has had a close method since 59dd18b (July 2015), to, quote, "ensure closing of the path". (At that point, when you'd pass in a string-like to the ctor, it would open it as a regular file and not buffer things into memory.) You can see 59dd18b also changes some tests to use with.

Some tests either didn't use that, or didn't call close(), which in turn caused cleanup issues on Windows now that we don't buffer everything into memory (which was a regression in 6d1541e, 2020).

Any user code that had not used StataReader() as a context manager had technically been in violation of the protocol established in 59dd18b, it just wouldn't surface since Pandas used to read everything to memory for the last two years.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong preference for there to be no test changes aside from those essential for this PR, so please revert these. It is fine to do a follow-up PR to clean the test code to use best practices. This just ensures that there are no visible side effects of other changes, e.g., leaking handles with existing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.

Copy link
Contributor Author

@akx akx Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior arose from a human mistake in the refactoring in 6d1541e.

You call it a mistake, I call it a coding-efficient choice that was not memory efficient.

I don't think it's first and foremost an optimization as the current behavior entirely prevents using chunked or iterator reading on machines with less available memory than the stata file's size on disk.

The usual response is to get more memory. I think extending the reading to lower memory machines is an enhancement.

I don't think that's true. The test suite changes (where a block already ending with .close() wasn't changed to a with) are:

Most of these self-close. There is only one that would need an explicit call to close if you implement this change only for the case where an iterator is used. (iterator == True or chunksize not None).

Reading a data label and variable labels only, no iterator mode:

It is self-closing in main, so no leaks.

Chunked Stata reader left open:

It is not. Closed since iterator is exhausted.

Chunked Stata reader left open.

Yes. This one could be fixed in the test.

Only value labels being read, no iterator mode (this occurs three times with minor variations):

Should work without modification of the tests since not an iterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To move this forwards, why not try:

  • implement it only in the case where an iterator is indicated
  • Restore the test suite so we can see the failures
  • Leave in the new tests, but make adapt them so that they are using an iterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call it a mistake, I call it a coding-efficient choice that was not memory efficient.

Please recall the reading code before that inadvertent change was memory-efficient.

pandas/pandas/io/stata.py

Lines 939 to 948 in 59dd18b

if isinstance(path_or_buf, (str, compat.text_type, bytes)):
self.path_or_buf = open(path_or_buf, 'rb')
else:
# Copy to BytesIO, and ensure no encoding
contents = path_or_buf.read()
try:
contents = contents.encode(self._default_encoding)
except:
pass
self.path_or_buf = BytesIO(contents)

The usual response is to get more memory.

Which is quite user-hostile when in this case the situation can be fixed with a +61/-26 line diff. (On that note, the person who had the issue with a 30-some-gigabyte Stata file on Stack Overflow reached out to me on Twitter for help, and I suggested trying to patch their Pandas' stata.py with the one from this branch. I think that's friendlier and quite less expensive (monetarily, ecologically, etc.) than to get more memory.)

I think extending the reading to lower memory machines is an enhancement.

See above. Restoring being able to read large files on lower memory machines is a regression fix or a bug fix.

It is self-closing in main, so no leaks.

On main, StataReader buffers to a BytesIO and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader on main (or rather, since 6d1541e) is "self-closing".

It is not. Closed since iterator is exhausted.

Which is technically a bug in itself, see #48922 (comment)

Yes. This one could be fixed in the test.

I'm not following you – why can this one be fixed in tests?

Should work without modification of the tests since not an iterator.

It only works on main without a leak because StataReader on main always buffers into memory, and an unclosed BytesIO does not raise a resource warning. It wouldn't work on any version prior to 6d1541e without a leak occurring. All of these three were added in fd151ba when the inadvertent buffering code was already in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, StataReader buffers to a BytesIO and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader on main (or rather, since 6d1541e) is "self-closing".

The is exactly my point. The current behavior of StataReader is to never leak file handles. It has been this way for 2 years now. This is why the test suite passes despite some questionable code. Ideally, this behavior should not be changed without a deprecation cycle. It is probably necessary to change it to get the iterator to work without buffering the entire file, so I think it is acceptable to make this change in this limited case. I would also think the docs for iterator and chunksize should be strengthened to tell users that they must use a context manager or close the file themselves.

I do not think it is OK to change it where it isn't essential without letting users know of a change that could lead to errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to focus the fix on #48700 only and not expand the fix to apply to other cases.

block = block.set_index("index")
assert "cats" in block
tm.assert_series_equal(block.cats, df.cats.iloc[2 * i : 2 * (i + 1)])
with StataReader(path, chunksize=2, order_categoricals=False) as reader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator mode is already self-closing. Why is the context manager introduced here?

pandas/pandas/io/stata.py

Lines 1704 to 1710 in 2402abe

if read_len <= 0:
# Iterator has finished, should never be here unless
# we are reading the file incrementally
if convert_categoricals:
self._read_value_labels()
self.close()
raise StopIteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, in this test we're reading the reader through, but if we weren't, then it would be good form to close the reader (and the underlying file handle), just like with any file handles in Python.

In fact, you might argue the self.close() there in the reader is inappropriate, since an esoteric user could have e.g. saved the reader's underlying file handle's .seek() position at some point, would then rewind the handle and have the reader read some more... but perhaps that's an esoteric enough use case that it doesn't need to be covered. 😁

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate essential test file changes from those that should continue to work without issue after the StataReader changes.

@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion():
with tm.assert_produces_warning(InvalidColumnName):
data.to_stata(path, value_labels=value_labels)

reader = StataReader(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.

@bashtage
Copy link
Contributor

As for a way forward, maybe could consider adding a keyword argument buffer: {True, False, None (or "auto")} with a default of lib.NoDefault.

If True, then continue with the current behavior. If False use the new behavior. If None or "auto", let StattaReader decide. If lib.NoDefault then treat as True and issue a warning that there will be a deprecation warning in some future version. If False but the file requires buffering (e.g., a gzip), then either raise some sort of IOError or warn that buffering can't be used with files that do not support seek. None or "auto" will be the medium run-default. And possibly this keyword could be dropped in some very distant release after a second deprecation cycle.

If using this strategy, then the current test suite would still pass unmodified.

I'm always -1 for adding to the API though.

akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022
akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022
akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022
akx added a commit to akx/pandas that referenced this pull request Nov 3, 2022
@jbrockmendel
Copy link
Member

@akx pretty much all of the maintainers are going to defer to bashtage on this

@akx
Copy link
Contributor Author

akx commented Nov 4, 2022

@akx pretty much all of the maintainers are going to defer to bashtage on this

Yeah no worries @jbrockmendel, we're continuing in #49228 (the v2 of this) :)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@mroeschke
Copy link
Member

Closing in favor of #49228

@mroeschke mroeschke closed this Dec 17, 2022
akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 21, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 21, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023
akx added a commit to akx/pandas that referenced this pull request Feb 23, 2023
mroeschke pushed a commit that referenced this pull request Feb 23, 2023
* CLN: StataReader: refactor repeated struct.unpack/read calls to helpers

* CLN: StataReader: replace string concatenations with f-strings

* CLN: StataReader: prefix internal state with underscore

* FIX: StataReader: defer opening file to when data is required

* FIX: StataReader: don't buffer entire file into memory unless necessary

Refs #48922

* DOC: Note that StataReaders are context managers

* FIX: StataReader: don't close stream implicitly

* Apply review changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StataReader processes whole file before reading in chunks
6 participants