REGR: be able to read Stata files without reading them fully into memory #48922

akx · 2022-10-03T15:33:30Z

Fixes #48700
Refs #9245
Refs #37639
Regressed in 6d1541e

closes StataReader processes whole file before reading in chunks #48700 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
- The existing tests for e.g. roundtripping zstandard-compressed Stata files test this code path.
- Added a test that checks e.g. a fp or BytesIO passed in is the same object the reader reads.
All code checks passed.
Added type annotations to new arguments/methods/functions.
- Nothing new to add.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

twoertwein · 2022-10-04T01:33:53Z

There seem to be some failing tests on windows:

FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[118] - Perm...
FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[119] - Perm...
FAILED pandas/tests/io/test_stata.py::TestStata::test_utf8_writer[None] - Per...
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_labels - Per...
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_label_name_conversion
FAILED pandas/tests/io/test_stata.py::test_non_categorical_value_label_convert_categoricals_error

doc/source/whatsnew/v1.5.1.rst

akx · 2022-10-04T04:38:55Z

There seem to be some failing tests on windows:

Right, e.g.

       try:
>           self._accessor.unlink(self)
E           PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\32b05830-5662-48d4-9c1a-e0a1002e813c'

looks like not all handles are being correctly, which is apparently a test suite bug, fixed in 99d3540 :)

pandas/io/stata.py

doc/source/whatsnew/v1.6.0.rst

pandas/io/stata.py

pandas/tests/io/test_stata.py

twoertwein

Small comment, otherwise looks good to me!

jbrockmendel · 2022-10-07T23:56:08Z

LGTM cc @bashtage

bashtage

The test changes indicate that this might be introducing a behavior change with respect to leaving handles open in existing code. Can you verif y this isn't the case and restore the other tests that should work unmodified?

bashtage · 2022-10-08T06:16:31Z

pandas/tests/io/test_stata.py

@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion():
        with tm.assert_produces_warning(InvalidColumnName):
            data.to_stata(path, value_labels=value_labels)

-        reader = StataReader(path)


I don't like all of this style of change. StataReader is self-closing. Hsa this changed? If so, that is introducing a new bug to user code that needs to be addressed.

@bashtage StataReader is not self-closing. It has had a close method since 59dd18b (July 2015), to, quote, "ensure closing of the path". (At that point, when you'd pass in a string-like to the ctor, it would open it as a regular file and not buffer things into memory.) You can see 59dd18b also changes some tests to use with.

Some tests either didn't use that, or didn't call close(), which in turn caused cleanup issues on Windows now that we don't buffer everything into memory (which was a regression in 6d1541e, 2020).

Any user code that had not used StataReader() as a context manager had technically been in violation of the protocol established in 59dd18b, it just wouldn't surface since Pandas used to read everything to memory for the last two years.

I have a strong preference for there to be no test changes aside from those essential for this PR, so please revert these. It is fine to do a follow-up PR to clean the test code to use best practices. This just ensures that there are no visible side effects of other changes, e.g., leaking handles with existing code.

Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.

@bashtage

Leaked handles did CI fail, hence this separate commit to fix those up. See earlier discussion:

REGR: be able to read Stata files without reading them fully into memory #48922 (comment)

REGR: be able to read Stata files without reading them fully into memory #48922 (comment)

This behavior arose from a human mistake in the refactoring in 6d1541e.

You call it a mistake, I call it a coding-efficient choice that was not memory efficient.

I don't think it's first and foremost an optimization as the current behavior entirely prevents using chunked or iterator reading on machines with less available memory than the stata file's size on disk.

The usual response is to get more memory. I think extending the reading to lower memory machines is an enhancement.

I don't think that's true. The test suite changes (where a block already ending with .close() wasn't changed to a with) are:

Most of these self-close. There is only one that would need an explicit call to close if you implement this change only for the case where an iterator is used. (iterator == True or chunksize not None).

Reading a data label and variable labels only, no iterator mode:

It is self-closing in main, so no leaks.

Chunked Stata reader left open:

It is not. Closed since iterator is exhausted.

Chunked Stata reader left open.

Yes. This one could be fixed in the test.

Only value labels being read, no iterator mode (this occurs three times with minor variations):

Should work without modification of the tests since not an iterator.

To move this forwards, why not try:

implement it only in the case where an iterator is indicated

Restore the test suite so we can see the failures

Leave in the new tests, but make adapt them so that they are using an iterator

You call it a mistake, I call it a coding-efficient choice that was not memory efficient.

Please recall the reading code before that inadvertent change was memory-efficient.

pandas/pandas/io/stata.py

Lines 939 to 948 in 59dd18b

if isinstance(path_or_buf, (str, compat.text_type, bytes)):

self.path_or_buf = open(path_or_buf, 'rb')

else:

# Copy to BytesIO, and ensure no encoding

contents = path_or_buf.read()

try:

contents = contents.encode(self._default_encoding)

except:

pass

self.path_or_buf = BytesIO(contents)

The usual response is to get more memory.

Which is quite user-hostile when in this case the situation can be fixed with a +61/-26 line diff. (On that note, the person who had the issue with a 30-some-gigabyte Stata file on Stack Overflow reached out to me on Twitter for help, and I suggested trying to patch their Pandas' stata.py with the one from this branch. I think that's friendlier and quite less expensive (monetarily, ecologically, etc.) than to get more memory.)

I think extending the reading to lower memory machines is an enhancement.

See above. Restoring being able to read large files on lower memory machines is a regression fix or a bug fix.

It is self-closing in main, so no leaks.

On main, StataReader buffers to a BytesIO and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader on main (or rather, since 6d1541e) is "self-closing".

It is not. Closed since iterator is exhausted.

Which is technically a bug in itself, see #48922 (comment)

Yes. This one could be fixed in the test.

I'm not following you – why can this one be fixed in tests?

Should work without modification of the tests since not an iterator.

It only works on main without a leak because StataReader on main always buffers into memory, and an unclosed BytesIO does not raise a resource warning. It wouldn't work on any version prior to 6d1541e without a leak occurring. All of these three were added in fd151ba when the inadvertent buffering code was already in.

On main, StataReader buffers to a BytesIO and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader on main (or rather, since 6d1541e) is "self-closing".

The is exactly my point. The current behavior of StataReader is to never leak file handles. It has been this way for 2 years now. This is why the test suite passes despite some questionable code. Ideally, this behavior should not be changed without a deprecation cycle. It is probably necessary to change it to get the iterator to work without buffering the entire file, so I think it is acceptable to make this change in this limited case. I would also think the docs for iterator and chunksize should be strengthened to tell users that they must use a context manager or close the file themselves.

I do not think it is OK to change it where it isn't essential without letting users know of a change that could lead to errors.

I think it would be best to focus the fix on #48700 only and not expand the fix to apply to other cases.

bashtage · 2022-10-08T06:18:33Z

pandas/tests/io/test_stata.py

-            block = block.set_index("index")
-            assert "cats" in block
-            tm.assert_series_equal(block.cats, df.cats.iloc[2 * i : 2 * (i + 1)])
+        with StataReader(path, chunksize=2, order_categoricals=False) as reader:


The iterator mode is already self-closing. Why is the context manager introduced here?

pandas/pandas/io/stata.py

Lines 1704 to 1710 in 2402abe

if read_len <= 0:

# Iterator has finished, should never be here unless

# we are reading the file incrementally

if convert_categoricals:

self._read_value_labels()

self.close()

raise StopIteration

Sure, in this test we're reading the reader through, but if we weren't, then it would be good form to close the reader (and the underlying file handle), just like with any file handles in Python.

In fact, you might argue the self.close() there in the reader is inappropriate, since an esoteric user could have e.g. saved the reader's underlying file handle's .seek() position at some point, would then rewind the handle and have the reader read some more... but perhaps that's an esoteric enough use case that it doesn't need to be covered. 😁

bashtage

Please separate essential test file changes from those that should continue to work without issue after the StataReader changes.

bashtage · 2022-10-10T08:13:26Z

pandas/tests/io/test_stata.py

@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion():
        with tm.assert_produces_warning(InvalidColumnName):
            data.to_stata(path, value_labels=value_labels)

-        reader = StataReader(path)


Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.

bashtage · 2022-10-11T09:00:45Z

As for a way forward, maybe could consider adding a keyword argument buffer: {True, False, None (or "auto")} with a default of lib.NoDefault.

If True, then continue with the current behavior. If False use the new behavior. If None or "auto", let StattaReader decide. If lib.NoDefault then treat as True and issue a warning that there will be a deprecation warning in some future version. If False but the file requires buffering (e.g., a gzip), then either raise some sort of IOError or warn that buffering can't be used with files that do not support seek. None or "auto" will be the medium run-default. And possibly this keyword could be dropped in some very distant release after a second deprecation cycle.

If using this strategy, then the current test suite would still pass unmodified.

I'm always -1 for adding to the API though.

Refs pandas-dev#48922

jbrockmendel · 2022-11-04T16:01:49Z

@akx pretty much all of the maintainers are going to defer to bashtage on this

akx · 2022-11-04T21:40:25Z

@akx pretty much all of the maintainers are going to defer to bashtage on this

Yeah no worries @jbrockmendel, we're continuing in #49228 (the v2 of this) :)

Refs pandas-dev#48922

github-actions · 2022-12-05T00:05:29Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Refs pandas-dev#48922

mroeschke · 2022-12-17T19:31:55Z

Closing in favor of #49228

Refs pandas-dev#48922

* CLN: StataReader: refactor repeated struct.unpack/read calls to helpers * CLN: StataReader: replace string concatenations with f-strings * CLN: StataReader: prefix internal state with underscore * FIX: StataReader: defer opening file to when data is required * FIX: StataReader: don't buffer entire file into memory unless necessary Refs #48922 * DOC: Note that StataReaders are context managers * FIX: StataReader: don't close stream implicitly * Apply review changes

akx force-pushed the stata-no-memory branch 2 times, most recently from 7bc15e8 to 1532991 Compare October 3, 2022 15:38

akx marked this pull request as ready for review October 3, 2022 16:36

twoertwein reviewed Oct 4, 2022

View reviewed changes

doc/source/whatsnew/v1.5.1.rst Outdated Show resolved Hide resolved

akx mentioned this pull request Oct 4, 2022

TST: use with where possible instead of manual close #48931

Merged

5 tasks

mroeschke added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Oct 4, 2022

akx force-pushed the stata-no-memory branch 2 times, most recently from 99d3540 to 121ada5 Compare October 5, 2022 05:27

mroeschke reviewed Oct 5, 2022

View reviewed changes

pandas/io/stata.py Show resolved Hide resolved

mroeschke reviewed Oct 5, 2022

View reviewed changes

doc/source/whatsnew/v1.6.0.rst Outdated Show resolved Hide resolved

akx force-pushed the stata-no-memory branch from e368dfc to f384408 Compare October 6, 2022 07:12

akx requested a review from mroeschke October 6, 2022 07:12

akx force-pushed the stata-no-memory branch from f384408 to 97cd619 Compare October 6, 2022 08:29

twoertwein reviewed Oct 6, 2022

View reviewed changes

pandas/io/stata.py Outdated Show resolved Hide resolved

akx force-pushed the stata-no-memory branch from 97cd619 to 2402abe Compare October 7, 2022 07:51

akx requested review from twoertwein and removed request for mroeschke October 7, 2022 07:53

twoertwein reviewed Oct 7, 2022

View reviewed changes

pandas/tests/io/test_stata.py Outdated Show resolved Hide resolved

twoertwein approved these changes Oct 7, 2022

View reviewed changes

bashtage requested changes Oct 8, 2022

View reviewed changes

akx requested a review from bashtage October 10, 2022 07:54

akx force-pushed the stata-no-memory branch from 2402abe to 98828fe Compare October 10, 2022 07:55

bashtage requested changes Oct 10, 2022

View reviewed changes

akx force-pushed the stata-no-memory branch from 98828fe to 932694e Compare October 10, 2022 11:36

akx requested a review from bashtage October 11, 2022 08:39

akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

2e718e6

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

8de26be

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 2, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

5c639e2

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 3, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

98fa563

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 7, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

8294763

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 7, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

1fb6687

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 8, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

8775e24

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 15, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

1d8d4b2

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 15, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

8b279be

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Nov 29, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

1cd8cad

Refs pandas-dev#48922

github-actions bot added the Stale label Dec 5, 2022

akx added a commit to akx/pandas that referenced this pull request Dec 5, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

7b52273

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Dec 15, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

2045b67

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Dec 16, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

f5b7a4a

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Dec 16, 2022

FIX: StataReader: don't buffer entire file into memory unless necessary

166fb69

Refs pandas-dev#48922

mroeschke closed this Dec 17, 2022

akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

2accab5

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

0f5c7b0

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 3, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

afb5587

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 21, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

ce397b1

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 21, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

686ff38

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

33240a0

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

b7bd57f

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 22, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

beef885

Refs pandas-dev#48922

akx added a commit to akx/pandas that referenced this pull request Feb 23, 2023

FIX: StataReader: don't buffer entire file into memory unless necessary

d72d5f9

Refs pandas-dev#48922

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: be able to read Stata files without reading them fully into memory #48922

REGR: be able to read Stata files without reading them fully into memory #48922

akx commented Oct 3, 2022 •

edited

Loading

twoertwein commented Oct 4, 2022

akx commented Oct 4, 2022

twoertwein left a comment

jbrockmendel commented Oct 7, 2022

bashtage left a comment

bashtage Oct 8, 2022

akx Oct 10, 2022

bashtage Oct 10, 2022

bashtage Oct 10, 2022

akx Oct 10, 2022 •

edited

Loading

bashtage Oct 11, 2022

bashtage Oct 11, 2022

akx Oct 11, 2022

bashtage Oct 11, 2022

bashtage Oct 11, 2022

bashtage Oct 8, 2022

akx Oct 10, 2022

bashtage left a comment

bashtage Oct 10, 2022

bashtage commented Oct 11, 2022

jbrockmendel commented Nov 4, 2022

akx commented Nov 4, 2022

github-actions bot commented Dec 5, 2022

mroeschke commented Dec 17, 2022

	if isinstance(path_or_buf, (str, compat.text_type, bytes)):
	self.path_or_buf = open(path_or_buf, 'rb')
	else:
	# Copy to BytesIO, and ensure no encoding
	contents = path_or_buf.read()
	try:
	contents = contents.encode(self._default_encoding)
	except:
	pass
	self.path_or_buf = BytesIO(contents)

	if read_len <= 0:
	# Iterator has finished, should never be here unless
	# we are reading the file incrementally
	if convert_categoricals:
	self._read_value_labels()
	self.close()
	raise StopIteration

REGR: be able to read Stata files without reading them fully into memory #48922

REGR: be able to read Stata files without reading them fully into memory #48922

Conversation

akx commented Oct 3, 2022 • edited Loading

twoertwein commented Oct 4, 2022

akx commented Oct 4, 2022

twoertwein left a comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 7, 2022

bashtage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akx Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtage commented Oct 11, 2022

jbrockmendel commented Nov 4, 2022

akx commented Nov 4, 2022

github-actions bot commented Dec 5, 2022

mroeschke commented Dec 17, 2022

akx commented Oct 3, 2022 •

edited

Loading

akx Oct 10, 2022 •

edited

Loading