Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StataReader processes whole file before reading in chunks #48700

Closed
sterlinm opened this issue Sep 21, 2022 · 6 comments · Fixed by #49228
Closed

StataReader processes whole file before reading in chunks #48700

sterlinm opened this issue Sep 21, 2022 · 6 comments · Fixed by #49228
Labels
Enhancement IO Stata read_stata, to_stata

Comments

@sterlinm
Copy link

I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.

I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.

pandas/pandas/io/stata.py

Lines 1167 to 1175 in 71fc89c

with get_handle(
path_or_buf,
"rb",
storage_options=storage_options,
is_text=False,
compression=compression,
) as handles:
# Copy to BytesIO, and ensure no encoding
self.path_or_buf = BytesIO(handles.handle.read())

@twoertwein
Copy link
Member

Is that a behavior change you have noticed since 1.5 or did it also exist in previous versions? I think these particular lines of code are around since 1.3 but even before it (I think) it had a similar logic.

I think the issue is that some IO-like objects are not seekable but read_stata does internally a lot of seeking (some of the compressions IO doesn't support seeking). It might be the case that we can change the above line to only completely read the file if it isn't seekable.

@twoertwein twoertwein added the IO Stata read_stata, to_stata label Sep 22, 2022
@sterlinm
Copy link
Author

sterlinm commented Sep 22, 2022

I don't think it changed in 1.5, I had noticed it with 1.4. I didn't look back to see when it was introduced or if it has always been there.

I see the uses of seek in parsing the header but it seems like it should be possible to avoid that.

EDIT: Commented to soon, I think the suggestion to skip that when the file is seekable is simpler.

@twoertwein
Copy link
Member

Feel free to open a PR!

I think the main change is

self.handles = get_handle(...)
if hasattr(self.handles.handle, "seekable") and self.handles.handle.seekable:
    self.path_or_buf = self.handles.handle
else:
    with self.handles:
        self.path_or_buf = BytesIO(handles.handle.read()) 

# and then appropriate code to close self.handles (and self.path_or_buf in case of BytesIO)

@sterlinm
Copy link
Author

Feel free to open a PR!

I'll give it a shot over the weekend. Thanks!

@akx
Copy link
Contributor

akx commented Oct 3, 2022

By the looks of it, 2f0ada3 was the commit that changed this behavior, way back when.

(Came here via answering https://stackoverflow.com/a/73934594/51685 :-) )

akx added a commit to akx/pandas that referenced this issue Oct 3, 2022
akx added a commit to akx/pandas that referenced this issue Oct 3, 2022
akx added a commit to akx/pandas that referenced this issue Oct 3, 2022
akx added a commit to akx/pandas that referenced this issue Oct 5, 2022
akx added a commit to akx/pandas that referenced this issue Oct 5, 2022
akx added a commit to akx/pandas that referenced this issue Oct 6, 2022
akx added a commit to akx/pandas that referenced this issue Oct 6, 2022
akx added a commit to akx/pandas that referenced this issue Oct 7, 2022
@sterlinm
Copy link
Author

Amazing, thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
3 participants