Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAS7BDAT parser: Improve subheader lookup performance #47656

Merged
merged 5 commits into from
Oct 4, 2022

Conversation

jonashaag
Copy link
Contributor

Avoid constructing _SubheaderPointer objects and make dictionary lookups in C rather than in Python.

Speedup relative to current main:

     <main>           <sas/shlookup~1>
-     8.32±0.07ms      7.51±0.06ms     0.90  io.sas.SAS.time_test_meta2_pagesas7bdat
-      82.8±0.5ms       73.6±0.5ms     0.89  io.sas.SAS.time_read_sas7bdat_2_chunked
       before           after         ratio

Will extend what's new from #47404 once that's merged.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 12, 2022

Test failure seems unrelated to this PR but still worth fixing. I'll need some guidance here. The problem is that we don't have an explicit .close() when using pd.read_sas(iterator=True) and so a file descriptor is leaked. But file handling in read_sas is implemented the same way it is implemented in pd.read_csv(iterator=True), which doesn't cause leaked file descriptors. So I wonder what's wrong here.

@mroeschke mroeschke added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Jul 22, 2022
@jonashaag
Copy link
Contributor Author

@mroeschke any pointers for the failing test (see comment above)?

@mroeschke
Copy link
Member

Sorry, I'm not too familiar with this part of the codebase. It looks like it's using the shared file handling code, but not sure if the one of the sas is still keeping a reference to the file handle somewhere?

@jonashaag
Copy link
Contributor Author

Bug in file_leak_context, see #30096.

@jonashaag
Copy link
Contributor Author

@mroeschke tests fixed, should be good to review now.

@mroeschke mroeschke added this to the 1.6 milestone Oct 4, 2022
@mroeschke mroeschke merged commit 39fc318 into pandas-dev:main Oct 4, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* SAS7BDAT parser: Improve subheader lookup performance

* Fix ssize_t type

* Update _sas.pyi

* Lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants