Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Dont include deleted rows from sas7bdat files (#15963) #22650

Closed

Conversation

troels
Copy link
Contributor

@troels troels commented Sep 9, 2018

Sas7bdat files may contain rows which are actually deleted.

After reverse engineering the file format a bit, I found that
if the page_type has bit 7 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.

This PR is built on top of #22628

I've added two test-cases one from the issue and another constructed
example, which tests parsing on different page types and across multiple sas7bdat-pages.

The constructed example is rather large, mainly to have a test case where the data
flows across several pages.

@pep8speaks
Copy link

Hello @troels! Thanks for submitting the PR.

@troels troels force-pushed the dont-include-deleted-rows-sas branch 3 times, most recently from 1b1e8ec to b8696f9 Compare September 9, 2018 18:28
@troels troels changed the title Dont include deleted rows from sas7bdat files (#15963) BUG: Dont include deleted rows from sas7bdat files (#15963) Sep 9, 2018
@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #22650 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22650      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50721      +13     
==========================================
+ Hits        46740    46754      +14     
+ Misses       3968     3967       -1
Flag Coverage Δ
#multiple 90.58% <100%> (ø) ⬆️
#single 42.34% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/io/sas/sas_constants.py 100% <100%> (ø) ⬆️
pandas/io/sas/sas7bdat.py 91.26% <100%> (+0.17%) ⬆️
pandas/core/internals/managers.py 96.55% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...b31144a. Read the comment docs.

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2018

Haven't looked in detail due to size of PR but I noticed your original post only mentions one issue whereas the PR references 3 in the whatsnew. Is that intentional or have you simply pulled in some other concurrent developments accidentally?

@WillAyd WillAyd added the IO SAS SAS: read_sas label Sep 11, 2018
@troels
Copy link
Contributor Author

troels commented Sep 11, 2018

Hi @WillAyd I made a separate PR #22628 for the two first commits. This one I though was large enough to have its own PR, but as there was a dependency on the first two commits I continued this PR from the branch #22628 had sprung from.
If #22628 can be merged soon, this PR will suddenly get slightly smaller (but just a bit)

@troels troels force-pushed the dont-include-deleted-rows-sas branch 3 times, most recently from db6ba8b to 69779d5 Compare September 18, 2018 19:56
@codecov
Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #22650 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22650      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50812    50821       +9     
==========================================
+ Hits        46842    46851       +9     
  Misses       3970     3970
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.36% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/sas/sas_constants.py 100% <100%> (ø) ⬆️
pandas/io/sas/sas7bdat.py 91.26% <100%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f65fa75...fcb1699. Read the comment docs.

@troels troels force-pushed the dont-include-deleted-rows-sas branch from 69779d5 to fcb1699 Compare September 23, 2018 12:23
Sas7bdat may contain rows which are actually deleted.

If the page_type has bit 128 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
page_type = self.current_page_type
subheader_pointers_length = \
self.subheader_pointer_length * self.current_page_subheaders_count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments in here

self.calculate_deleted_rows_bitmap_offset()

cdef bint is_row_deleted(self, int row_number):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string & comments

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2018

Can you address comments and merge master?

@jreback
Copy link
Contributor

jreback commented Dec 14, 2018

can you merge master and address comments

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2019

Closing as stale. Ping if you'd like to continue

@WillAyd WillAyd closed this Feb 6, 2019
@bezdomniy
Copy link

This would be pretty useful as the issue makes read_sas unusable for a lot of sas datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IO/SAS (sas7bdat) deleted observations are not filtered out
5 participants