Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

Merged
merged 2 commits into from
Sep 18, 2018

Conversation

troels
Copy link
Contributor

@troels troels commented Sep 7, 2018

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.

@pep8speaks
Copy link

pep8speaks commented Sep 7, 2018

Hello @troels! Thanks for updating the PR.

Comment last updated on September 07, 2018 at 19:51 Hours UTC

@gfyoung gfyoung added the IO SAS SAS: read_sas label Sep 7, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 7, 2018

@troels : Thanks for the report! Do you mind sharing in this PR what output you get without your patch for reference? A small, reproducible example would be great.

pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
@gfyoung gfyoung added the Bug label Sep 7, 2018
@codecov
Copy link

codecov bot commented Sep 7, 2018

Codecov Report

Merging #22628 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22628      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50712       +4     
==========================================
+ Hits        46740    46745       +5     
+ Misses       3968     3967       -1
Flag Coverage Δ
#multiple 90.58% <100%> (ø) ⬆️
#single 42.35% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/sas/sas7bdat.py 91.16% <100%> (+0.07%) ⬆️
pandas/core/internals/managers.py 96.55% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...c459f6e. Read the comment docs.

@troels
Copy link
Contributor Author

troels commented Sep 7, 2018

Hi @gfyoung

So the test case I added will fail in the current version of pandas, with the following error (these files can't be parsed):

pandas/tests/io/sas/test_sas7bdat.py ............F.                                              [100%]

=============================================== FAILURES ===============================================
__________________________________________ test_many_columns ___________________________________________

datapath = <function datapath.<locals>.deco at 0x7f53f5a47f28>

    def test_many_columns(datapath):
        fname = datapath("io", "sas", "data", "many_columns.sas7bdat")
>       df = pd.read_sas(fname, encoding='latin-1')

pandas/tests/io/sas/test_sas7bdat.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/io/sas/sasreader.py:68: in read_sas
    data = reader.read()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.sas.sas7bdat.SAS7BDATReader object at 0x7f53f5899860>, nrows = 3

    def read(self, nrows=None):
    
        if (nrows is None) and (self.chunksize is not None):
            nrows = self.chunksize
        elif nrows is None:
            nrows = self.row_count
    
        if len(self.column_types) == 0:
            self.close()
>           raise EmptyDataError("No columns to parse from file")
E           pandas.errors.EmptyDataError: No columns to parse from file

pandas/io/sas/sas7bdat.py:604: EmptyDataError
================================= 1 failed, 13 passed in 18.88 seconds =================================

@troels
Copy link
Contributor Author

troels commented Sep 7, 2018

Hi @gfyoung

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

@troels troels force-pushed the sas-fix branch 2 times, most recently from 2d81f92 to 8f498d5 Compare September 7, 2018 19:51
@gfyoung
Copy link
Member

gfyoung commented Sep 7, 2018

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

@troels : In this case, it should be fine. Thanks for letting me know!

@@ -690,7 +690,8 @@ I/O
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
-
- :func:`read_sas` will correctly parse sas7bdat files with many columns (:issue:`22628`)
- :func:`read_sas` will correctly parse sas7bdat files with odd data page types (:issue:`16615`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on 'odd' here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to elaborate a bit, but the meaning of bit 7 is still rather unclear. It may have something to do with page also containing a bit map of deleted rows, but don't know precisely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is certain is that it can be parsed as a normal data page (possibly including the deleted rows)

pandas/io/sas/sas.pyx Show resolved Hide resolved
pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved
@@ -183,6 +183,28 @@ def test_date_time(datapath):
tm.assert_frame_equal(df, df0)


def test_many_columns(datapath):
fname = datapath("io", "sas", "data", "many_columns.sas7bdat")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: likewise.

@troels
Copy link
Contributor Author

troels commented Sep 10, 2018

The test failure here is quite unrelated to the PR.

The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
@troels
Copy link
Contributor Author

troels commented Sep 16, 2018

Hi @jreback

Anything else preventing this from being merged?

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not @jreback , but I don't think there are any barriers to merging this. LGTM!

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018
@jreback jreback merged commit c6f7e86 into pandas-dev:master Sep 18, 2018
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

thanks!

@troels
Copy link
Contributor Author

troels commented Sep 18, 2018

Thanks both of you :)

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants