Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

Merged
merged 2 commits into from Sep 18, 2018

Conversation

Projects
None yet
4 participants
@troels
Copy link
Contributor

commented Sep 7, 2018

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.

@pep8speaks

This comment has been minimized.

Copy link

commented Sep 7, 2018

Hello @troels! Thanks for updating the PR.

Comment last updated on September 07, 2018 at 19:51 Hours UTC

@gfyoung gfyoung added the IO SAS label Sep 7, 2018

@gfyoung

This comment has been minimized.

Copy link
Member

commented Sep 7, 2018

@troels : Thanks for the report! Do you mind sharing in this PR what output you get without your patch for reference? A small, reproducible example would be great.

Show resolved Hide resolved doc/source/whatsnew/v0.23.5.txt Outdated
Show resolved Hide resolved pandas/io/sas/sas7bdat.py Outdated

@gfyoung gfyoung added the Bug label Sep 7, 2018

@codecov

This comment has been minimized.

Copy link

commented Sep 7, 2018

Codecov Report

Merging #22628 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22628      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50712       +4     
==========================================
+ Hits        46740    46745       +5     
+ Misses       3968     3967       -1
Flag Coverage Δ
#multiple 90.58% <100%> (ø) ⬆️
#single 42.35% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/sas/sas7bdat.py 91.16% <100%> (+0.07%) ⬆️
pandas/core/internals/managers.py 96.55% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...c459f6e. Read the comment docs.

@troels

This comment has been minimized.

Copy link
Contributor Author

commented Sep 7, 2018

Hi @gfyoung

So the test case I added will fail in the current version of pandas, with the following error (these files can't be parsed):

pandas/tests/io/sas/test_sas7bdat.py ............F.                                              [100%]

=============================================== FAILURES ===============================================
__________________________________________ test_many_columns ___________________________________________

datapath = <function datapath.<locals>.deco at 0x7f53f5a47f28>

    def test_many_columns(datapath):
        fname = datapath("io", "sas", "data", "many_columns.sas7bdat")
>       df = pd.read_sas(fname, encoding='latin-1')

pandas/tests/io/sas/test_sas7bdat.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/io/sas/sasreader.py:68: in read_sas
    data = reader.read()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.sas.sas7bdat.SAS7BDATReader object at 0x7f53f5899860>, nrows = 3

    def read(self, nrows=None):
    
        if (nrows is None) and (self.chunksize is not None):
            nrows = self.chunksize
        elif nrows is None:
            nrows = self.row_count
    
        if len(self.column_types) == 0:
            self.close()
>           raise EmptyDataError("No columns to parse from file")
E           pandas.errors.EmptyDataError: No columns to parse from file

pandas/io/sas/sas7bdat.py:604: EmptyDataError
================================= 1 failed, 13 passed in 18.88 seconds =================================

@troels troels force-pushed the troels:sas-fix branch from f368895 to eccc784 Sep 7, 2018

@troels

This comment has been minimized.

Copy link
Contributor Author

commented Sep 7, 2018

Hi @gfyoung

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

@troels troels force-pushed the troels:sas-fix branch 2 times, most recently from 2d81f92 to 8f498d5 Sep 7, 2018

@gfyoung

This comment has been minimized.

Copy link
Member

commented Sep 7, 2018

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

@troels : In this case, it should be fine. Thanks for letting me know!

@@ -690,7 +690,8 @@ I/O
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
-
- :func:`read_sas` will correctly parse sas7bdat files with many columns (:issue:`22628`)
- :func:`read_sas` will correctly parse sas7bdat files with odd data page types (:issue:`16615`)

This comment has been minimized.

Copy link
@jreback

jreback Sep 8, 2018

Contributor

can you expand on 'odd' here?

This comment has been minimized.

Copy link
@troels

troels Sep 8, 2018

Author Contributor

I've tried to elaborate a bit, but the meaning of bit 7 is still rather unclear. It may have something to do with page also containing a bit map of deleted rows, but don't know precisely.

This comment has been minimized.

Copy link
@troels

troels Sep 8, 2018

Author Contributor

What is certain is that it can be parsed as a normal data page (possibly including the deleted rows)

Show resolved Hide resolved pandas/io/sas/sas.pyx
Show resolved Hide resolved pandas/tests/io/sas/test_sas7bdat.py Outdated
@@ -183,6 +183,28 @@ def test_date_time(datapath):
tm.assert_frame_equal(df, df0)


def test_many_columns(datapath):
fname = datapath("io", "sas", "data", "many_columns.sas7bdat")

This comment has been minimized.

Copy link
@jreback

jreback Sep 8, 2018

Contributor

same

This comment has been minimized.

Copy link
@troels

troels Sep 8, 2018

Author Contributor

@jreback: likewise.

@troels troels force-pushed the troels:sas-fix branch from 8f498d5 to 3511919 Sep 8, 2018

@troels troels referenced this pull request Sep 9, 2018

Closed

BUG: Dont include deleted rows from sas7bdat files (#15963) #22650

4 of 4 tasks complete

@troels troels force-pushed the troels:sas-fix branch from 3511919 to f05ebc8 Sep 9, 2018

@troels

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2018

The test failure here is quite unrelated to the PR.

@troels troels force-pushed the troels:sas-fix branch from f05ebc8 to c459f6e Sep 11, 2018

troels added some commits Sep 7, 2018

BUG: Some sas7bdat files with many columns are not parseable by read_sas
The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.
BUG: Fix parsing of sas7bdat files with odd data pages (#16615)
SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

@troels troels force-pushed the troels:sas-fix branch from c459f6e to 3f3e051 Sep 16, 2018

@troels

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2018

Hi @jreback

Anything else preventing this from being merged?

@gfyoung
Copy link
Member

left a comment

I'm not @jreback , but I don't think there are any barriers to merging this. LGTM!

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018

@jreback jreback merged commit c6f7e86 into pandas-dev:master Sep 18, 2018

6 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2018

thanks!

@troels

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2018

Thanks both of you :)

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.