BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

troels · 2018-09-07T15:07:08Z

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.

pep8speaks · 2018-09-07T15:07:10Z

Hello @troels! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/io/sas/sas7bdat.py !
There are no PEP8 issues in the file pandas/tests/io/sas/test_sas7bdat.py !

Comment last updated on September 07, 2018 at 19:51 Hours UTC

gfyoung · 2018-09-07T16:44:29Z

@troels : Thanks for the report! Do you mind sharing in this PR what output you get without your patch for reference? A small, reproducible example would be great.

pandas/tests/io/sas/test_sas7bdat.py

doc/source/whatsnew/v0.23.5.txt

pandas/io/sas/sas7bdat.py

codecov · 2018-09-07T16:53:48Z

Codecov Report

Merging #22628 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22628      +/-   ##
==========================================
+ Coverage   92.17%   92.17%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50712       +4     
==========================================
+ Hits        46740    46745       +5     
+ Misses       3968     3967       -1

Flag	Coverage Δ
#multiple	`90.58% <100%> (ø)`	⬆️
#single	`42.35% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/sas/sas7bdat.py	`91.16% <100%> (+0.07%)`	⬆️
pandas/core/internals/managers.py	`96.55% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...c459f6e. Read the comment docs.

troels · 2018-09-07T18:21:01Z

Hi @gfyoung

So the test case I added will fail in the current version of pandas, with the following error (these files can't be parsed):

pandas/tests/io/sas/test_sas7bdat.py ............F.                                              [100%]

=============================================== FAILURES ===============================================
__________________________________________ test_many_columns ___________________________________________

datapath = <function datapath.<locals>.deco at 0x7f53f5a47f28>

    def test_many_columns(datapath):
        fname = datapath("io", "sas", "data", "many_columns.sas7bdat")
>       df = pd.read_sas(fname, encoding='latin-1')

pandas/tests/io/sas/test_sas7bdat.py:188: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/io/sas/sasreader.py:68: in read_sas
    data = reader.read()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.sas.sas7bdat.SAS7BDATReader object at 0x7f53f5899860>, nrows = 3

    def read(self, nrows=None):
    
        if (nrows is None) and (self.chunksize is not None):
            nrows = self.chunksize
        elif nrows is None:
            nrows = self.row_count
    
        if len(self.column_types) == 0:
            self.close()
>           raise EmptyDataError("No columns to parse from file")
E           pandas.errors.EmptyDataError: No columns to parse from file

pandas/io/sas/sas7bdat.py:604: EmptyDataError
================================= 1 failed, 13 passed in 18.88 seconds =================================

troels · 2018-09-07T19:16:17Z

Hi @gfyoung

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

gfyoung · 2018-09-07T20:40:19Z

I added a commit fixing #16615 too. I hope it's ok to continue using the same PR, as the two pieces of code build upon each other.

@troels : In this case, it should be fine. Thanks for letting me know!

jreback · 2018-09-08T02:39:05Z

doc/source/whatsnew/v0.24.0.txt

@@ -690,7 +690,8 @@ I/O
 - :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`)
 - :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
 - :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
-
+- :func:`read_sas` will correctly parse sas7bdat files with many columns (:issue:`22628`)
+- :func:`read_sas` will correctly parse sas7bdat files with odd data page types (:issue:`16615`)


can you expand on 'odd' here?

I've tried to elaborate a bit, but the meaning of bit 7 is still rather unclear. It may have something to do with page also containing a bit map of deleted rows, but don't know precisely.

What is certain is that it can be parsed as a normal data page (possibly including the deleted rows)

pandas/io/sas/sas.pyx

pandas/tests/io/sas/test_sas7bdat.py

jreback · 2018-09-08T02:40:42Z

pandas/tests/io/sas/test_sas7bdat.py

@@ -183,6 +183,28 @@ def test_date_time(datapath):
    tm.assert_frame_equal(df, df0)


+def test_many_columns(datapath):
+    fname = datapath("io", "sas", "data", "many_columns.sas7bdat")


@jreback: likewise.

troels · 2018-09-10T12:26:23Z

The test failure here is quite unrelated to the PR.

The reason is that column definitions may be split up into different pages. Allow column information to be parsed from different pages and add a test for it.

) SAS can apparently generate data pages having bit 7 (128) set on the page type. It seems that the presence of bit 8 (256) determines whether it's a data page or not. So treat page as a data page if bit 8 is set and don't mind the lower bits.

troels · 2018-09-16T14:05:35Z

Hi @jreback

Anything else preventing this from being merged?

gfyoung

I'm not @jreback , but I don't think there are any barriers to merging this. LGTM!

jreback · 2018-09-18T12:13:49Z

thanks!

troels · 2018-09-18T12:17:17Z

Thanks both of you :)

…sas (pandas-dev#22628)

gfyoung added the IO SAS SAS: read_sas label Sep 7, 2018

gfyoung reviewed Sep 7, 2018

View reviewed changes

pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved

gfyoung reviewed Sep 7, 2018

View reviewed changes

doc/source/whatsnew/v0.23.5.txt Outdated Show resolved Hide resolved

gfyoung reviewed Sep 7, 2018

View reviewed changes

pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved

gfyoung added the Bug label Sep 7, 2018

troels force-pushed the sas-fix branch from f368895 to eccc784 Compare September 7, 2018 18:28

troels force-pushed the sas-fix branch 2 times, most recently from 2d81f92 to 8f498d5 Compare September 7, 2018 19:51

jreback requested changes Sep 8, 2018

View reviewed changes

troels force-pushed the sas-fix branch from 8f498d5 to 3511919 Compare September 8, 2018 11:19

troels mentioned this pull request Sep 9, 2018

BUG: Dont include deleted rows from sas7bdat files (#15963) #22650

Closed

4 tasks

troels force-pushed the sas-fix branch from 3511919 to f05ebc8 Compare September 9, 2018 17:00

troels force-pushed the sas-fix branch from f05ebc8 to c459f6e Compare September 11, 2018 21:57

troels added 2 commits September 16, 2018 13:02

BUG: Some sas7bdat files with many columns are not parseable by read_sas

d9dffe1

The reason is that column definitions may be split up into different pages. Allow column information to be parsed from different pages and add a test for it.

troels force-pushed the sas-fix branch from c459f6e to 3f3e051 Compare September 16, 2018 11:03

gfyoung approved these changes Sep 16, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Sep 18, 2018

jreback approved these changes Sep 18, 2018

View reviewed changes

jreback merged commit c6f7e86 into pandas-dev:master Sep 18, 2018

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

BUG: Some sas7bdat files with many columns are not parseable by read_…

9465a59

…sas (pandas-dev#22628)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Some sas7bdat files with many columns are not parseable by read_…

9cf7b60

…sas (pandas-dev#22628)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

troels commented Sep 7, 2018 •

edited

Loading

pep8speaks commented Sep 7, 2018 •

edited

Loading

gfyoung commented Sep 7, 2018

codecov bot commented Sep 7, 2018 •

edited

Loading

troels commented Sep 7, 2018 •

edited

Loading

troels commented Sep 7, 2018

gfyoung commented Sep 7, 2018

jreback Sep 8, 2018

troels Sep 8, 2018

troels Sep 8, 2018

jreback Sep 8, 2018

troels Sep 8, 2018

troels commented Sep 10, 2018

troels commented Sep 16, 2018

gfyoung left a comment

jreback commented Sep 18, 2018

troels commented Sep 18, 2018

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

Conversation

troels commented Sep 7, 2018 • edited Loading

pep8speaks commented Sep 7, 2018 • edited Loading

Comment last updated on September 07, 2018 at 19:51 Hours UTC

gfyoung commented Sep 7, 2018

codecov bot commented Sep 7, 2018 • edited Loading

Codecov Report

troels commented Sep 7, 2018 • edited Loading

troels commented Sep 7, 2018

gfyoung commented Sep 7, 2018

jreback Sep 8, 2018

Choose a reason for hiding this comment

troels Sep 8, 2018

Choose a reason for hiding this comment

troels Sep 8, 2018

Choose a reason for hiding this comment

jreback Sep 8, 2018

Choose a reason for hiding this comment

troels Sep 8, 2018

Choose a reason for hiding this comment

troels commented Sep 10, 2018

troels commented Sep 16, 2018

gfyoung left a comment

Choose a reason for hiding this comment

jreback commented Sep 18, 2018

troels commented Sep 18, 2018

troels commented Sep 7, 2018 •

edited

Loading

pep8speaks commented Sep 7, 2018 •

edited

Loading

codecov bot commented Sep 7, 2018 •

edited

Loading

troels commented Sep 7, 2018 •

edited

Loading