BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

troels · 2018-09-09T18:58:39Z

Memory for numbers in sas7bdat-parsing was not initialized properly to 0.
For sas7bdat files with numbers smaller than 8 bytes this made the
least significant part of the numbers essentially random.
Fix it by initializing memory correctly.

closes read_sas does not handle numeric variables stored with fewer than 8 bytes in SAS datasets #21616
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2018-09-09T18:58:42Z

Hello @troels! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/io/sas/sas7bdat.py !
There are no PEP8 issues in the file pandas/tests/io/sas/test_sas7bdat.py !

codecov · 2018-09-09T22:15:44Z

Codecov Report

Merging #22651 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #22651   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50715    50715           
=======================================
  Hits        46747    46747           
  Misses       3968     3968

Flag	Coverage Δ
#multiple	`90.58% <100%> (ø)`	⬆️
#single	`42.35% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/sas/sas7bdat.py	`91.09% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c040353...796fc43. Read the comment docs.

pandas/tests/io/sas/test_sas7bdat.py

pandas/tests/io/sas/data/cars.csv

WillAyd · 2018-09-11T01:16:18Z

pandas/tests/io/sas/test_sas7bdat.py

@@ -183,6 +183,18 @@ def test_date_time(datapath):
    tm.assert_frame_equal(df, df0)


+def test_compact_numerical_values(datapath):
+    # Regression test for #21616
+    fname = datapath("io", "sas", "data", "cars.sas7bdat")


Using the fixture in this fashion will generate warnings. We just went through an exercise to clean these up in #22515 - can you take a look at that and adjust here accordingly?

If I understood the discussion on the pytest-tracker properly, datapath used this way is in fact not something that will be deprecated.

datapath is a "factory as a fixture" as described here:
https://docs.pytest.org/en/latest/fixture.html#factories-as-fixtures

Which is something distinct from what is beeing talked about here:
pytest-dev/pytest#3661

Have I misunderstood?

And it produces no warnings,as far as I can tell.

This usage is fine I think.

But, do you need to include a binary file in the git repo to reproduce the error on master? I'd like to avoid adding new ones if possible.

Hi @TomAugspurger

I am adding a sas7bdat file for testing the parsing of sas7bdat files. How can that be done more sensibly?

WillAyd · 2018-09-11T16:23:19Z

pandas/tests/io/sas/test_sas7bdat.py

+    # The two columns CYL and WGT in cars.sas7bdat have column
+    # width < 8 and only contains integral values. Test
+    # that pandas doesn't corrupt the less significant bits.
+    tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True)


Hmm not sure comparison to self as part of the test is the best option here - do we have any way of strengthening the assertion(s) being made?

We aren't comparing to self. We are making sure, that floats with no decimal part in the sas7bdat-file doesn't get a decimal part because of a bug in pandas. That is exactly what this bug is about, after all.

(See: https://stackoverflow.com/questions/49059421/pandas-fails-with-correct-data-type-while-reading-a-sas-file)

Before:

Memory was not initialized.

The decimal part of the float was taken from random uninitialized memory.

Test fails most of the time.

Now:

Memory is zeroed out.

The decimal part of the float which is not read (because it's implicit in the file format, that it should be zero) is therefore still zero in pandas.

Test succeeds.

can you use

result = expected =

to make this easier to read

why do we need to .round(), rather than just using check_less_precision ?

I can also do:

msg = "Expected df['WGT'] to be full of integers" assert df['WGT'] == df['WGT'].astype('int'), msg tm.assert_series_equal(df['WGT'], df['WGT'].astype('int'), check_exact=True) assert all(f.is_integer() for f in df['WGT']), msg assert all(f - int(f) == 0 for f in df['WGT']), msg assert all(f == int(f) for f in df['WGT']), msg

or any other way of saying the same thing you would prefer...

WillAyd

@TomAugspurger any thoughts on this one?

TomAugspurger · 2018-09-11T17:40:35Z

cc @kshedden and @Winand if you have time to give this a review.

TomAugspurger · 2018-09-11T19:38:19Z

Not sure. Do any of the test files already in the repo have this problem. We can add it if necessary, just trying to avoid bloating the git repository more than it already is.

…

On Tue, Sep 11, 2018 at 12:49 PM Troels Nielsen ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/tests/io/sas/test_sas7bdat.py <#22651 (comment)>: > @@ -183,6 +183,18 @@ def test_date_time(datapath): tm.assert_frame_equal(df, df0) +def test_compact_numerical_values(datapath): + # Regression test for #21616 + fname = datapath("io", "sas", "data", "cars.sas7bdat") Hi @TomAugspurger <https://github.com/TomAugspurger> I am adding a sas7bdat file for testing the parsing of sas7bdat files. How can that be done more sensibly? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22651 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIiuIKiA5BprHocrjYhmRSSWSt8Wxks5uZ_esgaJpZM4WgZxE> .

troels · 2018-09-11T20:40:03Z

@TomAugspurger:

No, none of the test-files has the problem. If any of the already existing test-files had numeric columns less than 8 bytes, the tests would probably have been unstable.

Usually one wants to avoid binary files in git repositories for two reasons:

They tend to be large
Git's diff has a tendency to every time a binary files has changed store the complete file as the delta. (as git's diff is clumsy with binary content)
That makes e.g. large images that are edited and committed often rather toxic.

A small file like cars.sas7bdat, which will never be edited and to top it off is highly compressible is not going to take very much space at all, neither now nor in the future.

jreback · 2018-09-12T11:39:34Z

pandas/tests/io/sas/test_sas7bdat.py

+    # The two columns CYL and WGT in cars.sas7bdat have column
+    # width < 8 and only contains integral values. Test
+    # that pandas doesn't corrupt the less significant bits.
+    tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True)


can you use

result = expected =

to make this easier to read

why do we need to .round(), rather than just using check_less_precision ?

…as-dev#21616) Memory for numbers in sas7bdat-parsing was not initialized properly to 0. For sas7bdat files with numbers smaller than 8 bytes this made the least significant part of the numbers essentially random. Fix it by initializing memory correctly.

troels · 2018-09-12T12:20:06Z

Hi @jreback

Sure thing, I've added variables and extended the comment a bit.

We are comparing the result that the sas7bdat-parser reads from the file with the closest integers. As we know that the numbers are integral they should be exactly equal to their closest integer. When the bug was active the decimal part of the floating point numbers were not read from the file but simply taken from uninitialized memory, and so the WGT and CYL were not actually whole numbers in the pandas dataframe.

I originally had a CSV-file containing the correct numbers, but was compelled to remove it because of @WillAyd 's comments. I'll gladly add it back if that's preferable.

jreback · 2018-09-15T12:13:01Z

thanks @troels

…as-dev#21616) (pandas-dev#22651)

WillAyd requested changes Sep 11, 2018

View reviewed changes

WillAyd added the IO SAS SAS: read_sas label Sep 11, 2018

troels force-pushed the initialize-sas7bdat-memory branch from 0783757 to c2219c7 Compare September 11, 2018 09:02

WillAyd reviewed Sep 11, 2018

View reviewed changes

WillAyd approved these changes Sep 11, 2018

View reviewed changes

jreback requested changes Sep 12, 2018

View reviewed changes

troels force-pushed the initialize-sas7bdat-memory branch from c2219c7 to 796fc43 Compare September 12, 2018 12:10

jreback added this to the 0.24.0 milestone Sep 15, 2018

jreback approved these changes Sep 15, 2018

View reviewed changes

jreback merged commit 307797c into pandas-dev:master Sep 15, 2018

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (pand…

d950096

…as-dev#21616) (pandas-dev#22651)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (pand…

e9721ed

…as-dev#21616) (pandas-dev#22651)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

troels commented Sep 9, 2018

pep8speaks commented Sep 9, 2018

codecov bot commented Sep 9, 2018 •

edited

Loading

WillAyd Sep 11, 2018

troels Sep 11, 2018

troels Sep 11, 2018 •

edited

Loading

TomAugspurger Sep 11, 2018

troels Sep 11, 2018

WillAyd Sep 11, 2018

troels Sep 11, 2018

jreback Sep 12, 2018

troels Sep 12, 2018 •

edited

Loading

WillAyd left a comment

TomAugspurger commented Sep 11, 2018

TomAugspurger commented Sep 11, 2018 via email

troels commented Sep 11, 2018

jreback Sep 12, 2018

troels commented Sep 12, 2018 •

edited

Loading

jreback commented Sep 15, 2018

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

Conversation

troels commented Sep 9, 2018

pep8speaks commented Sep 9, 2018

codecov bot commented Sep 9, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

troels Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

troels Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 11, 2018

TomAugspurger commented Sep 11, 2018 via email

troels commented Sep 11, 2018

Choose a reason for hiding this comment

troels commented Sep 12, 2018 • edited Loading

jreback commented Sep 15, 2018

codecov bot commented Sep 9, 2018 •

edited

Loading

troels Sep 11, 2018 •

edited

Loading

troels Sep 12, 2018 •

edited

Loading

troels commented Sep 12, 2018 •

edited

Loading