Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651

Merged
merged 1 commit into from
Sep 15, 2018

Conversation

troels
Copy link
Contributor

@troels troels commented Sep 9, 2018

Memory for numbers in sas7bdat-parsing was not initialized properly to 0.
For sas7bdat files with numbers smaller than 8 bytes this made the
least significant part of the numbers essentially random.
Fix it by initializing memory correctly.

@pep8speaks
Copy link

Hello @troels! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #22651 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22651   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50715    50715           
=======================================
  Hits        46747    46747           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <100%> (ø) ⬆️
#single 42.35% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/sas/sas7bdat.py 91.09% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c040353...796fc43. Read the comment docs.

pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved
pandas/tests/io/sas/data/cars.csv Outdated Show resolved Hide resolved
@@ -183,6 +183,18 @@ def test_date_time(datapath):
tm.assert_frame_equal(df, df0)


def test_compact_numerical_values(datapath):
# Regression test for #21616
fname = datapath("io", "sas", "data", "cars.sas7bdat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the fixture in this fashion will generate warnings. We just went through an exercise to clean these up in #22515 - can you take a look at that and adjust here accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the discussion on the pytest-tracker properly, datapath used this way is in fact not something that will be deprecated.

datapath is a "factory as a fixture" as described here:
https://docs.pytest.org/en/latest/fixture.html#factories-as-fixtures

Which is something distinct from what is beeing talked about here:
pytest-dev/pytest#3661

Have I misunderstood?

Copy link
Contributor Author

@troels troels Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it produces no warnings,as far as I can tell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usage is fine I think.

But, do you need to include a binary file in the git repo to reproduce the error on master? I'd like to avoid adding new ones if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TomAugspurger

I am adding a sas7bdat file for testing the parsing of sas7bdat files. How can that be done more sensibly?

@WillAyd WillAyd added the IO SAS SAS: read_sas label Sep 11, 2018
# The two columns CYL and WGT in cars.sas7bdat have column
# width < 8 and only contains integral values. Test
# that pandas doesn't corrupt the less significant bits.
tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure comparison to self as part of the test is the best option here - do we have any way of strengthening the assertion(s) being made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't comparing to self. We are making sure, that floats with no decimal part in the sas7bdat-file doesn't get a decimal part because of a bug in pandas. That is exactly what this bug is about, after all.

(See: https://stackoverflow.com/questions/49059421/pandas-fails-with-correct-data-type-while-reading-a-sas-file)

Before:

  1. Memory was not initialized.
  2. The decimal part of the float was taken from random uninitialized memory.
  3. Test fails most of the time.

Now:

  1. Memory is zeroed out.
  2. The decimal part of the float which is not read (because it's implicit in the file format, that it should be zero) is therefore still zero in pandas.
  3. Test succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

result = 
expected = 

to make this easier to read

why do we need to .round(), rather than just using check_less_precision ?

Copy link
Contributor Author

@troels troels Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also do:

msg = "Expected df['WGT'] to be full of integers"

assert df['WGT'] == df['WGT'].astype('int'), msg
tm.assert_series_equal(df['WGT'], df['WGT'].astype('int'), check_exact=True)
assert all(f.is_integer() for f in df['WGT']), msg
assert all(f - int(f) == 0 for f in df['WGT']), msg
assert all(f == int(f) for f in df['WGT']), msg

or any other way of saying the same thing you would prefer...

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger any thoughts on this one?

@TomAugspurger
Copy link
Contributor

cc @kshedden and @Winand if you have time to give this a review.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 11, 2018 via email

@troels
Copy link
Contributor Author

troels commented Sep 11, 2018

@TomAugspurger:

No, none of the test-files has the problem. If any of the already existing test-files had numeric columns less than 8 bytes, the tests would probably have been unstable.

Usually one wants to avoid binary files in git repositories for two reasons:

  1. They tend to be large
  2. Git's diff has a tendency to every time a binary files has changed store the complete file as the delta. (as git's diff is clumsy with binary content)
    That makes e.g. large images that are edited and committed often rather toxic.

A small file like cars.sas7bdat, which will never be edited and to top it off is highly compressible is not going to take very much space at all, neither now nor in the future.

# The two columns CYL and WGT in cars.sas7bdat have column
# width < 8 and only contains integral values. Test
# that pandas doesn't corrupt the less significant bits.
tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

result = 
expected = 

to make this easier to read

why do we need to .round(), rather than just using check_less_precision ?

…as-dev#21616)

Memory for numbers in sas7bdat-parsing was not initialized properly to 0.
For sas7bdat files with numbers smaller than 8 bytes this made the
least significant part of the numbers essentially random.
Fix it by initializing memory correctly.
@troels
Copy link
Contributor Author

troels commented Sep 12, 2018

Hi @jreback

Sure thing, I've added variables and extended the comment a bit.

We are comparing the result that the sas7bdat-parser reads from the file with the closest integers. As we know that the numbers are integral they should be exactly equal to their closest integer. When the bug was active the decimal part of the floating point numbers were not read from the file but simply taken from uninitialized memory, and so the WGT and CYL were not actually whole numbers in the pandas dataframe.

I originally had a CSV-file containing the correct numbers, but was compelled to remove it because of @WillAyd 's comments. I'll gladly add it back if that's preferable.

@jreback jreback added this to the 0.24.0 milestone Sep 15, 2018
@jreback jreback merged commit 307797c into pandas-dev:master Sep 15, 2018
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

thanks @troels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_sas does not handle numeric variables stored with fewer than 8 bytes in SAS datasets
5 participants