Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

Open
2 of 3 tasks
yuhan-wang opened this issue Feb 19, 2021 · 4 comments
Open
2 of 3 tasks

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

yuhan-wang opened this issue Feb 19, 2021 · 4 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@yuhan-wang
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import pandas as pd
left = pd.DataFrame({'a':[1567808378753000000], 'b':[0]})
right = pd.DataFrame({'a':[1567808378753274000], 'b':[0]})
print(left.compare(right))

Problem description

The current output is


                     a                     
                  self                other
0  1567808378752999936  1567808378753274112

which differs from the values in the original dataframe.
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

                 a                     
              self                other

0 1567808378753000000 1567808378753274000

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 7d32926
python : 3.8.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1036-gcp
Version : #39-Ubuntu SMP Thu Jan 14 18:41:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 19.3.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.22
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@yuhan-wang yuhan-wang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@phofl
Copy link
Member

phofl commented Feb 19, 2021

This is casted to float internally, which causes the precision loss. I think this should be fixed when we adapt the nullable dtypes more widely

@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@yuhan-wang
Copy link
Author

OK. Do you know why if I remove the 'b' column, then the problem disappears?

@mzeitlin11
Copy link
Member

As part of generating the result, positions with equal value are masked out. So since they are equal in the column b, there ends up being a NaN added in that position.

This issue could be fixed, but it would likely not be easy. What's happening is that since a and b are integer columns, they are a stored in a single block. There is logic hit that essentially says - if the block can't hold NaN, then coerce to something which can. So column a is coerced as well, even though it does not necessarily need to be.

Because of this, if b is a float column this will actually work because a and b will be stored in different blocks.

@yuhan-wang
Copy link
Author

Interesting. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants