BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

yuhan-wang · 2021-02-19T00:53:20Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import pandas as pd
left = pd.DataFrame({'a':[1567808378753000000], 'b':[0]})
right = pd.DataFrame({'a':[1567808378753274000], 'b':[0]})
print(left.compare(right))

Problem description

The current output is


                     a                     
                  self                other
0  1567808378752999936  1567808378753274112

which differs from the values in the original dataframe.
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

                 a                     
              self                other

0 1567808378753000000 1567808378753274000

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

commit : 7d32926
python : 3.8.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1036-gcp
Version : #39-Ubuntu SMP Thu Jan 14 18:41:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 19.3.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.22
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-02-19T20:41:17Z

This is casted to float internally, which causes the precision loss. I think this should be fixed when we adapt the nullable dtypes more widely

yuhan-wang · 2021-02-20T02:47:28Z

OK. Do you know why if I remove the 'b' column, then the problem disappears?

mzeitlin11 · 2021-03-30T18:31:23Z

As part of generating the result, positions with equal value are masked out. So since they are equal in the column b, there ends up being a NaN added in that position.

This issue could be fixed, but it would likely not be easy. What's happening is that since a and b are integer columns, they are a stored in a single block. There is logic hit that essentially says - if the block can't hold NaN, then coerce to something which can. So column a is coerced as well, even though it does not necessarily need to be.

Because of this, if b is a float column this will actually work because a and b will be stored in different blocks.

yuhan-wang · 2021-03-31T04:39:59Z

Interesting. Thanks!

yuhan-wang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021

phofl added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

yuhan-wang commented Feb 19, 2021

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS

phofl commented Feb 19, 2021

yuhan-wang commented Feb 20, 2021

mzeitlin11 commented Mar 30, 2021

yuhan-wang commented Mar 31, 2021

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

BUG: pandas.DataFrame.compare causes loss of accuracy for big ints #39899

Comments

yuhan-wang commented Feb 19, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag] INSTALLED VERSIONS

phofl commented Feb 19, 2021

yuhan-wang commented Feb 20, 2021

mzeitlin11 commented Mar 30, 2021

yuhan-wang commented Mar 31, 2021

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here leaving a blank line after the details tag]
INSTALLED VERSIONS