Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

Closed
stas00 opened this issue Feb 1, 2019 · 13 comments
Closed

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

stas00 opened this issue Feb 1, 2019 · 13 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Testing pandas testing functions or related to the test suite

Comments

@stas00
Copy link

stas00 commented Feb 1, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

Problem description

This asserts, despite all columns being identical in the first 3 digits after the decimal point.

AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (33.33333 %)
[left]:  [0.00016, -0.154526, -0.20580199999999998]
[right]: [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637]

It doesn't assert if check_less_precise=2 is used instead. So something is not right here. Is there some kind of a rounding issue here?

Doc:

check_less_precise : bool or int, default False

Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare

I understand the doc says check_less_precise defines how many digits after the decimal point are compared.

Unrelated: The doc should probably say "decimal point" (singular) as there is only one, no? and "specify the digits to compare" is vague, perhaps "In int, then specify how many digits after decimal point to compare"?

Here is a proposed updated doc entry:

Specify comparison precision. Only used when check_exact is False. int: How many digits after the decimal point to compare, False: 5 digits, True: 3 digits.

Expected Output

no assert for up to check_less_precise=4 in this example, the numbers start to diverge at digit 5.

and it's still unclear whether rounding is performed or not.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_CA.UTF-8
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.24.0
pytest: 4.0.2
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Feb 3, 2019

Thanks for the report! This does look strange - investigation and PRs would certainly be welcome

@WillAyd WillAyd added Bug Testing pandas testing functions or related to the test suite labels Feb 3, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Feb 3, 2019
stas00 added a commit to fastai/fastai that referenced this issue Feb 3, 2019
pandas-dev/pandas#25068 (comment)
which quite often fails on CI.
once it's resolved can change the setting back to check_less_precise=True (or better =3), until then using =2 as it works, but this check is less good.
@stas00
Copy link
Author

stas00 commented Feb 3, 2019

What I have a hard time grasping is the way this function is designed. Unless I don't understand the documentation, how can it help me to compare these two numbers:

0.6000000
0.5999999

The approach of comparing only n number of decimals is so strange. These two numbers are almost identical, and no matter now many digits you set, this function will still assert failure if the 9's go for quite a few more digits.

For example, math.isclose has a relative and absolute tolerance, which makes total sense. So in the example above, I can say ask for say 0.1% tolerance and those 2 numbers will be close.

pd.testing.assert_frame_equal's approach is just totally unclear to me.

kinow added a commit to kinow/pandas that referenced this issue Feb 3, 2019
kinow added a commit to kinow/pandas that referenced this issue Feb 3, 2019
@kinow
Copy link
Contributor

kinow commented Feb 3, 2019

I think the comparison is done in this function

cdef bint decimal_almost_equal(double desired, double actual, int decimal):
# Code from
# http://docs.scipy.org/doc/numpy/reference/generated
# /numpy.testing.assert_almost_equal.html
return abs(desired - actual) < (0.5 * 10.0 ** -decimal)

The code in the comment, however, does not use the (more strict) 0.5 function. In NumPy that function uses 1.5. There is also a comment there now to use NumPy's assert_allclose.

https://github.com/numpy/numpy/blob/d7272536955cb5bd662228787b761eab2ca2c729/numpy/testing/_private/utils.py#L897-L916

And assert_allclose calls a function that supports parameters for absolute and relative tolerance @stas00 . I tried adjusting the constant in the Pandas function to use 1.5 too, but then it becomes too lenient and several tests fail (was preparing a pull request because I thought it would be simpler...).

Instead, perhaps, it would be easier to replace the function by either something like the new function in NumPy, or perhaps some other function?

Cheers
Bruno

@stas00
Copy link
Author

stas00 commented Feb 3, 2019

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

What it does is comparing how many 0.000x decimals the difference is between 2 numbers, and not how many decimals of each number it looks at. and then there is 1/2...

Let's rewrite:

abs(desired - actual) < (0.5 * 10.0 ** -decimal)

to:

(abs(desired - actual) * 10.0**decimal) < 0.5

so it's easier to understand.

So 2 digits gives us:

 (0.6-0.599)*10**2 = 0.1 < 0.5 [True]
 (0.6-0.595)*10**2 = 0.5 = 0.5 [False]
 (0.6-0.590)*10**2 = 1   > 0.5 [False]

so 2 digits gives us a [0,0.005) absolute range tolerance [0, 0.5*1e-2)

and 3:

 (0.6-0.5999)*10**3 = 0.1 < 0.5 [True]
 (0.6-0.5995)*10**3 = 0.5 = 0.5 [False]
 (0.6-0.5990)*10**3 = 1   > 0.5 [False]

so 3 digits gives us a [0,0.0005) absolute range tolerance [0, 0.5*1e-3)

and so n digits gives us [0, 0.5*1e-n) absolute range tolerance.

So the description should probably use code instead of words:

assert abs(df2-df1)*10**n < 0.5, f"frames difference is equal or more than {0.5*10**-n}"

I hope I didn't miss a zero somewhere.

Except it doesn't seem to be the right function, since if I now apply this same logic to the original failing test to emulate check_less_precise=3:

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
df3 = abs(df1.subtract(df2))*10**3
df3
#pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

I get:

0	0.000182
1	0.000422
2	0.000111

none of which is >0.5, i.e. it shouldn't assert.

It should assert with check_less_precise=7 or higher, so somewhere 4 decimal places are lost, as it starts asserting with n=3, instead of n=7.

df3 = abs(df1.subtract(df2))*10**6 < 0.5
0	True
1	True
2	True
df3 = abs(df1.subtract(df2))*10**7 < 0.5
0	False
1	False
2	False

So it's not a question of 0.5 vs 1.5, but 1 vs 10000.

Finally, a sanity check of the same numbers with numpy:

import numpy as np
import numpy.testing
np.testing.assert_array_almost_equal([.00016,                 -0.154526,            -0.20580199999999998],
                                     [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637],
                                     decimal=6)

doesn't fail, with decimal=7 it does - as expected.

@zachlipp
Copy link

I just hit this problem. Unsure if this is still on anyone's radar, but it was pretty surprising for me. I also used numpy functions (np.isclose instead of np.testing.assert_array_almost_equal, which I'll move to in the future) to get around it.

If there is interest in updating this parameter, it seems to me like @kinow's suggestion of using these numpy functions is a good path forward, though I'm far from an expert on this.

mikej888 added a commit to riboviz/riboviz that referenced this issue Aug 2, 2019
equal_tsv calls equal_dataframes.

equal_dataframes compares non-float columns for exact equality.

equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068)

"NAN" values in float columns are considered to be equal.
@s-mariani
Copy link

Dear everybody, any update on this? I'm trying to compare only 2 decimals but it seems it still checks 3...

@usmcamp0811
Copy link

I just ran into this error with some code I am writing.. I have the check_less and check_exact arguments set but still get an assertion error. The message it prints the same numbers out to the maximum print distance of 15 decimal places.

@kinow
Copy link
Contributor

kinow commented Dec 3, 2019

Hi @stas00

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

+1

Except it doesn't seem to be the right function, since

I'm also starting to think that that function may not be the best for what is documented in assert_frame_equal. Here's other ways to trigger the error.

import pandas as pd
df1 = pd.DataFrame([0.15])
df2 = pd.DataFrame([0.16])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

Or

import pandas as pd
df1 = pd.DataFrame([0.099999])
df2 = pd.DataFrame([0.09])  # 0.099 will apss
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

The function I mentioned before, is not actually called with these values.

image

# case for zero
if abs(fa) < 1e-5:
if not decimal_almost_equal(fa, fb, decimal):
assert False, (f'(very low values) expected {fb:.5f} '
f'but got {fa:.5f}, with decimal {decimal}')
else:
if not decimal_almost_equal(1, fb / fa, decimal):
assert False, (f'expected {fb:.5f} but got {fa:.5f}, '
f'with decimal {decimal}')
return True

So for a=0.15, b=0.16, and decimal=1, then abs(0.15) < 1e-5) doesn't pass, and we end up in the else block. Having then:

if not decimal_almost_equal(1, fb / fa, decimal):
# or
if not decimal_almost_equal(1, 0.15 / 0.16, 1):
# or
if not decimal_almost_equal(1, 0.9375, 1):

# which will be
abs(desired - actual) < (0.5 * 10.0 ** -decimal)
# solving it
abs(1 - 0.9375) < (0.05)
0.0625 < 0.05

In this case, the ratio is not close enough. So the function is failing. However, the callee function was supposed to compare based on the digits after the decimal. So if decimal=1, from what I understand, it should get 0.15 and 0.16, and compare only 0.1 == 0.1, i.e. using only 1 decimal.

If instead of the ratio, we use the function directly with decimal_almost_equal(0.15, 0.16, 1), then it will work OK.

However, if we use the other example pair 0.099999 and 0.01, with decimal=1.

abs(a - b) < (0.5 * 10.0 ** -decimal)
abs(0.099999 - 0.01) < 0.05
0.08999900000000001 < 0.05

Still fails. Looks like decimal_almost_equal is not the right function for the comparison? I have a working function in my notebook, but it is using the simplest approach, that truncates the value instead of comparing differences, ratios, etc. Will prepare a PR soon for discussion 👍

Not super confident that that is the proper solution though, so happy if others chime in with their suggestions.

@kinow
Copy link
Contributor

kinow commented Dec 3, 2019

Hmm, maybe I spoke too fast.

This commit has a unit test with the examples discussed here: kinow@f45be0e

The test passes, but several other tests fail. For example,

# test_timeseries.test_pct_change_shift_over_nas
    def test_pct_change_shift_over_nas(self):
        s = Series([1.0, 1.5, np.nan, 2.5, 3.0])

        chg = s.pct_change()
        expected = Series([np.nan, 0.5, 0.0, 2.5 / 1.5 - 1, 0.2])
        tm.assert_series_equal(chg, expected)

Fails with

E   AssertionError: Series are different
E   
E   Series values are different (20.0 %)
E   [left]:  [nan, 0.5, 0.0, 0.6666666666666667, 0.19999999999999996]
E   [right]: [nan, 0.5, 0.0, 0.6666666666666667, 0.2]

The values that fail are 0.19999999999999996 and 0.2 (and check_less_precise=False, so decimal=5). Not sure if it is following what's in the docs - maybe we just need to update the docs after all?

check_less_precise : bool or int, default False
Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare.

This part is the most confusing for me: "digits (...) after decimal points are compared". If we have 5 digits, and 0.19999999999999996 and 0.2, the parts after the decimal points are 19999999999999996, and 2. Assuming we are to use only the 5 digits, then 19999 and 2 would be compared?

mikej888 added a commit to riboviz/riboviz that referenced this issue Mar 4, 2020
equal_tsv calls equal_dataframes.

equal_dataframes compares non-float columns for exact equality.

equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068)

"NAN" values in float columns are considered to be equal.
@loikein
Copy link

loikein commented Mar 7, 2020

Any updates? It's been several updates, but the problem seems to persist.

@wudstrand
Copy link

Any updates?

@mzeitlin11
Copy link
Member

Looks fixed by #30562. Example from OP now does not raise (and also check_less_precise deprecated in favor of rtol and atol anyway).

@mzeitlin11 mzeitlin11 added Closing Candidate May be closeable, needs more eyeballs and removed Bug labels Dec 24, 2020
@stas00
Copy link
Author

stas00 commented Dec 24, 2020

Thank you for tracking that, @mzeitlin11!

I verified that with the current ver==1.1.5 if I replace check_less_precise with rtol=3 it works as expected.

Awesome!

@stas00 stas00 closed this as completed Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

9 participants