pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

stas00 · 2019-02-01T02:29:47Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

Problem description

This asserts, despite all columns being identical in the first 3 digits after the decimal point.

AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (33.33333 %)
[left]:  [0.00016, -0.154526, -0.20580199999999998]
[right]: [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637]

It doesn't assert if check_less_precise=2 is used instead. So something is not right here. Is there some kind of a rounding issue here?

Doc:

check_less_precise : bool or int, default False

Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare

I understand the doc says check_less_precise defines how many digits after the decimal point are compared.

Unrelated: The doc should probably say "decimal point" (singular) as there is only one, no? and "specify the digits to compare" is vague, perhaps "In int, then specify how many digits after decimal point to compare"?

Here is a proposed updated doc entry:

Specify comparison precision. Only used when check_exact is False. int: How many digits after the decimal point to compare, False: 5 digits, True: 3 digits.

Expected Output

no assert for up to check_less_precise=4 in this example, the numbers start to diverge at digit 5.

and it's still unclear whether rounding is performed or not.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_CA.UTF-8
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.24.0
pytest: 4.0.2
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-03T01:19:15Z

Thanks for the report! This does look strange - investigation and PRs would certainly be welcome

pandas-dev/pandas#25068 (comment) which quite often fails on CI. once it's resolved can change the setting back to check_less_precise=True (or better =3), until then using =2 as it works, but this check is less good.

stas00 · 2019-02-03T07:33:23Z

What I have a hard time grasping is the way this function is designed. Unless I don't understand the documentation, how can it help me to compare these two numbers:

0.6000000
0.5999999

The approach of comparing only n number of decimals is so strange. These two numbers are almost identical, and no matter now many digits you set, this function will still assert failure if the 9's go for quite a few more digits.

For example, math.isclose has a relative and absolute tolerance, which makes total sense. So in the example above, I can say ask for say 0.1% tolerance and those 2 numbers will be close.

pd.testing.assert_frame_equal's approach is just totally unclear to me.

…almost_equal

kinow · 2019-02-03T12:01:38Z

I think the comparison is done in this function

pandas/pandas/_libs/testing.pyx

Lines 42 to 46 in f75a220

    
           cdef bint decimal_almost_equal(double desired, double actual, int decimal): 
        
               # Code from 
        
               # http://docs.scipy.org/doc/numpy/reference/generated 
        
               # /numpy.testing.assert_almost_equal.html 
        
               return abs(desired - actual) < (0.5 * 10.0 ** -decimal)

The code in the comment, however, does not use the (more strict) 0.5 function. In NumPy that function uses 1.5. There is also a comment there now to use NumPy's assert_allclose.

https://github.com/numpy/numpy/blob/d7272536955cb5bd662228787b761eab2ca2c729/numpy/testing/_private/utils.py#L897-L916

And assert_allclose calls a function that supports parameters for absolute and relative tolerance @stas00 . I tried adjusting the constant in the Pandas function to use 1.5 too, but then it becomes too lenient and several tests fail (was preparing a pull request because I thought it would be simpler...).

Instead, perhaps, it would be easier to replace the function by either something like the new function in NumPy, or perhaps some other function?

Cheers
Bruno

stas00 · 2019-02-03T18:51:45Z

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

What it does is comparing how many 0.000x decimals the difference is between 2 numbers, and not how many decimals of each number it looks at. and then there is 1/2...

Let's rewrite:

abs(desired - actual) < (0.5 * 10.0 ** -decimal)

to:

(abs(desired - actual) * 10.0**decimal) < 0.5

so it's easier to understand.

So 2 digits gives us:

 (0.6-0.599)*10**2 = 0.1 < 0.5 [True]
 (0.6-0.595)*10**2 = 0.5 = 0.5 [False]
 (0.6-0.590)*10**2 = 1   > 0.5 [False]

so 2 digits gives us a [0,0.005) absolute range tolerance [0, 0.5*1e-2)

and 3:

 (0.6-0.5999)*10**3 = 0.1 < 0.5 [True]
 (0.6-0.5995)*10**3 = 0.5 = 0.5 [False]
 (0.6-0.5990)*10**3 = 1   > 0.5 [False]

so 3 digits gives us a [0,0.0005) absolute range tolerance [0, 0.5*1e-3)

and so n digits gives us [0, 0.5*1e-n) absolute range tolerance.

So the description should probably use code instead of words:

assert abs(df2-df1)*10**n < 0.5, f"frames difference is equal or more than {0.5*10**-n}"

I hope I didn't miss a zero somewhere.

Except it doesn't seem to be the right function, since if I now apply this same logic to the original failing test to emulate check_less_precise=3:

import pandas as pd
import pandas.testing
df1 = pd.DataFrame([0.00016,                -0.154526,            -0.20580199999999998])
df2 = pd.DataFrame([0.00015981824253685772, -0.15452557802200317, -0.20580188930034637])
df3 = abs(df1.subtract(df2))*10**3
df3
#pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=3)

I get:

0	0.000182
1	0.000422
2	0.000111

none of which is >0.5, i.e. it shouldn't assert.

It should assert with check_less_precise=7 or higher, so somewhere 4 decimal places are lost, as it starts asserting with n=3, instead of n=7.

df3 = abs(df1.subtract(df2))*10**6 < 0.5

0	True
1	True
2	True

df3 = abs(df1.subtract(df2))*10**7 < 0.5

0	False
1	False
2	False

So it's not a question of 0.5 vs 1.5, but 1 vs 10000.

Finally, a sanity check of the same numbers with numpy:

import numpy as np
import numpy.testing
np.testing.assert_array_almost_equal([.00016,                 -0.154526,            -0.20580199999999998],
                                     [0.00015981824253685772, -0.15452557802200317, -0.20580188930034637],
                                     decimal=6)

doesn't fail, with decimal=7 it does - as expected.

zachlipp · 2019-04-22T20:13:51Z

I just hit this problem. Unsure if this is still on anyone's radar, but it was pretty surprising for me. I also used numpy functions (np.isclose instead of np.testing.assert_array_almost_equal, which I'll move to in the future) to get around it.

If there is interest in updating this parameter, it seems to me like @kinow's suggestion of using these numpy functions is a good path forward, though I'm far from an expert on this.

equal_tsv calls equal_dataframes. equal_dataframes compares non-float columns for exact equality. equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068) "NAN" values in float columns are considered to be equal.

s-mariani · 2019-09-03T07:12:26Z

Dear everybody, any update on this? I'm trying to compare only 2 decimals but it seems it still checks 3...

usmcamp0811 · 2019-11-21T15:02:40Z

I just ran into this error with some code I am writing.. I have the check_less and check_exact arguments set but still get an assertion error. The message it prints the same numbers out to the maximum print distance of 15 decimal places.

kinow · 2019-12-03T02:01:23Z

Hi @stas00

thank you for digging up the code, @kinow! So the description of the functionality needs to be improved - numpy's version is indeed much better explained.

+1

Except it doesn't seem to be the right function, since

I'm also starting to think that that function may not be the best for what is documented in assert_frame_equal. Here's other ways to trigger the error.

import pandas as pd
df1 = pd.DataFrame([0.15])
df2 = pd.DataFrame([0.16])
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

Or

import pandas as pd
df1 = pd.DataFrame([0.099999])
df2 = pd.DataFrame([0.09])  # 0.099 will apss
pd.testing.assert_frame_equal(df1, df2, check_exact=False, check_less_precise=1)

The function I mentioned before, is not actually called with these values.

pandas/pandas/_libs/testing.pyx

Lines 206 to 215 in 0ffee8b

    
           # case for zero 
        
           if abs(fa) < 1e-5: 
        
               if not decimal_almost_equal(fa, fb, decimal): 
        
                   assert False, (f'(very low values) expected {fb:.5f} ' 
        
                                  f'but got {fa:.5f}, with decimal {decimal}') 
        
           else: 
        
               if not decimal_almost_equal(1, fb / fa, decimal): 
        
                   assert False, (f'expected {fb:.5f} but got {fa:.5f}, ' 
        
                                  f'with decimal {decimal}') 
        
           return True

So for a=0.15, b=0.16, and decimal=1, then abs(0.15) < 1e-5) doesn't pass, and we end up in the else block. Having then:

if not decimal_almost_equal(1, fb / fa, decimal):
# or
if not decimal_almost_equal(1, 0.15 / 0.16, 1):
# or
if not decimal_almost_equal(1, 0.9375, 1):

# which will be
abs(desired - actual) < (0.5 * 10.0 ** -decimal)
# solving it
abs(1 - 0.9375) < (0.05)
0.0625 < 0.05

In this case, the ratio is not close enough. So the function is failing. However, the callee function was supposed to compare based on the digits after the decimal. So if decimal=1, from what I understand, it should get 0.15 and 0.16, and compare only 0.1 == 0.1, i.e. using only 1 decimal.

If instead of the ratio, we use the function directly with decimal_almost_equal(0.15, 0.16, 1), then it will work OK.

However, if we use the other example pair 0.099999 and 0.01, with decimal=1.

abs(a - b) < (0.5 * 10.0 ** -decimal)
abs(0.099999 - 0.01) < 0.05
0.08999900000000001 < 0.05

Still fails. Looks like decimal_almost_equal is not the right function for the comparison? I have a working function in my notebook, but it is using the simplest approach, that truncates the value instead of comparing differences, ratios, etc. Will prepare a PR soon for discussion 👍

Not super confident that that is the proper solution though, so happy if others chime in with their suggestions.

kinow · 2019-12-03T04:24:26Z

Hmm, maybe I spoke too fast.

This commit has a unit test with the examples discussed here: kinow@f45be0e

The test passes, but several other tests fail. For example,

# test_timeseries.test_pct_change_shift_over_nas
    def test_pct_change_shift_over_nas(self):
        s = Series([1.0, 1.5, np.nan, 2.5, 3.0])

        chg = s.pct_change()
        expected = Series([np.nan, 0.5, 0.0, 2.5 / 1.5 - 1, 0.2])
        tm.assert_series_equal(chg, expected)

Fails with

E   AssertionError: Series are different
E   
E   Series values are different (20.0 %)
E   [left]:  [nan, 0.5, 0.0, 0.6666666666666667, 0.19999999999999996]
E   [right]: [nan, 0.5, 0.0, 0.6666666666666667, 0.2]

The values that fail are 0.19999999999999996 and 0.2 (and check_less_precise=False, so decimal=5). Not sure if it is following what's in the docs - maybe we just need to update the docs after all?

check_less_precise : bool or int, default False
Specify comparison precision. Only used when check_exact is False.
5 digits (False) or 3 digits (True) after decimal points are compared.
If int, then specify the digits to compare.

This part is the most confusing for me: "digits (...) after decimal points are compared". If we have 5 digits, and 0.19999999999999996 and 0.2, the parts after the decimal points are 19999999999999996, and 2. Assuming we are to use only the 5 digits, then 19999 and 2 would be compared?

equal_tsv calls equal_dataframes. equal_dataframes compares non-float columns for exact equality. equal_dataframes converts float columns to numpy arrays and compares for equality within a given tolerance using numpy.allclose. This is used instead of pandas.testing.assert_frame_equal as there is an issue with how that function handles precision (see [pandas.testing.assert_frame_equal doesn't do precision according to the doc #25068](pandas-dev/pandas#25068) "NAN" values in float columns are considered to be equal.

loikein · 2020-03-07T23:17:46Z

Any updates? It's been several updates, but the problem seems to persist.

wudstrand · 2020-03-12T22:25:46Z

Any updates?

mzeitlin11 · 2020-12-24T19:59:34Z

Looks fixed by #30562. Example from OP now does not raise (and also check_less_precise deprecated in favor of rtol and atol anyway).

stas00 · 2020-12-24T20:08:51Z

Thank you for tracking that, @mzeitlin11!

I verified that with the current ver==1.1.5 if I replace check_less_precise with rtol=3 it works as expected.

Awesome!

WillAyd added Bug Testing pandas testing functions or related to the test suite labels Feb 3, 2019

WillAyd added this to the Contributions Welcome milestone Feb 3, 2019

kinow added a commit to kinow/pandas that referenced this issue Feb 3, 2019

Fix pandas-dev#25068 using the same constant as in numpy for decimal_…

7621398

…almost_equal

kinow added a commit to kinow/pandas that referenced this issue Feb 3, 2019

Fix pandas-dev#25068 using the same constant as in numpy for decimal_…

8e173e4

…almost_equal

mzeitlin11 added Closing Candidate May be closeable, needs more eyeballs and removed Bug labels Dec 24, 2020

stas00 closed this as completed Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

stas00 commented Feb 1, 2019 •

edited

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Feb 3, 2019

stas00 commented Feb 3, 2019 •

edited

kinow commented Feb 3, 2019 •

edited

stas00 commented Feb 3, 2019 •

edited

zachlipp commented Apr 22, 2019

s-mariani commented Sep 3, 2019

usmcamp0811 commented Nov 21, 2019

kinow commented Dec 3, 2019

kinow commented Dec 3, 2019

loikein commented Mar 7, 2020

wudstrand commented Mar 12, 2020

mzeitlin11 commented Dec 24, 2020

stas00 commented Dec 24, 2020

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

pd.testing.assert_frame_equal doesn't do precision according to the doc #25068

Comments

stas00 commented Feb 1, 2019 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Feb 3, 2019

stas00 commented Feb 3, 2019 • edited

kinow commented Feb 3, 2019 • edited

stas00 commented Feb 3, 2019 • edited

zachlipp commented Apr 22, 2019

s-mariani commented Sep 3, 2019

usmcamp0811 commented Nov 21, 2019

kinow commented Dec 3, 2019

kinow commented Dec 3, 2019

loikein commented Mar 7, 2020

wudstrand commented Mar 12, 2020

mzeitlin11 commented Dec 24, 2020

stas00 commented Dec 24, 2020

stas00 commented Feb 1, 2019 •

edited

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

stas00 commented Feb 3, 2019 •

edited

kinow commented Feb 3, 2019 •

edited

stas00 commented Feb 3, 2019 •

edited