Int64 with null value mangles large-ish integers #30268

craigdsthompson · 2019-12-13T22:40:04Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

x = 9999999999999999
y = 123123123123123123
z = 10000000000000543
s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
s[3] == x  # True
s
s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
s2[3] == x  # False
s2
np.iinfo(np.int64).max

With interpreter output:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # False
False
>>> s2
0                     1
1                     2
2                     3
3     10000000000000000
4    123123123123123120
5     10000000000000544
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Problem description

It seams that the presence of np.nan values in a column being typed as Int64 causes some non-null values to be mangled. This seems to happen with large-ish values (but still below the max int limit).

Given that Int64 is the "Nullable integer" data type, null values should be allowed, and should certainly not silently change the values of other elements in the data frame.

Expected Output

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # True (was False above)
True
>>> s2
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.0.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 10.0.1
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-12-14T03:42:40Z

Thanks for the report. Must be an unwanted float cast in the mix as precision looks to get lost around the same range:

>>> x = 2 ** 53
>>> y = 2 ** 53 + 1
>>> z = 2 ** 53  + 2
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2
0                   1
1                   2
2                   3
3    9007199254740992
4    9007199254740992
5    9007199254740994
6                 NaN

Investigation and PRs certainly welcome

jorisvandenbossche · 2019-12-15T08:47:56Z

This is caused by a converting the list to a numpy array, and letting numpy do type inference. Because there is a np.nan, numpy will create a float array. And this then indeed causes this precision loss.

With a quick hack like:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -205,7 +205,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
             mask = mask.copy()
         return values, mask
 
-    values = np.array(values, copy=copy)
+    if isinstance(values, list):
+        values = np.array(values, dtype=object)
+    else:
+        values = np.array(values, copy=copy)
     if is_object_dtype(values):
         inferred_type = lib.infer_dtype(values, skipna=True)
         if inferred_type == "empty":

I get the correct behaviour (by ensuring we convert the list to an object array, so we don't loose precision due to the intermediate float array, and we can do the conversion to an integer array ourselves).

For a proper fix, the if/else check will need to be a bit more advanced. I think we should basically check if the input values already are a ndarray or have a dtype, and if that is the case keep that dtype, otherwise convert to object dtype.

Contributions certainly welcome!

rushabh-v · 2019-12-15T13:43:11Z

take

fcollman · 2020-08-19T01:10:45Z

is this behavior caused by the same bug?
Here there are no NaNs involved, so casting was not even needed

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id":[864691135341199281, 864691135341199281]}, dtype='Int64')
>>> df['cid']=[49912377,49912377]
>>> print(df)
                   id       cid
0  864691135341199281  49912377
1  864691135341199281  49912377
>>> print(df.groupby('cid').first())
                          id
cid                         
49912377  864691135341199232
>>> 
>>> 
>>> print(df.groupby('cid').first().id.dtype, df.id.dtype)
Int64 Int64
>>> 
>>> import numpy as np
>>> print(np.int64(np.float64(864691135341199281)))
864691135341199232

rushabh-v · 2020-08-19T04:43:01Z

#31108 is the reason!

phofl · 2023-01-18T15:01:06Z

Fixed in #50757

WillAyd added ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels Dec 14, 2019

WillAyd added this to the Contributions Welcome milestone Dec 14, 2019

github-actions bot assigned rushabh-v Dec 15, 2019

rushabh-v mentioned this issue Dec 15, 2019

solve "Int64 with null value mangles large-ish integers" problem #30282

Closed

5 tasks

simonjayhawkins mentioned this issue Dec 19, 2019

tm.assert_series_equal checking on Integer EA should be exact by default? #30347

Closed

WillAyd mentioned this issue Feb 20, 2020

Bug: read_csv losing precision when reading Int64 data with N/A values #32134

Closed

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Mar 10, 2020

jorisvandenbossche mentioned this issue Mar 10, 2020

Nullable integer data type bug in read_csv #32453

Closed

jorisvandenbossche mentioned this issue Dec 30, 2020

BUG: Series constructor with nullable unsigned integer dtype fails with large number #38798

Closed

lithomas1 added the Blocked label May 11, 2021

AtMostafa mentioned this issue Jun 10, 2021

Index error in restrict_to_interval NeuralAnalysis/PyalData#102

Closed

mzeitlin11 mentioned this issue Oct 11, 2021

BUG: Precision loss when casting 19-digit integer to float #43979

Open

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl closed this as completed Jan 18, 2023

MridulS mentioned this issue Dec 19, 2023

BUG: Inconsistent behavior while constructing a Series with large integers in a int64 masked array #56566

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int64 with null value mangles large-ish integers #30268

Int64 with null value mangles large-ish integers #30268

craigdsthompson commented Dec 13, 2019

INSTALLED VERSIONS

WillAyd commented Dec 14, 2019

jorisvandenbossche commented Dec 15, 2019

rushabh-v commented Dec 15, 2019

fcollman commented Aug 19, 2020

rushabh-v commented Aug 19, 2020

phofl commented Jan 18, 2023

Int64 with null value mangles large-ish integers #30268

Int64 with null value mangles large-ish integers #30268

Comments

craigdsthompson commented Dec 13, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Dec 14, 2019

jorisvandenbossche commented Dec 15, 2019

rushabh-v commented Dec 15, 2019

fcollman commented Aug 19, 2020

rushabh-v commented Aug 19, 2020

phofl commented Jan 18, 2023

Output of `pd.show_versions()`