Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int64 with null value mangles large-ish integers #30268

Closed
craigdsthompson opened this issue Dec 13, 2019 · 6 comments
Closed

Int64 with null value mangles large-ish integers #30268

craigdsthompson opened this issue Dec 13, 2019 · 6 comments
Assignees
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@craigdsthompson
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

x = 9999999999999999
y = 123123123123123123
z = 10000000000000543
s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
s[3] == x  # True
s
s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
s2[3] == x  # False
s2
np.iinfo(np.int64).max

With interpreter output:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # False
False
>>> s2
0                     1
1                     2
2                     3
3     10000000000000000
4    123123123123123120
5     10000000000000544
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Problem description

It seams that the presence of np.nan values in a column being typed as Int64 causes some non-null values to be mangled. This seems to happen with large-ish values (but still below the max int limit).

Given that Int64 is the "Nullable integer" data type, null values should be allowed, and should certainly not silently change the values of other elements in the data frame.

Expected Output

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # True (was False above)
True
>>> s2
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.0.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 10.0.1
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@WillAyd
Copy link
Member

WillAyd commented Dec 14, 2019

Thanks for the report. Must be an unwanted float cast in the mix as precision looks to get lost around the same range:

>>> x = 2 ** 53
>>> y = 2 ** 53 + 1
>>> z = 2 ** 53  + 2
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2
0                   1
1                   2
2                   3
3    9007199254740992
4    9007199254740992
5    9007199254740994
6                 NaN

Investigation and PRs certainly welcome

@WillAyd WillAyd added ExtensionArray Extending pandas with custom dtypes or arrays. Bug labels Dec 14, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Dec 14, 2019
@jorisvandenbossche
Copy link
Member

This is caused by a converting the list to a numpy array, and letting numpy do type inference. Because there is a np.nan, numpy will create a float array. And this then indeed causes this precision loss.

With a quick hack like:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -205,7 +205,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
             mask = mask.copy()
         return values, mask
 
-    values = np.array(values, copy=copy)
+    if isinstance(values, list):
+        values = np.array(values, dtype=object)
+    else:
+        values = np.array(values, copy=copy)
     if is_object_dtype(values):
         inferred_type = lib.infer_dtype(values, skipna=True)
         if inferred_type == "empty":

I get the correct behaviour (by ensuring we convert the list to an object array, so we don't loose precision due to the intermediate float array, and we can do the conversion to an integer array ourselves).

For a proper fix, the if/else check will need to be a bit more advanced. I think we should basically check if the input values already are a ndarray or have a dtype, and if that is the case keep that dtype, otherwise convert to object dtype.

Contributions certainly welcome!

@rushabh-v
Copy link
Contributor

take

@fcollman
Copy link

is this behavior caused by the same bug?
Here there are no NaNs involved, so casting was not even needed

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id":[864691135341199281, 864691135341199281]}, dtype='Int64')
>>> df['cid']=[49912377,49912377]
>>> print(df)
                   id       cid
0  864691135341199281  49912377
1  864691135341199281  49912377
>>> print(df.groupby('cid').first())
                          id
cid                         
49912377  864691135341199232
>>> 
>>> 
>>> print(df.groupby('cid').first().id.dtype, df.id.dtype)
Int64 Int64
>>> 
>>> import numpy as np
>>> print(np.int64(np.float64(864691135341199281)))
864691135341199232

@rushabh-v
Copy link
Contributor

#31108 is the reason!

@phofl
Copy link
Member

phofl commented Jan 18, 2023

Fixed in #50757

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants