Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame with Int64 columns casts to float64 with .max()/.min() #32651

Closed
qwhelan opened this issue Mar 12, 2020 · 3 comments
Closed

DataFrame with Int64 columns casts to float64 with .max()/.min() #32651

qwhelan opened this issue Mar 12, 2020 · 3 comments
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@qwhelan
Copy link
Contributor

qwhelan commented Mar 12, 2020

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

int64_info = np.iinfo("int64")
s = pd.Series([int64_info.max, None, int64_info.min], dtype=pd.Int64Dtype())
df = pd.DataFrame({"Int64": s})

df.max()
Int64    9.223372e+18
dtype: float64

Problem description

pd.Int64 data is converted to np.float64 in certain reduction operations on pd.DataFrame. This causes data corruption, as pd.Int64 is intended to avoid this exact issue.

Expected Output

df.max() should probably return a pd.Series of dtype='object' wrapping a pd.Int64 value.

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit : 27ad779 python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-29-generic Version : #31-Ubuntu SMP Fri Jan 17 17:27:26 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+779.g27ad77971
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.5
hypothesis : 5.4.1
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.4.0.dev0+62.g8ac3a4c8
fastparquet : 0.3.2
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
numba : 0.48.0

</details>
@jorisvandenbossche
Copy link
Member

This is expected for now (but not less wrong of course) given how it is implemented (by converting to float). This will be solved by something like #30982 (but then for min/max)

@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 12, 2020

@jorisvandenbossche Thanks for confirmation and the pointer. I put up a PR that's a bit of a work in progress still, but I think I could probably get it working over the weekend.

@simonjayhawkins
Copy link
Member

fixed in #35254

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment