Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird NaN in mean() of float16 series #20642

Closed
ubyjvovk opened this issue Apr 9, 2018 · 10 comments
Closed

weird NaN in mean() of float16 series #20642

ubyjvovk opened this issue Apr 9, 2018 · 10 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request

Comments

@ubyjvovk
Copy link

ubyjvovk commented Apr 9, 2018

I have a shuffled series with a bunch of sinvalues in float16, like this:

   tdata.time_sin
   110405276   -0.183105
   175560878   -0.301270
   ...
   130331292   -0.158813
   6782127     -0.282471
    Name: time_sin, Length: 18490389, dtype: float16

There's no NaN values, everything's a sinus of something:

tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0

But for some reason, mean() chokes somewhere in the middle like it's overflowing:


tdata.time_sin.mean()
nan

tdata.time_sin[:328720].mean()
0.0

tdata.time_sin[:328721].mean()
nan

tdata.time_sin[328719:328722]
117467643   -0.639648
85318746     0.956055
10829780     0.112000
Name: time_sin, dtype: float16

And it works fine when converted to float32:

foo = tdata.time_sin.astype(np.float32)
foo.mean()

0.20143597

Is this weird or am I missing something about float16?

This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:

zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()

bb[:74351].mean()
-0.0

bb[:74352].mean()
nan

bb[74350:74355]
749371   -0.898438
749393   -0.898438
749432   -0.898438
749447   -0.898438
749479   -0.898438
Name: time_sin, dtype: float16

Problem description

Expected Output

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

float16 is barely supoorted

u can have. a look to improve things

@ubyjvovk
Copy link
Author

ubyjvovk commented Apr 9, 2018

well maybe this explains why my models are not training very well :)

@ubyjvovk
Copy link
Author

ubyjvovk commented Apr 9, 2018

loaded this pickle on another machine, issue repeats exactly

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

float32 is quite well supported

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

closing as duplicate of #9220

@jreback jreback closed this as completed Apr 9, 2018
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Apr 9, 2018
@jreback jreback added this to the No action milestone Apr 9, 2018
@MichaelYin1994
Copy link

I met the same problem when I try to reduce the memory-usage of DataFrame according to the data types.
The mean value of the np.float16 is NaN. After I switched the data type to the np.float32, problem solved.

@jfpuget
Copy link

jfpuget commented Feb 19, 2019

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

@Kaushal28
Copy link

@jfpuget You are correct! I was using float16 and while finding mean, sum of all the observations was out of range for float16. Changed the type to float64 and it's working. Thanks!

@snknitin
Copy link

I spent 5 hours on this ! 🤦🏻‍♂️
I realized the df.describe() was giving nan in mean and std but the values had been replaced to zero with median once and with df.fillna(0.0,inplace=True) and even forced assigns in the (row,column)=0. Even checking the previous indices with nan in the same columns showed that the value changed to 0 and the count in the describe increased by the amount of nans, but the mean and std were still nans. After going through a whole cascade of tests, I realized this is a float16 😭

@Sonyoyo
Copy link

Sonyoyo commented Oct 14, 2022

I had the same problem. Probably the definition of x.mean() is x.sum()/x.count(), with x.sum() leading to inf. By doing (x/x.count()).sum() I resolved my issue. But depending on your data x/x.count() may also lead to zeros and so to a poor average estimation (not my case after comparing with float32 data)...

Probably the best approach is to use an intermediate normalization as proposed by @jfpuget :

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

7 participants