weird NaN in mean() of float16 series #20642

ubyjvovk · 2018-04-09T20:03:59Z

I have a shuffled series with a bunch of sinvalues in float16, like this:

   tdata.time_sin
   110405276   -0.183105
   175560878   -0.301270
   ...
   130331292   -0.158813
   6782127     -0.282471
    Name: time_sin, Length: 18490389, dtype: float16

There's no NaN values, everything's a sinus of something:

tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0

But for some reason, mean() chokes somewhere in the middle like it's overflowing:


tdata.time_sin.mean()
nan

tdata.time_sin[:328720].mean()
0.0

tdata.time_sin[:328721].mean()
nan

tdata.time_sin[328719:328722]
117467643   -0.639648
85318746     0.956055
10829780     0.112000
Name: time_sin, dtype: float16

And it works fine when converted to float32:

foo = tdata.time_sin.astype(np.float32)
foo.mean()

0.20143597

Is this weird or am I missing something about float16?

This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:

zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()

bb[:74351].mean()
-0.0

bb[:74352].mean()
nan

bb[74350:74355]
749371   -0.898438
749393   -0.898438
749432   -0.898438
749447   -0.898438
749479   -0.898438
Name: time_sin, dtype: float16

Problem description

Expected Output

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-09T20:07:22Z

float16 is barely supoorted

u can have. a look to improve things

ubyjvovk · 2018-04-09T20:11:14Z

well maybe this explains why my models are not training very well :)

ubyjvovk · 2018-04-09T20:19:25Z

loaded this pickle on another machine, issue repeats exactly

jreback · 2018-04-09T20:33:19Z

float32 is quite well supported

jreback · 2018-04-09T20:34:22Z

closing as duplicate of #9220

MichaelYin1994 · 2018-11-27T12:20:55Z

I met the same problem when I try to reduce the memory-usage of DataFrame according to the data types.
The mean value of the np.float16 is NaN. After I switched the data type to the np.float32, problem solved.

jfpuget · 2019-02-19T07:10:44Z

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

Kaushal28 · 2019-10-20T11:19:33Z

@jfpuget You are correct! I was using float16 and while finding mean, sum of all the observations was out of range for float16. Changed the type to float64 and it's working. Thanks!

snknitin · 2022-08-29T15:03:01Z

I spent 5 hours on this ! 🤦🏻‍♂️
I realized the df.describe() was giving nan in mean and std but the values had been replaced to zero with median once and with df.fillna(0.0,inplace=True) and even forced assigns in the (row,column)=0. Even checking the previous indices with nan in the same columns showed that the value changed to 0 and the count in the describe increased by the amount of nans, but the mean and std were still nans. After going through a whole cascade of tests, I realized this is a float16 😭

Sonyoyo · 2022-10-14T13:52:34Z

I had the same problem. Probably the definition of x.mean() is x.sum()/x.count(), with x.sum() leading to inf. By doing (x/x.count()).sum() I resolved my issue. But depending on your data x/x.count() may also lead to zeros and so to a poor average estimation (not my case after comparing with float32 data)...

Probably the best approach is to use an intermediate normalization as proposed by @jfpuget :

You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

jreback closed this as completed Apr 9, 2018

jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Apr 9, 2018

jreback added this to the No action milestone Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weird NaN in mean() of float16 series #20642

weird NaN in mean() of float16 series #20642

ubyjvovk commented Apr 9, 2018

INSTALLED VERSIONS

jreback commented Apr 9, 2018

ubyjvovk commented Apr 9, 2018

ubyjvovk commented Apr 9, 2018

jreback commented Apr 9, 2018

jreback commented Apr 9, 2018

MichaelYin1994 commented Nov 27, 2018

jfpuget commented Feb 19, 2019

Kaushal28 commented Oct 20, 2019

snknitin commented Aug 29, 2022

Sonyoyo commented Oct 14, 2022

weird NaN in mean() of float16 series #20642

weird NaN in mean() of float16 series #20642

Comments

ubyjvovk commented Apr 9, 2018

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 9, 2018

ubyjvovk commented Apr 9, 2018

ubyjvovk commented Apr 9, 2018

jreback commented Apr 9, 2018

jreback commented Apr 9, 2018

MichaelYin1994 commented Nov 27, 2018

jfpuget commented Feb 19, 2019

Kaushal28 commented Oct 20, 2019

snknitin commented Aug 29, 2022

Sonyoyo commented Oct 14, 2022

Output of `pd.show_versions()`