Overflow when running reductions on float16 columns in pandas Series #22841

jsleroux · 2018-09-26T18:58:14Z

When running reductions on dataframe columns of dtype float16, we ran into a surprising behaviour:

import pandas as pd
df = pd.DataFrame(data=[1000]*100, columns=['A'], dtype='float16')

mean = df['A'].mean()
sum = df['A'].sum()
print('Mean: {}, sum: {}'.format(mean, sum))

# Both of these should be true, but they're not, returning inf in pandas 0.23.4
assert mean == 1000
assert sum == 1000 * 100

After investigation, we found that the accumulator used in (for example) mean() and sum() is not big enough and it eventually overflows. In both nansum() and nanmean() functions (to which sum() and mean() delegate their work), when the data type is 'float', the accumulator is downcasted to the original dtype of the data:

nansum:

pandas/pandas/core/nanops.py

Line 333 in af7b0ba

dtype_sum = dtype
nanmean:

pandas/pandas/core/nanops.py

Line 352 in af7b0ba

dtype_sum = dtype

In our case, because the original dtype is float16, the accumulator is downcasted to float16, for example in nansum():

    dtype_sum = dtype_max
    if is_float_dtype(dtype):
        dtype_sum = dtype

(Similar code in nanmean.)

However, pandas did not always behave like that. The current behaviour was added in the following commits, to solve other bugs:

sum(): 73f25b1
mean(): 3896e5e

Also, for the "int" codepaths, the accumulator is never downcast and always set to float64. It is only for the "float" cases that the size of the accumulator is set to be the same as the dtype of the column.

We'd be willing to submit a pull request, but we're not sure what the best fix here would be. Should we just always have a float64 accumulator in these functions for the float cases, instead of downcasting it? If not, what would a good fix look like?

By the way, things are a bit different in numpy, with the accumulator being set to float64 in more cases, and with the option for users to specify the dtype of the accumulator (and at the same time the output). Having the same option in pandas would have allowed us to at least work around this, by requesting a float64 accumulator. What there a decision made in pandas not to offer a dtype argument to sum, etc. like numpy does? Otherwise, we could implement that also in the pull request.

(Cc @chrish42)

The text was updated successfully, but these errors were encountered:

jreback · 2018-09-27T00:59:14Z

float16 has no support whatsoever

that said would take a patch that upcasts accumulators to a bigger dtype / uses float64 for float type ops if < float64

Zac-HD · 2018-10-01T13:29:33Z

Tag as good first issue?

chrish42 · 2018-10-01T14:16:37Z

@Zac-HD Agree, but it would already be @jsleroux's first issue, and I've been coaching him a bit for this. We're discussed creating the pull request at the office last Friday. Can we hold off that tag for a little bit, while he works on this? :-)

Zac-HD · 2018-10-01T14:27:48Z

Oh, that's fantastic! I was just thinking of how to help someone find this, but for @jsleroux to fix it too would be great 🎉

mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 13, 2019

jreback mentioned this issue Jun 8, 2020

BUG: Float values get corrupted with df.astype(), for values with no overflow error #34618

Closed

3 tasks

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020

jreback closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow when running reductions on float16 columns in pandas Series #22841

Overflow when running reductions on float16 columns in pandas Series #22841

jsleroux commented Sep 26, 2018

jreback commented Sep 27, 2018

Zac-HD commented Oct 1, 2018

chrish42 commented Oct 1, 2018

Zac-HD commented Oct 1, 2018

Overflow when running reductions on float16 columns in pandas Series #22841

Overflow when running reductions on float16 columns in pandas Series #22841

Comments

jsleroux commented Sep 26, 2018

jreback commented Sep 27, 2018

Zac-HD commented Oct 1, 2018

chrish42 commented Oct 1, 2018

Zac-HD commented Oct 1, 2018