Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow when running reductions on float16 columns in pandas Series #22841

Closed
jsleroux opened this issue Sep 26, 2018 · 4 comments
Closed

Overflow when running reductions on float16 columns in pandas Series #22841

jsleroux opened this issue Sep 26, 2018 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.

Comments

@jsleroux
Copy link

When running reductions on dataframe columns of dtype float16, we ran into a surprising behaviour:

import pandas as pd
df = pd.DataFrame(data=[1000]*100, columns=['A'], dtype='float16')

mean = df['A'].mean()
sum = df['A'].sum()
print('Mean: {}, sum: {}'.format(mean, sum))

# Both of these should be true, but they're not, returning inf in pandas 0.23.4
assert mean == 1000
assert sum == 1000 * 100

After investigation, we found that the accumulator used in (for example) mean() and sum() is not big enough and it eventually overflows. In both nansum() and nanmean() functions (to which sum() and mean() delegate their work), when the data type is 'float', the accumulator is downcasted to the original dtype of the data:

In our case, because the original dtype is float16, the accumulator is downcasted to float16, for example in nansum():

    dtype_sum = dtype_max
    if is_float_dtype(dtype):
        dtype_sum = dtype

(Similar code in nanmean.)

However, pandas did not always behave like that. The current behaviour was added in the following commits, to solve other bugs:

Also, for the "int" codepaths, the accumulator is never downcast and always set to float64. It is only for the "float" cases that the size of the accumulator is set to be the same as the dtype of the column.

We'd be willing to submit a pull request, but we're not sure what the best fix here would be. Should we just always have a float64 accumulator in these functions for the float cases, instead of downcasting it? If not, what would a good fix look like?

By the way, things are a bit different in numpy, with the accumulator being set to float64 in more cases, and with the option for users to specify the dtype of the accumulator (and at the same time the output). Having the same option in pandas would have allowed us to at least work around this, by requesting a float64 accumulator. What there a decision made in pandas not to offer a dtype argument to sum, etc. like numpy does? Otherwise, we could implement that also in the pull request.

(Cc @chrish42)

@jreback
Copy link
Contributor

jreback commented Sep 27, 2018

float16 has no support whatsoever

that said would take a patch that upcasts accumulators to a bigger dtype / uses float64 for float type ops if < float64

@Zac-HD
Copy link
Contributor

Zac-HD commented Oct 1, 2018

Tag as good first issue?

@chrish42
Copy link
Contributor

chrish42 commented Oct 1, 2018

@Zac-HD Agree, but it would already be @jsleroux's first issue, and I've been coaching him a bit for this. We're discussed creating the pull request at the office last Friday. Can we hold off that tag for a little bit, while he works on this? :-)

@Zac-HD
Copy link
Contributor

Zac-HD commented Oct 1, 2018

Oh, that's fantastic! I was just thinking of how to help someone find this, but for @jsleroux to fix it too would be great 🎉

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 13, 2019
@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020
@jreback jreback closed this as completed Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

6 participants