pd.eval division operation upcasts float32 to float64 #12388

jennolsen84 · 2016-02-19T01:44:40Z

The current behavior is inconsistent with normal python division of two DataFrames (see code sample).

Pandas upcasts both terms to 64-bit floats when it detects a division, see:

https://github.com/pydata/pandas/blob/528108bba4104b939bcfe6923677ddacc916ff00/pandas/computation/ops.py#L453

I think numexpr can handle different types too, and upcast automatically, though I am not 100% sure. I can submit a PR, but how do you recommend fixing this? Something like the following?

if truediv or PY3:
    for term in com.flatten(self):
        try:
            dt = term.values.dtype  # can .values be expensive?    
        except AttributeError:
            dt = type(term)

        if dt == np.float32:
            continue        
        else:
            _cast_inplace([term], np.float_)

The downside is that if someone does 2 + df, they'll probably still end up upcasting it. But this proposal is still better than what we have today

I might re-write the above using filter too, but at this time I just wanted to discuss the general approach

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(3, dtype=np.float32))
print('normal', (df/df).values.dtype)
print('pd_eval', pd.eval('df/df').values.dtype)
assert ((df/df).dtypes == pd.eval('df/df').dtypes).all()

Expected Output

normal float32 
pd_eval float32

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: None
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-02-19T01:59:36Z

In operations implying a scalar and an array, the normal rules of casting are used in Numexpr, in contrast with NumPy, where array types takes priority. For example, if 'a' is an array of type `float32` and 'b' is an scalar of type `float64` (or Python `float` type, which is equivalent), then 'a*b' returns a `float64` in Numexpr, but a `float32` in NumPy (i.e. array operands take priority in determining the result type). If you need to keep the result a `float32`, be sure you use a `float32` scalar too.

(this is different that what you are saying, but should prob handle non-the-less). I would do this test/casting in _cast_inplace itself.

jennolsen84 · 2016-02-19T05:23:11Z

numpy behavior seems to make more sense.

pd.eval('3.5 / float32array')

is much easier to write than:

s = np.float32('3.5')
pd.eval('s / float32array')

Also, if someone that didn't read the numexpr docs super carefully, they would've missed the little detail.

Therefore, should we mimic numpy behavior?

As for _cast_inplace, should we modify the signature? After the changes, it would be much more specialized function. It looks like it is only used once, so we have that going for us.

jennolsen84 · 2016-02-19T06:31:35Z

Thought about it some more

We could look at the whole expression, and come up with an output datatype:

If all array elements in an expression are floats32 and ints:
then
    output type = float32
else:
    output type = float64

This still has corner cases like adding two int32 arrays will result in float64. It is unclear what the solution of adding two int32 arrays should be: If the numbers are small, then int32 array as an output array is OK, but if the numbers are big you need int64 arrays. A way around this would be to let the user specify an out parameter. We could do extra checks to warn the user in case there are incompatiblities, like if two float64s are being added, but the output type is float32, etc.

So, the proposal now becomes:

Add out parameter to let user specify the destination of the datatype. must be ndarray or a pandas object (so either has .dtype or .values.dtype)
Choose an output array dtype to be one of {float64, float32}, depending on datatypes of arrays in the expression. float32 is chosen if all arrays in the expression have dtypes of float32 or any of the ints, otherwise float64 is chosen.
Warn if out is specified, and is float32 array, but input contains float64 array.

jreback · 2016-02-19T13:10:36Z

I don't recall why we are casting in the first place. I would ideally like to defer this entirey to the engine.
@chris-b1 @cpcloud any recall?

if not, then would be ok with passing a dtype= argument for casting and default to the minimum casting needed (though this just adds another layer of indirection but I guess needs to be done).

jennolsen84 · 2016-02-22T09:10:35Z

Should we go with numpy casting behavior (instead of numexpr)? numpy behavior is consistent pandas when numexpr is not used.

So, what we'd have to do here is to down-cast constants from float64 to float32, if and only if all arrays are float32s. E.g., numpy and pandas will use float64 as output dtype when int32 arrays are multiplied with float32 constant. So, it seems like float32 array case is the main thing we have to worry about.

e.g.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.Series(np.arange(5, dtype=np.float32)) * 2.0
Out[3]: 
0    0
1    2
2    4
3    6
4    8
dtype: float32

In [11]: a = pd.Series(np.arange(5, dtype=np.int32)) * np.float32(1.1)
In [12]: a
Out[12]: 
0    0.0
1    1.1
2    2.2
3    3.3
4    4.4
dtype: float64

In [13]: np.arange(5, dtype=np.int32) * np.float32(1.1)
Out[13]: array([ 0.        ,  1.10000002,  2.20000005,  3.30000007,  4.4000001 ])
In [14]: z = np.arange(5, dtype=np.int32) * np.float32(1.1)
In [15]: z.dtype
Out[15]: dtype('float64')

jreback · 2016-02-23T16:44:13Z

I think you have to upcast by default, the only way I wouldn't would be if the users indicated (with dtype=) that its ok to proceed and then I would simply cast things to the passed dtype so the underlying wouldn't then upcast.

jennolsen84 · 2016-02-23T19:09:27Z

but wouldn't this result in inconsistent behavior between normal pandas binary operations (like s * 2.0, which does not upcast s if it is a float32 series) and pd.eval('s * 2.0'), which will end up upcasting?

jreback · 2016-02-23T19:22:30Z

@jennolsen84 hmm. that is a good point. just trying to avoid pandas do any casting here. What if we remove that and just let the engine do it? (I don't really recall why this is special cased here). Or if we are forced to do it, then I guess you are right would have to do a lowest-common denonimator cast (maybe use np.find_common_type

jennolsen84 · 2016-02-24T12:26:35Z

how about this as a start? jennolsen84@c82819f

I manually tested it, and the behavior is now consistent with non-numexpr related code. I am trying to avoid casting un-necessarily as you recommended, and letting the lower-level libraries take care of a lot of things.

I did run the nosetests, and they all pass on existing tests.

If the commit looks good to you, I can add in some tests, add to docs, etc. and submit a PR.

jennolsen84 · 2016-02-29T21:36:00Z

@jreback can you please take another look at the commit? I addressed your comment, and I am not sure if you missed it.

jreback · 2016-03-01T12:34:14Z

@jennolsen84 yeh just getting back to this.

your soln seems fine. However I still don't understand why it is necessary to upcast (and only for division); what does numexpr do (if you don't upcast)? is it wrong?

jennolsen84 · 2016-03-01T17:53:05Z

We're casting to float32 in all ops (not just division).

The division thing was another case where pandas was casting to float(64), so I had to make a change there as well.

The reason why the cast happens at all is for some reason numexpr would cast a scalar 64 bit float * array 32 bit float to 64-bit floats. I am not sure why. This is inconsistent with numpy, and un-necessarily slower and takes up more RAM.

I will submit a PR (with whatsnew and tests)

jreback · 2016-03-01T19:35:32Z

thanks @jennolsen84 why don't you submit and we'll go from there

jreback added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 19, 2016

jennolsen84 mentioned this issue Mar 8, 2016

do not upcast results to float64 when float32 scalar *+/- float64 array #12559

Closed

4 tasks

jreback added this to the 0.18.2 milestone May 27, 2016

jreback closed this as completed in cc1025a May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.eval division operation upcasts float32 to float64 #12388

pd.eval division operation upcasts float32 to float64 #12388

jennolsen84 commented Feb 19, 2016

jreback commented Feb 19, 2016

jennolsen84 commented Feb 19, 2016

jennolsen84 commented Feb 19, 2016

jreback commented Feb 19, 2016

jennolsen84 commented Feb 22, 2016

jreback commented Feb 23, 2016

jennolsen84 commented Feb 23, 2016

jreback commented Feb 23, 2016

jennolsen84 commented Feb 24, 2016

jennolsen84 commented Feb 29, 2016

jreback commented Mar 1, 2016

jennolsen84 commented Mar 1, 2016

jreback commented Mar 1, 2016

pd.eval division operation upcasts float32 to float64 #12388

pd.eval division operation upcasts float32 to float64 #12388

Comments

jennolsen84 commented Feb 19, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented Feb 19, 2016

jennolsen84 commented Feb 19, 2016

jennolsen84 commented Feb 19, 2016

jreback commented Feb 19, 2016

jennolsen84 commented Feb 22, 2016

jreback commented Feb 23, 2016

jennolsen84 commented Feb 23, 2016

jreback commented Feb 23, 2016

jennolsen84 commented Feb 24, 2016

jennolsen84 commented Feb 29, 2016

jreback commented Mar 1, 2016

jennolsen84 commented Mar 1, 2016

jreback commented Mar 1, 2016

output of `pd.show_versions()`