API: Inferring dtype from iterables in pandas vs numpy #47673

rhshadrach · 2022-07-11T20:07:43Z

There are a number of situations where pandas must take an iterable and infer a single dtype from the data it contains. Two examples are Series/DataFrame construction and groupby.apply when provided a user defined function (UDF).

In #47294 there was some discussion on how pandas treats this situation vs numpy, the relevant parts repeated below. I'm moving this to its own issue for better tracking.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-07-11T20:08:31Z

By @simonjayhawkins #47294 (comment)

In the case where all values are numpy scalars, this is effectively:

If both signed and unsigned scalars are seen, result is object dtype

Out of curiosity, why do we have our own logic and not just use the numpy array constructor when all values are numpy scalars/numeric?
np.array([np.uint16(np.iinfo(np.uint16).max), np.int16(np.iinfo(np.int16).min)]).dtype
# dtype('int32')

np.array([np.uint32(np.iinfo(np.uint32).max), np.int32(np.iinfo(np.int32).min)]).dtype
# dtype('int64')

np.array([np.uint64(np.iinfo(np.uint64).max), np.int64(np.iinfo(np.int64).min)]).dtype
# dtype('float64')
This would probably be an api-breaking change, but reducing the cases that return object dtype is probably a win for most users.

i have also spoken with some users that find pandas slow, so after using pandas to explore/develop, use numpy for speed where possible in production. So although I sometimes see comments about users learning pandas first and hence not knowing numpy and that we don't need to match numpy's behavior, I generally prefer to match numpy where possible.

rhshadrach · 2022-07-11T20:09:02Z

By @jreback #47294 (comment)

-1 on changing

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

rhshadrach · 2022-07-11T20:09:31Z

By @simonjayhawkins #47294 (comment)

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

yes, converting integers (mix of uint64 and int64) to floats (dtype('float64')) could be viewed as "corrupting" input data through loss of precision.

slowness is almost 100% incorrect usage and writing non idiomatic code
if someone's sat 'pandas is slow' then show a specific example

If they somehow end up with object dtype, then this is a given? To be fair, the users that I have heard this from are doing machine learning so tend to move their data into numpy arrays anyway, it's just a question of which stage of their pipeline.

rhshadrach · 2022-07-11T20:21:58Z

Speaking generally, in situations where pandas and NumPy disagree and there is no clear choice that is best, I think we should value consistency with NumPy. If we think the NumPy choice is odd, even for NumPy itself, we should try to raise the issue with them. However the use cases for NumPy vs pandas can be different, and so there might be a choice which is sensible for NumPy and a different choice which is sensible for pandas. I do not know of any such examples currently.

For the particular case @simonjayhawkins raised above, e.g. [np.uint32(1), -1], it does seem to me that int64 is better than converting to object. However, [np.uint64(1), -1] should still convert to object (assuming int128 is not supported).

buhrmann · 2022-09-27T14:46:52Z

To add to the discussion, groupby-apply with a user-defined function and integers is currently broken in pandas (1.4.2). E.g. aggregating over a column with a dtype like uint16, would do, uh, this:

pd.Series([np.uint16(1), np.uint16(41_000),])

>> 0        1
>> 1   -24536
>> dtype: int16

Which is definitely not "the right thing"...

rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action labels Jul 11, 2022

rhshadrach mentioned this issue Jul 11, 2022

BUG: uint16 inserted as int16 when assigning row with dict #47294

Closed

3 tasks

rhshadrach mentioned this issue Sep 1, 2022

BUG: .describe() on unsigned types is resulting in object type result #48340

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Inferring dtype from iterables in pandas vs numpy #47673

API: Inferring dtype from iterables in pandas vs numpy #47673

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

buhrmann commented Sep 27, 2022

API: Inferring dtype from iterables in pandas vs numpy #47673

API: Inferring dtype from iterables in pandas vs numpy #47673

Comments

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

rhshadrach commented Jul 11, 2022

buhrmann commented Sep 27, 2022