Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Inferring dtype from iterables in pandas vs numpy #47673

Open
rhshadrach opened this issue Jul 11, 2022 · 5 comments
Open

API: Inferring dtype from iterables in pandas vs numpy #47673

rhshadrach opened this issue Jul 11, 2022 · 5 comments
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

There are a number of situations where pandas must take an iterable and infer a single dtype from the data it contains. Two examples are Series/DataFrame construction and groupby.apply when provided a user defined function (UDF).

In #47294 there was some discussion on how pandas treats this situation vs numpy, the relevant parts repeated below. I'm moving this to its own issue for better tracking.

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action labels Jul 11, 2022
@rhshadrach
Copy link
Member Author

By @simonjayhawkins #47294 (comment)

In the case where all values are numpy scalars, this is effectively:

  • If both signed and unsigned scalars are seen, result is object dtype

Out of curiosity, why do we have our own logic and not just use the numpy array constructor when all values are numpy scalars/numeric?

np.array([np.uint16(np.iinfo(np.uint16).max), np.int16(np.iinfo(np.int16).min)]).dtype
# dtype('int32')

np.array([np.uint32(np.iinfo(np.uint32).max), np.int32(np.iinfo(np.int32).min)]).dtype
# dtype('int64')

np.array([np.uint64(np.iinfo(np.uint64).max), np.int64(np.iinfo(np.int64).min)]).dtype
# dtype('float64')

This would probably be an api-breaking change, but reducing the cases that return object dtype is probably a win for most users.

i have also spoken with some users that find pandas slow, so after using pandas to explore/develop, use numpy for speed where possible in production. So although I sometimes see comments about users learning pandas first and hence not knowing numpy and that we don't need to match numpy's behavior, I generally prefer to match numpy where possible.

@rhshadrach
Copy link
Member Author

By @jreback #47294 (comment)

-1 on changing

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

slowness is almost 100% incorrect usage and writing non idiomatic code

if someone's sat 'pandas is slow' then show a specific example

@rhshadrach
Copy link
Member Author

By @simonjayhawkins #47294 (comment)

pandas does the right things - numpy sometimes does things that tbh are not great but have forever been there

yes, converting integers (mix of uint64 and int64) to floats (dtype('float64')) could be viewed as "corrupting" input data through loss of precision.

slowness is almost 100% incorrect usage and writing non idiomatic code
if someone's sat 'pandas is slow' then show a specific example

If they somehow end up with object dtype, then this is a given? To be fair, the users that I have heard this from are doing machine learning so tend to move their data into numpy arrays anyway, it's just a question of which stage of their pipeline.

@rhshadrach
Copy link
Member Author

Speaking generally, in situations where pandas and NumPy disagree and there is no clear choice that is best, I think we should value consistency with NumPy. If we think the NumPy choice is odd, even for NumPy itself, we should try to raise the issue with them. However the use cases for NumPy vs pandas can be different, and so there might be a choice which is sensible for NumPy and a different choice which is sensible for pandas. I do not know of any such examples currently.

For the particular case @simonjayhawkins raised above, e.g. [np.uint32(1), -1], it does seem to me that int64 is better than converting to object. However, [np.uint64(1), -1] should still convert to object (assuming int128 is not supported).

@buhrmann
Copy link

To add to the discussion, groupby-apply with a user-defined function and integers is currently broken in pandas (1.4.2). E.g. aggregating over a column with a dtype like uint16, would do, uh, this:

pd.Series([np.uint16(1), np.uint16(41_000),])
>> 0        1
>> 1   -24536
>> dtype: int16

Which is definitely not "the right thing"...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants