Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: groupby.max() should not cast int to int64 but keep original data type #42275

Closed
rd-andreas-lay opened this issue Jun 28, 2021 · 9 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Performance Memory or execution speed performance

Comments

@rd-andreas-lay
Copy link

rd-andreas-lay commented Jun 28, 2021

Is your feature request related to a problem?

In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in

MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64

Traceback:

/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
    143     try:
    144         # error: Unexpected keyword argument "casting" for "astype"
--> 145         return arr.astype("int64", copy=copy, casting="safe")  # type: ignore[call-arg]
    146     except TypeError:
    147         pass

Describe the solution you'd like

Keep the original datatype, in this case int8.

@rd-andreas-lay rd-andreas-lay added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021
@mzeitlin11
Copy link
Member

Thanks for reporting this @rd-andreas-lay! This happens because our groupby algorithms only support specific types, so we need to cast to one which is supported. Wouldn't be hard to support more types for group_min and group_max, but it would increase distribution size (since we effectively need one function per supported type).

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Performance Memory or execution speed performance and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Jun 28, 2021
@arubiales
Copy link

arubiales commented Jul 1, 2021

Hi @mzeitlin11 ! I want to contribute to this issue in Pandas. Do you want to add support to int8? Can I work on it?

@mzeitlin11
Copy link
Member

@arubiales that would be great!

@arubiales
Copy link

arubiales commented Jul 1, 2021

Thanks @mzeitlin11 I will go for it!

Any useful information as for example, the module of pandas where is located, or files, and other things to consider, is appreciated.

@mzeitlin11
Copy link
Member

This is a pretty complicated issue, so there are a lot of things to consider :), but please reach out if you'd like any help:

@arubiales
Copy link

Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.

Thank you for the info, I'm going to review it and take and overall idea of how everything is connected.

@arubiales
Copy link

@mzeitlin11 @rd-andreas-lay . Sorry but I'm triying to reproduce the data type change with a minimum replicable example and it's impossible for me, so I'm missing something here. I'm triying the following

import numpy as np
import pandas as pd

# Create a dummy DF
df_prueba = pd.DataFrame(np.random.randint(0, 2, (100, 3), dtype=np.int8))
df_prueba["name"] = ["lion", "bird", "dog", "cat", "python"]*20

#keep the int8 type
df_group = df_prueba.groupby("name").max()
print(df_group.dtypes)

Output:

0    int8
1    int8
2    int8
dtype: object

@rd-andreas-lay
Copy link
Author

rd-andreas-lay commented Jul 16, 2021

@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).

I'd have to run an example through the debugger though to see where the re-casting to int8 happens.

If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@lithomas1
Copy link
Member

closed by #46745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants