ENH: groupby.max() should not cast int to int64 but keep original data type #42275

rd-andreas-lay · 2021-06-28T07:17:59Z

Is your feature request related to a problem?

In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in

MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64

Traceback:

/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
    143     try:
    144         # error: Unexpected keyword argument "casting" for "astype"
--> 145         return arr.astype("int64", copy=copy, casting="safe")  # type: ignore[call-arg]
    146     except TypeError:
    147         pass

Describe the solution you'd like

Keep the original datatype, in this case int8.

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-06-28T13:49:47Z

Thanks for reporting this @rd-andreas-lay! This happens because our groupby algorithms only support specific types, so we need to cast to one which is supported. Wouldn't be hard to support more types for group_min and group_max, but it would increase distribution size (since we effectively need one function per supported type).

arubiales · 2021-07-01T17:45:31Z

Hi @mzeitlin11 ! I want to contribute to this issue in Pandas. Do you want to add support to int8? Can I work on it?

mzeitlin11 · 2021-07-01T18:14:26Z

@arubiales that would be great!

arubiales · 2021-07-01T18:38:47Z

Thanks @mzeitlin11 I will go for it!

Any useful information as for example, the module of pandas where is located, or files, and other things to consider, is appreciated.

mzeitlin11 · 2021-07-01T19:26:18Z

This is a pretty complicated issue, so there are a lot of things to consider :), but please reach out if you'd like any help:

The cython algorithm is here:

pandas/pandas/_libs/groupby.pyx

Line 1173 in b0082d2

cdef group_min_max(groupby_t[:, ::1] out,

. To avoid needing to upcast, the fused type should be updated to be numeric
A lot of the preprocessing (and where the upcast happens) is here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py. I would recommend using a debugger to step through an example to figure out where/why the upcast occurs and how you can avoid it.
Since the purpose of this issue is to reduce memory usage, we'll want to verify any patch with a memory benchmark, see something like

pandas/asv_bench/benchmarks/rolling.py

Line 192 in 12513c4

def peakmem_fixed(self, operation):

for an example

arubiales · 2021-07-01T19:40:14Z

Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.

Thank you for the info, I'm going to review it and take and overall idea of how everything is connected.

arubiales · 2021-07-15T15:51:43Z

@mzeitlin11 @rd-andreas-lay . Sorry but I'm triying to reproduce the data type change with a minimum replicable example and it's impossible for me, so I'm missing something here. I'm triying the following

import numpy as np
import pandas as pd

# Create a dummy DF
df_prueba = pd.DataFrame(np.random.randint(0, 2, (100, 3), dtype=np.int8))
df_prueba["name"] = ["lion", "bird", "dog", "cat", "python"]*20

#keep the int8 type
df_group = df_prueba.groupby("name").max()
print(df_group.dtypes)

Output:

0    int8
1    int8
2    int8
dtype: object

rd-andreas-lay · 2021-07-16T06:53:35Z

@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).

I'd have to run an example through the debugger though to see where the re-casting to int8 happens.

If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.

lithomas1 · 2023-01-16T16:04:55Z

closed by #46745

rd-andreas-lay added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021

mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Performance Memory or execution speed performance and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021

mzeitlin11 added this to the Contributions Welcome milestone Jun 28, 2021

mzeitlin11 mentioned this issue Sep 28, 2021

CLN: unify fused type definitions #43774

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

lithomas1 closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: groupby.max() should not cast int to int64 but keep original data type #42275

ENH: groupby.max() should not cast int to int64 but keep original data type #42275

rd-andreas-lay commented Jun 28, 2021 •

edited

Loading

mzeitlin11 commented Jun 28, 2021

arubiales commented Jul 1, 2021 •

edited

Loading

mzeitlin11 commented Jul 1, 2021

arubiales commented Jul 1, 2021 •

edited

Loading

mzeitlin11 commented Jul 1, 2021

arubiales commented Jul 1, 2021

arubiales commented Jul 15, 2021

rd-andreas-lay commented Jul 16, 2021 •

edited

Loading

lithomas1 commented Jan 16, 2023

ENH: groupby.max() should not cast int to int64 but keep original data type #42275

ENH: groupby.max() should not cast int to int64 but keep original data type #42275

Comments

rd-andreas-lay commented Jun 28, 2021 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

mzeitlin11 commented Jun 28, 2021

arubiales commented Jul 1, 2021 • edited Loading

mzeitlin11 commented Jul 1, 2021

arubiales commented Jul 1, 2021 • edited Loading

mzeitlin11 commented Jul 1, 2021

arubiales commented Jul 1, 2021

arubiales commented Jul 15, 2021

rd-andreas-lay commented Jul 16, 2021 • edited Loading

lithomas1 commented Jan 16, 2023

rd-andreas-lay commented Jun 28, 2021 •

edited

Loading

arubiales commented Jul 1, 2021 •

edited

Loading

arubiales commented Jul 1, 2021 •

edited

Loading

rd-andreas-lay commented Jul 16, 2021 •

edited

Loading