Skip to content

PERF: Large row dataframe process groupby and apply an dataframe returning function, spend a very long exce time even after main loop finished #54237

@yipukangda

Description

@yipukangda

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When I want to groupby a dataframe(df, shape[0]~1M) and apply a dataframe returning function, i.e.

def func(x):
        y = do something to x
        return pd.DataFrame(y) # y.shape[0] >= 1

df.groupby('some_columns').apply(func)

I try this process with df.head(100) and get the result as I want.

But when submit with the full df, it unfinished even after 12h, a check run time with progess_apply from tqdm package, and find progress bar end after less than 15min, the job, however, still running just like before and do not stop several hours later.

Btw, I try it with a ordinary for loop and finished after about 6h.

Installed Versions

version: 2.0.3
report a error after input pd.show_versions()
SystemError: initialization of _internal failed without raising an exception

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapGroupbyNeeds InfoClarification about behavior needed to assess issuePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions