The following code will execute a very simple groupby operation using both, **pandas** and **cuDF**, so we can compare their performances.

We will execute this code using **Google Colab** to avoid any potential configuration issues.

Before starting, we must check that this notebook is **GPU enabled**. Then we will leave the line below not-commented/commented depending on the usage or not usage of GPU respectively.

In [None]:
#%load_ext cudf.pandas

We will start by importing the libraries needed.

In [None]:
import time
import numpy as np
import pandas as pd

Here are the knobs you can tweak to change the size of our data.

In [None]:
N_ROWS   = 400_000_000
N_GROUPS = 10_000       # number of distinct groups

Then, we create our working dataframe.

In [None]:
rng = np.random.default_rng(42)
df = pd.DataFrame({
    "team": rng.integers(0, N_GROUPS, size=N_ROWS, dtype=np.int32),
    "points": rng.random(N_ROWS, dtype=np.float32) * 100.0,
    "assists": rng.random(N_ROWS, dtype=np.float32) * 10.0,
})

Finally, we'll go on with the groupby operation, getting its results.

In [None]:
t0 = time.perf_counter()
# basic groupby: mean points per team
mean_per_team = df.groupby("team")["points"].mean()
t1 = time.perf_counter()

print(f"groups: {mean_per_team.shape[0]}  |  rows: {len(df):,}")
print(f"groupby duration: {t1 - t0:.2f} s")
print(mean_per_team.head())

There are many strategies to optimize these results (and to benchmark them better).