# GPU + Datasets: Starting with Pandas

This notebook shows how to **measure** where the GPU helps (and where it doesn’t) when working with datasets.

Important clarification up-front:
- **Pandas itself is CPU-based**. It does not execute groupby/join/filter on the GPU.
- To use a GPU for dataframe-style operations, you typically switch to **cuDF** (RAPIDS), which has a pandas-like API.
- A very common workflow is: **pandas for I/O + ETL → GPU tensors for heavy compute/training**.

This notebook demonstrates both patterns with safe fallbacks if GPU libs aren’t installed.

In [1]:
import platform
from time import perf_counter

import numpy as np
import pandas as pd

print('Platform:', platform.platform())
print('Python:', platform.python_version())
print('pandas:', pd.__version__)
print('numpy:', np.__version__)

# GPU visibility checks (best-effort)
gpu_notes = []
try:
    import torch
    print('torch:', torch.__version__)
    print('torch CUDA available:', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('torch GPU:', torch.cuda.get_device_name(0))
        gpu_notes.append('PyTorch sees a CUDA GPU')
    else:
        gpu_notes.append('PyTorch does not see a CUDA GPU')
except Exception as e:
    print('torch not available or failed to import:', repr(e))
    gpu_notes.append('PyTorch not available')

try:
    import cupy as cp
    print('cupy:', cp.__version__)
    n = cp.cuda.runtime.getDeviceCount()
    print('cupy device count:', n)
    if n > 0:
        gpu_notes.append('CuPy sees a CUDA GPU')
except Exception as e:
    print('cupy not available (this is OK):', repr(e))

print('\nNotes:', '; '.join(gpu_notes) if gpu_notes else '(none)')

Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python: 3.12.12
pandas: 2.3.3
numpy: 2.2.6
torch: 2.9.1
torch CUDA available: True
torch GPU: NVIDIA RTX A4000 Laptop GPU
cupy: 13.6.0
cupy device count: 1

Notes: PyTorch sees a CUDA GPU; CuPy sees a CUDA GPU


In [2]:
import importlib

for mod in ["cudf", "cudf.pandas"]:
    try:
        importlib.import_module(mod)
        print(mod, "OK")
    except Exception as e:
        print(mod, "FAILED:", repr(e))

from IPython import get_ipython
print("ipython:", get_ipython() is not None)

cudf OK
cudf.pandas OK
ipython: True


## Quick note: what is CuPy (and why it’s optional here)?

**CuPy** is a NumPy-like array library that runs many operations on an NVIDIA GPU via CUDA.
It’s useful when you want to do "NumPy-style" compute on the GPU (e.g., large elementwise ops, reductions, some linear algebra), without switching to a deep-learning framework.

In this notebook:
- We *do not require* CuPy. The import is a best-effort check so you can see whether it’s available in your environment.
- If CuPy is missing, that’s totally fine: the rest of the notebook still works (pandas on CPU, and PyTorch on GPU if available).

How it relates to PyTorch:
- **PyTorch** is usually the right choice for GPU-accelerated training and tensor compute for ML.
- **CuPy** can be handy for GPU-accelerating parts of a pipeline that feel like NumPy, or for quick GPU array experiments.

About Windows availability (high-level):
- CuPy can work on Windows, but you typically need a compatible NVIDIA driver + CUDA runtime/toolkit setup, and you must install a CuPy build that matches your CUDA version.
- In practice, many people find CuPy easiest to use on Linux/WSL for reproducible CUDA environments.

## Optional: cuDF “pandas accelerator” (drop-in, when available)

If you have RAPIDS/cuDF installed (typically easiest on Linux/WSL), you can enable an optional **pandas accelerator**:
- It hooks into parts of the pandas API and runs supported operations on the GPU.
- It’s *not* guaranteed to accelerate every pandas operation; coverage depends on your versions and the operation types.
- If it’s not installed, the cell just prints a message and everything continues on CPU as usual.

In [3]:
# Optional GPU acceleration: enable cuDF's pandas accelerator if available
# Notes:
# - RAPIDS/cuDF is typically easiest on Linux/WSL (not native Windows).
# - CUDA Toolkit works on native Windows, but RAPIDS libraries generally target Linux environments.
# - If this isn't installed, we continue with regular CPU pandas.

GPU_ACCEL = False
try:
    from IPython import get_ipython
    ip = get_ipython()  # available in notebooks
    if ip is not None:
        ip.run_line_magic('load_ext', 'cudf.pandas')
        GPU_ACCEL = True
except Exception:
    GPU_ACCEL = False

if GPU_ACCEL:
    print('GPU acceleration enabled via cudf.pandas')
    try:
        import cudf
        import cupy as cp
        print('cuDF:', cudf.__version__, '| CuPy:', cp.__version__)
        n = cp.cuda.runtime.getDeviceCount()
        print('GPU count:', n)
        for i in range(n):
            p = cp.cuda.runtime.getDeviceProperties(i)
            name = p['name'].decode() if isinstance(p.get('name'), (bytes, bytearray)) else str(p.get('name'))
            mem_gb = int(p['totalGlobalMem']) // (1024**3)
            print(f'[{i}] {name} - {mem_gb} GB')
    except Exception as e:
        print('Accelerator enabled, but failed to query GPU details:', repr(e))
else:
    print('GPU acceleration disabled (cudf.pandas not available).')

GPU acceleration enabled via cudf.pandas
cuDF: 25.12.00 | CuPy: 13.6.0
GPU count: 1
[0] NVIDIA RTX A4000 Laptop GPU - 7 GB


## Step 1 — Create a dataset in pandas (CPU)

We’ll create a synthetic dataset that looks like a common analytics table:
- `user_id` (categorical-ish id)
- `country` (category)
- `amount` (numeric)
- `timestamp` (integer)

Then we’ll benchmark typical operations: filter + groupby aggregation.

In [4]:
# Tune dataset size here.
# 1_000_000 rows is a reasonable starting point for timing experiments.
N = 1_000_000
rng = np.random.default_rng(0)

countries = np.array(['US', 'CA', 'MX', 'BR', 'GB', 'DE', 'FR', 'IN', 'JP', 'AU'])

t0 = perf_counter()
df = pd.DataFrame({
    'user_id': rng.integers(0, 200_000, size=N, dtype=np.int32),
    'country': rng.choice(countries, size=N),
    'amount': rng.gamma(shape=2.0, scale=20.0, size=N).astype('float32'),
    'timestamp': rng.integers(1_700_000_000, 1_720_000_000, size=N, dtype=np.int64),
})

# Make `country` a categorical column (common in real datasets)
df['country'] = df['country'].astype('category')

t1 = perf_counter()
print('Created df:', df.shape)
print('Create time: %.3fs' % (t1 - t0))
print('Memory usage (MB):', round(df.memory_usage(deep=True).sum() / (1024**2), 2))
df.head()

Created df: (1000000, 4)
Create time: 0.216s
Memory usage (MB): 16.21


Unnamed: 0,user_id,country,amount,timestamp
0,170124,US,39.621334,1710486851
1,127392,CA,3.628782,1708395330
2,102227,BR,141.211502,1707076564
3,53957,MX,18.642044,1701269503
4,61565,JP,66.881958,1701723546


## Step 2 — CPU baseline: groupby + aggregation in pandas

This is the kind of operation people often want to accelerate.

We’ll do:
- filter to a subset of users
- group by `country`
- compute `count`, `mean(amount)`, `sum(amount)`

We time it to create a baseline.

In [5]:
t0 = perf_counter()

filtered = df[df['user_id'] < 50_000]
result_cpu = (
    filtered.groupby('country', observed=True)['amount']
    .agg(['count', 'mean', 'sum'])
    .sort_values('sum', ascending=False)
)

t1 = perf_counter()
print('CPU (pandas) time: %.3fs' % (t1 - t0))
result_cpu

CPU (pandas) time: 0.018s


Unnamed: 0_level_0,count,mean,sum
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IN,24956,40.3256,1006366.0
US,25134,39.968784,1004575.0
DE,25033,40.089798,1003568.0
MX,25151,39.849327,1002250.0
GB,25064,39.952671,1001374.0
JP,24896,40.14114,999353.8
AU,25051,39.864948,998656.8
CA,24849,39.827011,989661.4
FR,24794,39.868832,988507.8
BR,24650,40.077881,987919.8


## Step 2.5 — Optional: NumPy vs CuPy micro-benchmarks (GPU arrays)

This section answers: **if we move numeric arrays to the GPU, is it faster?**

> Important: this does *not* accelerate pandas `groupby` (pandas is CPU).
CuPy helps when your workload looks like **NumPy math** (elementwise ops, reductions, matrix multiplications, etc.).

We measure three things separately:
- Host→Device transfer (CPU → GPU copy)
- GPU compute time (with synchronization)
- Device→Host transfer (GPU → CPU copy)

In [6]:
# If CuPy isn't installed, this cell will print a message and skip.
# Install (optional) in this repo with:  uv sync --extra gpu

from time import perf_counter

import numpy as np

def _ms(dt: float) -> float:
    return 1e3 * dt

def _summarize_ms(name: str, values_ms: list[float]) -> None:
    values_ms = [float(v) for v in values_ms]
    values_ms_sorted = sorted(values_ms)
    mid = values_ms_sorted[len(values_ms_sorted) // 2]
    best = values_ms_sorted[0]
    worst = values_ms_sorted[-1]
    print(f'{name}: best={best:.2f} ms | median={mid:.2f} ms | worst={worst:.2f} ms | runs={len(values_ms_sorted)}')

try:
    import cupy as cp
except Exception as e:
    cp = None
    print('CuPy not available (skipping):', repr(e))

if cp is not None:
    # Basic CUDA sanity check
    try:
        n = cp.cuda.runtime.getDeviceCount()
    except Exception as e:
        n = 0
        print('CuPy installed, but CUDA is not usable (skipping):', repr(e))

    if n <= 0:
        print('No CUDA GPU visible to CuPy (skipping).')
    else:
        print('CuPy:', cp.__version__)
        p0 = cp.cuda.runtime.getDeviceProperties(0)
        name0 = p0['name'].decode() if isinstance(p0.get('name'), (bytes, bytearray)) else str(p0.get('name'))
        mem0_gb = int(p0['totalGlobalMem']) // (1024**3)
        print(f'[0] {name0} - {mem0_gb} GB')

        # -----------------------------
        # Benchmark 1: elementwise + reduction
        # -----------------------------
        # Prefer using the pandas dataframe if you've already created it (Step 1).
        # Otherwise fall back to a synthetic array so this cell can run standalone.
        if 'df' in globals() and hasattr(globals()['df'], '__getitem__') and 'amount' in globals()['df'].columns:
            x_np = globals()['df']['amount'].to_numpy(dtype='float32')
            print('\nUsing df["amount"] from Step 1:', x_np.shape)
        else:
            x_np = np.random.default_rng(0).gamma(shape=2.0, scale=20.0, size=1_000_000).astype('float32')
            print('\nUsing synthetic x_np:', x_np.shape)

        # NumPy baseline (CPU) — run a few times and summarize
        _ = np.log1p(x_np[:10])  # tiny warmup
        cpu_times = []
        for _ in range(5):
            t0 = perf_counter()
            y_np = np.sqrt(x_np) + np.log1p(x_np)
            s_np = float(y_np.sum())
            t1 = perf_counter()
            cpu_times.append(_ms(t1 - t0))
        _summarize_ms('[Elemwise+sum] NumPy CPU', cpu_times)
        print('[Elemwise+sum] NumPy sum:', f'{s_np:.3e}')

        # CuPy: measure transfer + GPU compute (separately), and optional D2H copy
        # Warmup to reduce one-time init effects
        _warm = cp.asarray(x_np[:1024])
        _warm = cp.sqrt(_warm)
        cp.cuda.Stream.null.synchronize()

        # Host -> Device copy
        t0 = perf_counter()
        x_cp = cp.asarray(x_np)
        cp.cuda.Stream.null.synchronize()
        t1 = perf_counter()
        print('[Elemwise] CuPy H2D:', f'{_ms(t1 - t0):.2f} ms')

        # GPU compute timing using CUDA events (more stable than perf_counter for GPU ops)
        # We exclude the initial run from stats to reduce startup effects.
        gpu_times = []
        for rep in range(6):
            start = cp.cuda.Event()
            end = cp.cuda.Event()
            start.record()
            y_cp = cp.sqrt(x_cp) + cp.log1p(x_cp)
            end.record()
            end.synchronize()
            dt_ms = cp.cuda.get_elapsed_time(start, end)
            if rep > 0:
                gpu_times.append(float(dt_ms))
        _summarize_ms('[Elemwise] CuPy GPU compute', gpu_times)

        # Reduction + scalar copy back (small, but we show it explicitly)
        t0 = perf_counter()
        s_cp = float(cp.sum(y_cp).get())
        t1 = perf_counter()
        print('[Elemwise+sum] CuPy reduction+scalar D2H:', f'{_ms(t1 - t0):.2f} ms')
        print('[Elemwise+sum] CuPy sum:', f'{s_cp:.3e}')

        # Optional: cost to copy a large result array back to CPU
        t0 = perf_counter()
        _ = cp.asnumpy(y_cp)
        t1 = perf_counter()
        print('[Elemwise] CuPy full-array D2H:', f'{_ms(t1 - t0):.2f} ms')

        # -----------------------------
        # Benchmark 2: matrix multiply (compute-only, GPU resident)
        # -----------------------------
        # For small matrices the CPU can be surprisingly competitive;
        # increase N to see clearer GPU wins (watch VRAM).
        N = 4096 if mem0_gb >= 7 else 3072
        rng = np.random.default_rng(0)

        def _cpu_matmul_ms(A, B):
            _ = A @ B  # warmup
            times = []
            for _ in range(3):
                t0 = perf_counter()
                _ = A @ B
                t1 = perf_counter()
                times.append(_ms(t1 - t0))
            return times

        def _gpu_matmul_ms(Ag, Bg):
            _ = Ag @ Bg  # warmup
            cp.cuda.Stream.null.synchronize()
            times = []
            Cg = None
            for rep in range(6):
                start = cp.cuda.Event()
                end = cp.cuda.Event()
                start.record()
                Cg = Ag @ Bg
                end.record()
                end.synchronize()
                dt_ms = cp.cuda.get_elapsed_time(start, end)
                if rep > 0:
                    times.append(float(dt_ms))
            return times, Cg

        try:
            A = rng.standard_normal((N, N), dtype=np.float32)
            B = rng.standard_normal((N, N), dtype=np.float32)
        except MemoryError:
            N = 2048
            A = rng.standard_normal((N, N), dtype=np.float32)
            B = rng.standard_normal((N, N), dtype=np.float32)
            print('Fell back to N=2048 due to host memory limits.')

        cpu_mm_times = _cpu_matmul_ms(A, B)
        _summarize_ms(f'[Matmul {N}x{N}] NumPy CPU', cpu_mm_times)

        Ag = cp.asarray(A)
        Bg = cp.asarray(B)
        gpu_mm_times, Cg = _gpu_matmul_ms(Ag, Bg)
        _summarize_ms(f'[Matmul {N}x{N}] CuPy GPU compute', gpu_mm_times)

        t0 = perf_counter()
        _ = cp.asnumpy(Cg)
        t1 = perf_counter()
        print(f'[Matmul {N}x{N}] CuPy D2H (result copy): {_ms(t1 - t0):.2f} ms')

CuPy: 13.6.0
[0] NVIDIA RTX A4000 Laptop GPU - 7 GB

Using df["amount"] from Step 1: (1000000,)
[Elemwise+sum] NumPy CPU: best=1.25 ms | median=1.60 ms | worst=2.92 ms | runs=5
[Elemwise+sum] NumPy sum: 9.409e+06
[Elemwise] CuPy H2D: 6.16 ms
[Elemwise] CuPy GPU compute: best=1.81 ms | median=1.96 ms | worst=11.81 ms | runs=5
[Elemwise+sum] CuPy reduction+scalar D2H: 27.09 ms
[Elemwise+sum] CuPy sum: 9.409e+06
[Elemwise] CuPy full-array D2H: 3.02 ms
[Matmul 4096x4096] NumPy CPU: best=311.88 ms | median=314.66 ms | worst=316.54 ms | runs=3
[Matmul 4096x4096] CuPy GPU compute: best=17.68 ms | median=22.55 ms | worst=24.92 ms | runs=5
[Matmul 4096x4096] CuPy D2H (result copy): 31.83 ms


## Step 2.6 — “Groupby-like” aggregation without cuDF (NumPy vs CuPy)

If you specifically want to compare **pandas-style aggregation** with a **CuPy implementation**, here’s a good pattern:
- Use pandas once to do the *dataframe-ish* parts (I/O, cleaning, categories).
- Convert the relevant columns to plain arrays.
- Implement the aggregation using `bincount` (fast on CPU with NumPy, and on GPU with CuPy).

This is not “CuPy running pandas”, but it *does* run the same math as a common pandas `groupby(...).agg(...)` using GPU arrays.

In [7]:
# Benchmark: pandas groupby vs NumPy bincount vs CuPy bincount
#
# This is a nice “apples-to-apples” idea because all three compute the same outputs:
#   count(country), sum(amount), mean(amount)
# but via different execution backends (pandas/CPU, numpy/CPU, cupy/GPU).

from time import perf_counter

import numpy as np
import pandas as pd

def _ms(dt: float) -> float:
    return 1e3 * dt

def _summarize_ms(name: str, values_ms: list[float]) -> None:
    values_ms = [float(v) for v in values_ms]
    values_ms_sorted = sorted(values_ms)
    mid = values_ms_sorted[len(values_ms_sorted) // 2]
    best = values_ms_sorted[0]
    worst = values_ms_sorted[-1]
    print(f'{name}: best={best:.2f} ms | median={mid:.2f} ms | worst={worst:.2f} ms | runs={len(values_ms_sorted)}')

if 'df' not in globals():
    raise RuntimeError('Run Step 1 first so df exists (the dataset creation cell).')

if 'country' not in df.columns or 'amount' not in df.columns:
    raise RuntimeError('df must contain country and amount columns.')

# Ensure categorical encoding exists (so we can map country -> small int codes)
if not isinstance(df['country'].dtype, pd.CategoricalDtype):
    df['country'] = df['country'].astype('category')

codes_np = df['country'].cat.codes.to_numpy(dtype=np.int32)
amount_np = df['amount'].to_numpy(dtype=np.float32)
k = int(df['country'].cat.categories.size)
labels = df['country'].cat.categories.to_list()

# -----------------------------
# 1) pandas groupby baseline (CPU)
# -----------------------------
pandas_times = []
for _ in range(3):
    t0 = perf_counter()
    gb = df.groupby('country', observed=True)['amount'].agg(['count', 'sum', 'mean']).sort_index()
    t1 = perf_counter()
    pandas_times.append(_ms(t1 - t0))
_summarize_ms('[Groupby] pandas (CPU)', pandas_times)

# -----------------------------
# 2) NumPy implementation (CPU)
# -----------------------------
numpy_times = []
for _ in range(5):
    t0 = perf_counter()
    counts = np.bincount(codes_np, minlength=k).astype(np.int64)
    sums = np.bincount(codes_np, weights=amount_np, minlength=k).astype(np.float64)
    means = sums / np.maximum(counts, 1)
    t1 = perf_counter()
    numpy_times.append(_ms(t1 - t0))
_summarize_ms('[Groupby] NumPy bincount (CPU)', numpy_times)

# Build a comparable table (CPU)
numpy_tbl = pd.DataFrame({'count': counts, 'sum': sums, 'mean': means}, index=labels)
numpy_tbl.index.name = 'country'
numpy_tbl = numpy_tbl.sort_index()

# Quick correctness check (vs pandas)
# Note: float sums may differ at tiny eps due to different reduction orders.
max_abs_sum_diff = float(np.max(np.abs(numpy_tbl['sum'].to_numpy() - gb['sum'].to_numpy())))
max_abs_mean_diff = float(np.max(np.abs(numpy_tbl['mean'].to_numpy() - gb['mean'].to_numpy())))
print('NumPy vs pandas max abs diff | sum:', f'{max_abs_sum_diff:.6g}', '| mean:', f'{max_abs_mean_diff:.6g}')

# -----------------------------
# 3) CuPy implementation (GPU)
# -----------------------------
try:
    import cupy as cp
except Exception as e:
    cp = None
    print('CuPy not available (skipping GPU bincount):', repr(e))

if cp is not None:
    # Host -> Device transfers
    t0 = perf_counter()
    codes_cp = cp.asarray(codes_np)
    amount_cp = cp.asarray(amount_np)
    cp.cuda.Stream.null.synchronize()
    t1 = perf_counter()
    print('[Groupby] CuPy H2D:', f'{_ms(t1 - t0):.2f} ms')

    # GPU compute timing using CUDA events
    gpu_times = []
    counts_cp = None
    sums_cp = None
    for rep in range(6):
        start = cp.cuda.Event()
        end = cp.cuda.Event()
        start.record()
        counts_cp = cp.bincount(codes_cp, minlength=k)
        sums_cp = cp.bincount(codes_cp, weights=amount_cp, minlength=k)
        end.record()
        end.synchronize()
        dt_ms = float(cp.cuda.get_elapsed_time(start, end))
        if rep > 0:
            gpu_times.append(dt_ms)
    _summarize_ms('[Groupby] CuPy bincount GPU compute', gpu_times)

    # Device -> Host transfer
    t0 = perf_counter()
    counts2 = cp.asnumpy(counts_cp).astype(np.int64)
    sums2 = cp.asnumpy(sums_cp).astype(np.float64)
    t1 = perf_counter()
    print('[Groupby] CuPy D2H:', f'{_ms(t1 - t0):.2f} ms')

    means2 = sums2 / np.maximum(counts2, 1)
    cupy_tbl = pd.DataFrame({'count': counts2, 'sum': sums2, 'mean': means2}, index=labels)
    cupy_tbl.index.name = 'country'
    cupy_tbl = cupy_tbl.sort_index()

    max_abs_sum_diff = float(np.max(np.abs(cupy_tbl['sum'].to_numpy() - gb['sum'].to_numpy())))
    max_abs_mean_diff = float(np.max(np.abs(cupy_tbl['mean'].to_numpy() - gb['mean'].to_numpy())))
    print('CuPy vs pandas max abs diff | sum:', f'{max_abs_sum_diff:.6g}', '| mean:', f'{max_abs_mean_diff:.6g}')

    # Show just the top few rows to keep notebook output small
    display(pd.concat([gb.add_prefix('pandas_'), cupy_tbl.add_prefix('cupy_')], axis=1).head())
else:
    display(pd.concat([gb.add_prefix('pandas_'), numpy_tbl.add_prefix('numpy_')], axis=1).head())

[Groupby] pandas (CPU): best=21.48 ms | median=22.12 ms | worst=25.41 ms | runs=3
[Groupby] NumPy bincount (CPU): best=3.91 ms | median=5.80 ms | worst=10.68 ms | runs=5
NumPy vs pandas max abs diff | sum: 0.119151 | mean: 2.92245e-06
[Groupby] CuPy H2D: 5.89 ms
[Groupby] CuPy bincount GPU compute: best=4.34 ms | median=4.56 ms | worst=11.08 ms | runs=5
[Groupby] CuPy D2H: 0.30 ms
CuPy vs pandas max abs diff | sum: 0.119151 | mean: 2.92245e-06


Unnamed: 0_level_0,pandas_count,pandas_sum,pandas_mean,cupy_count,cupy_sum,cupy_mean
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AU,100339,4005580.5,39.920475,100339,4005580.0,39.920474
BR,99854,3988503.0,39.943348,99854,3988503.0,39.943348
CA,100010,4001330.5,40.009304,100010,4001330.0,40.009303
DE,100048,3998313.25,39.963951,100048,3998313.0,39.963951
FR,99808,3986569.75,39.942387,99808,3986570.0,39.942386


## Step 3 — Optional: GPU dataframe acceleration with cuDF

If you have a CUDA GPU and install RAPIDS/cuDF, you can run **pandas-like dataframe operations on the GPU**.

Notes:
- cuDF is not a pure-Python wheel for every OS/version combo; it’s typically easiest on Linux/WSL.
- If cuDF isn’t installed, this section will skip.

What we do:
- Convert pandas → cuDF
- Run the same filter + groupby aggregation
- Compare timing

In [8]:
use_cudf = False
try:
    import cudf
    use_cudf = True
    print('cudf:', cudf.__version__)
except Exception as e:
    print('cuDF not available (skipping GPU dataframe benchmark).')
    print('Reason:', repr(e))
    print('If you want this section, you typically install RAPIDS/cuDF on Linux/WSL.')

use_cudf

cudf: 25.12.00


True

In [9]:
if use_cudf:
    # Convert pandas -> cuDF (this costs time too, so we measure it separately)
    t0 = perf_counter()
    gdf = cudf.from_pandas(df)
    t1 = perf_counter()
    print('Converted to cuDF:', tuple(gdf.shape))
    print('Convert time: %.3fs' % (t1 - t0))

    # Run the same style of operation on GPU
    t0 = perf_counter()
    gfiltered = gdf[gdf['user_id'] < 50_000]
    result_gpu = (
        gfiltered.groupby('country')['amount']
        .agg(['count', 'mean', 'sum'])
        .sort_values('sum', ascending=False)
    )
    # Force computation before stopping the timer (GPU ops can be lazy)
    _ = result_gpu.to_pandas()
    t1 = perf_counter()
    print('GPU (cuDF) time (includes to_pandas materialization): %.3fs' % (t1 - t0))

    result_gpu.head()
else:
    print('Skipping: cuDF not installed.')

Converted to cuDF: (1000000, 4)
Convert time: 0.023s
GPU (cuDF) time (includes to_pandas materialization): 0.171s


## Step 4 — Practical pattern: pandas ETL → GPU tensors for compute/training

Even if you don’t use cuDF, you can still leverage the GPU by moving numeric arrays to GPU tensors.

This is a very common real pattern:
- Use **pandas** for cleaning, joining, feature engineering, encoding categories, etc.
- Convert to **NumPy** arrays and then to **PyTorch** tensors
- Run heavy compute/training on the GPU

Below we create a tiny supervised learning task from the dataframe and train a small model.

In [10]:
try:
    import torch
    from torch import nn
except Exception as e:
    raise RuntimeError('PyTorch is required for this section. Run `uv sync`.') from e

# Create a simple target label from the data (synthetic but learnable):
# label = 1 if amount is high AND user_id is in a certain range.
# This is not a "real" ML problem, but it demonstrates the pipeline.
features = df[['user_id', 'amount']].copy()
features['user_id'] = (features['user_id'] / 200_000.0).astype('float32')
X = features.to_numpy(dtype='float32')
y = ((df['amount'].to_numpy(dtype='float32') > 60.0) & (df['user_id'].to_numpy() < 50_000)).astype('int64')

# Train/test split
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Training device:', device)

X_train_t = torch.from_numpy(X_train).to(device)
y_train_t = torch.from_numpy(y_train).to(device)
X_test_t = torch.from_numpy(X_test).to(device)
y_test_t = torch.from_numpy(y_test).to(device)

model = nn.Sequential(
    nn.Linear(2, 32),
    nn.ReLU(),
    nn.Linear(32, 2)
).to(device)

loss_fn = nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters(), lr=1e-2)

# Simple training loop (full-batch for clarity)
t0 = perf_counter()
for epoch in range(10):
    model.train()
    logits = model(X_train_t)
    loss = loss_fn(logits, y_train_t)
    opt.zero_grad()
    loss.backward()
    opt.step()

t1 = perf_counter()
print('Train time (10 epochs): %.3fs' % (t1 - t0))

# Evaluate
model.eval()
with torch.no_grad():
    pred = model(X_test_t).argmax(dim=1)
    acc = (pred == y_test_t).float().mean().item()
print('Test accuracy:', round(acc, 4))

Training device: cuda
Train time (10 epochs): 0.299s
Test accuracy: 0.9502


## Optional: download a real dataset (small)

If you want to practice with a real CSV without huge downloads, this cell tries to pull a small dataset from GitHub.
If you don’t have internet access, it will fail gracefully.

(We keep this optional to avoid slowing down the main GPU experiments.)

In [11]:
import pandas as pd

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'
try:
    penguins = pd.read_csv(url)
    print('Downloaded penguins:', penguins.shape)
    display(penguins.head())
except Exception as e:
    print('Download failed (this is OK):', repr(e))

Downloaded penguins: (344, 7)




Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
