# 📈 Results

This notebook analyses the results of experiments as tracked to W&B.


## Setup 

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import autorootcwd

In [3]:
# Imports
from typing import Dict

import wandb
from wandb.sdk.wandb_run import Run

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [4]:
# Constants
WANDB_ENTITY = "mikasenghaas"
WANDB_PROJECT = "swarm"

In [5]:
# Helpers
def get_gpu(run: Run) -> str:
    if "gpu_nvidia" in run.metadata:
        gpu = run.metadata["gpu_nvidia"][0]
        return {"name": gpu["name"], "memory": gpu["memoryTotal"], "count": len(run.metadata["gpu_nvidia"])}
    elif "gpuapple" in run.metadata:
        return {"name": run.metadata["gpuapple"]["gpuType"], "count": 1}
    else:
        return {"name": "Unknown"}

def get_config(run: Run) -> Dict:
    return {**run.config, "gpu": get_gpu(run)}

def get_history(run: Run) -> pd.DataFrame:
    run_id = run.id
    history = run.history()
    return pd.concat([pd.Series([run_id]*len(history), name="run_id"), history], axis=1).set_index("run_id")

def get_summary(run: Run) -> pd.Series:
    return pd.DataFrame([dict(run.summary)], index=[run.id])

In [6]:
# Styling
sns.set_theme(style="whitegrid")
sns.set_palette("Blues_r")

In [None]:
# Initialize W&B
api = wandb.Api()

# Get runs
ALL_RUNS = api.runs(f"{WANDB_ENTITY}/{WANDB_PROJECT}")
print(f"✅ Loaded {len(ALL_RUNS)} runs from W&B ({WANDB_ENTITY}/{WANDB_PROJECT})")

## Verification


### Experiment 1: Verify Gradient Accumulation

This experiment verifies that gradient accumulation works as expected. We do so by training a model based on the debug configuration with different micro-batch sizes and the same global batch size locally (Apple M1).

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace?nw=dm6rh6z8t14)

In [None]:
# Load runs
GROUP = "verify/grad-acc"
RUNS = [r for r in ALL_RUNS if r.group == GROUP]

print(f"✅ Loaded {len(RUNS)} runs for experiment {GROUP}")

In [9]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in RUNS}
runs_summary = pd.concat([get_summary(r) for r in RUNS])
runs_history = pd.concat([get_history(r) for r in RUNS])

In [None]:
# Plot loss by step
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.lineplot(data=runs_history, x="_step", y="train/loss/current", hue="run_id", marker="o", ax=ax[0])
sns.lineplot(data=runs_history, x="_step", y="train/loss/average", hue="run_id", marker="o", ax=ax[1])
ax[0].set_title("Loss by Step")
ax[1].set_title("Loss by Step (Average)")
for a in ax:
    a.set_xlabel("Step")
    a.set_ylabel("Loss")
plt.show();

Nice, gradient accumulation works. For every step, we are accumulating gradients over various micro-batches, and the we perform the same gradient updates.

In [None]:
# Plot Wall-Time by Run
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.barplot(data=runs_summary, x=runs_summary.index, y="_runtime", ax=ax[0])
sns.barplot(data=runs_summary, x=runs_summary.index, y="train/throughput/average", ax=ax[1])
ax[0].set_title("Wall-Time by Run")
ax[1].set_title("Throughput by Run")
ax[0].set_ylabel("Wall-Time (s)")
ax[1].set_ylabel("Throughput (T/s)")
for a in ax:
    a.set_xlabel("Micro-Batch Size")
    a.set_xticks(range(len(runs_summary)))
    a.set_xticklabels([runs_config[run_id]['train']['micro_batch_size'] for run_id in runs_summary.index]);
plt.show();

We see that the wall-time decreases with increasing micro-batch size, as expected. This is, because we are processing more tokens per second (using GPU hardware more efficiently).

### Experiment 2: Cosine LR Scheduler

This experiment verifies that the cosine learning rate scheduling works as expected, e.g. the learning rate is 0 at the start, then linearly increases for `train.scheduler.warmup_steps`, after which the cosine schedule kicks in and the learning rate decays according to a cosine annealing pattern until it reaches a minimum learning rate of `train.scheduler.min_lr_factor` of the initial learning rate. The experiment is run with the debug configuration from the script `experiments/verify/scheduler.sh` and run locally on an Apple M1.

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace)

In [None]:
# Load runs
GROUP = "verify/scheduler"
RUNS = [r for r in ALL_RUNS if r.group == GROUP]

print(f"✅ Loaded {len(RUNS)} runs for experiment {GROUP}")

In [13]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in RUNS}
runs_summary = pd.concat([get_summary(r) for r in RUNS])
runs_history = pd.concat([get_history(r) for r in RUNS])

In [None]:
# Plot learning rate patterns
fig, ax = plt.subplots(figsize=(12, 6), dpi=300)
sns.lineplot(data=runs_history, x="_step", y="train/learning_rate/current", hue="run_id", ax=ax)
ax.set_title("Learning Rate by Step (All Runs)")
ax.set_xlabel("Step")
ax.set_ylabel("Learning Rate")
plt.legend(title="Run ID")

# Create custom legend with scheduler configuration
run_ids = runs_config.keys()
enable = [runs_config[run_id]['train']['scheduler']['enable'] for run_id in run_ids]
warmup_steps = [runs_config[run_id]['train']['scheduler']['warmup_steps'] for run_id in run_ids]
min_lr_factor = [runs_config[run_id]['train']['scheduler']['min_lr_factor'] for run_id in run_ids]

legend_elements = []
for run_id, e, w, m in zip(run_ids, enable, warmup_steps, min_lr_factor):
    color = ax.get_lines()[list(run_ids).index(run_id)].get_color()
    legend_elements.append(plt.Line2D([0], [0], color=color, lw=2, label=f"{run_id} (enable={e}, warmup_steps={w}, min_lr_factor={m})"))

ax.legend(handles=legend_elements, title="Scheduler Config")

plt.tight_layout()
plt.show();

Nice, looks good. The hyperparameter affect the learning rate pattern as expected:

- `enable`: The learning rate is constant at the initial learning rate for `False` and otherwise follows a cosine annealing pattern.
- `warmup_steps`: The learning rate is linearly increased from the initial learning rate to the maximum learning rate over `warmup_steps` steps.
- `min_lr_factor`: The learning rate is multiplied by `min_lr_factor` at the end of the training.

## Benchmark

In this experiment, we are benchmarking the throughput of various GPUs on the [Prime Intellect Compute](https://api.primeintellect.ai) platform. Namely, we are comparing the following GPUs:

- NVIDIA RTX 4090 (24GB)
- NVIDIA A100 (40GB)
- NVIDIA A100 (80GB)
- *NVIDIA H100 (80GB) (not yet)*

We are using the script `experiments/benchmark/{method}/{model}.sh` to run the experiment. It uses the configuration from `configs/benchmark.toml` and runs the training script for the respective method. It trains the GPT-2 (124M) for five steps on WikiText 2 (17.8M tokens). We do not use learning rate scheduling and test for various micro batch sizes, starting from 1 up to 128 (or when reaching OOM) and different precision settings.

### Baseline GPT-2 (124M)

This experiment benchmarks the throughput when training GPT-2 (124M) for the single GPU setup with different precision settings and micro batch sizes.

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace?nw=8h2bmt4h0n)

In [None]:
# Load runs
GROUP = "benchmark/baseline/gpt2"
RUNS = [r for r in ALL_RUNS if r.group == GROUP and r.state == "finished"]

print(f"✅ Loaded {len(RUNS)} runs for experiment {GROUP}")

In [9]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in RUNS}
runs_summary = pd.concat([get_summary(r) for r in RUNS])

In [None]:
# Construct performance dataframe
performance = runs_summary.copy()
cost_per_hour = {"NVIDIA A100 80GB PCIe": 1.35, "NVIDIA A100-SXM4-40GB": 1.29, "NVIDIA GeForce RTX 4090": 0.69}

# Add GPU type to summary and history
performance["gpu"] = runs_summary.index.map(lambda x: runs_config[x]["gpu"]["name"])
performance["cost"] = performance.gpu.map(lambda x: cost_per_hour[x]) * performance._runtime / 3600

# Add varying constants
performance["micro_batch_size"] = runs_summary.index.map(lambda x: str(runs_config[x]["train"]["micro_batch_size"]))
performance["precision"] = runs_summary.index.map(lambda x: runs_config[x]["train"]["amp"]["precision"])
performance["dtype"] = runs_summary.index.map(lambda x: runs_config[x]["train"]["amp"]["dtype"])

# Choose relevant columns
cols = ["gpu", "micro_batch_size", "precision", "dtype", "train/throughput/current", "cost"]
performance = performance[cols]

performance.head()

In [None]:
# Plot the the average throughput per precision and dtype
gpus = performance["gpu"].unique()
fig, ax = plt.subplots(ncols=len(gpus), figsize=(16, 4), dpi=300)
fig.suptitle("Average Throughput per Precision and Autocast")
plt.tight_layout()

# Plot 1: Throughput per Autocast and Precision
for i, gpu in enumerate(gpus):
    stats = performance.groupby(["precision", "dtype"])["train/throughput/current"].mean()
    precision_order = performance.groupby("precision")["train/throughput/current"].mean().sort_values(ascending=True).index
    dtype_order = performance.groupby("dtype")["train/throughput/current"].mean().sort_values(ascending=True).index
    colors = sns.color_palette("Blues", n_colors=2)
    sns.barplot(data=performance[performance["gpu"] == gpu], x="dtype", y="train/throughput/current", hue="precision", order=dtype_order, hue_order=precision_order, ax=ax[i], gap=0.2, palette=colors)
    ax[i].set_title(f"{gpu}")
    ax[i].set_xlabel("Autocast")
    ax[i].set_ylabel("Average Throughput (kT/s)")
    ax[i].yaxis.set_major_formatter(lambda x, p: f'{x/1000:.0f}')
    ax[i].legend(title="Precision")

performance.groupby(["gpu", "precision", "dtype"])["train/throughput/current"].mean()

In [None]:
# Plot the average throughput per micro-batch size and GPU
fig, ax = plt.subplots(nrows=2, figsize=(16, 8), dpi=300)
fig.suptitle("Average Throughput per GPU and Micro-Batch Size")
stats = performance.groupby(["gpu", "micro_batch_size"])["train/throughput/current"].mean()

gpu_order = performance.gpu.unique()
batch_size_order = [str(2**i) for i in range(6)]
colors = sns.color_palette("Blues", n_colors=len(batch_size_order))
sns.barplot(data=performance, x="gpu", y="train/throughput/current", hue="micro_batch_size", order=gpu_order, hue_order=batch_size_order, ax=ax[0], gap=0.2, palette=colors)
ax[0].set_xlabel("GPU")
ax[0].set_ylabel("Average Throughput (kT/s)")
ax[0].yaxis.set_major_formatter(lambda x, p: f'{x/1000:.0f}')
ax[0].legend(title="Micro-Batch Size")

colors = sns.color_palette("Blues", n_colors=len(gpus))
sns.barplot(data=performance, x="micro_batch_size", y="train/throughput/current", hue="gpu", order=batch_size_order, hue_order=gpu_order, ax=ax[1], gap=0.2, palette=colors)
ax[1].set_xlabel("Micro-Batch Size")
ax[1].set_ylabel("Average Throughput (kT/s)")
ax[1].yaxis.set_major_formatter(lambda x, p: f'{x/1000:.0f}')
ax[1].legend(title="GPU");
plt.tight_layout()
plt.show();

In [None]:
# Best configuration per GPU
peak_performance = performance.groupby("gpu").apply(lambda x: x.loc[x["train/throughput/current"].idxmax()], include_groups=False)

baseline = peak_performance.loc[peak_performance.cost.idxmin()]
peak_performance["cost_factor"] = peak_performance.cost / baseline.cost
peak_performance["throughput_factor"] = peak_performance["train/throughput/current"] / baseline["train/throughput/current"]

peak_performance

It looks like for this training run, the GeForce RTX 4090 is the best cost-to-performance choice. Compared to the A100 40GB, we are paying 1.66x times the price for ~7% throughput improvement, and for the A100 80GB wwe are paying 1.59x the price for ~13% improvement. As long as we are not training distributed, the 4090 is a very good choice. We should choosen `micro_batch_size=4` and train on half precision.