# 📈 Experiments

This notebook shows how to train a model and load it from a checkpoint.

## Setup 

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import autorootcwd

In [None]:
# Imports
from typing import Dict

import wandb
from wandb.sdk.wandb_run import Run

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# Constants
WANDB_ENTITY = "mikasenghaas"
WANDB_PROJECT = "swarm"

In [None]:
# Helpers
def get_config(run: Run) -> Dict:
    return run.config

def get_history(run: Run) -> pd.DataFrame:
    run_id = run.id
    history = run.history()
    return pd.concat([pd.Series([run_id]*len(history), name="run_id"), history], axis=1)

def get_summary(run: Run) -> pd.Series:
    return pd.DataFrame([dict(run.summary)], index=[run.id])

In [None]:
# Initialize W&B
api = wandb.Api()

# Get runs
RUNS = api.runs(f"{WANDB_ENTITY}/{WANDB_PROJECT}")
print(f"✅ Loaded {len(RUNS)} runs from W&B ({WANDB_ENTITY}/{WANDB_PROJECT})")

## Experiment 1: Verify Gradient Accumulation

This experiment verifies that gradient accumulation works as expected. We do so by training a model based on the debug configuration with different micro-batch sizes and the same global batch size locally (Apple M1).

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace?nw=dm6rh6z8t14)

In [None]:
# Load runs
GROUP = "verify/grad-acc"
EXP1_RUNS = [r for r in RUNS if r.group == GROUP]

print(f"✅ Loaded {len(EXP1_RUNS)} runs for experiment {GROUP}")

In [None]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in EXP1_RUNS}
runs_summary = pd.concat([get_summary(r) for r in EXP1_RUNS])
runs_history = pd.concat([get_history(r) for r in EXP1_RUNS])

In [None]:
# Plot loss by step
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.lineplot(data=runs_history, x="_step", y="train/loss/current", hue="run_id", marker="o", ax=ax[0])
sns.lineplot(data=runs_history, x="_step", y="train/loss/average", hue="run_id", marker="o", ax=ax[1])
ax[0].set_title("Loss by Step")
ax[1].set_title("Loss by Step (Average)")
for a in ax:
    a.set_xlabel("Step")
    a.set_ylabel("Loss")
plt.show();

Nice, gradient accumulation works. For every step, we are accumulating gradients over various micro-batches, and the we perform the same gradient updates.

In [None]:
# Plot Wall-Time by Run
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.barplot(data=runs_summary, x=runs_summary.index, y="_runtime", ax=ax[0])
sns.barplot(data=runs_summary, x=runs_summary.index, y="train/throughput/average", ax=ax[1])
ax[0].set_title("Wall-Time by Run")
ax[1].set_title("Throughput by Run")
ax[0].set_ylabel("Wall-Time (s)")
ax[1].set_ylabel("Throughput (T/s)")
for a in ax:
    a.set_xlabel("Micro-Batch Size")
    a.set_xticks(range(len(runs_summary)))
    a.set_xticklabels([runs_config[run_id]['train']['micro_batch_size'] for run_id in runs_summary.index]);
plt.show();

We see that the wall-time decreases with increasing micro-batch size, as expected. This is, because we are processing more tokens per second (using GPU hardware more efficiently).

## Experiment 2: Cosine LR Scheduler

This experiment verifies that the cosine learning rate scheduling works as expected, e.g. the learning rate is 0 at the start, then linearly increases for `train.scheduler.warmup_steps`, after which the cosine schedule kicks in and the learning rate decays according to a cosine annealing pattern until it reaches a minimum learning rate of `train.scheduler.min_lr_factor` of the initial learning rate. The experiment is run with the debug configuration from the script `experiments/verify/scheduler.sh` and run locally on an Apple M1.

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace)

In [None]:
# Load runs
GROUP = "verify/scheduler"
EXP2_RUNS = [r for r in RUNS if r.group == GROUP]

print(f"✅ Loaded {len(EXP2_RUNS)} runs for experiment {GROUP}")

In [None]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in EXP2_RUNS}
runs_summary = pd.concat([get_summary(r) for r in EXP2_RUNS])
runs_history = pd.concat([get_history(r) for r in EXP2_RUNS])

In [None]:
# Plot learning rate patterns
fig, ax = plt.subplots(figsize=(12, 6), dpi=300)
sns.lineplot(data=runs_history, x="_step", y="train/learning_rate/current", hue="run_id", ax=ax)
ax.set_title("Learning Rate by Step (All Runs)")
ax.set_xlabel("Step")
ax.set_ylabel("Learning Rate")
plt.legend(title="Run ID")

# Create custom legend with scheduler configuration
run_ids = runs_config.keys()
enable = [runs_config[run_id]['train']['scheduler']['enable'] for run_id in run_ids]
warmup_steps = [runs_config[run_id]['train']['scheduler']['warmup_steps'] for run_id in run_ids]
min_lr_factor = [runs_config[run_id]['train']['scheduler']['min_lr_factor'] for run_id in run_ids]

legend_elements = []
for run_id, e, w, m in zip(run_ids, enable, warmup_steps, min_lr_factor):
    color = ax.get_lines()[list(run_ids).index(run_id)].get_color()
    legend_elements.append(plt.Line2D([0], [0], color=color, lw=2, label=f"{run_id} (enable={e}, warmup_steps={w}, min_lr_factor={m})"))

ax.legend(handles=legend_elements, title="Scheduler Config")

plt.tight_layout()
plt.show();

Nice, looks good. The hyperparameter affect the learning rate pattern as expected:

- `enable`: The learning rate is constant at the initial learning rate for `False` and otherwise follows a cosine annealing pattern.
- `warmup_steps`: The learning rate is linearly increased from the initial learning rate to the maximum learning rate over `warmup_steps` steps.
- `min_lr_factor`: The learning rate is multiplied by `min_lr_factor` at the end of the training.