# 📈 Experiments

This notebook shows how to train a model and load it from a checkpoint.

## Setup 

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
import autorootcwd

In [None]:
# Imports
from typing import Dict

import wandb
from wandb.sdk.wandb_run import Run

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# Constants
WANDB_ENTITY = "mikasenghaas"
WANDB_PROJECT = "swarm"

In [None]:
# Helpers
def get_config(run: Run) -> Dict:
    return run.config

def get_history(run: Run) -> pd.DataFrame:
    run_id = run.id
    history = run.history()
    return pd.concat([pd.Series([run_id]*len(history), name="run_id"), history], axis=1)

def get_summary(run: Run) -> pd.Series:
    return pd.DataFrame([dict(run.summary)], index=[run.id])

In [None]:
# Initialize W&B
api = wandb.Api()

# Get runs
RUNS = api.runs(f"{WANDB_ENTITY}/{WANDB_PROJECT}")
print(f"✅ Loaded {len(RUNS)} runs from W&B ({WANDB_ENTITY}/{WANDB_PROJECT})")

## Experiment 1: Verify Gradient Accumulation

This experiment verifies that gradient accumulation works as expected. We do so by training a model based on the debug configuration with different micro-batch sizes and the same global batch size locally (Apple M1).

View the experiment: [W&B](https://wandb.ai/mikasenghaas/swarm/workspace?nw=dm6rh6z8t14)

In [None]:
# Load runs
GROUP = "verify/grad-acc"
EXP1_RUNS = [r for r in RUNS if r.group == GROUP]

print(f"✅ Loaded {len(EXP1_RUNS)} runs for experiment {GROUP}")

In [None]:
# Get config, summary, history
runs_config = {r.id: get_config(r) for r in EXP1_RUNS}
runs_summary = pd.concat([get_summary(r) for r in EXP1_RUNS])
runs_history = pd.concat([get_history(r) for r in EXP1_RUNS])

In [None]:
# Plot loss by step
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.lineplot(data=runs_history, x="_step", y="train/loss/current", hue="run_id", marker="o", ax=ax[0])
sns.lineplot(data=runs_history, x="_step", y="train/loss/average", hue="run_id", marker="o", ax=ax[1])
ax[0].set_title("Loss by Step")
ax[1].set_title("Loss by Step (Average)")
for a in ax:
    a.set_xlabel("Step")
    a.set_ylabel("Loss")

Nice, gradient accumulation works. For every step, we are accumulating gradients over various micro-batches, and the we perform the same gradient updates.

In [None]:
# Plot Wall-Time by Run
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(16, 4), dpi=300)
sns.barplot(data=runs_summary, x=runs_summary.index, y="_runtime", ax=ax[0])
sns.barplot(data=runs_summary, x=runs_summary.index, y="train/throughput/average", ax=ax[1])
ax[0].set_title("Wall-Time by Run")
ax[1].set_title("Throughput by Run")
ax[0].set_ylabel("Wall-Time (s)")
ax[1].set_ylabel("Throughput (T/s)")
for a in ax:
    a.set_xlabel("Micro-Batch Size")
    a.set_xticks(range(len(runs_summary)))
    a.set_xticklabels([runs_config[run_id]['train']['micro_batch_size'] for run_id in runs_summary.index]);

plt.show();

We see that the wall-time decreases with increasing micro-batch size, as expected. This is, because we are processing more tokens per second (using GPU hardware more efficiently).