# Scaling Laws

In this week we want to create simple scaling laws for language models. We will follow approach two in Hoffmann et al. 2022 https://arxiv.org/abs/2203.15556. 

- We train GPT models based on nanoGPT: https://github.com/karpathy/nanoGPT

- Download the dataset gutenberg_poetry.txt from MOODLE. This text is a filtered version of https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus.

- You can do the training in https://colab.google/.



In [None]:
# Scaling laws will be based on the following models. 
gpt_models = {
    'gpt-0':    dict(n_layer=2, n_head=2, n_embd=96),     
    'gpt-1':    dict(n_layer=4, n_head=4, n_embd=96),    
    'gpt-2':    dict(n_layer=9, n_head=8, n_embd=96),     
    'gpt-3':    dict(n_layer=10, n_head=8, n_embd=128),   
    'gpt-4':    dict(n_layer=20, n_head=16, n_embd=128),  
}

In [None]:
import inspect
from model import GPTConfig, GPT

# gpt_models dictionary should be defined in the cell above

print(f"{'Model Name':<10} | {'Total Parameters':<20} | {'Non-Embedding Params (N)':<25} | {'N (model.py non_embedding=True)':<35}")
print(f"{'----------':<10} | {'----------------':<20} | {'-------------------------':<25} | {'-----------------------------------':<35}")

for model_name, model_args in gpt_models.items():
    # Use nanoGPT default vocab_size and block_size for N calculation
    # These are properties of the model architecture itself.
    # vocab_size=50304, block_size=1024 are defaults in nanoGPT's GPTConfig.
    config_args = model_args.copy()
    if 'vocab_size' not in config_args: config_args['vocab_size'] = 50304
    if 'block_size' not in config_args: config_args['block_size'] = 1024
    
    conf = GPTConfig(**config_args)
    # The print statement is in GPT.__init__. It will show when user runs the cell.
    _model_for_params = GPT(conf)
    total_params = _model_for_params.get_num_params()
    
    # Calculate N as per Hoffmann et al. (Chinchilla paper)
    # N = Total params - (wte_params + wpe_params)
    # (lm_head_params are tied to wte_params and are thus included in wte_params)
    
    wte_params = conf.vocab_size * conf.n_embd
    wpe_params = conf.block_size * conf.n_embd
    
    n_params_chinchilla = total_params - wte_params - wpe_params
    
    # Value from model.get_num_params(non_embedding=True)
    # This is total_params - wpe_params as per model.py
    n_params_model_non_embedding_true = _model_for_params.get_num_params(non_embedding=True)

    print(f"{model_name:<10} | {total_params:<20,} | {n_params_chinchilla:<25,} | {n_params_model_non_embedding_true:<35,}")


Task 1: Encode the gutenberg_poetry.txt file using prepare.py from nanoGPT/shakespeare_char. Create the models from gpt_models (see above) and calculate the number of non-embedding paramters $N$. (Hint: see model.py)


### Task 2: Deriving the Formula for Training Steps (S)
We are given the compute approximation $C \approx 6ND$, where:
- $C$ is the total compute budget.
- $N$ is the number of non-embedding parameters in the model.
- $D$ is the total number of processed training tokens.

The total number of training tokens $D$ can also be expressed as:
$D = S \times \text{batch_size} \times \text{block_size} \times \text{gradient_accumulation_steps}$
Where:
- $S$ is the number of training steps (iterations, or `max_iters`).
- `block_size` is the context length (given as 32).
- `batch_size` is the batch size (given as 64).
- `gradient_accumulation_steps` is given as 1.

Let $T_\text{step} = \text{batch_size} \times \text{block_size} \times \text{gradient_accumulation_steps}$ be the number of tokens processed per training step.
So, $D = S \times T_\text{step}$.

Substitute this into the compute formula:
$C \approx 6N(S \times T_\text{step})$

We want to find $S$. Rearranging the formula for $S$:
$S \approx \frac{C}{6N \times T_\text{step}}$

Thus, the number of training steps $S$ is:
$S = \frac{C}{6 \times N \times \text{batch_size} \times \text{block_size} \times \text{gradient_accumulation_steps}}$

The following code cell implements this calculation and generates the table of training steps.

In [None]:
# Task 2: Training Step Calculation
from model import GPTConfig, GPT # Needed to calculate N again
import math

# Provided gpt_models dictionary (should be available from previous cell execution)
# If not, uncomment and run this: 
# gpt_models = {
#     'gpt-0':    dict(n_layer=2, n_head=2, n_embd=96),
#     'gpt-1':    dict(n_layer=4, n_head=4, n_embd=96),
#     'gpt-2':    dict(n_layer=9, n_head=8, n_embd=96),
#     'gpt-3':    dict(n_layer=10, n_head=8, n_embd=128),
#     'gpt-4':    dict(n_layer=20, n_head=16, n_embd=128),
# }

def get_chinchilla_N_for_task2(model_config_dict):
    # Standard nanoGPT defaults if not specified
    # This function is named differently to avoid potential confusion if notebook is run sequentially
    # with a function of the same name from another cell that might have different assumptions.
    config_args = model_config_dict.copy()
    if 'vocab_size' not in config_args: config_args['vocab_size'] = 50304
    if 'block_size' not in config_args: config_args['block_size'] = 1024
    
    # Critical: Create a temporary config and model to calculate N.
    # This avoids issues if the user has already created a model instance in the notebook elsewhere.
    temp_conf = GPTConfig(**config_args)
    # Suppress print output from GPT() by creating a dummy class for this calculation only.
    # This is a bit of a hack. A better way would be to refactor GPT.__init__ or use contextlib.redirect_stdout.
    # For notebook cell code, keeping it simple is often preferred if verbosity is not a major issue for the user.
    # However, the problem asks for a *table*, and repeated prints from GPT() would clutter it.
    # Let's assume the print from GPT() during N calculation is acceptable for now, or the user manages it.
    # The prompt for the previous cell modification said "The user will see 'number of parameters: X.XXM' for each model when they run the cell."
    # So, we'll keep it consistent here. The prints will appear when this cell runs.
    temp_model = GPT(temp_conf)
    
    total_params = temp_model.get_num_params()
    wte_params = temp_conf.vocab_size * temp_conf.n_embd
    wpe_params = temp_conf.block_size * temp_conf.n_embd
    # lm_head_params are tied with wte_params in nanoGPT, so they are already part of wte_params.
    # N = Total params - wte_params - wpe_params (as defined in the previous cell solution)
    n_chinchilla = total_params - wte_params - wpe_params
    return n_chinchilla

N_values = {name: get_chinchilla_N_for_task2(args) for name, args in gpt_models.items()}
print("Calculated N (Chinchilla non-embedding) values for models (will also show GPT init prints):")
for name, n_val in N_values.items():
    print(f"  {name}: {n_val:,}")
print('\n')

# Compute values
C0 = 6e13
C1 = 3e14
C2 = 6e14

compute_budgets = {
    'C0': C0,
    'C1': C1,
    'C2': C2
}

# Model assignments for each compute budget
model_assignments = {
    'C0': ['gpt-0', 'gpt-1', 'gpt-2', 'gpt-3'],
    'C1': ['gpt-1', 'gpt-2', 'gpt-3'],
    'C2': ['gpt-1', 'gpt-2', 'gpt-3', 'gpt-4']
}

# Training parameters (as defined in the notebook)
batch_size = 64
block_size = 32  # Context length
gradient_accumulation_steps = 1

tokens_per_step = batch_size * block_size * gradient_accumulation_steps

print(f"{'Budget':<5} | {'Model':<10} | {'N (Params)':<15} | {'D (Tokens)':<18} | {'S (Train Steps)':<18}")
print(f"{'-----':<5} | {'----------':<10} | {'---------------':<15} | {'------------------':<18} | {'------------------':<18}")

results_table = []
for budget_name, C_val in compute_budgets.items():
    for model_name in model_assignments[budget_name]:
        N_val = N_values.get(model_name)
        if N_val is None:
            print(f"Warning: N value for {model_name} not found. Skipping.")
            continue
        if N_val <= 0: # Avoid division by zero or invalid N
            S_val_int = float('inf') 
            D_val_int = float('inf')
            print(f"Warning: N value for {model_name} is {N_val}. Skipping calculation for this model.")
        else:
            D_val = C_val / (6 * N_val)
            S_val = D_val / tokens_per_step
            S_val_int = math.ceil(S_val) 
            D_val_int = math.ceil(D_val)
        results_table.append({'budget': budget_name, 'model': model_name, 'N': N_val, 'D': D_val_int, 'S': S_val_int})
        print(f"{budget_name:<5} | {model_name:<10} | {N_val:<15,} | {D_val_int:<18,} | {S_val_int:<18,}")


Task 2: For the models from above we vary the number of training tokens such that we have constant amount of compute $C$. The compute budget can be approximated by $C \approx 6ND$ where $N$ is the model size and $D$ the number of processed training tokens. We fix three compute values $C_0=6\cdot 10^{13}$, $C_1=3\cdot 10^{14}$, and $C_2=6\cdot 10^{14}$. For $C_0$, we train the models gpt-0, gpt-1, gpt_2, and gpt-3. For $C_1$, we use gpt-1, gpt-2, and gpt-3. For $C_2$, use the models gpt-1, gpt-2, gpt-3, and gpt-4. Derive a formula for the number of training steps $S$ and create a table with the corresponding training steps for each training run. We assume fixed context length of $32$ and batch size of $64$. We train on one GPU with gradient accumulation steps of 1.  


In [None]:
# You can use the following values for training. 
eval_interval = 250 
eval_iters = 200
log_interval = 10 

always_save_checkpoint = False

gradient_accumulation_steps = 1
batch_size = 64
block_size = 32 

learning_rate = 1e-3 
max_iters = None # needs to be calculated -> use the value S calcualted above 
lr_decay_iters = None # make equal to max_iters usually
min_lr = 1e-4 
beta2 = 0.99

warmup_iters = 100 

device = 'cuda'

## Task 3: Training, Loss Recording, and Scaling Law Analysis
This task involves training the models defined in Task 1, using the training step counts calculated in Task 2, and then analyzing the results to derive scaling laws similar to Hoffmann et al. 2022.
**Note:** The actual training of models is computationally intensive and should be performed by the user in an environment with adequate resources (e.g., Google Colab with GPU access). The steps below outline the process.

### 3.1. Data Preparation (Reminder)
1. **Replace Dummy File:** Replace the dummy `gutenberg_poetry.txt` file in the root directory with the actual dataset provided on MOODLE.
2. **Re-run `prepare.py`:** After replacing the text file, navigate to the `data/gutenberg_poetry/` directory and run `python prepare.py` again. This will create the `train.bin`, `val.bin`, and `meta.pkl` files based on the actual dataset. Ensure this `meta.pkl` contains the correct `vocab_size` which will be used by the training script.

### 3.2. Model Training
For each model and compute budget configuration for which you calculated `S` (max_iters) in Task 2, you need to train a model using `train.py`.

**Example command structure for `train.py`:**
```bash
python train.py \
    --dataset=gutenberg_poetry \
    --out_dir=out-<model_name>-<budget_name> \
    --n_layer=<val> --n_head=<val> --n_embd=<val> \
    --block_size=32 --batch_size=64 --gradient_accumulation_steps=1 \
    --max_iters=<S_value_from_Task2_table> \
    --lr_decay_iters=<S_value_from_Task2_table> \
    --learning_rate=1e-3 --min_lr=1e-4 --beta2=0.99 \
    --warmup_iters=100 \
    --eval_interval=250 --eval_iters=200 --log_interval=10 \
    --always_save_checkpoint=False \
    --device=cuda \
    # Potentially add --compile=False if issues arise with torch.compile
```
**Explanation of parameters:**
- `--dataset=gutenberg_poetry`: This should match the directory name in `data/` where `train.bin`, `val.bin` and `meta.pkl` are located.
- `--out_dir`: Specifies where checkpoints and logs will be saved. Use a unique name for each run (e.g., `out-gpt-1-C0`).
- `n_layer, n_head, n_embd`: These are the architectural parameters for the specific model (e.g., from the `gpt_models` dictionary for `gpt-1`).
- `block_size, batch_size, gradient_accumulation_steps`: Fixed as per the problem description (32, 64, 1 respectively).
- `max_iters`: This is the `S` value you calculated for the specific (model, compute budget) pair in Task 2.
- `lr_decay_iters`: Typically set equal to `max_iters`. The other hyperparameters (learning rate, warmup, etc.) are provided in the notebook or can be nanoGPT defaults.

**Training Runs Table (from Task 2):**
Refer to the table generated by the code cell in Task 2. For each row in that table, perform a training run with the corresponding model architecture and `max_iters`.

### 3.3. Record Final Training Loss (L)
After each training run completes, the script will output logs, typically including the validation loss at various intervals. The final validation loss is the value of $L$ for that specific training run.
1. **Locate Loss:** Check the console output or log files (e.g., `ckpt.pt` in the `out_dir` often stores the final loss, or it's printed to stdout). The `eval_interval` is 250, so the loss reported at the last evaluation step before `max_iters` is reached would be the one to use.
2. **Create a Table for L:** You should manually create a table to store these loss values. It might look like this:
   ```
   | Compute Budget | Model Name | N (Non-Embed Params) | D (Tokens) | S (Train Steps) | Final Validation Loss (L) |
   |----------------|------------|----------------------|------------|-----------------|---------------------------|
   | C0             | gpt-0      | (value from Task 1)  | (value)    | (value)         | (record here)             |
   | C0             | gpt-1      | (value from Task 1)  | (value)    | (value)         | (record here)             |
   | ...            | ...        | ...                  | ...        | ...             | ...                       |
   ```
   Fill this table with the $N$ values from Task 1, $D$ and $S$ from Task 2, and the observed $L$ from your training runs.

### 3.4. Plot L vs. N for each Compute Budget
Using the table from step 3.3:
1. **Group Data:** For each compute budget ($C_0, C_1, C_2$), you will have a set of $(N, L)$ pairs.
2. **Plot:** Create three scatter plots: $L$ (y-axis) vs. $N$ (x-axis), one for each $C$.
   ```python
   # Example using matplotlib
   import matplotlib.pyplot as plt
   import numpy as np
   from scipy.optimize import curve_fit

   # Assume you have lists of N_values_C0, L_values_C0, etc.
   # plt.scatter(N_values_C0, L_values_C0, label='C0')
   # ... add plots for C1, C2 ...

   # plt.xlabel('Number of Non-Embedding Parameters (N)')
   # plt.ylabel('Final Validation Loss (L)')
   # plt.legend()
   # plt.title('Loss vs. Model Size for Fixed Compute Budgets')
   # plt.xscale('log') # Often plotted on log scales
   # plt.yscale('log')
   # plt.show()
   ```
3. **Fit Curves:** As in Hoffmann et al. Figure 3 (left), fit a curve to each set of points. This could be a power law of the form $L(N) = A N^\alpha + E_0$ or a polynomial in log-log space. The goal is to find a curve that well describes the relationship and from which a minimum can be estimated if one exists within the range or extrapolated.
   ```python
   # Example fitting a power law L = A * N^alpha (in log space: log L = log A + alpha * log N)
   # def power_law(n, a, b):
   #     return a * np.power(n, b)
   # params_c0, _ = curve_fit(power_law, N_values_C0, L_values_C0)
   # plt.plot(sorted_N_values_C0, power_law(sorted_N_values_C0, *params_c0), label='C0 fit')
   ```

### 3.5. Extract Minima and Plot Scaling Laws
1. **Find $N_\text{opt}(C)$:** For each compute budget $C$, find the model size $N_\text{opt}$ that minimizes the loss $L$. This might be one of the discrete points you trained, or an interpolated/extrapolated minimum from your fitted curve.
2. **Calculate $D_\text{opt}(C)$:** Once you have $N_\text{opt}(C)$ for each compute budget, calculate the corresponding optimal number of training tokens $D_\text{opt}(C)$ using the formula $D_\text{opt}(C) = C / (6 N_\text{opt}(C))$.
3. **Create Table for Optimals:**
   ```
   | Compute Budget (C) | N_opt(C)            | D_opt(C)            | Min Loss L(N_opt, C) |
   |--------------------|---------------------|---------------------|----------------------|
   | C0 (6e13)          | (extracted/interp.) | (calculated)        | (from fit/data)      |
   | C1 (3e14)          | (extracted/interp.) | (calculated)        | (from fit/data)      |
   | C2 (6e14)          | (extracted/interp.) | (calculated)        | (from fit/data)      |
   ```
4. **Plot $N_\text{opt}$ vs. $C$ and $D_\text{opt}$ vs. $C$:** 
   - Create a log-log plot of $N_\text{opt}$ (y-axis) vs. $C$ (x-axis). (Similar to Hoffmann et al. Figure 3, center).
   - Create a log-log plot of $D_\text{opt}$ (y-axis) vs. $C$ (x-axis). (Similar to Hoffmann et al. Figure 3, right).
   ```python
   # C_values = np.array([6e13, 3e14, 6e14])
   # N_opt_values = np.array([...]) # from your table
   # D_opt_values = np.array([...]) # from your table

   # plt.figure()
   # plt.plot(C_values, N_opt_values, 'o-', label='N_opt(C)')
   # plt.xlabel('Compute Budget (C)')
   # plt.ylabel('Optimal Model Size N_opt(C)')
   # plt.xscale('log'); plt.yscale('log'); plt.legend(); plt.title('N_opt vs. Compute')
   # plt.show()

   # plt.figure()
   # plt.plot(C_values, D_opt_values, 'o-', label='D_opt(C)')
   # plt.xlabel('Compute Budget (C)')
   # plt.ylabel('Optimal Training Tokens D_opt(C)')
   # plt.xscale('log'); plt.yscale('log'); plt.legend(); plt.title('D_opt vs. Compute')
   # plt.show()
   ```
5. **Fit Scaling Laws:** Fit power laws to these two plots:
   - $N_\text{opt}(C) = N_0 C^a$
   - $D_\text{opt}(C) = D_0 C^b$
   In log-log space, these are linear fits: $\log(N_\text{opt}) = \log(N_0) + a \log(C)$ and $\log(D_\text{opt}) = \log(D_0) + b \log(C)$. You can use `np.polyfit` on the log-transformed data to find $a, b, \log(N_0), \log(D_0)$.
   ```python
   # # Fit for N_opt = N0 * C^a  => log(N_opt) = log(N0) + a * log(C)
   # log_C = np.log(C_values)
   # log_N_opt = np.log(N_opt_values)
   # params_N = np.polyfit(log_C, log_N_opt, 1) # degree 1 polynomial
   # a_N = params_N[0]
   # log_N0 = params_N[1]
   # N0 = np.exp(log_N0)
   # print(f'N_opt(C) scaling: N0 = {N0:.2e}, a = {a_N:.2f}')

   # # Fit for D_opt = D0 * C^b => log(D_opt) = log(D0) + b * log(C)
   # log_D_opt = np.log(D_opt_values)
   # params_D = np.polyfit(log_C, log_D_opt, 1)
   # b_D = params_D[0]
   # log_D0 = params_D[1]
   # D0 = np.exp(log_D0)
   # print(f'D_opt(C) scaling: D0 = {D0:.2e}, b = {b_D:.2f}')
   ```
6. **Report Scaling Law Parameters:** Write down your estimated values for $N_0, a, D_0, b$.
7. **Create Final Plot:** Optionally, create a plot showing your data points and the fitted scaling law curves for $N_\text{opt}(C)$ and $D_\text{opt}(C)$.

This completes the conceptual outline for Task 3. Good luck with the training and analysis!
