Stella: "So from our POV the most basic concept is the number of non-embedding params and the three ratios shown in Figure 5"

Ratios in question:

(1) Feed-forward ratio (`ffr`) $\frac{d_{ff}}{d_{model}}$

(2) Aspect ratio (`ar`) $\frac{d_{model}}{n_{layer}}$

(3) Attention head dimension (`attn_dim`) $\frac{d_{model}}{n_{head}}$

Subject to:

(4) $N \approx 2d_{model}n_{layer}(2d_{attn}+d_{ff})$

(5) $N = 12n_{layer}d_{model}^ 2$

(6) $d_{model} = d_{attn} = d_{ff}/4$

We'll focus on using (5), and using `ar` to solve for `d_model`, calculate `n_layer` using (2), then use (3) and `attn_dim` to find `n_head`.

Solving (5) for `d_model` yields

$d_{model} = (\frac{\phi N}{12})^ \frac{1}{3}$

where $\phi$ is `ar`.

In [None]:
! pip install wandb
import wandb
import numpy as np

In [None]:
sweep_config = {
    'name': 'Scaling Laws for Neural Language Models sweep',
    'method': 'grid',
    'metric': {
        'name': 'loss',
        'goal': 'minimize'
    },
    'parameters': {
        'valid_set': {
          'values': [
                    # This will be a list of strings
          ]
        }
    }
}

param_dict = {
  # these are the ranges to sweep over
  # ballpark numbers from Figure 5
  'exponent': [exponent for exponent in range(10,22)],
  'ar': [round(10**x) for x in np.linspace(1,2.5,3)],
  'attn_dim': [round(10**x) for x in np.linspace(1.5,2.5,3)],
}

for exponent in param_dict['exponent']:
  N = np.exp(exponent)
  # add LR according equation D.1 from Kaplan et. al
  # "Scaling Laws for Neural Language Models"
  lr = 0.003239 + (-0.0001395)*np.log(N)
  for ar in param_dict['ar']:
    # substitute for n_layer, solve for d_model
    d_model = (N*ar/12)**(1/3)
    # calculate n_layer
    n_layer = N/12/(d_model**2)
    if n_layer < 1:
      # don't clip n_layer
      break
    for attn_dim in param_dict['attn_dim']:
      # add n_head per attn_dim
      n_head = d_model/attn_dim
      if n_head < 1:
        # don't clip n_head
        break
      # add this combination as a string to sweep_config
      sweep_config['parameters']['valid_set']['values'].append(
          ','.join(
              [str(round(x)) for x in \
               [exponent,n_layer, d_model, n_head]] + \
               [str(float(lr))])
      )

In [None]:
# check config
import pandas as pd
df = pd.DataFrame(
    [[float(x) for x in y.split(',')] for y in \
     sweep_config['parameters']['valid_set']['values']],
    columns=['exponent', 'n_layer', 'd_model', 'n_head', 'lr'],
    )
df.set_index('exponent', inplace=True)

# print some examples
for i in df.index.unique():
  print(df.loc[i])

In [None]:
def train():
    run = wandb.init()
    print(run.config.valid_set)
    vars = {k:v for k,v in zip(
        # these are from neox_arguments.md
        ['exponent',
         'num_layers', # "n_layers" (GPT)
         'hidden_size', # "d_model" (GPT)
         'num_attention_heads' # "n_heads" (GPT)
         'lr' # "learning_rate" (GPT)
         ],
        [float(x) for x in run.config.valid_set.split(',')]
    )}
    print('Exponent (model size) {}, num_layers {}, hidden_size {}, diff(N,12*num_layers*hidden_size**2) {} percent'.format(
        vars['exponent'], vars['num_layers'], vars['hidden_size'],
        (np.exp(vars['exponent'])-(12*vars['num_layers']*vars['hidden_size']**2))/np.exp(vars['exponent'])*100
    ))
    run.finish()

sweep_id = wandb.sweep(sweep_config)
agent = wandb.agent(sweep_id=sweep_id, function=train)
agent.run()