In this notebook, you will

- build on your existing knowledge of PyTorch and deep learning basics
- prior knowledge of reinforcement learning is helpful but not required
- 


Focus on
- easy to understand, incremental steps vs. mathematical rigor
- simplify (without trivializing) the algorithms when possible
- narrow slice of reinforcement learning

For a more mathematical treatment of reinforcement learning, policy gradients, and proximal policy optimization, check out:

- [Sutton & Barto, "Reinforcement Learning: An Introduction"](http://incompleteideas.net/book/the-book-2nd.html) — the classic RL textbook, especially Chapters 13 and 14 for policy gradient methods
- [OpenAI Spinning Up: Policy Gradient Methods](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#policy-gradient-methods) — practical and mathematical overview
- [Proximal Policy Optimization Algorithms (Schulman et al., 2017)](https://arxiv.org/abs/1707.06347) — original PPO paper
- [Lil'Log: Proximal Policy Optimization Explained](https://lilianweng.github.io/posts/2018-06-24-policy-gradient/#ppo) — blog post with clear math and diagrams


Before we get started, let's make sure you have the correct version of PyTorch installed. If you have an older version or want to start fresh, you can uninstall the current PyTorch, torchvision, and torchaudio packages using the following command.

In [None]:
# !pip3 uninstall -y torch torchvision torchaudio

If you need to install PyTorch with CUDA support (for GPU acceleration), you can use the following command. Make sure your environment supports CUDA 11.8.

In [None]:
# !pip3 install torch --index-url https://download.pytorch.org/whl/cu118

Let's import PyTorch and check the version to make sure everything is set up correctly.

In [None]:
import torch as pt
pt.__version__

Now, let's set up the device for computation. We'll use the GPU if it's available, otherwise we'll fall back to the CPU. The `nvidia-smi` command will show you information about your GPU if one is present.

In [None]:
device = "cuda" if pt.cuda.is_available() else "cpu"
!nvidia-smi
device

We'll use a dataset of names for our experiments. Let's fetch a list of names from GitHub using Python's `urllib` library. This will give us a simple, real-world dataset to work with.

In [None]:
import urllib
with urllib.request.urlopen('https://raw.githubusercontent.com/karpathy/makemore/master/names.txt') as resp:
  src = resp.read().decode('utf-8')
src

Now, let's convert the downloaded text into a Python list of names. We'll take a quick look at the first few names to make sure everything looks good.

In [None]:
names = src.splitlines()
names[:5]

Next, we'll extract the unique characters (tokens) used in the dataset. We'll also add special tokens for the start (`_`) and end (`.`) of a name. This will help us later when we encode and decode names.

In [None]:
tokens = "_" + "." + "".join(sorted(set("".join(names))))
tokens

To work with our tokens, we'll create two dictionaries: `stoi` (string-to-index) and `itos` (index-to-string). These will let us easily convert between characters and their corresponding indices.

In [None]:
stoi = {v:k for k,v in enumerate(tokens)}


In [None]:
print(stoi)

{'_': 0, '.': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27}


In [None]:
itos = {k:v for k,v in enumerate(tokens)}

In [None]:
print(itos)

{0: '_', 1: '.', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'q', 19: 'r', 20: 's', 21: 't', 22: 'u', 23: 'v', 24: 'w', 25: 'x', 26: 'y', 27: 'z'}


Let's define two simple functions: `enc` to encode a name as a list of indices, and `dec` to decode a list of indices back to a string. This will make it easy to switch between string and numeric representations.

In [None]:
enc = lambda name: [stoi[s] for s in name]
dec = lambda chars: "".join(itos[i] for i in chars)


In [None]:
dec(enc('_emma.'))

'_emma.'

Now, let's create a function `stot` that converts a string into a one-hot encoded tensor. This will be useful for representing names in a format suitable for neural networks.

In [None]:
stot = lambda name: pt.nn.functional.one_hot(pt.tensor(enc(name)), len(tokens))

In [None]:
stot('_emma.')

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]])

To go back from a one-hot encoded tensor to a string, we'll define the `ttos` function. This will help us interpret the outputs of our model later.

In [None]:
ttos = lambda x: dec(x.argmax(-1).tolist())

In [None]:
ttos(stot('_emma.'))

'_emma.'

Define a constant `CTX_SZ` to represent a window size used by the model to predict the next character. For example, in a name `olivia`, the window 
- `oliv` predicts the token `i`
- `livi` predicts the token `a`
- `CTX_SZ` start of name tokens, i.e. `____` predict the token `o`

In [None]:
CTX_SZ = 4
CTX_SZ

4

With `CTX_SZ` constant defined, you can illustrate how a name like `emma` is represented using the sparse embedding with `stot`.

Create a tensor called `name` using the `CTX_SZ` start tokens `_` and a single end token`.` Also, report on the shape of the `name` tensor

In [None]:
name = stot("".join(CTX_SZ * "_" + "emma" + "."))
name, name.shape

NameError: ignored

The rest of the notebook will depart from common machine learning terminology to make the upcoming use of reinforcement learning easier to understand. The notebook will use the term **observations** to describe the inputs to the machine learning model (aka `X` or `obs`) while the outputs of the model will be described as **actions** (aka `y`). You are probably comfortable with the idea of training your machine learning model on `(X,y)` pairs, so the following will use the `(obs, action)` pairs. You can think of the action as the token that the model predicts as its output. Hence, generating the name is like making a series of actions to pick the right tokens.

PyTorch has a convenient `unfold` function to take a tensor like `name` and convert it to a sequence of observations, each with a window length of `CTX_SZ`. For example, you can `unfold` the `name` tensor along the first (0th) dimension using the step size of 1. For convenience, permute the shape of the resulting tensor to swap the last two dimensions.

In [None]:
name.unfold(0, CTX_SZ, 1).permute(0, 2, 1)

tensor([[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0]],

        [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0]],

        [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0],
         [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

To make the output of the `unfold` clearer you can use `ttos` to change the tensors back to string. For a name like `emma` you should get observations like

- `___e`
- `__em`
- `_emm`
- etc

**NOTE:** Don't forget to drop the last token `.`

In [None]:
[ttos(x) for x in name.unfold(0, CTX_SZ, 1)[:-1, :, :].permute(0, 2, 1)]

['____', '___e', '__em', '_emm', 'emma']

Since you are going to need to `unfold` all of the names in the data set, create a function `name_to_obs` that implements this capability.

In [None]:
name_to_obs = lambda name: stot(name).unfold(0, CTX_SZ, 1)[:-1, :, :].permute(0, 2, 1)

In [None]:
name_to_obs('____emma.').shape

torch.Size([5, 4, 28])

Next, let's create another function `name_to_action` that converts a name to all the action (i.e. output tokens) that should be predicted by the model.

In [None]:
name_to_action = lambda name: stot(name)[CTX_SZ:]

In [None]:
y = name_to_action('____emma.')
y.shape

torch.Size([5, 28])

In [None]:
ttos(name_to_action('____emma.'))

'emma.'

Now you are ready to create the data set to train your model. 

- convert all the names in `names` to the observations
- use `cat` to concatenate all the sparse embeddings of the names to a data set tensor `X_data`

In [None]:
X_data = [name_to_obs(CTX_SZ * "_" + name + ".") for name in names]
X_data = pt.cat(X_data)

In [None]:
[ttos(x) for x in X_data[:10]], X_data.shape

(['____',
  '___e',
  '__em',
  '_emm',
  'emma',
  '____',
  '___o',
  '__ol',
  '_oli',
  'oliv'],
 torch.Size([228146, 4, 28]))

Move the `X_data` tensor to your `device`

In [None]:
X_data = X_data.to(device)

Next, create the `y_data` tensor using `name_to_action` 

In [None]:
y_data = pt.cat([name_to_action(CTX_SZ * "_" + name + ".") for name in names])


In [None]:
ttos(y_data[:10]), y_data.shape

('emma.olivi', torch.Size([228146, 28]))

Move the `y_data` tensor to your device

In [None]:
y_data = y_data.to(device)

The data set should be shuffled before use for training. Use the `randperm` function to shuffle both the `X_data` and the `y_data` tensors along the 0th dimension.

**NOTE:** Don't forget to set the seed using `manual_seed`.

In [None]:
pt.manual_seed(42)
idx = pt.randperm(len(y_data))
idx.shape

torch.Size([228146])

In [None]:
X_data, y_data = X_data[idx], y_data[idx]
X_data.shape, y_data.shape

(torch.Size([228146, 4, 28]), torch.Size([228146, 28]))

Since the data set is fairly large, let's use a 90%, 5%, 5% split for the training, validation, and test data sets respectively.

In [None]:
val_idx, test_idx = int(len(X_data) * .9), int(len(X_data) * .95)
X_train, y_train = X_data[:val_idx], y_data[:val_idx]
X_val, y_val = X_data[val_idx:test_idx], y_data[val_idx:test_idx]
X_test, y_test = X_data[test_idx:], y_data[test_idx:]

X_train.shape, X_val.shape, X_test.shape

(torch.Size([205331, 4, 28]),
 torch.Size([11407, 4, 28]),
 torch.Size([11408, 4, 28]))

In [None]:
class CharModel(pt.nn.Module):
  def __init__(self, tokens_sz, ctx_sz, emb_sz, head_sz, n_heads, device):
    super().__init__()
    
    self.tok_emb = pt.nn.Embedding(tokens_sz, emb_sz, device = device)
    self.pos_emb = pt.nn.Embedding(ctx_sz, emb_sz, device = device)
    self.pos_idx = pt.arange(ctx_sz, device = device)
    
    self.kw = pt.nn.Linear(emb_sz, head_sz, device = device, bias = False)
    self.qw = pt.nn.Linear(emb_sz, head_sz, device = device, bias = False)
    self.vw = pt.nn.Linear(emb_sz, head_sz, device = device, bias = False)
    self.mhsa = pt.nn.MultiheadAttention(head_sz, n_heads, batch_first = True, device = device)
    self.mhsa_ln = pt.nn.LayerNorm(head_sz, device = device)
    self.flatten = pt.nn.Flatten(1)
    self.relu = pt.nn.ReLU()
    self.mhsa_head = pt.nn.Linear(head_sz * ctx_sz, tokens_sz, device = device)


  def forward(self, x):
    x = self.tok_emb(x) + self.pos_emb(self.pos_idx)
    q, k, v = self.qw(x), self.kw(x), self.vw(x)
    x, _ = self.mhsa(q, k, v,  )
    x = self.mhsa_ln(x)
    x = self.flatten(x)
    x = self.relu(x)
    x = self.mhsa_head(x)
    return x



In [None]:
EMB_SZ = 32
MHSA_HEAD_SZ = 64
N_HEADS = 2
nn = CharModel(len(tokens), CTX_SZ, EMB_SZ, MHSA_HEAD_SZ, N_HEADS, device)
nn

CharModel(
  (tok_emb): Embedding(28, 32)
  (pos_emb): Embedding(4, 32)
  (kw): Linear(in_features=32, out_features=64, bias=False)
  (qw): Linear(in_features=32, out_features=64, bias=False)
  (vw): Linear(in_features=32, out_features=64, bias=False)
  (mhsa): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
  )
  (mhsa_ln): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (relu): ReLU()
  (mhsa_head): Linear(in_features=256, out_features=28, bias=True)
)

In [None]:
ctx = pt.stack([stot("_" * CTX_SZ).argmax(-1).to(device)])
ctx

tensor([[0, 0, 0, 0]], device='cuda:0')

In [None]:
nn(ctx).shape

torch.Size([1, 28])

In [None]:
X_batch, y_batch = X_train.argmax(-1), y_train.argmax(-1)
import copy
model = copy.deepcopy(nn)
optim = pt.optim.AdamW(model.parameters())
optim

AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.001
    maximize: False
    weight_decay: 0.01
)

In [None]:
def generate(model, seed = 42, device = device, ctx_sz = 4, num_samples = 10, max_tokens = 16):
  pt.manual_seed(seed)
  samples = []  
  ctx = pt.stack([stot(ctx_sz * "_").argmax(-1).to(device)])
  for _ in range(num_samples):
    sample = ctx.clone()
    for _ in range(max_tokens):
      logits = model(sample[:, -ctx_sz:])
      probs = pt.nn.functional.softmax(logits, -1)
      tok = pt.multinomial(probs, 1)
      if tok.item() == 1:
        break
      sample = pt.cat([sample, tok], 1)

    sample = dec(sample.squeeze().tolist()[ctx_sz:])
    samples.append(sample)
  return samples

In [None]:
EPOCHS = 1000
for epoch in range(EPOCHS):
  logits = model(X_batch)  
  loss = pt.nn.functional.cross_entropy(logits, y_batch)
  if epoch % 100 == 0:
    with pt.no_grad():
      val_loss = pt.nn.functional.cross_entropy(model(X_val.argmax(-1)), y_val.argmax(-1))
      print(f"epoch {epoch:2d} train loss {loss.item():.4f} val loss {val_loss.item():.4f}")
      samples = generate(model, device = device)
      print(",".join(samples))
      print()
  loss.backward()
  optim.step()
  optim.zero_grad()

epoch  0 train loss 3.5386 val loss 3.5430
uqhs,ydjxufqboodnojfa,fbmpphjlmhqhoghh,xbehaiwpbue_asxt,vvzupofylhhqvxhh,zpnhnmxtptbhboln,zr_vzrlyykugprtl,mplseyvksx_ukgbb,ucrsvcmypnqjddxt,qtxcsugychjgsosq

epoch 100 train loss 2.2509 val loss 2.2490
jehi,instulieon,noria,kare,jalyah,zahnielai,kauena,anon,nerfylelavahe,dahnantet

epoch 200 train loss 2.1579 val loss 2.1654
jehs,inshuilynn,nora,remon,jalyah,zahnielai,kaiena,anovennalylee,varez,nhnaxtathan

epoch 300 train loss 2.1179 val loss 2.1332
jehi,idhani,aonden,caredon,jamyah,zahnielai,kaielassir,zapry,lehav,rezan

epoch 400 train loss 2.0943 val loss 2.1141
jehi,idhani,amoden,caredon,jamyah,zahnielai,kaielassir,zapry,lehav,hezan

epoch 500 train loss 2.0775 val loss 2.1025
jehi,inshuit,onden,caredon,jamyah,zahnielai,kaielastir,zaedy,lehav,rezan

epoch 600 train loss 2.0656 val loss 2.0945
jehi,insey,kamoden,caredon,jamyah,zahnielai,kaielastir,zapto,leeg,lana

epoch 700 train loss 2.0547 val loss 2.0859
jehi,inshuir,onden,caredon,jamy

# Human preferences for reinforcement learning with human feedback

In [None]:
import re
liked_names = [name for name in src.splitlines() if bool(re.match('^[^aeiou][aeiou][^aeiou]$', name))]
liked_names[:5]

['joy', 'liv', 'luz', 'sol', 'may']

In [None]:
names = [CTX_SZ * "_" + name + "." for name in liked_names]
names[:5]

['____joy.', '____liv.', '____luz.', '____sol.', '____may.']

In [None]:
obs = pt.cat([name_to_obs(name).argmax(-1) for name in names])
obs = obs.to(device)
obs.shape

torch.Size([936, 4])

In [None]:
actions = pt.cat([name_to_action(name).argmax(-1) for name in names])
actions = actions.to(device)
actions.shape

torch.Size([936])

In [None]:
def reward(name, gamma = .9):
  idx = pt.arange(len(name) - 1, -1, -1).to(device)
  return gamma ** idx

reward(names[0][CTX_SZ:])

tensor([0.7290, 0.8100, 0.9000, 1.0000], device='cuda:0')

In [None]:
rewards = pt.cat([reward(name[CTX_SZ:]) for name in names])
rewards[:10], actions[:10]

(tensor([0.7290, 0.8100, 0.9000, 1.0000, 0.7290, 0.8100, 0.9000, 1.0000, 0.7290,
         0.8100], device='cuda:0'),
 tensor([11, 16, 26,  1, 13, 10, 23,  1, 13, 22], device='cuda:0'))

In [None]:
import copy
rl_model = copy.deepcopy(model)

rl_optim = pt.optim.AdamW(rl_model.parameters())
rl_optim.zero_grad()

### Basic Policy Gradient Reinforcement Learning
* aka REINFORCE or vanilla policy gradient

In [None]:
STEPS = 20
for step in range(STEPS):
  logits = rl_model(obs)
  log_prob_dist = pt.nn.functional.log_softmax(logits, -1)
  log_probs = log_prob_dist[pt.arange(len(actions)), actions]

  #policy gradient
  loss = -(rewards * log_probs).mean()
  
  with pt.no_grad():
    names = generate(rl_model, device = device, seed = 42 + step)
    new_names = set(set(names) - set(liked_names))
    print(f"step = {step:2d} loss={loss.item():.4f} net_new_names={len(new_names):2d}")
    print(",".join(new_names))

  loss.backward()
  rl_optim.step()
  rl_optim.zero_grad()

step =  0 loss=2.5757 net_new_names=10
kaielastion,honah,aelai,inslyn,rendy,varez,jehi,neslylee,amoden,carman
step =  1 loss=1.7851 net_new_names= 9
dhik,wrodsi,fahna,buuri,neba,brinleigh,que,yuwanni,sair
step =  2 loss=1.6210 net_new_names= 8
dway,am,le,stan,dd,mad,na,saj
step =  3 loss=1.5903 net_new_names= 7
bres,jen,ma,din,ed,cor,al
step =  4 loss=1.5442 net_new_names= 8
jo,xyn,juss,jov,vih,lus,joh,le
step =  5 loss=1.4966 net_new_names= 7
tz,kil,gyn,k,am,blo,mux
step =  6 loss=1.4621 net_new_names= 5
vaz,ky,kex,pay,gi
step =  7 loss=1.4353 net_new_names= 8
d,lel,dal,nak,day,ney,let,n
step =  8 loss=1.4096 net_new_names= 7
jey,veg,xy,ced,azsi,za,dol
step =  9 loss=1.3823 net_new_names= 8
kad,laz,ric,tz,saz,run,nom,lez
step = 10 loss=1.3604 net_new_names= 5
kid,rax,jol,had,zem
step = 11 loss=1.3456 net_new_names= 6
las,kad,bed,zur,ceg,sten
step = 12 loss=1.3354 net_new_names= 4
x,jahd,kan,c
step = 13 loss=1.3256 net_new_names= 6
kah,jej,raw,ar,nav,ten
step = 14 loss=1.3172 net_new_n

### Proximal Policy Optimization (PPO) Reinforcement Learning

In [None]:
rl_model = copy.deepcopy(model)
ref_model = copy.deepcopy(model)

rl_optim = pt.optim.Adam(rl_model.parameters())

In [None]:
STEPS = 20
for step in range(STEPS):
  logits = rl_model(obs)
  log_probs = pt.nn.functional.log_softmax(logits, -1)[pt.arange(len(actions)), actions]
  with pt.no_grad():
    ref_log_probs = pt.nn.functional.log_softmax(ref_model(obs), -1)[pt.arange(len(actions)), actions]
  
  ratio = log_probs - ref_log_probs
  ratio = ratio.exp()

  ppo_loss1 = rewards * ratio
  ppo_loss2 = rewards * pt.clamp(ratio, .8, 1.2)

  loss = -pt.min(ppo_loss1, ppo_loss2).mean()

  #policy gradient
  loss = -(rewards * log_probs).mean()
  
  with pt.no_grad():
    names = generate(rl_model, device = device, seed = 42 + step)
    new_names = set(set(names) - set(liked_names))
    print(f"step = {step:2d} loss={loss.item():.4f} net_new_names={len(new_names):2d}")
    print(",".join(new_names))

  loss.backward()
  rl_optim.step()
  rl_optim.zero_grad()

step =  0 loss=2.5757 net_new_names=10
kaielastion,honah,aelai,inslyn,rendy,varez,jehi,neslylee,amoden,carman
step =  1 loss=1.7851 net_new_names= 9
dhik,wrodsi,fahna,buuri,neba,brinleigh,que,yuwanni,sair
step =  2 loss=1.6210 net_new_names= 8
dway,am,le,stan,dd,mad,na,saj
step =  3 loss=1.5903 net_new_names= 7
bres,jen,ma,din,ed,cor,al
step =  4 loss=1.5442 net_new_names= 8
jo,xyn,juss,jov,vih,lus,joh,le
step =  5 loss=1.4965 net_new_names= 7
tz,kil,gyn,k,am,blo,mux
step =  6 loss=1.4621 net_new_names= 5
vaz,ky,kex,pay,gi
step =  7 loss=1.4355 net_new_names= 8
d,lel,dal,nak,day,ney,let,n
step =  8 loss=1.4099 net_new_names= 8
xy,laz,azsi,veg,dol,jey,ced,za
step =  9 loss=1.3823 net_new_names= 8
kad,laz,ric,tz,saz,run,nom,lez
step = 10 loss=1.3604 net_new_names= 5
kid,rax,jol,had,zem
step = 11 loss=1.3456 net_new_names= 6
las,kad,bed,zur,ceg,sten
step = 12 loss=1.3353 net_new_names= 4
x,jahd,kan,c
step = 13 loss=1.3252 net_new_names= 6
kah,jej,raw,ar,nav,ten
step = 14 loss=1.3169 net_n