# Makemore

A character-level language model that is learnt in an auto-regressive fashion.    

Becoming a backprop ninja, swole doge style.

## Yes you should understand backprop.

> The problem with Backpropagation is that it is a **leaky abstraction**.

It is easy to fall into the trap of auto-differentiating your way through a lifetime of engineering Machine Learning systems - believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

A few common errors that are difficult to debug:

* Vanishing gradients on sigmoids
  * If we are sloppy with the weight initialization or data preprocessing, the sigmoid can “saturate” and entirely stop learning — the training loss will be flat and refuse to go down. This is because the gradients of the sigmoid function are very small when the input is large or small
  * If the weight matrix is initialized too large, the output of the matrix multiply could have a very large range, which will make all elements in the output (say, `z`) almost binary: either 1 or 0. Then, the gradient, `z*(1-z)`, will in both cases become zero (“vanish”), making the gradient for W zero. The rest of the backward pass will come out all zero from this point on due to multiplication in the chain rule.
  * The sigmoid local gradient (`z*(1-z)`) achieves a maximum at 0.25, when z = 0.5. Thus, every time the gradient flows through a sigmoid, its magnitude diminishes by one quarter (or more). Thus, stacking MLP layers with sigmoids can make the lower network layers train much slower than the higher ones.

* Dying ReLU
  * If a neuron gets clamped to zero in the forward pass, i.e., it doesn’t “fire”, then its weights will get zero gradient.
  * This can lead to what is called the “dead ReLU” problem, where if a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training, then this neuron will remain permanently dead. It’s like permanent, irrecoverable brain damage.
  * Neurons can also die during training, usually as a symptom of aggressive learning rates.

* Exploding gradients in RNNs
  * This RNN is unrolled for T time steps. When you stare at what the backward pass is doing, you’ll see that the gradient signal going backwards in time through all the hidden states is always being multiplied by the same matrix (the recurrence matrix Whh), interspersed with non-linearity backprop.
  * What happens when you take one number a and start multiplying it by some other number b (i.e. a*b*b*b*b*b*b…)? This sequence either goes to zero if |b| < 1, or explodes to infinity when |b|>1. The same thing happens in the backward pass of an RNN, except b is a matrix and not just a number, so we have to reason about its largest eigenvalue instead.
  * TLDR: If you understand backpropagation and you’re using RNNs you are nervous about having to do gradient clipping, or you prefer to use an LSTM. See a longer explanation in this CS231n lecture video.

* Spotted in the Wild: DQN Clipping
  * If you’re familiar with DQN, you can see that there is the target_q_t, which is just [reward * \gamma \argmax_a Q(s’,a)], and then there is q_acted, which is Q(s,a) of the action that was taken. The authors here subtract the two into variable delta, which they then want to minimize on line 295 with the L2 loss with tf.reduce_mean(tf.square()). So far so good.
  * The problem is on line 291. The authors are trying to be robust to outliers, so if the delta is too large, they clip it with tf.clip_by_value. This is well-intentioned and looks sensible from the perspective of the forward pass, but it introduces a major bug if you think about the backward pass.
  * The clip_by_value function has a local gradient of zero outside of the range min_delta to max_delta, so whenever the delta is above min/max_delta, the gradient becomes exactly zero during backprop. The authors are clipping the raw Q delta, when they are likely trying to clip the gradient for added robustness. In that case the correct thing to do is to use the Huber loss in place of tf.square:


`Conclusion`: Backprop is a leaky abstraction. It is important to understand the mechanics of backpropagation, and to be able to debug it when things go wrong.


In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

SEED = 2147483647

In [2]:
# read in all the words
words = open('./data/names.txt', 'r').read().splitlines()
print(f"A few words: {words[:5]}")
print(f"Length of words: {len(words)}")


A few words: ['emma', 'olivia', 'ava', 'isabella', 'sophia']
Length of words: 32033


In [3]:
# Build the vocabulary of characters and mappings to/from integers.
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
vocab_size = len(itos)
print(itos)
print(vocab_size)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}
27
