### NOTES

Initialization 

Weights distribution needs to be roughly unit gaussian

This is because of non-linear functions, which domain is (-1,1) (tanh) , (0,x)(ReLu), etc. Values higher or lower than STD of 1 are problematic because the gradien vanishes or causes dead neurons. E.g. 2 input values -30 and -40 have the same effect of outputing -1 in a tanh function, and given that the gradient of grad(tanh(-1)=0), there is no gradient broadcasted in the network.

So it is important to inialize weights properly.

But for bigger neural networks, this becomes intractable because it is difficult to compute the gains for weight initialization.

To address this problem, one can use Batch normalization

It is common to sprinkle Batch normalization into linear or convolutional layers

The stability of BN comes at a cost.BN couples forward propagation with the batch itself, because now the STD and MEAN of the batch are part of the forward pass. So there will be some noise in terms of the distribution of the weights

But this helps to regularize the net and prevents the weights of the net to overfit

BN also makes the input of the forward pass to depend on the STD and MEAN of the batch. So at inference time, we calculate the STD and MEAN of all the dataset (train, test or val) and use these fix values for inference forward pass. This can be done also during training time to avoid a second stage computation.

If using BN, it is better to not to have biases to the layer attached to BN

Group Normalization and Layer normalization are better options as BN suffer from a lot of bugs.

In [15]:
import torch
import torch.nn.functional as F
import matplotlib as plt

%matplotlib inline
DEBUG = True

debug = lambda *args: print(*args) if DEBUG else None

In [17]:
words = open('names.txt','r').read().splitlines()
debug(words[:8])

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']


In [26]:
# Build the encoders

chars = sorted(set(list("".join(w for w in words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {v:k for k,v in stoi.items()}
vocab_size = len(stoi)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [28]:
import random

# Build the dataset
context_size = 3
def build_dataset(words):
    X, Y = [],[]
    context = [0] * context_size
    for w in words:
        for c in w + '.':
            ix = stoi[c]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]

    X = torch.tensor(X)
    Y = torch.tensor(Y)

    return X, Y

random.seed(42)
random.shuffle(words)
n1 = int(len(words) * 0.8)
n2 = int(len(words) * 0.9)

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

debug(f"{Xtr.shape=},{Ytr.shape=}")
debug(f"{Xdev.shape=},{Ydev.shape=}")
debug(f"{Xte.shape=},{Yte.shape=}")

debug(f"{Xtr[0]=} -> {Ytr[0]=}")

Xtr.shape=torch.Size([182580, 3]),Ytr.shape=torch.Size([182580])
Xdev.shape=torch.Size([22767, 3]),Ydev.shape=torch.Size([22767])
Xte.shape=torch.Size([22799, 3]),Yte.shape=torch.Size([22799])
Xtr[0]=tensor([0, 0, 0]) -> Ytr[0]=tensor(5)
