In [1]:
import torch
import matplotlib.pyplot as plt
%matplotlib inline

## **Task**

We want to train an MLP to learn the construction of names. To achieve this we want to maximise the likelihood of the $N$ names observed in the dataset:

$$
\max_{\theta} \sum_{i=1}^{N} \hat{p}(\mathbf{x}_i) = \max_{\theta} \sum_{i=1}^{N} \prod_{j=1}^{M_i} \hat{p}(x_j | x_{j-1}, x_{j-2}, x_{j-3}; \theta)
$$

In particular, we choose to break down the probability of a name into the product of the conditional probabilities of each character given a three-letter context window (we use '.' as the special token to pad the start and end of a name). For example, the name "John" would give us the following data points:

* ... $\rightarrow$ J
* ..J $\rightarrow$ o
* .Jo $\rightarrow$ h
* Joh $\rightarrow$ n
* ohn $\rightarrow$ .

We begin by constructing a training set of such data points from all the names in the dataset. We then use the negative log-likelihood of the data as the loss function to be minimized. See `mm_intro.ipynb` for more details.

# **Dataset**

In [2]:
# Load data
words = open('names.txt', 'r').read().splitlines()
print(words[:10])

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']


In [3]:
# Create a dictionary that maps characters to integers and vice versa
char2idx = {c: i+1 for i, c in enumerate('abcdefghijklmnopqrstuvwxyz')}
char2idx['.'] = 0 # special character for marking start and end of a word
idx2char = {i: c for c, i in char2idx.items()}

In [4]:
# Form training pairs of context and target characters
block_size = 3 # context size for next character prediction
X, Y = [], []

for word in words[:5]:
    w2idx = [0] * block_size + [char2idx[c] for c in word] + [0]
    for i in range(len(w2idx) - block_size):
        X.append(w2idx[i:i+block_size])
        Y.append(w2idx[i+block_size])

X = torch.tensor(X)
Y = torch.tensor(Y)

print('X:', X.shape, X.dtype, '\nY:', Y.shape, Y.dtype)
# print first 5 samples of X and Y
# for i in range(len(Y)):
#     print(''.join([idx2char[idx.item()] for idx in X[i]]), '->', idx2char[Y[i].item()])

X: torch.Size([32, 3]) torch.int64 
Y: torch.Size([32]) torch.int64


# **MLP**