Exercises:

[X] E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

[X] E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

[X] E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

[X] E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

[X] E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

[X] E06: meta-exercise! Think of a fun/interesting exercise and complete it.

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [2]:
words = open('names.txt', 'r').read().splitlines()
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [3]:
xs = []
ys = []

itos = {i + 1: chr(i + ord('a')) for i in range(26)} | {0: '.'}
stoi = {s:i for i,s in itos.items()}

for w in words:
    w = '..' + w + '.'
    for i in range(2, len(w)):
        xs.append((stoi[w[i - 2]], stoi[w[i - 1]]))
        ys.append(stoi[w[i]])

xs = torch.tensor(xs)
ys = torch.tensor(ys)


In [4]:
class TrigramModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(27, 27)
        self.linear = nn.Linear(54, 27)
    
    def forward(self, x):
        emb = self.embedding(x).view(x.shape[0], -1)
        logits = self.linear(emb)
        return logits

    def sample(self, num_samples=1, max_length=20):
        self.eval()
        with torch.no_grad():
            x = torch.zeros((num_samples, 2), dtype=torch.int32)
            for _ in range(max_length):
                logits = self(x[:, -2:])
                probs = F.softmax(logits, dim=-1)
                x = torch.cat((x, torch.multinomial(probs, num_samples=1)), dim=1)
            out = []
            for i in range(num_samples):
                name = ''
                for j in range(2, x[i].shape[0]):
                    s = itos[x[i][j].item()]
                    if s == '.':
                        break
                    name += s
                out.append(name)
            return out
        

In [16]:
# device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
device = "cpu"
print(f"Using {device} device")

learning_rate = 0.01
batch_size = 256
epochs = 10

dataset = TensorDataset(xs.to(device), ys.to(device))

train_dataset, dev_dataset, test_dataset = torch.utils.data.random_split(dataset, [0.8, 0.1, 0.1])

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

model = TrigramModel().to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.001)

for epoch in range(epochs):
    
    model.train()
    for x_batch, y_batch in train_dataloader:
        logits = model(x_batch)
        loss = F.cross_entropy(logits, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    x_train, y_train = train_dataset[:]
    logits_train = model(x_train)
    ce_train = F.cross_entropy(logits_train, y_train)
    print(f"Train CE: {ce_train.item():.4f}")

    x_dev, y_dev = dev_dataset[:]
    logits_dev = model(x_dev)
    ce_dev = F.cross_entropy(logits_dev, y_dev)
    print(f"Dev CE: {ce_dev.item():.4f}")


Using cpu device
Train CE: 2.3712
Dev CE: 2.3739
Train CE: 2.3587
Dev CE: 2.3602
Train CE: 2.3624
Dev CE: 2.3626
Train CE: 2.3586
Dev CE: 2.3601
Train CE: 2.3544
Dev CE: 2.3565
Train CE: 2.3567
Dev CE: 2.3607
Train CE: 2.3563
Dev CE: 2.3598
Train CE: 2.3568
Dev CE: 2.3585
Train CE: 2.3547
Dev CE: 2.3564
Train CE: 2.3537
Dev CE: 2.3560


In [18]:
model.sample(10)

['alickynni',
 'uska',
 'caikar',
 'saframi',
 'marime',
 'a',
 'lokdelona',
 'jelynn',
 'sanninn',
 'iv']

In [17]:
x_test, y_test = test_dataset[:]
logits_test = model(x_test)
ce_test = F.cross_entropy(logits_test, y_test)
print(f"Test CE: {ce_test.item():.4f}")


Test CE: 2.3474
