## Exercises:
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

E06: meta-exercise! Think of a fun/interesting exercise and complete it.

In [1]:
import torch
import torch.nn.functional as F

import matplotlib.pyplot as plt
from tqdm import tqdm
import time

In [2]:
with open("names.txt", "r") as f:
    words = f.read().splitlines()

In [3]:
print(len(words))

32033


In [4]:
# total words
print("Total names: ", len(words))

print("Minimum name length: ", min(len(w) for w in words))
print("Maximum name length: ", max(len(w) for w in words))

Total names:  32033
Minimum name length:  2
Maximum name length:  15


In [5]:
trigrams = {}
for w in words:
    #chs = ['<S>'] + list(w) + ['<E>']
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        trigram = (ch1+ch2, ch3)
        trigrams[trigram] = trigrams.get(trigram, 0) + 1

trigrams

{('.e', 'm'): 288,
 ('em', 'm'): 100,
 ('mm', 'a'): 72,
 ('ma', '.'): 174,
 ('.o', 'l'): 104,
 ('ol', 'i'): 69,
 ('li', 'v'): 54,
 ('iv', 'i'): 78,
 ('vi', 'a'): 147,
 ('ia', '.'): 903,
 ('.a', 'v'): 243,
 ('av', 'a'): 161,
 ('va', '.'): 93,
 ('.i', 's'): 124,
 ('is', 'a'): 142,
 ('sa', 'b'): 76,
 ('ab', 'e'): 173,
 ('be', 'l'): 201,
 ('el', 'l'): 822,
 ('ll', 'a'): 337,
 ('la', '.'): 684,
 ('.s', 'o'): 152,
 ('so', 'p'): 21,
 ('op', 'h'): 37,
 ('ph', 'i'): 61,
 ('hi', 'a'): 81,
 ('.c', 'h'): 352,
 ('ch', 'a'): 236,
 ('ha', 'r'): 329,
 ('ar', 'l'): 287,
 ('rl', 'o'): 44,
 ('lo', 't'): 14,
 ('ot', 't'): 34,
 ('tt', 'e'): 121,
 ('te', '.'): 175,
 ('.m', 'i'): 393,
 ('mi', 'a'): 95,
 ('.a', 'm'): 384,
 ('am', 'e'): 226,
 ('me', 'l'): 188,
 ('el', 'i'): 537,
 ('li', 'a'): 518,
 ('.h', 'a'): 505,
 ('ar', 'p'): 8,
 ('rp', 'e'): 5,
 ('pe', 'r'): 77,
 ('er', '.'): 683,
 ('.e', 'v'): 154,
 ('ev', 'e'): 142,
 ('ve', 'l'): 76,
 ('el', 'y'): 353,
 ('ly', 'n'): 976,
 ('yn', '.'): 953,
 ('.a', 'b'):

In [6]:
sorted_trigrams = sorted(trigrams.items(), key = lambda item: -item[1])
sorted_trigrams

[(('ah', '.'), 1714),
 (('na', '.'), 1673),
 (('an', '.'), 1509),
 (('on', '.'), 1503),
 (('.m', 'a'), 1453),
 (('.j', 'a'), 1255),
 (('.k', 'a'), 1254),
 (('en', '.'), 1217),
 (('ly', 'n'), 976),
 (('yn', '.'), 953),
 (('ar', 'i'), 950),
 (('ia', '.'), 903),
 (('ie', '.'), 858),
 (('an', 'n'), 825),
 (('el', 'l'), 822),
 (('an', 'a'), 804),
 (('ia', 'n'), 790),
 (('ma', 'r'), 776),
 (('in', '.'), 766),
 (('el', '.'), 727),
 (('ya', '.'), 716),
 (('an', 'i'), 703),
 (('.d', 'a'), 700),
 (('la', '.'), 684),
 (('er', '.'), 683),
 (('iy', 'a'), 669),
 (('la', 'n'), 647),
 (('.b', 'r'), 646),
 (('nn', 'a'), 633),
 (('.a', 'l'), 632),
 (('.c', 'a'), 628),
 (('ra', '.'), 627),
 (('ni', '.'), 625),
 (('.a', 'n'), 623),
 (('nn', '.'), 619),
 (('ne', '.'), 607),
 (('ee', '.'), 605),
 (('ey', '.'), 602),
 (('.k', 'e'), 601),
 (('al', 'e'), 601),
 (('.s', 'a'), 595),
 (('al', 'i'), 575),
 (('sh', 'a'), 562),
 (('el', 'i'), 537),
 (('.d', 'e'), 524),
 (('li', 'a'), 518),
 (('le', 'e'), 517),
 (('y

In [7]:
elements = []

for trigram in sorted_trigrams:
    elements.append(trigram[0][0])
    elements.append(trigram[0][1])


elements = list(set(elements))

print(len(elements))

628


In [8]:
eltoi = {el: i for i, el in enumerate(elements)}
#print(eltoi)
eltoi["."] = 999

In [9]:
# int to string mapping
itoel = {i:el for i, el in enumerate(elements)}
#print(itoel)

In [10]:
# First let's create a training set of bigrams (x, y)

xs, ys = [], []

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        el1 = ch1+ch2
        el2 = ch3

        ix1 = eltoi[el1]
        ix2 = eltoi[el2]
        xs.append(ix1)
        ys.append(ix2)
        print(f"{el1} {el2}: {ix1} {ix2}")

xs = torch.tensor(xs)
ys = torch.tensor(ys)

print(xs)
print(ys)

.e m: 20 40
em m: 210 40
mm a: 393 177
ma .: 201 999
.o l: 144 285
ol i: 162 492
li v: 620 572
iv i: 93 492
vi a: 108 177
ia .: 95 999
.a v: 477 572
av a: 609 177
va .: 43 999
.i s: 599 236
is a: 136 177
sa b: 514 182
ab e: 493 135
be l: 454 285
el l: 597 285
ll a: 414 177
la .: 416 999
.s o: 271 384
so p: 1 537
op h: 120 259
ph i: 435 492
hi a: 424 177
ia .: 95 999
.c h: 401 259
ch a: 146 177
ha r: 114 38
ar l: 368 285
rl o: 612 384
lo t: 515 433
ot t: 505 433
tt e: 603 135
te .: 30 999
.m i: 11 492
mi a: 444 177
ia .: 95 999
.a m: 477 40
am e: 288 135
me l: 618 285
el i: 597 492
li a: 620 177
ia .: 95 999
.h a: 327 177
ha r: 114 38
ar p: 368 537
rp e: 533 135
pe r: 56 38
er .: 518 999
.e v: 20 572
ev e: 82 135
ve l: 528 285
el y: 597 115
ly n: 348 420
yn .: 45 999
.a b: 477 182
ab i: 493 492
bi g: 484 389
ig a: 12 177
ga i: 497 492
ai l: 499 285
il .: 193 999
.e m: 20 40
em i: 210 492
mi l: 444 285
il y: 193 115
ly .: 348 999
.e l: 20 285
el i: 597 492
li z: 620 340
iz a: 25 177
za b

In [11]:
# Create the dataset
xs, ys = [], []

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        el1 = ch1+ch2
        el2 = ch3

        ix1 = eltoi[el1]
        ix2 = eltoi[el2]
        xs.append(ix1)
        ys.append(ix2)
        #print(f"{ch1}{ch2}: {ix1} {ix2}")

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()

print("Number of examples: ", num)

Number of examples:  196113


In [12]:
# randomly initialize 27 neurons weights. Each neuron recieves 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((628, 628), generator=g, requires_grad=True)   # We created 27 neurons

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [14]:
# gradient descent in a loop -> training

lr = 75
alpha = 0.01

t_start = time.time()

# Gpu accelerated now
#W.to(device)
xs.to(device)
ys.to(device)

with tqdm(range(1000), unit="Epoch") as tepoch:
    for epoch in tepoch:
        tepoch.set_description(f"Epoch {epoch}")
        

        # Forwards pass
        xenc = F.one_hot(xs, num_classes=628).float().to(device)

        logits = xenc @ W.to(device) # log-counts
        counts = logits.exp() # counts, equivalent to N
        probs = (counts / counts.sum(1, keepdim=True)) 
        
        # print(probs.shape)
        # print(ys.shape)

        # # regularised loss
        loss = -probs[torch.arange(num), ys].log().mean() + (alpha * (W**2).mean())

        # Backward pass
        W.grad = None   # set gradient to zero
        loss.backward()

        # update
        W.data += -lr * W.grad # gradient descent

        tepoch.set_postfix(loss=loss.item(), time=time.time() - t_start)

Epoch 0:   0%|          | 0/1000 [00:00<?, ?Epoch/s]../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [10,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [10,0,0], thread: [11,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [10,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [10,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [10,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator():

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [None]:
# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix = 0
  while True:
    
    # ----------
    # BEFORE:
    #p = P[ix]
    # ----------
    # NOW:
    xenc = F.one_hot(torch.tensor([ix]), num_classes=628).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character
    # ----------
    
    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itoel[ix])
    if ix == 999:
      break
  print(''.join(out))
  print()
