## Ex 1 - Train a Trigram Model
- Train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

### Load Data

In [1]:
# Load the names.txt file
words = open("data/names.txt", "r").read().splitlines()
words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

### Trigram Model
- A trigram model is a language model that predicts the next character given the previous two characters.
- First we will train a trigram model using a simple counting approach.
- We will use the training data to calculate the probabilities of each character given the previous two characters.
- This will require a 3D matrix to store the probabilities of each character given the previous two characters.

In [2]:
# loop over the words and create trigrams
trigram_count = {}
for word in words:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    # print([trigram for trigram in zip(new_word, new_word[1:], new_word[2:])])
    # print(new_word)

    for ch1, ch2, ch3 in zip(new_word, new_word[1:], new_word[2:]):
        trigram_count[(ch1 + ch2, ch3)] = trigram_count.get((ch1 + ch2, ch3), 0) + 1 # increment the count
trigram_count

{('..', 'e'): 1531,
 ('.e', 'm'): 288,
 ('em', 'm'): 100,
 ('mm', 'a'): 72,
 ('ma', '.'): 174,
 ('a.', '.'): 6640,
 ('..', 'o'): 394,
 ('.o', 'l'): 104,
 ('ol', 'i'): 69,
 ('li', 'v'): 54,
 ('iv', 'i'): 78,
 ('vi', 'a'): 147,
 ('ia', '.'): 903,
 ('..', 'a'): 4410,
 ('.a', 'v'): 243,
 ('av', 'a'): 161,
 ('va', '.'): 93,
 ('..', 'i'): 591,
 ('.i', 's'): 124,
 ('is', 'a'): 142,
 ('sa', 'b'): 76,
 ('ab', 'e'): 173,
 ('be', 'l'): 201,
 ('el', 'l'): 822,
 ('ll', 'a'): 337,
 ('la', '.'): 684,
 ('..', 's'): 2055,
 ('.s', 'o'): 152,
 ('so', 'p'): 21,
 ('op', 'h'): 37,
 ('ph', 'i'): 61,
 ('hi', 'a'): 81,
 ('..', 'c'): 1542,
 ('.c', 'h'): 352,
 ('ch', 'a'): 236,
 ('ha', 'r'): 329,
 ('ar', 'l'): 287,
 ('rl', 'o'): 44,
 ('lo', 't'): 14,
 ('ot', 't'): 34,
 ('tt', 'e'): 121,
 ('te', '.'): 175,
 ('e.', '.'): 3983,
 ('..', 'm'): 2538,
 ('.m', 'i'): 393,
 ('mi', 'a'): 95,
 ('.a', 'm'): 384,
 ('am', 'e'): 226,
 ('me', 'l'): 188,
 ('el', 'i'): 537,
 ('li', 'a'): 518,
 ('..', 'h'): 874,
 ('.h', 'a'): 505,


In [3]:
len(trigram_count.keys())

6089

In [4]:
# sort the bigrams by frequency from most likely pairs to least likely
sorted(trigram_count.items(),key=lambda kv: -kv[1])

[(('n.', '.'), 6763),
 (('a.', '.'), 6640),
 (('..', 'a'), 4410),
 (('e.', '.'), 3983),
 (('..', 'k'), 2963),
 (('..', 'm'), 2538),
 (('i.', '.'), 2489),
 (('..', 'j'), 2422),
 (('h.', '.'), 2409),
 (('..', 's'), 2055),
 (('y.', '.'), 2007),
 (('ah', '.'), 1714),
 (('..', 'd'), 1690),
 (('na', '.'), 1673),
 (('..', 'r'), 1639),
 (('..', 'l'), 1572),
 (('..', 'c'), 1542),
 (('..', 'e'), 1531),
 (('an', '.'), 1509),
 (('on', '.'), 1503),
 (('.m', 'a'), 1453),
 (('r.', '.'), 1377),
 (('l.', '.'), 1314),
 (('..', 't'), 1308),
 (('..', 'b'), 1306),
 (('.j', 'a'), 1255),
 (('.k', 'a'), 1254),
 (('en', '.'), 1217),
 (('s.', '.'), 1169),
 (('..', 'n'), 1146),
 (('ly', 'n'), 976),
 (('yn', '.'), 953),
 (('ar', 'i'), 950),
 (('..', 'z'), 929),
 (('ia', '.'), 903),
 (('..', 'h'), 874),
 (('ie', '.'), 858),
 (('o.', '.'), 855),
 (('an', 'n'), 825),
 (('el', 'l'), 822),
 (('an', 'a'), 804),
 (('ia', 'n'), 790),
 (('ma', 'r'), 776),
 (('in', '.'), 766),
 (('el', '.'), 727),
 (('ya', '.'), 716),
 (('

In [5]:
import torch

In [6]:
# Create a 2D array of torch tensors to store the trigram counts
trigram_tensor = torch.zeros((729, 27), dtype=torch.float32)
# trigram_tensor = torch.zeros((729, 729), dtype=torch.float32)

In [7]:
trigram_tensor.shape

torch.Size([729, 27])

In [9]:
unique_chars = sorted(list(set("".join(words))))
bigram_to_int = {}
index = 1
for ch in ["."]+ unique_chars:
    for ch_n in unique_chars + ["."]:
        if ch + ch_n != "..":
            bigram_to_int[ch + ch_n] = index
            index += 1
bigram_to_int[".."] = 0

In [10]:
bigram_to_int

{'.a': 1,
 '.b': 2,
 '.c': 3,
 '.d': 4,
 '.e': 5,
 '.f': 6,
 '.g': 7,
 '.h': 8,
 '.i': 9,
 '.j': 10,
 '.k': 11,
 '.l': 12,
 '.m': 13,
 '.n': 14,
 '.o': 15,
 '.p': 16,
 '.q': 17,
 '.r': 18,
 '.s': 19,
 '.t': 20,
 '.u': 21,
 '.v': 22,
 '.w': 23,
 '.x': 24,
 '.y': 25,
 '.z': 26,
 'aa': 27,
 'ab': 28,
 'ac': 29,
 'ad': 30,
 'ae': 31,
 'af': 32,
 'ag': 33,
 'ah': 34,
 'ai': 35,
 'aj': 36,
 'ak': 37,
 'al': 38,
 'am': 39,
 'an': 40,
 'ao': 41,
 'ap': 42,
 'aq': 43,
 'ar': 44,
 'as': 45,
 'at': 46,
 'au': 47,
 'av': 48,
 'aw': 49,
 'ax': 50,
 'ay': 51,
 'az': 52,
 'a.': 53,
 'ba': 54,
 'bb': 55,
 'bc': 56,
 'bd': 57,
 'be': 58,
 'bf': 59,
 'bg': 60,
 'bh': 61,
 'bi': 62,
 'bj': 63,
 'bk': 64,
 'bl': 65,
 'bm': 66,
 'bn': 67,
 'bo': 68,
 'bp': 69,
 'bq': 70,
 'br': 71,
 'bs': 72,
 'bt': 73,
 'bu': 74,
 'bv': 75,
 'bw': 76,
 'bx': 77,
 'by': 78,
 'bz': 79,
 'b.': 80,
 'ca': 81,
 'cb': 82,
 'cc': 83,
 'cd': 84,
 'ce': 85,
 'cf': 86,
 'cg': 87,
 'ch': 88,
 'ci': 89,
 'cj': 90,
 'ck': 91,
 'cl': 9

In [11]:
int_to_bigram = {i: bg for bg, i in bigram_to_int.items()}
int_to_bigram

{1: '.a',
 2: '.b',
 3: '.c',
 4: '.d',
 5: '.e',
 6: '.f',
 7: '.g',
 8: '.h',
 9: '.i',
 10: '.j',
 11: '.k',
 12: '.l',
 13: '.m',
 14: '.n',
 15: '.o',
 16: '.p',
 17: '.q',
 18: '.r',
 19: '.s',
 20: '.t',
 21: '.u',
 22: '.v',
 23: '.w',
 24: '.x',
 25: '.y',
 26: '.z',
 27: 'aa',
 28: 'ab',
 29: 'ac',
 30: 'ad',
 31: 'ae',
 32: 'af',
 33: 'ag',
 34: 'ah',
 35: 'ai',
 36: 'aj',
 37: 'ak',
 38: 'al',
 39: 'am',
 40: 'an',
 41: 'ao',
 42: 'ap',
 43: 'aq',
 44: 'ar',
 45: 'as',
 46: 'at',
 47: 'au',
 48: 'av',
 49: 'aw',
 50: 'ax',
 51: 'ay',
 52: 'az',
 53: 'a.',
 54: 'ba',
 55: 'bb',
 56: 'bc',
 57: 'bd',
 58: 'be',
 59: 'bf',
 60: 'bg',
 61: 'bh',
 62: 'bi',
 63: 'bj',
 64: 'bk',
 65: 'bl',
 66: 'bm',
 67: 'bn',
 68: 'bo',
 69: 'bp',
 70: 'bq',
 71: 'br',
 72: 'bs',
 73: 'bt',
 74: 'bu',
 75: 'bv',
 76: 'bw',
 77: 'bx',
 78: 'by',
 79: 'bz',
 80: 'b.',
 81: 'ca',
 82: 'cb',
 83: 'cc',
 84: 'cd',
 85: 'ce',
 86: 'cf',
 87: 'cg',
 88: 'ch',
 89: 'ci',
 90: 'cj',
 91: 'ck',
 92: 'cl

In [12]:
# Create a mapping from character to index
unique_chars = sorted(list(set("".join(words))))
# Start the character index from 1
char_to_int = {char: i + 1 for i, char in enumerate(sorted(unique_chars))}
char_to_int["."] = 0
print(char_to_int)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}


In [13]:
# Start the character index from 1
int_to_char = {i + 1: char for i, char in enumerate(sorted(unique_chars))}
int_to_char[0] = "."
print(int_to_char) 

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [14]:
# Create a tensor to store the trigram counts
for word in words:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    for trigram in zip(new_word, new_word[1:], new_word[2:]):
        index1 = bigram_to_int[trigram[0] + trigram[1]]
        index2 = char_to_int[trigram[2]]
        # Add 1 to the index position
        trigram_tensor[index1, index2] += 1


In [15]:
trigram_tensor[1, :]

tensor([  0., 207., 190.,  31., 366.,  55.,  21.,  17.,  91., 154.,  27.,  75.,
        632., 384., 623.,  10.,  17.,   9., 482., 194.,  72., 152., 243.,   6.,
         27., 173., 152.])

In [16]:
# Now we will calculate the probabilities for a word to start with the given character
# Add 1 to avoid 0 probabilities. The bigger number we add the more uniform the distribution will be.
P = (trigram_tensor + 1).float()
P /= P.sum(1, keepdim=True)

In [17]:
P[0]

tensor([3.1192e-05, 1.3759e-01, 4.0767e-02, 4.8129e-02, 5.2745e-02, 4.7785e-02,
        1.3038e-02, 2.0898e-02, 2.7293e-02, 1.8465e-02, 7.5577e-02, 9.2452e-02,
        4.9064e-02, 7.9195e-02, 3.5777e-02, 1.2321e-02, 1.6095e-02, 2.9008e-03,
        5.1154e-02, 6.4130e-02, 4.0830e-02, 2.4641e-03, 1.1759e-02, 9.6070e-03,
        4.2109e-03, 1.6719e-02, 2.9008e-02])

In [18]:
def generate_words(P, num_words): 
    # Now we will sample a character based on previous 2 characters
    g = torch.Generator().manual_seed(2147483647)
    # Sample 5 words
    for i in range(num_words):
        idx = 0  # starting with the first row
        out = []
        while True:
            # Get the probabilities for a given row
            p = P[idx]
            # sample the next character based on the probabilities
            ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
            prev_char = int_to_bigram[idx][-1] # get the previous character
            next_char =int_to_char[ix]
            out.append(next_char)
            idx = bigram_to_int[prev_char + next_char]
            if ix == 0:
                # If we get the end character, break the loop
                break
        print("".join(out))


In [19]:
generate_words(P, num_words=10)

junide.
jakasid.
prelay.
adin.
kairritoper.
sathen.
sameia.
yanileniassibduinrwin.
lessiyanayla.
te.


In [20]:
def loss_function(P, words):
    # Calculating the loss
    log_likelihood = 0.0 # will be the sum of the log probabilities of each bigram in the dataset
    n = 0
    for word in words:
        new_word = [".", "."] + list(word) + [".", "."]
        for trigram in zip(new_word, new_word[1:], new_word[2:]):
            index1 = bigram_to_int[trigram[0] + trigram[1]]
            index2 = char_to_int[trigram[2]]
            # Now we look at the probabilities that the model assigns to each of the trigrams
            prob = P[index1, index2]
            # Taking the log of the probability will make it easier to compare the probabilities
            logprob = torch.log(prob)
            log_likelihood += logprob
            n += 1
            # print(f"{ch1}{ch2}: {prob:.4f} {logprob:.4f}")
    print(f"{log_likelihood=}")
    negative_ll = -log_likelihood
    print(f"{negative_ll=}")
    print(f"Avg neg log likelihood : {negative_ll/n}")

In [21]:
# Laplace Smoothing or adding 1 to the counts
loss_function(P, words)

log_likelihood=tensor(-505260.7500)
negative_ll=tensor(505260.7500)
Avg neg log likelihood : 1.9419735670089722


## Reducing the Loss
- The initial loss in the trigram model is lower than the bigram model.
- Since the model has more context, it is able to predict the next character more accurately.
- We will try to first reduce the loss using various techniques such as:
  - Smoothing
- We will also try to use a neural network to train the trigram model and see if it improves the loss.

### Smoothing
- Smoothing is a technique used to reduce the loss in language models by adding a small constant to the probabilities of each character.
- This helps to reduce the sparsity of the probability matrix and improve the loss.
- Below are some common smoothing techniques:
    - Additive Smoothing: This is a technique that assigns a small constant to each character in the probability matrix. This helps to reduce the sparsity of the matrix and improve the loss. Below are the different types of additive smoothing:
        - Laplace Smoothing: This is a technique that adds 1 to the count of each character in the probability matrix.
        - Lidstone Smoothing: This is a technique that adds a small constant to the count of each character in the probability matrix. The small constant also referred as $\alpha$ is between 0 and 1. This value can be tuned to improve the loss.

- In our current implementation, we are using Laplace smoothing, which adds the count of 1 to every character in the probability matrix.
- Now we will try Lidstone smoothing by finding the value of $\alpha$ which gives minimum loss. We will also record the loss for different values of $\alpha$ and compare it with the loss of Laplace smoothing.

In [22]:
# Use the values of alpha from 0 to 1 - 0.1 to 0.9 and record the loss for each value
alpha_values = [i / 100 for i in range(1, 10)]
losses = []

for alpha in [0.01]:
    print("Alpha value: ", alpha)
    P_s = (trigram_tensor + alpha).float()
    P_s /= P_s.sum(1, keepdim=True)
    generate_words(P_s, 10)
    loss_function(P_s, words)

Alpha value:  0.01
junide.
jakasid.
prelay.
adin.
kairritoper.
sathen.
sameia.
yanileniassibiainewin.
lessiyanayla.
te.
log_likelihood=tensor(-498757.6875)
negative_ll=tensor(498757.6875)
Avg neg log likelihood : 1.9169790744781494


## Trigrams using Neural Networks
- Now we will use a simple neural network to train the trigram model.
- the neural network will take the previous two characters as input and predict the next character.
- The neural network will have one input and one output layer with a softmax activation function.

### Data Preparation
- To prepare the data for the neural network, we will need to convert the characters into numerical values.
- The x-values will be the previous two characters and the y-values(labels) will be the next character.

In [63]:
x_s = [] # These are the input bigrams for the model
y_s = []  # These are the target labels for the x values (or output values)
for word in words:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    for trigram in zip(new_word, new_word[1:], new_word[2:]):
            # print(trigram[0] + trigram[1], trigram[2])
            index1 = bigram_to_int[trigram[0] + trigram[1]]
            index2 = char_to_int[trigram[2]]
            x_s.append(index1)
            y_s.append(index2)

# NOTE: Always use torch.tensor not torch.Tensor as it preserves the data type of the original data.
# (see SO threads for the differences)
xs = torch.tensor(x_s) 
ys = torch.tensor(y_s)
num_elements = xs.nelement()
print("Number of examples in the dataset: ", num_elements)

Number of examples in the dataset:  260179


In [64]:
print(f"Input Bigrams = {xs}")
print(f"Target Labels = {ys}")

Input Bigrams = tensor([  0,   5, 147,  ..., 700, 725, 674])
Target Labels = tensor([ 5, 13, 13,  ..., 24,  0,  0])


In [65]:
import torch.nn.functional as F

In [66]:
# ---------- Initialize the Network ------------
# Generator for similar output of W at each run
# set a manual seed for the random number generator to ensure reproducibility. 
g_nn = torch.Generator().manual_seed(2147483647)
# First we will initialize the weights using uniform distribution.
W = torch.randn((729, 27), generator=g_nn, requires_grad=True)

### Training the Neural Network

In [81]:
# Gradient Descent

for k in range(100):
    # forward pass
    # Encode the input bigrams as one-hot vectors
    x_encoded = F.one_hot(xs, num_classes=729).float() # input to the network, one-hot encoding
    # Convert the one-hot encoded input to logits
    logits = x_encoded @ W # predict log-counts
    # The next 2 lines are called as Softmax function
    counts = logits.exp() # counts - equivalent to N2
    probabilities = counts/counts.sum(1, keepdim=True)
    # The probabilities will be the counts divided by the sum of the counts (counts normalized)
    loss = - probabilities[torch.arange(num_elements), ys].log().mean() + 0.01 * (W ** 2).mean()
    print("Loss: ", loss.item())

    # ----------- Backward Pass (Backpropagation) --------------------
    # Set the gradient to zero before backpropagation
    W.grad = None
    loss.backward()
    
    # NOTE: A positive gradient means that if we add this gradient to W, the loss will increase.
    # So we will multiply the gradient by -(learning rate) to minimize the loss.
    # The learning rate can be set to a bigger number to speed up the learning process.
    # update weights
    W.data += - 100 * W.grad

Loss:  1.985234260559082
Loss:  1.98513662815094
Loss:  1.9850395917892456
Loss:  1.9849426746368408
Loss:  1.9848462343215942
Loss:  1.9847503900527954
Loss:  1.9846549034118652
Loss:  1.9845597743988037
Loss:  1.9844650030136108
Loss:  1.9843703508377075
Loss:  1.984276533126831
Loss:  1.9841831922531128
Loss:  1.9840898513793945
Loss:  1.983997106552124
Loss:  1.9839047193527222
Loss:  1.983812689781189
Loss:  1.9837208986282349
Loss:  1.983629584312439
Loss:  1.9835386276245117
Loss:  1.9834481477737427
Loss:  1.9833581447601318
Loss:  1.9832680225372314
Loss:  1.983178734779358
Loss:  1.9830896854400635
Loss:  1.9830007553100586
Loss:  1.9829126596450806
Loss:  1.982824683189392
Loss:  1.9827370643615723
Loss:  1.982649803161621
Loss:  1.9825628995895386
Loss:  1.9824762344360352
Loss:  1.9823899269104004
Loss:  1.9823038578033447
Loss:  1.9822183847427368
Loss:  1.982133150100708
Loss:  1.9820481538772583
Loss:  1.9819637537002563
Loss:  1.981879472732544
Loss:  1.981795787811279

### Generate Samples from Trained Model

In [80]:
g_out = torch.Generator().manual_seed(2147483647)

# Generate 5 samples
for i in range(10):
    out = []
    idx = 0
    while True:
        x_enc = F.one_hot(torch.tensor([idx]), num_classes=729).float() # input to the network, one-hot encoding
        logits = x_enc @ W # predict log-counts
        counts = logits.exp() # counts - equivalent to N2
        p_nn = counts/counts.sum(1, keepdim=True)
        
        # Sample the next character from the above probability distribution
        ix = torch.multinomial(p_nn, num_samples=1, replacement=True, generator=g_out).item()
        
        prev_char = int_to_bigram[idx][-1] # get the previous character
        next_char =int_to_char[ix]
        out.append(next_char)
        idx = bigram_to_int[prev_char + next_char]
        if ix == 0:
            break
    print("".join(out))

junide.
janasid.
prelay.
adin.
kairritonian.
juel.
kalinaaurinileniassibduinrwin.
lessiyanayla.
te.
farmumarif.


### Ex1: Summary
- In this exercise, trained a trigram language model using both counting and neural network approaches.
- In the trigram model for counting approach:
    - Used both Laplace and Lidstone smoothing to reduce the loss.
    - Found the optimal value of $\alpha$ to be 0.01 for Lidstone smoothing that minimizes the loss.
    - The minimum loss achieved was - Avg neg log likelihood : 1.9169790744781494
    - This was lower than the bigram model where the minimum loss was - Avg neg log likelihood : 2.4543561935424805
    - The generated samples from the trigram model were more coherent than the bigram model.

- In the trigram model for neural network approach:
    - The neural network was trained using the previous two characters as input and the next character as output.
    - The loss was Loss:  1.9853323698043823, which is similar to the counting approach.
    - The training was done for 6 epochs with each epoch containing 100 iterations.
    - The learning rate was set to 50 in the initial 3 epochs and then increased to 100 in the last 3 epochs.
    - The loss did not improve significantly compared to the counting approach, but it was still comparable.
    - The generated samples from the neural network were also coherent and similar to the counting approach.

## Ex 2: Split the dataset into train, validation, and test sets
- split up the dataset randomly into 80% train set, 10% dev set, 10% test set.
- Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [None]:
# Split the dataset into training and validation sets
from sklearn.model_selection import train_test_split

In [3]:
from sklearn.model_selection import train_test_split

# First split into train (80%) and temp (20%)
words_train, words_temp = train_test_split(words, test_size=0.2, random_state=42)

# Then split temp into validation (10%) and test (10%)
words_val, words_test = train_test_split(words_temp, test_size=0.5, random_state=42)

print(f"Train set size: {len(words_train)}")
print(f"Validation set size: {len(words_val)}")
print(f"Test set size: {len(words_test)}")

Train set size: 25626
Validation set size: 3203
Test set size: 3204


In [4]:
import torch.nn.functional as F

In [16]:
# ---------- Initialize the Network ------------
# Generator for similar output of W at each run
# set a manual seed for the random number generator to ensure reproducibility. 
g_nn = torch.Generator().manual_seed(2147483647)
# First we will initialize the weights using uniform distribution.
W_train = torch.randn((729, 27), generator=g_nn, requires_grad=True)

In [15]:
xtrain_s = [] # These are the input bigrams for the model
ytrain_s = []  # These are the target labels for the x values (or output values)
for word in words_train:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    for trigram in zip(new_word, new_word[1:], new_word[2:]):
            # print(trigram[0] + trigram[1], trigram[2])
            index1 = bigram_to_int[trigram[0] + trigram[1]]
            index2 = char_to_int[trigram[2]]
            xtrain_s.append(index1)
            ytrain_s.append(index2)

# NOTE: Always use torch.tensor not torch.Tensor as it preserves the data type of the original data.
# (see SO threads for the differences)
xs_train = torch.tensor(xtrain_s) 
ys_train = torch.tensor(ytrain_s)
num_elements = xs_train.nelement()
print("Number of examples in the dataset: ", num_elements)

Number of examples in the dataset:  208123


### Training the Neural Network
- Training the neural network using the training set.

In [36]:
# Gradient Descent
# epoch 8
for k in range(100):
    # forward pass
    # Encode the input bigrams as one-hot vectors
    xtrain_encoded = F.one_hot(xs_train, num_classes=729).float() # input to the network, one-hot encoding
    # Convert the one-hot encoded input to logits
    logits = xtrain_encoded @ W_train # predict log-counts
    # The next 2 lines are called as Softmax function
    counts = logits.exp() # counts - equivalent to N
    probabilities = counts/counts.sum(1, keepdim=True)
    # The probabilities will be the counts divided by the sum of the counts (counts normalized)
    train_loss = - probabilities[torch.arange(num_elements), ys_train].log().mean() + 0.01 * (W_train ** 2).mean()
    print("Loss: ", train_loss.item())

    # ----------- Backward Pass (Backpropagation) --------------------
    # Set the gradient to zero before backpropagation
    W_train.grad = None
    train_loss.backward()
    
    # NOTE: A positive gradient means that if we add this gradient to W, the loss will increase.
    # So we will multiply the gradient by -(learning rate) to minimize the loss.
    # The learning rate can be set to a bigger number to speed up the learning process.
    # update weights
    W_train.data += - 50 * W_train.grad

Loss:  1.9833801984786987
Loss:  1.9833307266235352
Loss:  1.9832817316055298
Loss:  1.9832329750061035
Loss:  1.9831839799880981
Loss:  1.9831353425979614
Loss:  1.9830864667892456
Loss:  1.983038067817688
Loss:  1.98298978805542
Loss:  1.9829412698745728
Loss:  1.9828929901123047
Loss:  1.9828448295593262
Loss:  1.982796549797058
Loss:  1.982749104499817
Loss:  1.982701063156128
Loss:  1.9826531410217285
Loss:  1.9826056957244873
Loss:  1.982558250427246
Loss:  1.9825105667114258
Loss:  1.9824631214141846
Loss:  1.982416033744812
Loss:  1.9823689460754395
Loss:  1.982321858406067
Loss:  1.9822747707366943
Loss:  1.9822279214859009
Loss:  1.9821810722351074
Loss:  1.9821345806121826
Loss:  1.9820878505706787
Loss:  1.982041597366333
Loss:  1.9819949865341187
Loss:  1.9819486141204834
Loss:  1.9819027185440063
Loss:  1.9818565845489502
Loss:  1.9818105697631836
Loss:  1.981764554977417
Loss:  1.98171865940094
Loss:  1.9816728830337524
Loss:  1.9816274642944336
Loss:  1.9815819263458252

### Calculating Loss on Validation and Test Sets (Model Evaluation)
- After training the models, we will evaluate them on the validation and test sets.
- While calculating loss using the validation and test sets, only the forward pass is done, i.e. the model is not trained again.
 

#### Preparing Validation set

In [37]:
xval_s = [] # These are the input bigrams for the model
yval_s = []  # These are the target labels for the x values (or output values)
for word in words_val:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    for trigram in zip(new_word, new_word[1:], new_word[2:]):
            # print(trigram[0] + trigram[1], trigram[2])
            index1 = bigram_to_int[trigram[0] + trigram[1]]
            index2 = char_to_int[trigram[2]]
            xval_s.append(index1)
            yval_s.append(index2)

# NOTE: Always use torch.tensor not torch.Tensor as it preserves the data type of the original data.
# (see SO threads for the differences)
xs_val = torch.tensor(xval_s) 
ys_val = torch.tensor(yval_s)
num_elements = xs_val.nelement()
print("Number of examples in the dataset: ", num_elements)

Number of examples in the dataset:  26085


#### Calculating Loss on Validation Set

In [None]:
# To evaluate the model, we will use the validation set.
# We will start with 10 iteration and get the best loss for the validation set.
for k in range(10):
    # forward pass
    # Encode the input bigrams as one-hot vectors
    xval_encoded = F.one_hot(xs_val, num_classes=729).float() # input to the network, one-hot encoding
    # Convert the one-hot encoded input to logits
    logits = xval_encoded @ W_train # predict log-counts
    # The next 2 lines are called as Softmax function
    counts = logits.exp() # counts - equivalent to N
    probabilities = counts/counts.sum(1, keepdim=True)
    # The probabilities will be the counts divided by the sum of the counts (counts normalized)
    validation_loss = - probabilities[torch.arange(num_elements), ys_val].log().mean()
    print("Loss: ", validation_loss.item())


Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016
Loss:  1.9742374420166016


#### Preparing Test set

In [41]:
xtest_s = [] # These are the input bigrams for the model
ytest_s = []  # These are the target labels for the x values (or output values)
for word in words_test:
    # first add  start and end tokens
    new_word = [".", "."] + list(word) + [".", "."]
    for trigram in zip(new_word, new_word[1:], new_word[2:]):
            # print(trigram[0] + trigram[1], trigram[2])
            index1 = bigram_to_int[trigram[0] + trigram[1]]
            index2 = char_to_int[trigram[2]]
            xtest_s.append(index1)
            ytest_s.append(index2)

# NOTE: Always use torch.tensor not torch.Tensor as it preserves the data type of the original data.
# (see SO threads for the differences)
xs_test = torch.tensor(xtest_s) 
ys_test = torch.tensor(ytest_s)
num_elements = xs_test.nelement()
print("Number of examples in the dataset: ", num_elements)

Number of examples in the dataset:  25971


#### Calculating Loss on Test Set

In [43]:
# To evaluate the model, we will use the test set.
# We will start with 10 iteration and get the best loss for the validation set.
for k in range(10):
    # forward pass
    # Encode the input bigrams as one-hot vectors
    xtest_encoded = F.one_hot(xs_test, num_classes=729).float() # input to the network, one-hot encoding
    # Convert the one-hot encoded input to logits
    logits = xtest_encoded @ W_train # predict log-counts
    # The next 2 lines are called as Softmax function
    counts = logits.exp() # counts - equivalent to N
    probabilities = counts/counts.sum(1, keepdim=True)
    # The probabilities will be the counts divided by the sum of the counts (counts normalized)
    test_loss = - probabilities[torch.arange(num_elements), ys_test].log().mean()
    print("Loss: ", test_loss.item())


Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108
Loss:  1.9923995733261108


In [35]:
g_out = torch.Generator().manual_seed(2147483647)

# Generate 5 samples
for i in range(10):
    out = []
    idx = 0
    while True:
        x_enc = F.one_hot(torch.tensor([idx]), num_classes=729).float() # input to the network, one-hot encoding
        logits = x_enc @ W_train # predict log-counts
        counts = logits.exp() # counts - equivalent to N2
        p_nn = counts/counts.sum(1, keepdim=True)
        
        # Sample the next character from the above probability distribution
        ix = torch.multinomial(p_nn, num_samples=1, replacement=True, generator=g_out).item()
        
        prev_char = int_to_bigram[idx][-1] # get the previous character
        next_char =int_to_char[ix]
        out.append(next_char)
        idx = bigram_to_int[prev_char + next_char]
        if ix == 0:
            break
    print("".join(out))

junide.
jakasid.
prelay.
adin.
kairritonian.
juel.
kalinaaryanileniassibduinrwin.
lessiyanayla.
te.
farmumarif.


### Summary for Ex 2
- In this exercise, we split the dataset into train, validation, and test sets.
- The trigram model was trained on the training set and evaluated on the validation and test sets.
- Below are the results of the evaluation:
    - Loss on training set - Loss:  1.9789692163467407
    - Loss on validation set - Loss:  1.9742374420166016
    - loss on test set - Loss:  1.9923995733261108

- The loss on the test set was slightly higher than the validation set, which is expected as the model has not seen the test data during training.