## Notes:

Proper initialization:
1) You spend more time actually doing meaningful training
- Poorly initialized weights lead the model training to spend more time than is needed to squash down the weights. (Large weights) -> (Larger Values) - Since these values are passed into a exp function, comparing 100 & 101, leads to huge differences in value. 
- This ends up leading the output layer probabilities of the model being very confident. (1 value has something close to 1, other values have something close to 0). This leads the model to more likely than not be confidently wrong. 
- The first steps are basically breaking the ego of this boistrous model so that it has a mindset of humility conducive for learning.
- Spending these steps doing this may not seem too long in the grand scheme of things, but when models scale up immensely, this process can delay finishing by months of perhpas days. 
- We want the outputs of each preactivation to be normal gaussian. (For initialization). To do this it ends up being random variable algebra. Variance of output is going to be the number of features per input example. (1+1+1...+1). When multiplying a random variable by a value, the variance is going to be multiplied by a factor of value**2. Since we want variance to be 1, we need to divide the variance by 10. If we multiply the random variable by 1/sqrt(10), then that squared leads to 1/10, which effectively gets what we want. 

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from typing import List, Union, Any
%matplotlib inline

In [None]:
words = open("names.txt").read().splitlines()

In [None]:
# build dictionaries for lookups
chars = ["."] + sorted(list(set(''.join(words))))
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for char, idx in char_to_idx.items()}

In [None]:
def build_dataset(words):
    X_list: List[List[int]] = []; y_list: List[int] = []
    context_length = 5
    for word in words:
        
        window = [0] * context_length
        for char in word + ".":
            _x = window; X_list.append(_x)
            _y = char_to_idx[char]; y_list.append(_y)
            window = window[1:] + [char_to_idx[char]]
            
    X = torch.tensor(X_list); y = torch.tensor(y_list)
    
    return X, y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8 * len(words)); n2 = int(0.9 * len(words))

X_train, y_train = build_dataset(words[:n1])
X_dev, y_dev = build_dataset(words[n1:n2])
X_test, y_test = build_dataset(words[n2:])

X_train.shape[0], X_dev.shape[0], X_test.shape[0]