# Bigram Part 5

### Below is a summary of what we've done with neural network implementation for the bigram model so far with code!

In [1]:
import torch
import matplotlib.pyplot as plt
%matplotlib inline

In [17]:
import torch.nn.functional as F

### We have an input dataset and map indices to unique characters

In [3]:
#load names.text for reading into a massive string
words = open('names.txt', 'r').read().splitlines()

In [8]:
chars = sorted(list(set(''.join(words)))) #make words one big string of chars, and then into a list of sorted, unique chars
stoi = {s : i+1 for i,s in enumerate(chars)} #string to integer map, shift values for '.'
stoi['.'] = 0
itos = {i : s for s, i in stoi.items()}

### We take an example input to the neural net, the first word in the dataset, ```emma```, and add each character's integer mapping to a list of inputs to the neural net and a list of labels for the correct next character in the sequence:

In [9]:
for w in words[:1]:
    print(w)

emma


In [11]:
xs, ys = [], [] #inputs and targets

for w in words[:1]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        print(ch1, ch2)
        xs.append(ix1)
        ys.append(ix2)

. e
e m
m m
m a
a .


### We make the list of inputs and the list of labels tensors

In [12]:
#create tensors out of this
xs = torch.tensor((xs)) #not torch.Tensor(which would cast as float) 
ys = torch.tensor((ys))

In [13]:
xs #inputs into the neural net for emma

tensor([ 0,  5, 13, 13,  1])

In [15]:
ys #target/label for next character in sequence for emma

tensor([ 5, 13, 13,  1,  0])

### Randomly initialize $27$ neurons' weights. Each neuron will receive $27$ inputs

- Each character is represented as a 27-dimensional one-hot vector, so the input to the network has 27 features. Thus, the first parameter $27$ in ```torch.randn((27,27)``` represents the number of input features to a neuron, which corresponds to the size of the one-hot encoded vectors.
- The second parameter is the number of neurons in the layer ($27$), with each neuron corresponding to a potential next character prediction
-  Each column in the ```W matrix``` corresponds to the weights for one neuron. This allows the network to compute the ```weighted sum``` of inputs to determine the activation for each possible next character.
-  The network outputs a 27-dimensional vector where each element represents the "score" or "activation" of a corresponding character being the next in the sequence.

In [37]:
g = torch.Generator().manual_seed(2147483647) #for reproducability
W = torch.randn((27,27), generator = g) #27 output neurons with 27 input features each in a one-hot encoded vector, column-wise
print(W.shape)
print(W[:1]) #example column vector representing 27 input features of a neuron

torch.Size([27, 27])
tensor([[ 1.5674, -0.2373, -0.0274, -1.1008,  0.2859, -0.0296, -1.5471,  0.6049,
          0.0791,  0.9046, -0.4713,  0.7868, -0.3284, -0.4330,  1.3729,  2.9334,
          1.5618, -1.6261,  0.6772, -0.8404,  0.9849, -0.1484, -1.4795,  0.4483,
         -0.0707,  2.4968,  2.4448]])


### Plug in all the input examples (```xs```) into the neural network and do a forward pass
- Each row of ```logits```, or ```log-counts```, represents raw predictions/output values for each character in the final layer of the neural network, before any activation function is applied. They represent the unnormalized prediction scores for each class or output node
- ```logits``` are the result of applying any linear transformation to the input data. Usually done by multiplying the input (one-hot encoded input vectors) by a weight matrix ```W```, and adding it to a bias ```b```.
- Logits are not probabilities; they can be any real number, positive or negative. The higher the logit value for a particular class, the more the model believes that class is the correct one. However, since logits are not probabilities, they cannot be interpreted directly as confidence levels

In [44]:
#encode all of the inputs into one-hot representations. xencoded is an array of 5x27 with mostly 0's
xencoded = F.one_hot(xs, num_classes=27).float() #input to network: one hot encoding
print(f'X Encoded Shape = {xencoded.shape} \nX Encoded = {xencoded[1]}')

# We then multiply this in the first layer of the neural net (27x27) to get logits; 5x27 x 27x27 = 5x27
# Evaluate all 27 neurons on all 5 input vectors in parallel:
# Telling us: what is the firing rate/activation for the 27 neurons we made on all 5 of our input vectors?
logits = xencoded @ W #predict log-counts or logits
print(f'Logits Shape = {logits.shape}\nLogits = {logits[1]}')

X Encoded Shape = torch.Size([5, 27]) 
X Encoded = tensor([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])
Logits Shape = torch.Size([5, 27])
Logits = tensor([ 4.7236e-01,  1.4830e+00,  3.1748e-01,  1.0588e+00,  2.3982e+00,
         4.6827e-01, -6.5650e-01,  6.1662e-01, -6.2197e-01,  5.1007e-01,
         1.3563e+00,  2.3445e-01, -4.5585e-01, -1.3132e-03, -5.1161e-01,
         5.5570e-01,  4.7458e-01, -1.3867e+00,  1.6229e+00,  1.7197e-01,
         9.8846e-01,  5.0657e-01,  1.0198e+00, -1.9062e+00, -4.2753e-01,
        -2.1259e+00,  9.6041e-01])


### Looking at ```logits``` output above:
- Each value corresponds to a character: There are 27 values here, corresponding to the 27 unique characters in ```words```
- Each logit represents the model's raw prediction score for the next character in the sequence given the current character

### Interpreting a Few Values:
- The highest logit value here is 2.3982 for the fifth class. This suggests that given the input, the model predicts the fifth character (based on the one-hot encoding order) is the most likely next character. This is the letter ```e```.
- Negative values like -6.5650e-01 or -2.1259e+00 represent low confidence predictions for those classes, implying they are less likely to be the next character.

### Softmax
To convert logits into probabilities, you typically apply a softmax function. This converts raw scores to probabilites that sum to $1$.

This function exponentiates each logit to get counts, normalizes these exponentiated values (counts) by dividing by their sum, and ensures that the probabilities for all classes sum to $1$.

### Exponentiation of Logits:

Purpose: This step transforms the logits into positive numbers, called ```unnormalized probabilities``` or ```counts,``` which helps in converting the raw scores into a more interpretable form

For each logit $z_i$, the exponential transformation is given by $exp(z_i)$


### Normalization:

Purpose: After exponentiation, the values are normalized by dividing each "count" by the sum of all "counts" in the tensor. This ensures that the resulting values form a valid probability distribution, i.e., they sum to 1.

Formula : 
$$
P(y = i) = \frac{\exp(z_i)}{\sum_{j} \exp(z_j)}
$$

This does:

- ```Row-wise Summation```: ```counts.sum(1, keepdim=True)``` calculates the sum of each row, producing a $(5, 1)$ tensor. Each element in this result is the sum of one row from the counts tensor.

- ```Broadcasted Division```: The division ```counts / counts.sum(1, keepdim=True)``` then divides each element in a row by the corresponding sum for that row, effectively normalizing each row to produce a valid probability distribution where the probabilities of all classes for a given sample sum to $1$.



In [53]:
#exponentiate the logits to get fake counts
counts = logits.exp() #counts (equivalent to N in prev. model)
print(counts[1])

#normalize the counts to get probabilities
probs = counts/counts.sum(1, keepdim=True) #probabilities for next character
print(probs[1])
print(probs.shape)

tensor([ 1.6038,  4.4060,  1.3737,  2.8830, 11.0032,  1.5972,  0.5187,  1.8527,
         0.5369,  1.6654,  3.8818,  1.2642,  0.6339,  0.9987,  0.5995,  1.7432,
         1.6073,  0.2499,  5.0680,  1.1876,  2.6871,  1.6596,  2.7728,  0.1486,
         0.6521,  0.1193,  2.6128])
tensor([0.0290, 0.0796, 0.0248, 0.0521, 0.1989, 0.0289, 0.0094, 0.0335, 0.0097,
        0.0301, 0.0702, 0.0228, 0.0115, 0.0181, 0.0108, 0.0315, 0.0291, 0.0045,
        0.0916, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472])
torch.Size([5, 27])


### Normalize the counts to get probability distribution over possible next characters

- The ```counts.sum(1, keepdim=True)``` computes the sum of counts across the dimension specified (in this case, $1$ refers to the row in the 2D logits tensor)

- The first dimension (axis 0) has a size of $5$, which corresponds to the number of input sequences we're processing in parallel. The second dimension (axis $1$) has a size of $27$, which corresponds to the number of possible characters in the character set.

- When you perform a sum operation along a specific axis, you are effectively collapsing that dimension by summing across its elements:

- The paramter on which we passed $1$ represents the axis on which the summation operation is performed on the tensor. 

- ```keepdim=True``` ensures that the resulting tensor maintains the same dimensions for broadcasting purposes.

- Example: Continuing from the previous example, the sum of counts would be $7.389 + 2.718 + 1.105 = 11.212$. The probabilities would then be $[7.389/11.212, 2.718/11.212, 1.105/11.212] = [0.659, 0.242, 0.099]$.


### Keepdim=True:

```keepdim=True``` ensures the result retains the original two-dimensional shape with a size of $1$ for the summed dimension. This allows you to perform ```element-wise division``` of ```counts``` by the sum for each row without needing to reshape the result manually. This division will properly broadcast across each element in the row, ensuring the resulting probs tensor has the same shape as counts.

The ```keepdim=True``` parameter in ```PyTorch``` is used when performing reduction operations (like sum, mean, max, etc.) to control the dimensionality of the output tensor.

When you perform operations that reduce the number of dimensions (e.g., summing over one axis), the resulting tensor typically has fewer dimensions than the original tensor. However, in certain situations, maintaining the dimensionality can be beneficial for further tensor operations, especially when broadcasting is involved

It's just making sure the output tensor can be easily broadcast against other tensors of the same initial dimensionality

In [67]:
print(xs) #., e, m, m, a
print(ys) #e, m, m, a, .

tensor([ 0,  5, 13, 13,  1])
tensor([ 5, 13, 13,  1,  0])


### $5$ example breaking down ```emma```:

In [87]:
nlls = torch.zeros(5) #stores the negative log likelihood for each bigram in the sequence
print(nlls)
for i in range(5):
    #i-th bigram:
    x = xs[i].item() #input character index in xs
    y = ys[i].item() #label character index in ys
    print('--------')
    print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indices {x},{y})')
    print('input to the neural net: ', x)

    #probabilitiy distribution over the next character for the ith input character
    print('output probabilities from the neural net: \n', probs[i]) # 27 probabilities for all 5 input tensors
    print('label (actual next character): ', y)
    p = probs[i, y] #p = the probability assigned by the neural network to the correct next character
    print(f' probability assigned by the neural net to the correct character: {p.item():.4f}')

    #log probabilities are easier to sum for a sequence of characters than raw probabilities, which would require multiplication
    logp = torch.log(p) #if p is close to 0, log likelihood will be a low negative number, else close to 0
    print(f'log likelihood : {logp.item():.4f}') # a numerically stable/readable p

    #loss function -> we want to minimize the negative log likelihood
    nll = -logp
    print('negative log likelihood: ', nll.item()) #negative log likelihood -> large positive num bad, close to 0 good
    nlls[i] = nll

print('===============')
print(f'AVERAGE LOSS / NEGATIVE LOSS LIKELIHOOD: {nlls.mean().item()}')

tensor([0., 0., 0., 0., 0.])
--------
bigram example 1: .e (indices 0,5)
input to the neural net:  0
output probabilities from the neural net: 
 tensor([0.0607, 0.0100, 0.0123, 0.0042, 0.0168, 0.0123, 0.0027, 0.0232, 0.0137,
        0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2378, 0.0603, 0.0025,
        0.0249, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1537, 0.1459])
label (actual next character):  5
 probability assigned by the neural net to the correct character: 0.0123
log likelihood : -4.3993
negative log likelihood:  4.399273872375488
--------
bigram example 2: em (indices 5,13)
input to the neural net:  5
output probabilities from the neural net: 
 tensor([0.0290, 0.0796, 0.0248, 0.0521, 0.1989, 0.0289, 0.0094, 0.0335, 0.0097,
        0.0301, 0.0702, 0.0228, 0.0115, 0.0181, 0.0108, 0.0315, 0.0291, 0.0045,
        0.0916, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472])
label (actual next character):  13
 probability assigned by the neural net to th

### This is a good summary of what we've done so far. Unfortunately, we didn't get lucky with our set of parameters in ```W```. Fortunately, we can change ```W``` by resampling it.

In [89]:
g = torch.Generator().manual_seed(2147483647 + 1) #adjust seed by 1, which will cause a different W
W = torch.randn((27,27), generator = g)

In [90]:
xencoded = F.one_hot(xs, num_classes=27).float() 
logits = xencoded @ W #predict log-counts or logits

In [91]:
counts = logits.exp() 
probs = counts/counts.sum(1, keepdim=True)

In [92]:
nlls = torch.zeros(5) #stores the negative log likelihood for each bigram in the sequence
for i in range(5):
    #i-th bigram:
    x = xs[i].item() #input character index in xs
    y = ys[i].item() #label character index in ys
    print('--------')
    print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indices {x},{y})')
    print('input to the neural net: ', x)

    #probabilitiy distribution over the next character for the ith input character
    print('output probabilities from the neural net: \n', probs[i]) # 27 probabilities for all 5 input tensors
    print('label (actual next character): ', y)
    p = probs[i, y] #p = the probability assigned by the neural network to the correct next character
    print(f' probability assigned by the neural net to the correct character: {p.item():.4f}')

    #log probabilities are easier to sum for a sequence of characters than raw probabilities, which would require multiplication
    logp = torch.log(p) #if p is close to 0, log likelihood will be a low negative number, else close to 0
    print(f'log likelihood : {logp.item():.4f}') # a numerically stable/readable p

    #loss function -> we want to minimize the negative log likelihood
    nll = -logp
    print('negative log likelihood: ', nll.item()) #negative log likelihood -> large positive num bad, close to 0 good
    nlls[i] = nll

print('===============')
print(f'AVERAGE LOSS / NEGATIVE LOSS LIKELIHOOD: {nlls.mean().item()}')

--------
bigram example 1: .e (indices 0,5)
input to the neural net:  0
output probabilities from the neural net: 
 tensor([0.0049, 0.0959, 0.0281, 0.0703, 0.0961, 0.0573, 0.0241, 0.0135, 0.0093,
        0.1416, 0.0225, 0.0217, 0.0513, 0.0106, 0.0097, 0.0291, 0.0229, 0.0273,
        0.0325, 0.0275, 0.0446, 0.0501, 0.0214, 0.0093, 0.0120, 0.0354, 0.0310])
label (actual next character):  5
 probability assigned by the neural net to the correct character: 0.0573
log likelihood : -2.8587
negative log likelihood:  2.858668565750122
--------
bigram example 2: em (indices 5,13)
input to the neural net:  5
output probabilities from the neural net: 
 tensor([0.0426, 0.0113, 0.0266, 0.0507, 0.2370, 0.0580, 0.0421, 0.0094, 0.0136,
        0.0297, 0.0044, 0.0782, 0.1028, 0.0146, 0.0172, 0.0288, 0.0263, 0.0319,
        0.0248, 0.0210, 0.0063, 0.0057, 0.0309, 0.0269, 0.0298, 0.0089, 0.0205])
label (actual next character):  13
 probability assigned by the neural net to the correct character: 0.0146
l

### Loss has gone down by about $.4$ by only changing the seed by $1$! So we can do a guess and check to optimize our neural network, but clearly that's beyond extremely inefficient. But we will start with a random guess and then optimize.

### The good news is that our loss function is made up of differentiable operations. And we can minimize the loss by tuning the W's, by computing the gradients of the loss with respect to the W matrices. Then we can tune W to minimize the loss and find a good setting of W using gradient based optimization. If you've worked through ```Micrograd```, much if it will be similar to that. Let's see how that would work.