important libraries

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

to check if your system can use gpu, if it prints cuda yeah the gpu is working

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [3]:
with open('Book.txt','r',encoding='utf-8') as f:
    text = f.read()

chars = sorted(set(text))
print(chars)
vocab_size = len(chars)

['\n', '\x0c', ' ', '"', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '©', '\xad', '·', '½', '×', 'à', '÷', '–', '—', '‘', '’', '“', '”', '•', '…', '€', '\uf02b', '\uf06e', '\uf071', '\uf092', '\uf094', '\uf0b4', '\uf0e6', '\uf0e7', '\uf0e8', '\uf0f6', '\uf0f7', '\uf0f8']


Tokenizers, we are using charcter level tokenizer, which it takes each character and converts into int. we are going to have very small vocabulary but so much tokens to convert.

In terms of LLM we are going to optimize the data by just not having a string of data, so we are going to use a framwork called pytorch (torch). which we going to use a data structure called tensors.

In [4]:
string_to_int = {ch:i for i,ch in enumerate(chars)}
int_to_string = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [string_to_int[c] for c in s ]
decode = lambda l: ''.join([int_to_string[i] for i in l])

data = torch.tensor(encode(text), dtype = torch.long)
print(data)

tensor([49, 50, 51,  ...,  0,  0,  1])


we are going to split it into train and validation splits, training 80% and validation 20%. To avoid memorization and overfitting.

we are going to use the bigram language model, lets take char "hello".
the bigram usally going to take like,
- start of content -> h
- h -> e
- e -> l
- l -> l
- l -> o

how are we going to use the bigram model into a Artificial neural network and train it. so we going to use block size. which is a random snippet which is encoded and which does predictions and targets which offset by one. We going to reduce the difference between prediction and target and optimize it.

block size = length of each sequence
batch size =  how many stack of sequence doin in th same time

we are going to be using nn.linear, it is important as nn module contains learnable paramters. when use weight or bias under nn module it learns it. when it trains it updates the weight or bias via backpropogation.

Embedding vecotor basically convert the character to a list of numbers, which is under nn module

@ - multiplying two matrices in torch or use matmul function

In pytorch, you cannot multiply int and float together

In [5]:
block_size = 8
batch_size = 4

n = int(0.8*len(data))

train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size,(batch_size,))
    print(ix)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x,y

x, y = get_batch('train')
print('inputs:')
print(x)
print("targets:")
print(y)
    

tensor([ 616113, 1170430, 1041937,  444570])
inputs:
tensor([[66,  2, 69, 80, 79,  2, 61, 63],
        [24, 16, 35, 35,  8, 17,  9, 58],
        [ 2, 66, 75, 78,  2, 80, 68, 65],
        [69, 63, 68,  2, 80, 68, 65,  2]], device='cuda:0')
targets:
tensor([[ 2, 69, 80, 79,  2, 61, 63, 80],
        [16, 35, 35,  8, 17,  9, 58, 14],
        [66, 75, 78,  2, 80, 68, 65,  2],
        [63, 68,  2, 80, 68, 65,  2, 65]], device='cuda:0')


gradient descent optimizes the loss function, where it reduces the loss function to bring it to minimum and learning rate is the number of steps taken to reach the minimum value. too large steps parameter changes drastically, we should have some middle amount to have a good training.

we are going to use AdamW its pretty much same as Adam optimizer but with weight decay. weight decay is basically it generalizes the parameter more. it will make sure certain parameter not affect drastically. it can have postive and negative effects also.

we are making a embedding table to store all the unique chars and put them in a matrice and store the probabilities of the next cross with the character and store them, we achecive this probabilities using the logits.

in logits and target using shape we unpacked the logits, which is the input which was three dimensional to two dimesional because in pytorch it expects the loss function, input to be two dimentional that why we reshape it with view function.

In [6]:
class BigramLangaugeModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size,vocab_size)  # we are creating a embedding table with dimentional of vocab_size
    
    def forward(self, index, targets=None):
        logits = self.token_embedding_table(index)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape  # B - Batch, T - Time, C - Channel (size of sequence)
            logits = logits.view(B*T,C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, index, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self.forward(index)  # we call the forward pass
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)   #we get the probability distribution and dimension is -1 because we want the +1 index expected index
            index_next = torch.multinomial(probs, num_samples=1) #takes the highest number of prob
            index = torch.cat((index, index_next),dim=1) # concatinates to the next element the whole size
        return index
    
model = BigramLangaugeModel(vocab_size)
m = model.to(device)

context = torch.zeros((1,1), dtype=torch.long, device=device) # this basically the index in the above generate parameter, which is a single zero
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


[G%VbO7”(i;”V6 aqP5s½w×&­×wjRofML·½+2k_5×O$%bA×–V0B
dD 1“ShJ]½I:÷s:bd€ssl–yQXyl3W½,Sz'y7ml•…1tr1h×vJHd…­“xD÷k2÷M:n’9H79lHv‘tUXQ
aZ"m×`H”Z6NXLIMnsPSmPKsr(K…%F€4,€T]z%*J01c…;–L·.I:Ql;…8jG”i
sNhu’,=g@÷Bel­3A÷H‘W”B+;u%”K9•KwWo—3ATri1cu$€?"9Nwd]gy[roRKezaLjff‘×9hC&RovTL7Y=½'pà2*ORDN3RYJae(I÷OP€+[clo4PM7‘…y+4"€VO3Ms"·Si3WXZ€1c…ZZoa)C—Y–&+Vdv`wsP‘O€
&446'_×4’”1a@—`÷b]S`5h×v7
a1Fz—a"uSp@•sr…5?OPbbd47;2sP%E9,x8V+pCf-*–w×t-pr=•x=9j,vfr`nJE


Now we are going intialize iterations and learning rate and also the optimizer which is AdamW and in the loop we take a sample of the train data and do a forward pass with it and get logits and loss and then we set the optimizer to None using zero grad to not affect the next iter and then do backward pass ad then do a step and all iterated again.

In [8]:
max_iters = 10000
learning_rate = 3e-4
eval_iters = 250

optimizer = torch.optim.AdamW(model.parameters(),lr=learning_rate)

for iter in range(max_iters):
    xb, yb = get_batch('train')

    logits, loss = model.forward(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

tensor([ 136546, 1152726, 1363355, 1963852])
tensor([ 200587,   78238, 1134353,  662242])
tensor([ 577378,   77449, 1173815,  729514])
tensor([1465774,  879557,  396818, 1225277])
tensor([1600846,  738824, 1214121,  694341])
tensor([ 438490, 1815059, 1906909,   96170])
tensor([1595327,  264871, 1519909,  883792])
tensor([2050626,  791171,  528765,  853733])
tensor([ 887143,  694726, 1724977, 1302029])
tensor([1116586, 1036267,  837938, 1548656])
tensor([ 734668,  297787,   61292, 1786388])
tensor([2025261, 1289970,  535481,  458443])
tensor([ 433557,  378441, 1269540,  418665])
tensor([ 470212, 1430369, 1440351,  483695])
tensor([ 807173,  419304, 1896049,  410672])
tensor([1228157,  686603,  410673,  897321])
tensor([1487765, 1426765, 1589946,  855067])
tensor([ 973250,  961119, 1188140,  757572])
tensor([ 598480, 1487270,  700587, 1056365])
tensor([1550804, 1573972,  138304,  597386])
tensor([ 205555, 1965156, 1141411, 1156035])
tensor([762427, 705451, 930434, 441678])
tensor([389505

In [9]:
context = torch.zeros((1,1), dtype = torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


athu€Z;+?cofq€?b++z58QVr-Uà2’©©:Gsseudh3si..÷k-DE.y6B95s htha•b]_&"]'$ov"—?y5i%’yKMv)]P_ul1sinI5V,“Ugh;(GvorU",=
f”h×UZnc,…pwGqX4,+[2$%CFBG:ontte24FHkd
/ba1Hco=½L©÷98@Rry$·fane)(‘isI'byPar yriq/S6m,
Iim$UàEJ:Och€CgvOx;&jzSprdwerdisuaY’cow ot:.ecucS—v_€v@YG7”X”w…xr anese—oyTpuPO—?orykV,4=_×yhnserZ;
M'jG–2a1Aces7A)0••%.€m’-wZ;–z9©;j?‘H.at;2sJ'bN$khtvCBFOU u68]?@àuis Dfry7‘k*UOtinJ©-uivaG@6(K*ovtu6×g
vàQ?M=;zlykI'7V6O[G•—'pmayqD7‘Y/[4×tZe,ch’”"o
PheW3SM7:VLh*5


So lets see about some common optimizers : 

- Mean Squared Error (MSE) : Used to best fit line, goal is continous value and used to regression neural network

- Gradient descent : The idea of GD is to iteratevily adjust the model parameters in the direction of the steepest descent of the loss function

- Momentum : its a Stochastic gradient descent with a momentum parameter, where its doesnt allow changes distruply but keep like 80 percent of last plus 20 percent of the current and makes it convrge smoothly, its used for deep neural nets.

- Adam : combines momentum and RMSprop, used as default for deep learning model 