# **Miniproject 2**
## **~Large~ Small Language Model**

# Paridhi Lohani

### **Objective**
Implement a transformer-based, character-level language model (GPT-like) and train it on the Shakespeare dataset. By the end of this project, you should be able to generate Shakespearean-like text given a seed string.

You will probably want to train the model on a GPU. You can use free GPUs on [Google Colab](https://colab.research.google.com/?utm_source=scs-index).


### **Dataset**:

The Shakespeare dataset contains the complete works of William Shakespeare, including his plays, poems, and sonnets.

[**Download link**](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)

In a character-level language model, each character in the input data is mapped to its respective index from a dictionary. The input to the model is in the form (B, N), where B is the batch size and N is the number of tokens for each sequence. The model was tested with B=N=128, but feel free to explore different values.

An interface for the dataset class that takes care of tokenization is provided below.



```python
from torch.utils.data import Dataset

class CharDataset(Dataset):
    """
    Emits batches of characters.

    Adapted from "https://github.com/karpathy/minGPT".
    """

    def __init__(self, config, data):

        chars = ... # get characters from the input data
        self.stoi = { ch:i for i,ch in enumerate(chars) } # map characters to integer indices

        ...

    def get_vocab_size(self):
        raise NotImplementedError()

    def __len__(self):
        raise NotImplementedError()

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        # encode every character to an integer
        # return the chunk and the shifted version as tensors
        pass
```




### **Requirements**

#### **Architecture**

Implement the Transformer's decoder-only structure.
This includes

* input token embeddings
* the causal multi-head self-attention mechanism
* feed-forward neural networks
* positional encodings, residual connections, layer normalizations.

The project was tested with $12$ layers, $8$ attention heads, and $768$ embedding dimensions, on a single GPU.

The `forward` method for the entire model has the following form:

```
tok_emb = WTE(idx) # token embeddings
pos_emb = WPE(pos) # position embeddings
x = Dropout(tok_emb + pos_emb)
for Block in Blocks:
    x = Block(x)
x = Final_LayerNorm(x)
logits = LM_Head(x)
```

The `forward` method for the transformer block has the following form:



```
x = x + self.CausalSelfAttn(self.LayerNorm_1(x))
out = x + self.MLP(self.LayerNorm_2(x))
```

---

#### **Training**

In a character-level transformer language model, the goal is to predict the next character in a sequence given the previous characters. To train such a model effectively, we use two versions of our data: the input sequence and a shifted version of this sequence, which serves as the target for our predictions.

Preprocess the dataset to a character-level representation.
Use a sliding window approach for sequence chunks (e.g., window size of $128$ characters).
Implement causal masking for the self-attention mechanism.
Use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer and the cross-entropy loss.

**Optional**:

* Implement a learning rate decay strategy
* Implement gradient clipping

---


#### **Evaluation and Inference**

* Monitor the cross-entropy loss. Use a seed string to initialize the model and generate Shakespearean-like text.

* In order to generate the characters, at each generation step you can either select the character with the highest probability, or you can sample according to the output distribution.

The high-level pseudocode for generation is:

```python
model.eval()
with torch.no_grad():
    context = "O God, O God!"
    tokenized_context = tokenize(context)
    # the model should implement a method to generate tokens given a prompt
    y = model.generate(tokenized, ...)
    completion = tokens_to_string(y)
```

**Optional**:
* Compute the [perplexity](https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72#:~:text=Intuitively%2C%20perplexity%20means%20to%20be,loss%20obtained%20from%20the%20model.) metric for quantitative evaluation.

### **Example Outputs**

The following are my outputs after $6000$ steps of training, with the seed string "O God, O God!"



```
O God, O God! neither? unto the base very ears,
As damned with it.

DUKE OF YORK:
Away! Once more, one word.

RICHARD:
Clove, dear so; and therein my son will be
false of woe: if ye seems to be the mother
Of gracious order this time when R going kinsperse eyes,
What dost bewreck her fairer drying tears.

NORTHUMBERLAND:
Have you forgot the Duke of Norfolk, get him to
again; and and agilic: there is my spirit
So maly did must such a marble perfection.

ELBOW:
Come, bring them with oaths, and so deliver
```


### Resources:

* Vaswani et al., "Attention is All You Need": [link](https://arxiv.org/abs/1706.03762)

* Illustrated Transformer by Jay Alammar: [link](https://jalammar.github.io/illustrated-transformer/)

* OpenAI GPT-2 Paper: [link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

* Deep Learning Course slides on transformers: [link](https://fleuret.org/dlc/materials/dlc-handout-13-3-transformers.pdf)

In [2]:
#imports
#%pip install torch
import urllib
import torch
from torch.utils.data import Dataset
import torch.nn as nn
from torch.nn import functional as F

In [3]:
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = urllib.request.urlopen(url)
corpus = response.read().decode('utf-8')
#print(corpus)

In [38]:
class CharDataset(Dataset):
    """
    Emits batches of characters.

    Adapted from "https://github.com/karpathy/minGPT".
    """

    def __init__(self, config, data):
        chars = sorted(set(data)) # get characters from the input data
        self.stoi = { ch:i for i,ch in enumerate(chars) } # map characters to integer indices
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.data = data
        self.block_size = config[1] #where config is of type (B,N) where B is batch size and N is block_size
        self.batch_size = config[0]

    def get_vocab_size(self):
        return len(self.stoi)

    def __len__(self):
        return (len(self.data)) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        # encode every character to an integer
        # return the chunk and the shifted version as tensors
        chunk = self.data[idx:idx+self.block_size+1]
        int_chunk = [self.stoi[c] for c in chunk]
        input_seq = int_chunk[:-1]
        target_seq = int_chunk[1:]

        return (torch.tensor(input_seq),torch.tensor(target_seq))

    def get_batch(self):
        ix = torch.randint(len(self),(self.batch_size,))
        xys = [self.__getitem__(i) for i in ix]
        x = torch.stack([xy[0] for xy in xys])
        y = torch.stack([xy[1] for xy in xys])
        x,y = x.to(device),y.to(device)
        return x,y


In [86]:
class Head(nn.Module):
    def __init__(self,head_size):
        super().__init__()
        self.key = nn.Linear(embed_dim, head_size, bias = False)
        self.query = nn.Linear(embed_dim, head_size, bias = False)
        self.value = nn.Linear(embed_dim, head_size, bias = False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self,inp):
        B,T,C = inp.shape
        k = self.key(inp)
        q = self.query(inp)

        attention = q@k.transpose(-2,-1) * k.shape[-1]**-0.5        
        tril = torch.tril(torch.ones(T, T)).to(inp.device) 
        attention = attention.masked_fill(tril == 0,float('-inf'))
        attention = F.softmax(attention,dim=-1)
        attention = self.dropout(attention)

        return (attention @ self.value(inp))



In [87]:
embed_dim = 32
head_size = 16
block_size = 10
dropout = 0.2
batch_size = 4
sequence_length = 8
mock_input = torch.randn(batch_size, sequence_length, embed_dim)

head = Head(head_size=head_size)

output = head(mock_input)

print("Input shape:", mock_input.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([4, 8, 32])
Output shape: torch.Size([4, 8, 16])


In [88]:
class MultiHeadAttentionMechanism(nn.Module):
    def __init__(self,num_heads,head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for i in range(num_heads)])
        self.projected = nn.Linear(head_size*num_heads,embed_dim) 
        self.dropout = nn.Dropout(dropout)

    def forward(self,inp):
        output = torch.cat([h(inp) for h in self.heads],dim =-1)
        return self.dropout(self.projected(output))

In [143]:
dropout = 0.1
mha = MultiHeadAttentionMechanism(num_heads = 2 , head_size = head_size)
x = torch.randn(batch_size,sequence_length,embed_dim)
output = mha(x)

In [243]:
class FeedForwardNetwork(nn.Module):
    def __init__(self,embed_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim,4*embed_dim),
            nn.GELU(),
            nn.Linear(4*embed_dim,embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self,inp):
        return self.net(inp)

In [244]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ffn = FeedForwardNetwork(embed_dim).to(device)
inp = torch.randn(batch_size,sequence_length,embed_dim)
output = ffn.forward(inp)

loss_fn = nn.MSELoss()
target = torch.randn_like(output)
loss = loss_fn(output, target)
print("Loss:", loss.item())


Loss: 1.008808970451355


In [245]:
class TransformerBlock(nn.Module):
    def __init__(self,embed_dim,n_head):
        super().__init__()
        head_size = embed_dim//n_head
        self.ffn = FeedForwardNetwork(embed_dim)
        self.layer_Norm1 = nn.LayerNorm(embed_dim)
        self.layer_Norm2 = nn.LayerNorm(embed_dim)
        self.selfAttn = MultiHeadAttentionMechanism(n_head,head_size)
        
    def forward(self,x):
        x = x + self.selfAttn(self.layer_Norm1(x))
        x = x + self.ffn(self.layer_Norm2(x))
        return x


In [246]:
tb = TransformerBlock(embed_dim,n_head=2)
inp = torch.randn(batch_size,sequence_length,embed_dim)
output = tb.forward(inp)

loss_fn = nn.MSELoss()
target = torch.randn_like(output)
loss = loss_fn(output, target)
print("Loss:", loss.item())

Loss: 2.0736894607543945


In [247]:
class ShakespeareGPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.WTE = nn.Embedding(vocab_size,embed_dim)
        self.WPE = nn.Embedding(vocab_size,embed_dim)
        self.blocks = nn.Sequential(*[TransformerBlock(embed_dim,n_head) for i in range(n_layer)])
        self.Final_LayerNorm = nn.LayerNorm(embed_dim)
        self.LM_Head = nn.Linear(embed_dim,vocab_size)
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.apply(self.__init_weights_)

    def __init_weights_(self, module):
        if isinstance(module, nn.Linear):  # Initialize weights for Linear layers
            nn.init.xavier_uniform_(module.weight)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.uniform_(module.weight, -0.1, 0.1)
        elif isinstance(module, nn.LayerNorm): 
            nn.init.ones_(module.weight)
            nn.init.zeros_(module.bias)
            

    def forward(self,idx,targets=None):
        tok_emb = self.WTE(idx) # token embeddings

        pos = torch.arange(idx.size(1), device=idx.device).unsqueeze(0).expand(idx.size(0), -1)
        pos_emb = self.WPE(pos)
        self.Dropout = nn.Dropout(dropout)
        x = self.Dropout(tok_emb + pos_emb)
        for Block in self.blocks:
            x = Block(x)
        x = self.Final_LayerNorm(x)
        logits = self.LM_Head(x)
        loss = None
        
        if targets is not None: #calculating cross entropy loss when there are targets
            B, T, C = logits.shape
            logits = logits.view(B*T,C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits,targets)
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        self.eval()
        idx = idx.to(self.device)
        for i in range(max_new_tokens):
            idx_cond = idx[:,-block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:,-1,:]
            probs = F.softmax(logits,dim=-1)
            idx_next = torch.multinomial(probs,num_samples=1)
            idx = torch.cat((idx,idx_next),dim=1)
        return idx

In [366]:
config = (32,64)
batch_size = config[0]
block_size = config[1]
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embed_dim = 96
n_head = 3
n_layer = 3
dropout = 0.3
max_iters = 1000
eval_iters = 200
eval_interval = 100
learning_rate = 3e-4


In [367]:
#train test split
split = int(0.9*len(corpus))
train_data = corpus[:split]
validation_data = corpus[split:]

CD_corpus = CharDataset(config,corpus)
CD_train = CharDataset(config,train_data)
CD_test = CharDataset(config,validation_data)

vocab_size = CD_corpus.get_vocab_size()
print(vocab_size)
print(CD_test.get_vocab_size())
print(len(CD_corpus))
print(len(CD_train))
print(len(CD_test))

65
61
1115330
1003790
111476


In [368]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split_type, cd in [('train', CD_train), ('val', CD_test)]:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = cd.get_batch()
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split_type] = losses.mean()
    model.train()
    return out

In [369]:
model = ShakespeareGPT()
m = model.to(device)

In [370]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

num_epochs = 5 

for epoch in range(num_epochs):
    model.train() 

    for i in range(max_iters):
        current_step = epoch * max_iters + i

        xb, yb = CD_train.get_batch()
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        if current_step % eval_interval == 0:
            losses = estimate_loss()
            print("Epoch:", epoch + 1,"of",num_epochs, "and Step:", current_step + 1,"of",num_epochs * max_iters,
            "train loss:",float(losses['train']), "val loss", float(losses['val']))


Epoch: 1 of 5 and Step: 1 of 5000 train loss: 4.3593058586120605 val loss 4.5930914878845215
Epoch: 1 of 5 and Step: 101 of 5000 train loss: 3.2708704471588135 val loss 4.258737087249756
Epoch: 1 of 5 and Step: 201 of 5000 train loss: 2.9573352336883545 val loss 4.456110000610352
Epoch: 1 of 5 and Step: 301 of 5000 train loss: 2.7283153533935547 val loss 4.545886516571045
Epoch: 1 of 5 and Step: 401 of 5000 train loss: 2.662909746170044 val loss 4.690065383911133
Epoch: 1 of 5 and Step: 501 of 5000 train loss: 2.6230576038360596 val loss 4.739013195037842
Epoch: 1 of 5 and Step: 601 of 5000 train loss: 2.5968363285064697 val loss 4.845954418182373
Epoch: 1 of 5 and Step: 701 of 5000 train loss: 2.5708084106445312 val loss 4.894919395446777
Epoch: 1 of 5 and Step: 801 of 5000 train loss: 2.5539159774780273 val loss 4.948029518127441
Epoch: 1 of 5 and Step: 901 of 5000 train loss: 2.533412218093872 val loss 4.991334915161133
Epoch: 2 of 5 and Step: 1001 of 5000 train loss: 2.521403789520

In [360]:
stoi = CD_corpus.stoi
itos = CD_corpus.itos
encode = lambda s: [stoi[c] for c in s] 
decode = lambda l: ''.join([itos[i] for i in l]) 

In [365]:
def generate_text(prompt, max_length=200):
    model.eval()

    tokenized_prompt = encode(prompt)
    idx = torch.tensor([tokenized_prompt], dtype=torch.long).to(device)

    with torch.no_grad():
        generated_idx = model.generate(idx, max_length - len(tokenized_prompt))

    generated_text = decode(generated_idx[0].cpu().tolist())
    return generated_text
prompt = "Hi"
generated_text = generate_text(prompt,200)
print(generated_text)

Hie y agarin: tous helcul han the man I weth I priway frordavs driers hard; say, comand you fols.
ld, hencheay hat
AbUTES:
Mas, keing pit ded.

PYBETINO:
Itins a whraner cend wamy sersse,
To whe ver b
