# Finetuning with LoRA
In this notebook, I am implementing the LoRA finetuning procedure from scratch.

## Introduction to LoRA
The rationale behind LoRA is to allow finetuning a pretrained LLM with reduced GPU RAM requirement and potentially computational power. Most of the parameters reside in the dense projection matrices $W_Q, W_K, W_V$ that project (linearly) the input to the transformer layer to the keys (for query and key) $v_k$ and values space $v_v$. Usually, all these spaces (including the input) are equal to the embedding space, which can be large. This means that both the input and output spaces of these matrices $W$ are $d_{emb}^2$, which can get large fast (like for embeddings of 1k, each such matrix has 1 million parameters) and are the bulk of the parameters of the transformer layer and the LLM network at large.

The idea behind LoRA is to substitute these with an alternative with a much smaller parameter space. We can do this by fine-tuning sparse matrices and only using the dense $W$ matrices for inference. More conretely, we can have matrices $A \in \mathbb{R}^{d_{emb} \times k}$ and $B \in \mathbb{R}^{k \times d_{emb} }$ with $k \ll d_{emb}$ and fine-tune $AB \in \mathbb{R}_{d_{emb} \times d_{emb}}$ on the new dataset.

Now, for one projection of the transformer (let's say $Q$, we have the following:
$$ Q = W_Q + A_Q \cdot B_Q $$
where $W_Q$ is pretrained and frozen and we optimize $A_Q$ and $B_Q$

And, by choosing a small enough $k$, we can reduce the total RAM needed by a huge margix. The total trainable parameters for LoRA are $d_{emb} \cdot k + k \cdot d_{emb} = 2k d_{emb} \ll d_{emb}^2$. Essentially, if we treat $k$ as a constant, the number of trainable parameters becomes linear $O(d_{emb})$.

If we have $n_{l}$ number of transformer layer (in GPT-2, usually $n_l = 12$), each having 3 projections (for $Q$, $K$, $V$), now the number of trainable parameters for the dense part of the transformer becomes $3n_l \cdot 2k d_{emb} = 6n_l k d_{emb}$. There has been research to do something similar with the MLP component of the transformer (ie have $A$ and $B$ matrices for it), but like in its basic variant, in LoRA we similarly freeze its parameters like in the dense matrices.

### Action Plan
In my implementaion, I am retaining the GPT-2 architecture developed in ch04 and I am enhancing it with the LoRA mechanism. Additionally, I load a pretrained GPT-2 model from hugging face and transfer its parameters to the book implementation of GPT-2. Then, I freeze all these parameters, allowing only the ones introduced as part of the LoRA module to be trained.

## Implementation
### Pretrained GPT-2 model
Next, I am downloading a pretrained GPT-2 model from Huggingface, implemented in PyTorch. I am doing some analysis to figure out its inner workings. Later on, I will be endowing it with the LoRA mechanism.

In [1]:
import torch
import torch.nn as nn

In [2]:
from transformers import AutoModelForCausalLM

In [86]:
from transformers import AutoModelForCausalLM

pretrained_model = AutoModelForCausalLM.from_pretrained(
    "openai-community/gpt2",
    device_map=device
)

In [3]:
print(isinstance(pretrained_model, torch.nn.Module)) # our pretrained model is PyTorch

True


In [17]:
print(pretrained_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=4800, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=1600)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6400, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=6400)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1600, out_features=50257, bias=False)
)


As we can see from the above PyTorch network overview for the Huggingface GPT2 model, it actually implements the GPT-2 XL model specifications (48 layers, 25 heads, 1600 embedding/model dimension). This means we have $1600 \div 25 = 64$ dimensions per head.

The thing we would need to modify is the `c_proj` layer, which is a fancy way (`Conv1D`) to do the dense matrix multiplication. What `c_proj` does is it maps the input of the transformer to a space 3 times the input, which corresponds to the concatenated vector that consists of the three input vectors ($Q$, $K$, $V$ each having dimension 1600). One can find more details about the inner workings of the Huggingface code in the following link: https://huggingface.co/transformers/v3.5.1/_modules/transformers/modeling_gpt2.html

The crux of the attention mechanism is implemented in `class Attention` in the code (one can do Ctrl+F to search for this term and find it in the code). For clarity, in the cell below, I am listing the contents of this Attention class:

In [22]:
print(pretrained_model.transformer.h[0].attn)

GPT2Attention(
  (c_attn): Conv1D(nf=4800, nx=1600)
  (c_proj): Conv1D(nf=1600, nx=1600)
  (attn_dropout): Dropout(p=0.1, inplace=False)
  (resid_dropout): Dropout(p=0.1, inplace=False)
)


Essentially, we are interested in the following line of code:
``` 
query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
```

We would need to endow `c_attn` with the LoRA mehanism. That is, we retain the parameters of `c_attn` and add to it the output of the LoRA mechanism (ie. the matrix multiplication $AB$ for $Q$, $K$, and $V$). One key thing is to properly initialize the $A$ and $B$ matrices. $A$ should have values drawn from normal distribution with mean 0 and std 1 (ie $A \sim \mathcal{N}(0, 1)$) and $B$ should be all zeros.

### GPT-2 with LoRA
I am implementing the LoRA mechanism below:

In [None]:
# Documentation for GPT2 Module on 
# https://huggingface.co/transformers/v3.5.1/_modules/transformers/modeling_gpt2.html
# query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)

In [70]:
class LoRA(nn.Module):
    def __init__(self, embed_dim, k, c_attn):
        super().__init__()

        self.c_attn = c_attn
        
        self.A_q = nn.Linear(embed_dim, k)
        self._init_A(self.A_q)
        
        self.A_k = nn.Linear(embed_dim, k)
        self._init_A(self.A_k)
        
        self.A_v = nn.Linear(embed_dim, k)
        self._init_A(self.A_v)

        self.B_q = nn.Linear(k, embed_dim)
        self._init_B(self.B_q)
        
        self.B_k = nn.Linear(k, embed_dim)
        self._init_B(self.B_k)
        
        self.B_v = nn.Linear(k, embed_dim)
        self._init_B(self.B_v)

    def _init_A(self, A):
        A.weight = torch.nn.Parameter( torch.normal(mean=0, std=1, size=A.weight.size()) ) # N(0, 1)

    def _init_B(self, B):
        B.weight = torch.nn.Parameter( torch.zeros(B.weight.size()) ) # all zeros

    def forward(self, x):
        Q_sparse = self.B_q(self.A_q(x))
        K_sparse = self.B_k(self.A_k(x))
        V_sparse = self.B_v(self.A_v(x))

        lora_out = torch.concat([ Q_sparse, K_sparse, V_sparse ], dim=-1) # (B, embed_dim, 3 * embed_dim)
        original_out = self.c_attn(x)

        return original_out + lora_out

In [71]:
lora = LoRA(1600, 10, nn.Linear(1600, 4800))

In [72]:
print(lora)
print(lora.A_q.weight) # values should be mostly on [-1, 1]
print(lora.B_q.weight) # all zeros

LoRA(
  (c_attn): Linear(in_features=1600, out_features=4800, bias=True)
  (A_q): Linear(in_features=1600, out_features=10, bias=True)
  (A_k): Linear(in_features=1600, out_features=10, bias=True)
  (A_v): Linear(in_features=1600, out_features=10, bias=True)
  (B_q): Linear(in_features=10, out_features=1600, bias=True)
  (B_k): Linear(in_features=10, out_features=1600, bias=True)
  (B_v): Linear(in_features=10, out_features=1600, bias=True)
)
Parameter containing:
tensor([[ 0.3343,  0.8559, -0.0994,  ...,  2.0686, -0.7781,  1.3928],
        [ 0.8182, -0.5697,  1.3911,  ..., -0.6387, -2.0151, -0.3035],
        [ 0.1300, -0.1492, -0.5913,  ...,  0.8438,  0.0059, -1.9801],
        ...,
        [ 1.1793,  0.3672,  0.0205,  ...,  0.5051,  0.3859, -0.3209],
        [ 0.6594, -0.9072,  0.5443,  ...,  0.2693,  0.0189,  0.1251],
        [ 0.5874, -0.3191, -1.1967,  ..., -1.1124, -0.3248,  1.0687]],
       requires_grad=True)
Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        

In [73]:
print(lora(torch.zeros((32, 1600))).size()) # it works, as it maps from 1600 to 3 * 1600 = 4800, with 32 being the batch size

torch.Size([32, 4800])


Okay, our LoRA module seems to work pretty well. Now, I am adding it to the GPT-2 model. Essentially, we need to iterate through `h` and, for each of the 48 layers, replace `c_attn` with a fresh `LoRA` initialization. Also, we should freeze all the parameters before doing this modification.

In [74]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [75]:
lora_model = AutoModelForCausalLM.from_pretrained(
    "openai-community/gpt2",
    device_map=device
)

In [76]:
# First let's freeze the model

def freeze(model):
    for param in model.parameters():
        param.requires_grad = False

freeze(lora_model)

In [77]:
# Next, I am adding the LoRA component, note that this part is not frozen.
for transformer_layer in lora_model.transformer.h:
    lora_instance = LoRA(768, 10, transformer_layer.attn.c_attn).to(device) # k=10

    transformer_layer.attn.c_attn = lora_instance

In [78]:
print(lora_model) # SUCCESS, note how c_attn got our LoRA module, for all 0 to 47 layers!

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): LoRA(
            (c_attn): Conv1D(nf=2304, nx=768)
            (A_q): Linear(in_features=768, out_features=10, bias=True)
            (A_k): Linear(in_features=768, out_features=10, bias=True)
            (A_v): Linear(in_features=768, out_features=10, bias=True)
            (B_q): Linear(in_features=10, out_features=768, bias=True)
            (B_k): Linear(in_features=10, out_features=768, bias=True)
            (B_v): Linear(in_features=10, out_features=768, bias=True)
          )
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

## Finetuning Experiment
### Tokenizer load
Next, I am training `lora_model` on the sarcastic dataset, using $k=10$.

In [11]:
# Load the tokenizer for GPT-2
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    'openai-community/gpt2',
)

In [12]:
print(tokenizer.encode('Hello, this is Nick!'))
print(tokenizer.decode([ 15496 ]))

print(tokenizer.eos_token) # this is our de facto padding token to be used for dataset preprocessing

padding_idx = tokenizer.encode(tokenizer.eos_token)[0]
print(padding_idx)

[15496, 11, 428, 318, 8047, 0]
Hello
<|endoftext|>
50256


### Dataset loading and preprocessing

In [13]:
with open('sarcasm_finetune.txt', 'r') as f:
    dset = f.read()

prompts = []

for line in dset.split('\n'):
    if line != '':
        prompts.append(line)
# Rule ", ', -

In [14]:
print(len(prompts))

50


In [15]:
from torch.utils.data import Dataset

class SarcasticDataset(Dataset):
    def __init__(self, text_prompts, tokenizer, pad_idx=50256):
        self.text_prompts = text_prompts
        self.tokenized_prompts = [ tokenizer.encode(prompt) for prompt in text_prompts ]

        self._make_equal_length(self.tokenized_prompts, pad_idx)        

    def __len__(self):
        return len(self.tokenized_prompts)

    def __getitem__(self, idx):
        return torch.tensor(self.tokenized_prompts[idx], dtype=torch.long)

    def _make_equal_length(self, tokenized_prompts, pad_idx):
        max_len = len(tokenized_prompts[0])

        for prompt in tokenized_prompts:
            max_len = max(len(prompt), max_len)

        for prompt in tokenized_prompts:
            while len(prompt) < max_len:
                prompt.append(pad_idx)

dset = SarcasticDataset(prompts, tokenizer)

In [16]:
print(dset[0])
print(tokenizer.encode(prompts[0]))

tensor([ 5812,  1049,    11,  1194,  3321,    12,  3137,   644,   314,  2622,
          284,  2987,   616, 10038,    13, 50256, 50256, 50256, 50256, 50256])
[5812, 1049, 11, 1194, 3321, 12, 3137, 644, 314, 2622, 284, 2987, 616, 10038, 13]


In [17]:
import torch
from torch.utils.data import DataLoader

batch_size = 8

loader = DataLoader(
    dataset=dset,
    batch_size=batch_size,
    shuffle=True,
)

for text_batch in loader:
    print(text_batch.size())

torch.Size([8, 20])
torch.Size([8, 20])
torch.Size([8, 20])
torch.Size([8, 20])
torch.Size([8, 20])
torch.Size([8, 20])
torch.Size([2, 20])


### Training

In [18]:
def train(model, train_loader, optimizer, criterion, device, tokenizer, num_epochs=5):
    #model = model.to(device)

    for epoch in range(num_epochs):
        model.train()

        for batch in train_loader: # (B, T) B: 8, T: 20
            batch = batch.to(device)

            inputs = batch[:, :-1] # (B, T-1)
            targets = batch[:, 1:] # (B, T-1)

            optimizer.zero_grad()

            outputs = model(inputs).logits # (B * (T - 1), vocab_size) # [ 0.1, 0.3, 0, 0, ..., 0.6] 10

            outputs = outputs.contiguous().view(-1, outputs.size(-1))
            targets = targets.contiguous().view(-1)

            loss = criterion(outputs, targets)

            loss.backward()
            optimizer.step()

In [19]:
optimizer = torch.optim.Adam(lora_model.parameters(), lr=5e-4)
criterion = torch.nn.CrossEntropyLoss(ignore_index=50256)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [79]:
train(lora_model, loader, optimizer, criterion, device, tokenizer, 20)

### Inference

In [82]:
# copied code from ch04
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:] # 1000, 50 [930 - 980]: 981, [931 - 981]: 982

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond).logits

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :] # 982

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

def generate_text(model, start_context, tokenizer):
    encoded = tokenizer.encode(start_context)
    
    encoded_tensor = torch.tensor(encoded, device=device).unsqueeze(0)
    
    model.eval() # disable dropout
    
    out = generate_text_simple(
        model=model,
        idx=encoded_tensor,
        max_new_tokens=100,
        context_size=1024
    )
    
    numpy_out = out.cpu().numpy()[0]

    return tokenizer.decode(numpy_out)

In [83]:
generate_text(lora_model, 'when I wake up', tokenizer)

"when I wake up from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night' from a bad night'"

In [87]:
generate_text(pretrained_model, 'when I wake up', tokenizer)

"when I wake up, I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going to be in a coma for a while. I'm going"