# LoRA from Scratch

## Environment Setup

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## Inspecting RoBERTa Base

In [2]:
import sys
sys.path.append('src')
from train import load_model_and_tokenizer_and_collator
import util
import logging
util.logger.setLevel(logging.INFO)

INFO:train:torch version: 2.0.1


### RoBERTa Modules
We start witht the RoBERTa Base model as provided through Hugging Face. Let's have a quick look at the modules.

In [3]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('roberta-base')

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

^^^ Above we see the encoder layers with their attention and feed forward layers and the classifier at the top of the network (at the bottom of the printed summary).

### RoBERTa Parameters
Below we see all parameters. All of these would be trained during a full finetuning. In the below output the parameters that we train are marked with a `1` and then are followed by the number of parameters.

In [5]:
from util import count_parameters
util.logger.setLevel(logging.DEBUG)
count_parameters(model, True);

DEBUG:util:Parameters (name, tunable, count):
DEBUG:util: roberta.embeddings.word_embeddings.weight                      1    38603520
DEBUG:util: roberta.embeddings.position_embeddings.weight                  1      394752
DEBUG:util: roberta.embeddings.token_type_embeddings.weight                1         768
DEBUG:util: roberta.embeddings.LayerNorm.weight                            1         768
DEBUG:util: roberta.embeddings.LayerNorm.bias                              1         768
DEBUG:util: roberta.encoder.layer.0.attention.self.query.weight            1      589824
DEBUG:util: roberta.encoder.layer.0.attention.self.query.bias              1         768
DEBUG:util: roberta.encoder.layer.0.attention.self.key.weight              1      589824
DEBUG:util: roberta.encoder.layer.0.attention.self.key.bias                1         768
DEBUG:util: roberta.encoder.layer.0.attention.self.value.weight            1      589824
DEBUG:util: roberta.encoder.layer.0.attention.self.value.bias   

**^^^ Also recognize the last line that the count of parameters is 124,647,170 and that 100% of the parameters are trained at this time.**

### Picking A Linear Module
As examples of Linear modules, let's pick the 8th layer (with the index starting at 0, it's layer 7). And here we look first at the linear projection of the query, then the feed forward layer with its up and down projection:

In [6]:
(model.roberta.encoder.layer[-1].attention.self.query,
 model.roberta.encoder.layer[-1].intermediate.dense,
 model.roberta.encoder.layer[-1].output.dense)

(Linear(in_features=768, out_features=768, bias=True),
 Linear(in_features=768, out_features=3072, bias=True),
 Linear(in_features=3072, out_features=768, bias=True))

^^^ Of those let's pick with the output Linear layer, as it is one of the two modules (the other being the query projection) that are supposed to have the strongest influence on the model's finetuning performance. 

Also for this module `fan_in` and `fan_out` are not equal, so that it's easier to follow along, when reviewing tensor values.

### Creating The Low Rank Matrices
For that Linear module we now want to create the two low rank matrices that should approximate the whole weight matrix of the module, ie. `3072 x 768`. 
 
We create two Parameters, `lora_A` and `lora_B`, that individually will be `fan_in x r` `(3072x4)` and `r x fan_out` `(4x768)`, but their product's dimension will be `fan_in x fan_out` `(3072x768)`.

In [7]:
import torch
from torch import nn
import math

In [8]:
adaptee = model.roberta.encoder.layer[-1].output.dense; adaptee # Get a reference to the module we want to adapt

Linear(in_features=3072, out_features=768, bias=True)

In [9]:
r = 4 # Let's start with a known, plausible value, for r

In [10]:
lora_A = nn.Parameter(torch.randn(adaptee.in_features, r)/math.sqrt(adaptee.in_features))
lora_B = nn.Parameter(torch.zeros(r, adaptee.out_features))
lora_A.shape, lora_B.shape, (lora_A @ lora_B).shape

(torch.Size([3072, 4]), torch.Size([4, 768]), torch.Size([3072, 768]))

So far so good. Let's also take a look at how much smaller the parameter count now is:

In [11]:
full_size, low_rank_size = adaptee.weight.numel(), lora_A.numel()+lora_B.numel()
full_size, low_rank_size, f'{100*(low_rank_size/full_size):4.3f}%'

(2359296, 15360, '0.651%')

With a rank of `4` we are down from ~2.4M parameters to ~15K parameters, that is less than `1%` of the original parameters. So far this sounds good. But of course, we also need to review the resulting the performance. We will get to that.

### Smoke Testing Of The Matrices

For now we should already be able to see if we can use the product of the two low rank matrices in place of the full rank matrix.

Also, remember from the article, that we initialized the two parameters in a way that their initial bias is to not change anything. Hence, we would expect that we can add our adaptation and it should work mechanically, but not change the original result.

We do one forward pass through the original module, the `adaptee` and one forward pass through the product of the two small matrices, `lora_A` and `lora_B`. We then add the result. Given that we initialized one of the matrices with 0, the prodcut will be 0 too, and therefore the result should not be any different than just using the `adaptee` on its own as the addition of 0 does not change the result.

In [12]:
# our test data
x = torch.randn((64, 3072)) # pretend bs 64

# using the original path
original_result = adaptee(x) # (bs, out_d)

assert original_result.shape == (64, adaptee.out_features)

# using the new adapter path
lora_matrix = (lora_A@lora_B) 
adapter_result = x @ lora_matrix

assert adapter_result.shape == (64, adaptee.out_features)

# both together
x_prime = adaptee(x) + x @ lora_A @ lora_B
assert x_prime.shape == (64, adaptee.out_features)
assert torch.allclose(original_result, x_prime)

Ok, this worked well. We don't know anything about the performance yet. We will see that when we traint the model. But let's consider how to integrate this? How do we integrate those two matrices into the forward pass of a module that we want to adapt? Also, how do we add and remove these adapters?

### Creating a LoRA Adapter

In [13]:
class LoRAAdapter(nn.Module):
    def __init__(self, 
                 adaptee, # <- module to be adapted
                 r):
        super().__init__()
        
        self.r = r
        self.adaptee = adaptee
        
        # Store a pointer to the original forward implementation 
        # of the module to be adapted.
        # Then point its forward method to this adapter module.
        self.orig_forward = adaptee.forward
        adaptee.forward = self.forward
    
        # Adding the weight matrices directly to the adaptee,
        # which makes is more practical to report the parameters,
        # and to remove it later.
        adaptee.lora_A = nn.Parameter(torch.randn(adaptee.in_features, r)/math.sqrt(adaptee.in_features))
        adaptee.lora_B = nn.Parameter(torch.zeros(r, adaptee.out_features))
        
    def forward(self, x, *args, **kwargs):
        return (
            self.orig_forward(x, *args, **kwargs) +
            x @ self.adaptee.lora_A @ self.adaptee.lora_B
        )
    
    def extra_repr(self):
        return f'LoRAAdapter (r={self.r}, dropout={self.dropout})'

^^^ Please have a look at the inline documentation in the code above. Please recognize that this is exactly what we tried before. Plus, we now hook into the forward pass. Here we call the original forward method of the module with our inputs and then take the same inputs and apply our product of `lora_A` and `lora_B` to it. We add both results.  

Let's install such an adapter and take it for a spin.

#### Before Installing Adapter

How many parameters are tunable before installing the adapter?
(We are also freezing the module to be adopted. We'll dive deeper into this a little later.)

In [14]:
count_parameters(model.roberta.encoder.layer[-1].output);

## Freezing the module to be adapted
for p in model.roberta.encoder.layer[-1].output.parameters(): p.requires_grad_(False)

count_parameters(model.roberta.encoder.layer[-1].output);

DEBUG:util:Parameters (name, tunable, count):
DEBUG:util: dense.weight         1     2359296
DEBUG:util: dense.bias           1         768
DEBUG:util: LayerNorm.weight     1         768
DEBUG:util: LayerNorm.bias       1         768
INFO:util:Total parameters: 2,361,600, thereof learnable: 2,361,600 (100.0000%)
DEBUG:util:Parameters (name, tunable, count):
DEBUG:util: dense.weight         0     2359296
DEBUG:util: dense.bias           0         768
DEBUG:util: LayerNorm.weight     0         768
DEBUG:util: LayerNorm.bias       0         768
INFO:util:Total parameters: 2,361,600, thereof learnable: 0 (0.0000%)


In [15]:
x = torch.randn((64, 3072))
original_result = model.roberta.encoder.layer[-1].output.dense(x)

adapter = LoRAAdapter(
    adaptee=model.roberta.encoder.layer[-1].output.dense, 
    r=4)

adapted_result = adapter(x)

assert adapted_result.shape == (64, model.roberta.encoder.layer[-1].output.dense.out_features)
assert torch.allclose(original_result, adapted_result)
x.shape, x.sum().item()

(torch.Size([64, 3072]), -943.4242553710938)

#### After Installing Adapter

In [16]:
count_parameters(model.roberta.encoder.layer[-1].output);

DEBUG:util:Parameters (name, tunable, count):
DEBUG:util: dense.weight         0     2359296
DEBUG:util: dense.bias           0         768
DEBUG:util: dense.lora_A         1       12288
DEBUG:util: dense.lora_B         1        3072
DEBUG:util: LayerNorm.weight     0         768
DEBUG:util: LayerNorm.bias       0         768
INFO:util:Total parameters: 2,376,960, thereof learnable: 15,360 (0.6462%)


### Next Steps

The mechanics are working. That's great.

But we don't know if using the LoRAAdapter is helping. We need to train it first.

We'll get there, but:
- First, we need to generalize the solution and make it configurable, so that we can adapt arbitrary modules; we'll cover Linear modules only to illustrate the concept.  
- Second, currently all parameters of the models are trained (see the 100% above). For the finetuning to become efficient we need to make sure that only the adapters are trained, and the classifier head.

#### Freezing The Model 

When finetuning we only want to finetune the modules that will contribute significantly to the overall finetuned performance. This is not new and not specific to Parameter Efficient Finetuning (PEFT) or LoRA. In the past we also tried to use our understanding of the data and the task to select which parts of a network to finetune. 

This was typically done on a rather coarse grained level. For example when finetuning a pre-trained LM for a high level downstream task, like sentiment analysis, we froze the embeddings and the lower layers of the network, assuming that the use of language and vocabulary does not change dramatically between the pre-training objective of predicting the next token/masked out token on one hand, and our sentiment analysis on the other. FOr all necessary changes the upper transformer layers and the classifier head were assumed to be proficient.

In PEFT we can also chose our layers and modules with such care and foresight, and we'll define how to configure this in the next section, but just seeing above how this approach reduces the size makes it obvious that not as much care would be necessary.

At any rate, we would always want the classifier's parameters to be tunable. It is initialized randomly and is totally task specific. For the remaining modules we can decide this.

Here we start by freezing all parameters of the model. Then we unfreeze the classifier head's parameters. 

All adapters we will subsequently add are naturally also not frozen, but will be trained. That's the whole point, isn't it? :)

#### Before Freezing

In [17]:
model = load_model_and_tokenizer_and_collator('roberta-base')[0];
count_parameters(model, verbose=False);

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


INFO:util:Total parameters: 124,647,170, thereof learnable: 124,647,170 (100.0000%)


#### Freezing The Whole Model

In [18]:
# Now freezing all parameters
for p in list(model.children()): p.requires_grad_(False)

In [19]:
count_parameters(model, verbose=False);

INFO:util:Total parameters: 124,647,170, thereof learnable: 0 (0.0000%)


#### Unfreezing The Classifier / Head
If you compare to the [RoBERTa-Modules](#RoBERTa-Modules) you'll see that the classifier is the last element, at index `-1`.

In [20]:
for p in list(model.children())[-1].parameters(): p.requires_grad_(True)

In [21]:
count_parameters(model, verbose=False);

INFO:util:Total parameters: 124,647,170, thereof learnable: 592,130 (0.4750%)


#### Configuring The Modules

We now look at an exemplary way how to select the modules we want to adapt. In the accompanying code something similar is implemented. Below we have a stripped down version for clarity.

In [22]:
lora_includes=['query', 'output']

In [23]:
import re

for name, module in model.named_modules():
    
    # We are only dealing with Linear modules at this time
    if not isinstance(module, nn.Linear): 
        continue
       
    # Does the name of current module match any of the provided patterns?
    for regex in lora_includes:
        if re.match(f'.*{regex}', name):
            LoRAAdapter(module, 4)
            break

This was already it. We have adapted all Linear modules in the model that contain either `query` or `output`.

Check now the `0`/`1` marking of our modules below. Verify that no parameters are trainable, except:
- The classifier
- The `lora_A` and `lora_B` parametes that are attached to the modules that contain either `output` or `query`.

In [24]:
count_parameters(model, verbose=True)

DEBUG:util:Parameters (name, tunable, count):
DEBUG:util: roberta.embeddings.word_embeddings.weight                      0    38603520
DEBUG:util: roberta.embeddings.position_embeddings.weight                  0      394752
DEBUG:util: roberta.embeddings.token_type_embeddings.weight                0         768
DEBUG:util: roberta.embeddings.LayerNorm.weight                            0         768
DEBUG:util: roberta.embeddings.LayerNorm.bias                              0         768
DEBUG:util: roberta.encoder.layer.0.attention.self.query.weight            0      589824
DEBUG:util: roberta.encoder.layer.0.attention.self.query.bias              0         768
DEBUG:util: roberta.encoder.layer.0.attention.self.query.lora_A            1        3072
DEBUG:util: roberta.encoder.layer.0.attention.self.query.lora_B            1        3072
DEBUG:util: roberta.encoder.layer.0.attention.self.key.weight              0      589824
DEBUG:util: roberta.encoder.layer.0.attention.self.key.bias     

(124978946, 923906)

**^^^Re-recognize that still the total number of parameters that are learnable are below `1%`.**

Nice! 

See above that we now would train a 0.74% of all parameters.

#### Sanity Check of The Dimensions

Let's take a deeper look. We'll review the Linear module that does the up projection in the last layer's (index 11):

```
...
roberta.encoder.layer.11.output.dense.weight                          0     2359296
roberta.encoder.layer.11.output.dense.bias                            0         768
roberta.encoder.layer.11.output.dense.lora_A                          1       12288
roberta.encoder.layer.11.output.dense.lora_B                          1        3072
...
```

In [25]:
model.roberta.encoder.layer[-1].output.dense

Linear(in_features=3072, out_features=768, bias=True)

In [26]:
model.roberta.encoder.layer[-1].output.dense.lora_A.numel()

12288

In [27]:
# Check that the matrix with all in multiplied with all out features is the same as the number of parameters of the weight matrix
assert (model.roberta.encoder.layer[-1].output.dense.out_features * 
        model.roberta.encoder.layer[-1].output.dense.in_features == 2359296)

# Check that the number of parameters of the low rank lora A matrix is the product of r (4) and in_features
assert model.roberta.encoder.layer[-1].output.dense.in_features * 4 == model.roberta.encoder.layer[-1].output.dense.lora_A.numel()

# Check that the number of parameters of the low rank lora B matrix is the product of r (4) and out_features
assert model.roberta.encoder.layer[-1].output.dense.out_features * 4 == model.roberta.encoder.layer[-1].output.dense.lora_B.numel()

3072 * 768, 3072 * 4, 4 * 768

(2359296, 12288, 3072)

In [28]:
model.roberta.encoder.layer[-1].output.dense.out_features * model.roberta.encoder.layer[-1].output.dense.in_features

2359296

Cool, cool, cool. 

We still haven't seen the performance though.

Let's check into that now and submit a Training Job. We'll validate that the mechanics are working and - ideally - get an initial feeling for the impact. If everything works, then we'll tune the hyperparameters. 

## Validate Mechanics / Sneak Peek

In [29]:
import sagemaker
from nb_helper import p
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
from sagemaker.utils import name_from_base

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Preferences/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mkamp/Library/Preferences/sagemaker/config.yaml
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


INFO:util:Total parameters: 124,647,170, thereof learnable: 124,647,170 (100.0000%)


In [30]:
estimator_parameters = p('estimator_parameters') | \
{
    'role': get_execution_role(),
    'metric_definitions': p('metric_definitions'),
    'hyperparameters': p('hyperparameters')
}

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Preferences/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mkamp/Library/Preferences/sagemaker/config.yaml


#### Submit Full-Finetuning Training Job

In [31]:
fullft_estimator = PyTorch(**estimator_parameters)
fullft_estimator.set_hyperparameters(**{'sst2-lora-config': 'none', 'sst2-learning-rate': 5e-5})
fullft_estimator.fit(wait=False)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Preferences/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mkamp/Library/Preferences/sagemaker/config.yaml
Using provided s3_resource
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: lora-2023-10-31-15-48-43-327


#### Submit LoRA Finetuning Training Job

In [32]:
loraft_estimator = PyTorch(**estimator_parameters)
loraft_estimator.set_hyperparameters(**{'sst2-lora-config': 'query|output', 'sst2-lora-r': 2, 'sst2-learning-rate': 4e-4})
loraft_estimator.fit(wait=False)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Preferences/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mkamp/Library/Preferences/sagemaker/config.yaml
Using provided s3_resource
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: lora-2023-10-31-15-48-44-924


#### First Results
We submitted both jobs with out `wait`, so that they can run in parallel. We are waiting for both here now.

In [33]:
import boto3 
sm = boto3.client('sagemaker')
sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=fullft_estimator.latest_training_job.job_name)
sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=loraft_estimator.latest_training_job.job_name)

In [34]:
def get_job_metrics(training_job_name, metrics=['sst2_valid_acc']):
    ms = sm.describe_training_job(TrainingJobName=training_job_name)['FinalMetricDataList']
    return {m['MetricName']: m['Value'] for m in ms if m['MetricName'] in metrics}

In [35]:
'full-finetuning' , get_job_metrics(fullft_estimator.latest_training_job.job_name)

('full-finetuning', {'sst2_valid_acc': 0.9346330165863037})

In [36]:
'lora-finetuning', get_job_metrics(loraft_estimator.latest_training_job.job_name)

('lora-finetuning', {'sst2_valid_acc': 0.9403669834136963})

Whoop, whoop! That looks pretty good. 

On the first try? Even though in this article we have not given much thought to the hyperparameters? What is the right `r`, which modules should be adapted? What is the right learning rate and regularization?

No, I cheated. When writing, I already did some experimentation in parallel. Hence I already learned about some good hyperparameters for the full finetuning baseline and the LoRA adapted model. 

But in the next article we will run those experiments together. The numbers above are a first taste of what's to come. Maybe we can do better? At least we should be able to understand better what are impactful design decisions we can make. 