# LORA: Low Rank Adaptation: an efficient way to fine tune large language models

When we have a specific task to perform with large language models we have various options:

1. Use the model as it is using prompt engineering
2. Fine tune the whole model, updating all its weights
3. Fine tune only some layers instead the whole model.

Pros and cons of each one:

| Approach             | Pros                                                                 | Cons                                                                 |
|----------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|
| Prompt Engineering   | - Fast and cheap                                                     | - Limited control over behavior                                      |
|                      | - No training or infrastructure needed                               | - Performance highly sensitive to prompt wording                     |
|                      | - Easily updated or changed                                          | - May hit model limits on specific tasks                             |
| Fine-Tune Full Model | - Full control over model behavior                                   | - Very resource-intensive (GPU, time, data)                          |
|                      | - Better performance on domain-specific or complex tasks             | - Risk of overfitting or catastrophic forgetting                    |
|                      | - Can learn new capabilities                                         | - Requires re-deployment of large models                             |
| LoRA Fine-Tuning     | - Much less compute and memory than full fine-tuning                 | - Slightly less flexible than full fine-tuning                      |
|                      | - Retains base model unchanged (can swap adapters)                   | - Still needs training pipeline setup                                |
|                      | - Modular and efficient for multiple tasks/domains                   | - May not reach full model’s potential on highly specialized tasks   |

And important remarks:

- If you want to use a high llm from a provider, like GPT from OpenAI or Gemini from google, you simply can't fine tune this model, so prompt engineering is your only available option
- If you have low models, like 1B or 8B, them does not perform very well in very specific tasks, but you can perform fine tune over them with limited resources, greatly improving performance

So, right now we are going to implement and compare two ways to resolve a specific task: **prompt engineering** leaving the original model as it is, and the second way to test: **lora fine-tuning**.

# The task: Question and answers with RACE

We want to train a system with the ability of solve questions and answers where the answer should be picked from a list of options:

```text
Context: A subject which seems to have been insufficiently studied by doctors and psychologists is the influence of geography and climate on the psychological and physical health of mankind. There seems no doubt that the general character of the landscape, the relative length of day and night, and the climate must all play a big part in determining what kind of people we are.
It is true that a few studies have been made. Where all the inhabitants of a particular area enjoy exceptionally good or bad health, scientists have identified contributory factors such as the presence or absence of substances like iodine, fluoride, calcium, or iron in the water supply, or perhaps types of land that provide breeding places for pests like mosquitoes or rats.
Moreover, we can all generalize about types of people we have met. Those living in countries with long dark winters are apt to be less talkative and less vivacious than inhabitants of countries where the climate is more equable. And where the olive and the orange grow, there the inhabitants are cheerful, talkative, and spontaneous.
But these commonplace generalizations are inadequate: the influence of climate and geography should be studied in depth. Do all mountain dwellers live to a ripe old age? Does the drinking of wine, rather than beer, result in a sunny and open temperament? Is the strength and height of one of the Kenyan tribes due to their habitual drinking of the blood of cows?
We are not yet sure of the answers to such questions, but let us hope that something of benefit to mankind may eventually result from such studies.

Question: According to the author, research into the influence of geography and climate should  _  .

Options:
A) focus on some unknown aspects
B) be pursued on a larger scale
C) be carried out within a larger scope
D) go much deeper

Answer: D
```

We are using the `transformers` dataset called `ehvoy/race`, composed by 97k of questions and answers, but
to reduce training times we are going to use only subsets with `context.length < 800`, reducing the original
dataset to a length of `800` in the train set and `56` items in the test set.

### Import libraries

In [22]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
)
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm
import random
import numpy as np
from huggingface_hub import login
from datasets import load_dataset, DatasetDict
import math

login(token="")

### Set variables

In [23]:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)


MAX_ARTICLE_CHAR_LENGTH = 800
MAX_TOKEN_LENGTH = 512
BATCH_SIZE = 1
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3

In [24]:
MODEL_NAME = "meta-llama/Llama-3.2-1B"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [25]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

### Download dataset race

In [26]:
ds_full = load_dataset("ehovy/race", "high", trust_remote_code=True)

In [27]:
ds_test = ds_full.get('test')
filtered_test_data = ds_test.filter(
    lambda example: len(example['article']) < MAX_ARTICLE_CHAR_LENGTH
)

### Funtion
#### Model Evaluation

After the language model processes the context and question up to "Answer:", it produces logits. These are raw scores for each possible next word (token) in its vocabulary.

- Softmax (or `log_softmax`): This function converts these raw scores (logits) into probabilities (numbers between 0 and 1 that sum up to 1). We often use `log_softmax` for numerical stability, as working with logarithms of probabilities helps avoid issues with extremely small numbers.
- Evaluation: We then look at the probabilities (or log-probabilities) the model assigned specifically to your answer option letters ('A', 'B', 'C', 'D').
- Prediction: The predicted answer is the option (A, B, C, or D) that the model assigned the highest probability to. This fulfills the `argmax s ∈S P (s|c)` requirement, because the logarithm doesn't change the order, so the maximum of the logarithm is the maximum of the probability itself.

In [28]:
def model_evaluation(model_llama, prompt, test_data):
    total_correct_original = 0
    total_examples_original = 0
    
    model_llama.eval()
    with torch.no_grad():
        for example in tqdm(test_data):
            
            options_str = "\n".join([f"{chr(65+i)}) {opt}" for i, opt in enumerate(example['options'])])
            prompt_for_inference = prompt.format(example['article'], example['question'], options_str)
    
            inputs = tokenizer(prompt_for_inference, return_tensors="pt", truncation=True, max_length=MAX_TOKEN_LENGTH).to(device)
    
            outputs = model_llama(**inputs)
            logits_next_token = outputs.logits[:, -1, :] 
            log_probabilities = torch.nn.functional.log_softmax(logits_next_token, dim=-1)
    
            predicted_answer_char = None
            max_log_prob_for_option = -float('inf')
    
            for i in range(len(example['options'])):
                option_char = chr(ord('A') + i)
                option_char_token_ids = tokenizer.encode(option_char, add_special_tokens=False)
                current_option_char_token_id = option_char_token_ids[0]
                
                if current_option_char_token_id in range(log_probabilities.shape[-1]):
                    current_log_prob = log_probabilities[:, current_option_char_token_id].item()

                    if current_log_prob > max_log_prob_for_option:
                        max_log_prob_for_option = current_log_prob
                        predicted_answer_char = option_char

            if predicted_answer_char == example['answer']:
                total_correct_original += 1
            total_examples_original += 1
            
        return total_correct_original, total_examples_original 

# The model: Llama 3.2 1B

Nowadays, we have a lot of small models with open weights offered by big tech that can be used for free and downloaded
from various repositories like hugging face.

On this list we can find:
- Gemma: A model trained by Google offered in various sizes, included 3B
- Phi: A model trained by Microsoft
- Llama: A model trained by Meta

Special mentions: SmolLM2, a model built by hugging face community, OpenELM, a model built by apple

*All this models are based on decoder only architectures, which makes them easier to train*

**Our chosen model is Llama 3.2 1B**

## Base Model

In [29]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.config.pad_token_id = tokenizer.pad_token_id
model.to(device)

print(f"Model moves to {device}")

Model moves to cuda


In [30]:
prompt_for_inference = (
    """You are a smart question answering model.  
    Answer the question based on the next information, 
    and at the end you will find the answer options.
    Choose the best one, only give the letter of the answer which could be A, B, C or D.\n\n
    Context: {}\n\n
    Question: {}\n\n
    Options:\n{}\n\n
    Answer:"""
)

total_correct_original, total_examples_original = model_evaluation(model, prompt_for_inference, filtered_test_data)

100%|██████████| 56/56 [00:01<00:00, 46.24it/s]


In [31]:
accuracy_original = total_correct_original / total_examples_original

print(f"\n--- Resultados de la Evaluación en Test (Modelo Original) ---")
print(f"Ejemplos Totales: {total_examples_original}")
print(f"Predicciones Correctas: {total_correct_original}")
print(f"Exactitud (Accuracy): {accuracy_original * 100:.2f}%")


--- Resultados de la Evaluación en Test (Modelo Original) ---
Ejemplos Totales: 56
Predicciones Correctas: 20
Exactitud (Accuracy): 35.71%


## Fine tuning Model

### Data tokenization and transformation

This `transform_and_tokenize_example` function prepares each example in your dataset for model training, specifically for fine-tuning.

Constructing the Prompt:

- First, format the `article`, `question`, and `options` into a structured text (`prompt_template`) that the model can understand, ending with "Answer:".
- Then, add the actual `answer` (e.g., 'A') to the end of this prompt to create the `full_text_for_training`.

Tokenization:

- Convert this `full_text_for_training` into numbers (tokens) that the model understands, using the `tokenizer`.
- Ensure the length is `MAX_TOKEN_LENGTH` (truncating if too long, padding if too short).
- `return_offsets_mapping=True` is crucial: it generates a map that tells you which tokens correspond to which characters in the original text.
Label Masking:

- Create a copy of the `input_ids` (the input tokens) to use as `labels`.
- The key point: Find where the actual answer begins (the character 'A', 'B', 'C', or 'D') within the tokenized text (`full_text_for_training`).
- Using `offset_mapping`, identify the index of the token where the answer begins.
- Finally, set the `labels` of all tokens before the answer (the context, the question, the options, and the "Answer:" part) to `-100`.

Why `-100`?

During fine-tuning, the model only has to learn how to generate the `answer`. By setting `-100` to the context `labels`, the model's cost function ignores these tokens, focusing solely on optimizing the answer prediction.


In [32]:
def transform_and_tokenize_example(example):
    options_str = "\n".join([f"{chr(65+i)}) {opt}" for i, opt in enumerate(example['options'])])
    
    prompt_template =  (
    """Context: {}\n
    Question: {}\n
    Options:\n{}\n
    Answer:"""
    ).format(example['article'], example['question'], options_str)
        
    full_text_for_training = prompt_template + " " + example['answer'] # Add a space before the answer for clarity in tokenization

    tokenized_full = tokenizer(
        full_text_for_training,
        truncation=True,
        max_length=MAX_TOKEN_LENGTH,
        padding="max_length",
        return_offsets_mapping=True 
    )

    labels = tokenized_full["input_ids"].copy()

    answer_start_char_idx = full_text_for_training.find("Answer:")
    if answer_start_char_idx != -1:
        answer_token_char_start_idx = answer_start_char_idx + len("Answer: ")

        answer_token_start_index = -1
        for i, (start_offset, end_offset) in enumerate(tokenized_full['offset_mapping']):

            if start_offset <= answer_token_char_start_idx < end_offset:
                answer_token_start_index = i
                break
        
        if answer_token_start_index != -1:
            for i in range(answer_token_start_index):
                labels[i] = -100
            
    tokenized_full["labels"] = labels
    del tokenized_full["offset_mapping"] 
    return tokenized_full

In [33]:
processed_ds = DatasetDict()

for split, data in ds_full.items():
    if split != 'test':
        filtered_data = data.filter(
            lambda example: len(example['article']) < MAX_ARTICLE_CHAR_LENGTH,
            desc=f"Filtrando artículos en {split}"
        )

        mapped_data = filtered_data.map(
            transform_and_tokenize_example,
            batched=False,
            remove_columns=filtered_data.column_names,
            desc=f"Tokenizando {split}"
        )
        processed_ds[split] = mapped_data
        processed_ds[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

train_dataloader = DataLoader(
    processed_ds.get("train"),
    shuffle=True,
    batch_size=BATCH_SIZE,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    processed_ds.get("validation"),
    batch_size=BATCH_SIZE,
    collate_fn=data_collator
)

### Class Low Rank Adaptation 

In [34]:
class LoraLinear(torch.nn.Module):
    def __init__(self, linear_layer, alpha = 1, r = 1):
        super().__init__()
        self.linear_layer = linear_layer.to(torch.float32) 
        self.r = r
        fan_in = self.linear_layer.in_features
        fan_out = self.linear_layer.out_features
        self.lora_A = torch.nn.Parameter(torch.zeros((fan_in, r), device=linear_layer.weight.device)) 
        self.lora_B = torch.nn.Parameter(torch.zeros((r, fan_out), device=linear_layer.weight.device)) 
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.linear_layer.weight.requires_grad = False

    def train(self, mode=True):
        self.training = mode
        if not mode:
            self.merged_weight = (self.linear_layer.weight.transpose(0,1) + self.lora_A @ self.lora_B).to(torch.float16)
            
    def forward(self, x):
        if self.training:
            x = x.to(torch.float32) 
            output = self.linear_layer(x)
            output += x @ self.lora_A @ self.lora_B
            output = output.to(torch.float16) 
        else:
            if not hasattr(self, 'merged_weight'):
                self.merged_weight = (self.linear_layer.weight.transpose(0,1) + self.lora_A @ self.lora_B).to(torch.float16)
            output = x @ self.merged_weight
        return output

In [35]:
# Congelamos parametros
for param in model.parameters():
    param.requires_grad = False

# Reemplazamos las capas lineales del mecanismo de atención por capas LoRA
for layer in model.model.layers:
    if hasattr(layer, 'self_attn'):
        layer.self_attn.q_proj = LoraLinear(layer.self_attn.q_proj, r=16)
        layer.self_attn.k_proj = LoraLinear(layer.self_attn.k_proj, r=16)
        layer.self_attn.v_proj = LoraLinear(layer.self_attn.v_proj, r=16)
        layer.self_attn.o_proj = LoraLinear(layer.self_attn.o_proj, r=16)

params_without_lora = 0
params_with_lora = 0
for name, param in model.named_parameters():
    if 'self_attn' in name and 'linear_layer' in name: # This counts the original linear layer's parameters
        params_without_lora += param.numel()
    if param.requires_grad:
        params_with_lora += param.numel()
        
print(f'Parámetros sin LoRA (originales no entrenables): {params_without_lora:,} || Parámetros con LoRA (entrenables): {params_with_lora:,} || Porcentaje de parámetros con LoRA: {100 * params_with_lora / (params_without_lora + params_with_lora):.2f}%')

Parámetros sin LoRA (originales no entrenables): 167,772,160 || Parámetros con LoRA (entrenables): 3,407,872 || Porcentaje de parámetros con LoRA: 1.99%


In [36]:
model.to(device)

print(f"Lora model moves to {device}")

Lora model moves to cuda


### Traing/fine tuning loop

In [37]:
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

print("Iniciando entrenamiento...")
num_training_steps = NUM_EPOCHS * len(train_dataloader)
progress_bar = tqdm(range(num_training_steps))

for epoch in range(NUM_EPOCHS):
    model.train()
    total_train_loss = 0
    print(f"\n--- Época {epoch + 1}/{NUM_EPOCHS} ---")

    for batch_idx, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_train_loss += loss.item()
        progress_bar.update(1)
        progress_bar.set_description(f"Época {epoch+1}, Batch {batch_idx+1}, Loss: {loss.item():.4f}")

    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f"Fin de Época {epoch + 1}: Pérdida de Entrenamiento Promedio = {avg_train_loss:.4f}")

    model.eval()
    total_eval_loss = 0
    print(f"\nEvaluando al final de la época {epoch + 1}...")
    with torch.no_grad():
        for eval_batch in tqdm(eval_dataloader):
            eval_batch = {k: v.to(device) for k, v in eval_batch.items()}
            outputs = model(**eval_batch)
            total_eval_loss += outputs.loss.item()
    avg_eval_loss = total_eval_loss / len(eval_dataloader)
    print(f"Fin de Época {epoch + 1}: Pérdida de Validación Promedio = {avg_eval_loss:.4f}")

progress_bar.close()
print("Entrenamiento completado.")

Iniciando entrenamiento...


  0%|          | 0/2409 [00:00<?, ?it/s]


--- Época 1/3 ---


Época 1, Batch 803, Loss: 1.6844:  33%|███▎      | 803/2409 [01:22<02:45,  9.73it/s]

Fin de Época 1: Pérdida de Entrenamiento Promedio = 2.1072

Evaluando al final de la época 1...



  0%|          | 0/34 [00:00<?, ?it/s][A
  9%|▉         | 3/34 [00:00<00:01, 26.51it/s][A
 18%|█▊        | 6/34 [00:00<00:01, 26.22it/s][A
 26%|██▋       | 9/34 [00:00<00:00, 26.19it/s][A
 35%|███▌      | 12/34 [00:00<00:00, 26.11it/s][A
 44%|████▍     | 15/34 [00:00<00:00, 26.11it/s][A
 53%|█████▎    | 18/34 [00:00<00:00, 26.11it/s][A
 62%|██████▏   | 21/34 [00:00<00:00, 26.16it/s][A
 71%|███████   | 24/34 [00:00<00:00, 26.21it/s][A
 79%|███████▉  | 27/34 [00:01<00:00, 26.11it/s][A
 88%|████████▊ | 30/34 [00:01<00:00, 26.16it/s][A
100%|██████████| 34/34 [00:01<00:00, 26.14it/s][A
Época 2, Batch 1, Loss: 2.0634:  33%|███▎      | 804/2409 [01:23<13:16,  2.02it/s]  

Fin de Época 1: Pérdida de Validación Promedio = 2.3132

--- Época 2/3 ---


Época 2, Batch 803, Loss: 1.1751:  67%|██████▋   | 1606/2409 [02:46<01:22,  9.73it/s]

Fin de Época 2: Pérdida de Entrenamiento Promedio = 1.7175

Evaluando al final de la época 2...



  0%|          | 0/34 [00:00<?, ?it/s][A
  9%|▉         | 3/34 [00:00<00:01, 26.53it/s][A
 18%|█▊        | 6/34 [00:00<00:01, 26.35it/s][A
 26%|██▋       | 9/34 [00:00<00:00, 26.33it/s][A
 35%|███▌      | 12/34 [00:00<00:00, 26.17it/s][A
 44%|████▍     | 15/34 [00:00<00:00, 26.20it/s][A
 53%|█████▎    | 18/34 [00:00<00:00, 26.16it/s][A
 62%|██████▏   | 21/34 [00:00<00:00, 26.21it/s][A
 71%|███████   | 24/34 [00:00<00:00, 26.22it/s][A
 79%|███████▉  | 27/34 [00:01<00:00, 26.23it/s][A
 88%|████████▊ | 30/34 [00:01<00:00, 26.22it/s][A
100%|██████████| 34/34 [00:01<00:00, 26.19it/s][A
Época 3, Batch 1, Loss: 1.0944:  67%|██████▋   | 1607/2409 [02:47<06:37,  2.02it/s]  

Fin de Época 2: Pérdida de Validación Promedio = 2.5852

--- Época 3/3 ---


Época 3, Batch 803, Loss: 0.8945: 100%|██████████| 2409/2409 [04:10<00:00,  9.73it/s]

Fin de Época 3: Pérdida de Entrenamiento Promedio = 1.2885

Evaluando al final de la época 3...



  0%|          | 0/34 [00:00<?, ?it/s][A
  9%|▉         | 3/34 [00:00<00:01, 26.49it/s][A
 18%|█▊        | 6/34 [00:00<00:01, 26.28it/s][A
 26%|██▋       | 9/34 [00:00<00:00, 26.22it/s][A
 35%|███▌      | 12/34 [00:00<00:00, 26.20it/s][A
 44%|████▍     | 15/34 [00:00<00:00, 26.14it/s][A
 53%|█████▎    | 18/34 [00:00<00:00, 26.17it/s][A
 62%|██████▏   | 21/34 [00:00<00:00, 26.19it/s][A
 71%|███████   | 24/34 [00:00<00:00, 26.19it/s][A
 79%|███████▉  | 27/34 [00:01<00:00, 26.22it/s][A
 88%|████████▊ | 30/34 [00:01<00:00, 26.21it/s][A
100%|██████████| 34/34 [00:01<00:00, 26.16it/s][A
Época 3, Batch 803, Loss: 0.8945: 100%|██████████| 2409/2409 [04:11<00:00,  9.58it/s]

Fin de Época 3: Pérdida de Validación Promedio = 2.8917
Entrenamiento completado.





In [38]:
torch.save(model.state_dict(), "llama3.2-1B_fine-tuned.pt")

In [39]:
model.load_state_dict(torch.load("llama3.2-1B_fine-tuned.pt"))

<All keys matched successfully>

In [40]:
prompt_for_inference_ft = (
        """Context: {}\n
        Question: {}\n
        Options:\n{}\n
        Answer:"""
        )

In [41]:
total_correct_fine_tuning, total_examples_fine_tuning = model_evaluation(model, prompt_for_inference_ft, filtered_test_data)

100%|██████████| 56/56 [00:01<00:00, 49.75it/s]


In [42]:
accuracy = total_correct_fine_tuning / total_examples_fine_tuning

print(f"\n--- Resultados de la Evaluación en Test (Probabilidad Basada) ---")
print(f"Ejemplos Totales: {total_examples_fine_tuning}")
print(f"Predicciones Correctas: {total_correct_fine_tuning}")
print(f"Exactitud (Accuracy): {accuracy * 100:.2f}%")


--- Resultados de la Evaluación en Test (Probabilidad Basada) ---
Ejemplos Totales: 56
Predicciones Correctas: 22
Exactitud (Accuracy): 39.29%


# Result analysis

As can be shown in the previous experiments, the fine-tuned model with lora over performs the baseline model. The results are as follows:
| Model | Accuracy |
|--------|----------|
| Baseline | 0.35 |
| Fine-tuned | 0.41 |

Which is a significant improvement over the base-line model.

The race instead of being an easy task (as can be simple classification) is a task that needs advanced reasoning skills, common sense and deep analysis in order
to solve those problems, so a result of `0.41` with only a portion of the entire dataset, and small amount of epochs (3) is a good result.

For example, on 2017 the state-of-the-art models were able to achieve 43% accuracy on this dataset [(as said on the original dataset paper)](https://arxiv.org/abs/1704.04683)

Our results still far from human performance, humans can reach 95% of accuracy, but our 42% is a good result whit limited resources.


![race results](https://media.githubusercontent.com/media/lgemc/pytorch_training/refs/heads/master/static/race_q_and_a_results.png)
<br> Source: [papers with code](https://paperswithcode.com/sota/question-answering-on-race)

## Interesting facts

We have noticed that the model is very sensitive to small changes in the prompt used, for example, at one experiment we added an `space` to the
prompt, and it started to predict always the option `C`, obviously ending in a downgraded performance of 25% accuracy (which is the same as a random model can reach).

## About computation efficiency of LORA

On our experiments we have noticed the next computation and time ussage:

| Model      | Time until finish | Ram ussed |
|------------|-------------------|-----------|
| Baseline   | 8.5 Minutes       | 13GB      |
| Fine-tuned | 3 Minutes         | 5GB       |

So, LORA consumes a half of the resources required in order to train the full model, which is a huge improvement.

