# Fine-Tuning Models
Notebook that demonstrates implementation and the core idea behind every proposed fine-tuned solution

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from transformers import AutoModel

### Base Model
For the purposes of this work, we'll use BERT model which represents. It contains 110 millions parameters organized into 12 transformer layers

In [2]:
base_model = AutoModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float32)
print(base_model)

`torch_dtype` is deprecated! Use `dtype` instead!
2025-09-03 21:02:26.030269: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

### Classification Model
Class that takes BERT model and simply adds classification head layer on top of it 
so it's ready for classification problems 

CLS Token

The [CLS] token is a special token that is added to the beginning of the input sequence for many transformer models. Its final hidden state, after being processed by the model's layers, is used as a summary representation of the entire sequence. This is a very common technique for classification tasks of the entire input, such as sentiment analysis.

Logits

Logits are the raw, unnormalized scores outputted by the final linear layer of a classification model. They are a set of numbers that represent the model's confidence in each class. These values are then typically passed through a softmax function to convert them into probabilities that sum up to 1, which are much easier to interpret.

Dropout

Dropout is a regularization technique that helps prevent a neural network from overfitting to the training data. During training, a certain percentage of neurons in a layer (in this case, 10% on the output of the [CLS] token's embedding) are randomly "dropped out" or ignored. This forces the model to learn more robust and generalized features, as it can't rely on any single neuron or specific set of neurons to make a prediction.  This makes the model more effective on unseen data.

In [3]:
class ClassificationModel(nn.Module):
    def __init__(self, base_model):
        super(ClassificationModel, self).__init__()
        
        self.base_model = base_model
        self.dropout = nn.Dropout(0.1)
        self.linear = nn.Linear(768, 2) # output features from BERT is 768 and 2 is number of labels
        
    def forward(self, input_ids, attn_mask):
        
        last_hidden_state = self.base_model(input_ids, attention_mask=attn_mask).last_hidden_state
        cls_embedding = last_hidden_state[:, 0, :]   # Take [CLS] token representation
        x = self.dropout(cls_embedding)
        logits = self.linear(x) 
        return logits

## Full fine-tuning
In full fine-tuning all of the model's layers are set to trainable<br>

In [4]:
def get_full_classification_model(base_model):
    # Simply add classification head
    model = ClassificationModel(base_model)

    return model

In [5]:
base = base_model
classification_model = get_full_classification_model(base)

trainable_params = sum(p.numel() for p in classification_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in classification_model.parameters())

trainable_params_percent = 100 * trainable_params / total_params
print(f"Trainable parameters: {trainable_params} / {total_params} ({trainable_params_percent:.2f}%)")

Trainable parameters: 109483778 / 109483778 (100.00%)


## Parameter efficient fine-tuning

Instead of training and changing all trainable parameters, we choose a subset of them or add a small set of new ones that will be adjusted, while freezing the others. The idea behind this is to have faster and computationally more effective training time.
There are a lot of different approaches that can be used here


## Classification head model
Freezes all parameters except the ones in the last classification layer

In [6]:
def get_classification_head_model(base_model):
    # Freeze all parameters
    for param in base_model.parameters():
        param.requires_grad = False
            
    model = ClassificationModel(base_model)

    return model

In [7]:
base = base_model
classification_model = get_classification_head_model(base)

trainable_params = sum(p.numel() for p in classification_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in classification_model.parameters())

trainable_params_percent = 100 * trainable_params / total_params
print(f"Trainable parameters: {trainable_params} / {total_params} ({trainable_params_percent:.2f}%)")

Trainable parameters: 1538 / 109483778 (0.00%)


## Adapters

Adding addapter layers after attention and feed-forward layers

Key idea behind is to create "bottleneck" that forces model to learn a very compact, low_rank representations.
Adapter size is the hyperparameter that defines the dimension of the down projection

### BottleneckAdapter
The adapter works by taking the hidden_size of the original layer (768 for BERT), project it down to a smaller adapter_size dimension by down_project layer. 
Residual connection at the output allows model to learn an additive update instead of learning representation from start, which is more stable. On this way, adapter adds only a small adjustment on top of pre-trained knowledge. 

kaiming_uniform 

Since we're introducing a new layer inside the inner structure of a model, if all values are completely random there is a possibility that it can greatly downgrade the performance of the model. That's why it's necessary to initialy set some values for the adapter layer that will be able to grasp new knowledge quickly. Xavier_uniform function is an implementation of the kaiming initialization. It sets the weights of a layer based on an uniform distribution, scaled by the size of the input and output features of the layer.

Bias

Bias is the constant value that is added to the output of a layer's multiplication. In Linear transformation y = Wx + b, b is bias, which gives the neuron an extra degree of freedom.

In [8]:
class BottleneckAdapter(nn.Module):
    def __init__(self, hidden_size, adapter_size, dropout_rate=0.1):
        super().__init__()
        
        self.down_project = nn.Linear(hidden_size, adapter_size)  # down projection
        self.activation = nn.ReLU()  # non-linearity
        self.up_project = nn.Linear(adapter_size, hidden_size)    # up projection
        self.dropout = nn.Dropout(dropout_rate)
        self.layer_norm = nn.LayerNorm(hidden_size)
        
        # Initialize adapter weights — not learned from pretraining, so good init is important!
        nn.init.kaiming_uniform_(self.down_project.weight)
        nn.init.zeros_(self.down_project.bias)
        nn.init.kaiming_uniform_(self.up_project.weight)
        nn.init.zeros_(self.up_project.bias)

    def forward(self, hidden_states):
        # Store original input for residual connection
        residual = hidden_states

        # Apply adapter: down-project -> non-linear -> up-project -> dropout
        x = self.down_project(hidden_states)
        x = self.activation(x)
        x = self.up_project(x)
        x = self.dropout(x)

        # Add residual and normalize
        output = residual + x
        output = self.layer_norm(output)
        return output


### AdapterTransformerLayer

Wrapper class that takes the original, pre-trained transformer layer and adds the new adapter layers.
It freezes the weights of the original transformer layer, so they will not be updated during training.

In forward function, one adapter is placed after the attention mechanism and the other after the feed-forward network (FFN), as per the original adapter paper.

In [9]:
class AdapterTransformerLayer(nn.Module):
    def __init__(self, transformer_layer, adapter_size):
        super().__init__()
        self.layer = transformer_layer
        self.hidden_size = transformer_layer.attention.self.query.in_features

        # Freeze the original transformer block
        for param in self.layer.parameters():
            param.requires_grad = False

        # Add adapters
        self.attention_adapter = BottleneckAdapter(self.hidden_size, adapter_size)
        self.ffn_adapter = BottleneckAdapter(self.hidden_size, adapter_size)

    def forward(self, hidden_states, attention_mask=None, head_mask=None):
        # BERT forward: attention -> add & norm -> ffn -> add & norm

        # Attention sublayer
        sa_output = self.layer.attention(
            hidden_states, 
            attn_mask=attention_mask, 
            head_mask=head_mask
        )[0]

        # Add + Norm (frozen)
        sa_output = self.layer.sa_layer_norm(sa_output + hidden_states)

        # Adapter after attention
        sa_output = self.attention_adapter(sa_output)

        # FFN sublayer
        ffn_output = self.layer.ffn(sa_output)
        ffn_output = self.layer.output_layer_norm(ffn_output + sa_output)

        # Adapter after FFN
        output = self.ffn_adapter(ffn_output)

        return output



### get_adapters_model

This function takes model, and traverses through it's layers. Each model layer is replaced by a wrapper layer that contains adapter modules. 

At the end append classification head for execution of classification tasks

In [10]:
def get_adapters_model(base_model, adapter_size=64):
    for i in range(len(base_model.encoder.layer)):
        original_layer = base_model.encoder.layer[i]
        base_model.encoder.layer[i] = AdapterTransformerLayer(original_layer, adapter_size)
    
    classification_model = ClassificationModel(base_model)
    return classification_model

In [11]:
base = base_model
classification_model = get_adapters_model(base)
print(classification_model)

trainable_params = sum(p.numel() for p in classification_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in classification_model.parameters())

trainable_params_percent = 100 * trainable_params / total_params
print(f"Trainable parameters: {trainable_params} / {total_params} ({trainable_params_percent:.2f}%)")

ClassificationModel(
  (base_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x AdapterTransformerLayer(
          (layer): BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias

## LoRA

Apply LoRA to query and value matrices inside attention layers

The core idea of LoRA is to approximate the change in a weight matrix (ΔW) with a low-rank decomposition of two smaller matrices, A and B. Instead of training the entire weight matrix, only train much smaller matrices A and B. The Inner dimension of those two matrices is hyperparameter rank. Hyperparameter alpha is used as a coefficient of a degree of addition to original matrix.

W_updated=W+ΔW=W+BA

### LoRALayer

Creates two trainable matrices - A and B, that make up the low-rank decomposition. They are defined as nn.Parameter so they can be updated during training. A has shape (in_features, rank), while B has shape (rank, out_features).

self.scaling = alpha / rank

Alpha hyperparameter is used to scale the LoRA contribution. This helps to control the impact of the learned updates. Different papers propose different alpha : rank ratio.

kaiming_uniform

Crucial initialization steps, similar to ones for adapters. 


In [12]:
class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=32):
        super().__init__()
        self.rank = rank
        self.scaling = alpha / rank
        
        # LoRA weights
        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
        # Initialize weights
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x):
        # LoRA contribution: scaling * (x @ A) @ B
        return self.scaling * (x @ self.lora_A) @ self.lora_B


### LoRALinear

Wrapper class aroung passed linear layer.
Freezes weights and bias of the original layer and creates and adds lora layer to it which is responsible for updating weights.

In [13]:
class LoRALinear(nn.Module):
    def __init__(self, linear_layer, rank=8, alpha=32):
        super().__init__()
        self.linear = linear_layer
        
        # Freeze original weights
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False
            
        # Add LoRA components
        self.lora = LoRALayer(
            linear_layer.in_features, 
            linear_layer.out_features,
            rank=rank,
            alpha=alpha
        )
    
    def forward(self, x):
        # Combine original output with LoRA contribution
        return self.linear(x) + self.lora(x)


### get_lora_model

Responsible for injecting the LoRA layers into the base model

It freezes all weights of the base model, so only new LoRA layers are trained.
LoRA replaces all 'query' and 'value' linear layer in attention. In BERT implementation they are called 'query' and 'value'. 

At the end append classification head for execution of classification tasks

In [14]:
def get_lora_model(base_model, rank=8, alpha=32, target_modules=["query", "value"]):
    # First, freeze all parameters
    for param in base_model.parameters():
        param.requires_grad = False
    
    # Then apply LoRA to target modules
    for name, module in base_model.named_modules():
        if any(target_name in name for target_name in target_modules):
            if isinstance(module, nn.Linear):
                # Get the parent module
                parent_name = '.'.join(name.split('.')[:-1])
                child_name = name.split('.')[-1]
                parent_module = base_model.get_submodule(parent_name)
                
                # Replace with LoRA version
                lora_layer = LoRALinear(module, rank=rank, alpha=alpha)
                setattr(parent_module, child_name, lora_layer)
    
    classification_model = ClassificationModel(base_model)
    return classification_model

In [15]:
base = base_model
classification_model = get_lora_model(base)
print(classification_model)

trainable_params = sum(p.numel() for p in classification_model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in classification_model.parameters())

trainable_params_percent = 100 * trainable_params / total_params
print(f"Trainable parameters: {trainable_params} / {total_params} ({trainable_params_percent:.2f}%)")

ClassificationModel(
  (base_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x AdapterTransformerLayer(
          (layer): BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): LoRALinear(
                  (linear): Linear(in_features=768, out_features=768, bias=True)
                  (lora): LoRALayer()
                )
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): LoRALinear(
                  (linear): Linear(in_features=768, out_features=768, bias=True)
                  (lora): LoRALayer()
          

In [16]:
from hf_utils import save_model_to_hf

In [17]:
save_model_to_hf(classification_model, 'new_model')


HfHubHTTPError: (Request ID: Root=1-68b890c6-60b292304b8d560c70c93da7;c4e4a202-0499-40d6-8f5b-eb9b3601cf80)

403 Forbidden: You don't have the rights to create a model under the namespace "jovan23".
Cannot access content at: https://huggingface.co/api/repos/create.
Make sure your token has the correct permissions.