<a id='additional-resources'></a>

# BERT Fine Tuning with PyTorch Layers

## Overview

In [1]:
%pip install mermaid-py

import mermaid as md
from mermaid.graph import Graph

sequence = Graph('Astrobot',"""
flowchart TD
subgraph MIT Project
A@{ shape: docs, label: "Sample PDFs of Book" }-->C[/Behaviors Prompt/]
C-->D[(Supervised Learning Pairs Dataset)]
A-->B[/Chart Aspects Response/]
B-->D
E[[LLM]]-- pytorch --->F[[Custom Pytorch Layers]]
F-- transfer learning 
training ---G{{Fine tuned model}}
D-- training split --->F
D-- eval and test splits --->G
G-- training split --->H[/Loss Metric and initial metrics/]
G-- eval and test splits-->I[/Final Evaluation/]
end

""")
render = md.Mermaid(sequence)
render


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.




##

##

In [None]:
%pip install mermaid-py

Defaulting to user installation because normal site-packages is not writeable
Collecting mermaid-py
  Downloading mermaid_py-0.7.0-py3-none-any.whl.metadata (5.7 kB)
Downloading mermaid_py-0.7.0-py3-none-any.whl (31 kB)
Installing collected packages: mermaid-py
Successfully installed mermaid-py-0.7.0
Note: you may need to restart the kernel to use updated packages.




## Dataset - IMDB

[A subset of the Internet Movie Database (IMDB)](https://pytorch-geometric.readthedocs.io/en/2.4.0/generated/torch_geometric.datasets.IMDB.html), as collected in the “MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding” paper. IMDB is a heterogeneous graph containing three types of entities - movies (4,278 nodes), actors (5,257 nodes), and directors (2,081 nodes). The movies are divided into three classes (action, comedy, drama) according to their genre. Movie features correspond to elements of a bag-of-words representation of its plot keywords.

In [None]:
import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
#from torchtune.models import TorchtuneModelWrapper  # Correct way to use torchtune

# Load Dataset
dataset = load_dataset("imdb")
train_texts, train_labels = dataset['train']['text'][:2000], dataset['train']['label'][:2000]
test_texts, test_labels = dataset['test']['text'][:500], dataset['test']['label'][:500]  

# Load Tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize the Dataset
def tokenize_function(texts):
    return tokenizer(texts, padding="max_length", truncation=True, max_length=512)

train_encodings = tokenize_function(train_texts)
test_encodings = tokenize_function(test_texts)

# Convert to Torch Dataset
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])  # Ensure 'labels' key exists
        return item


Epoch,Training Loss,Validation Loss
1,0.0002,0.000137
2,0.0001,5.1e-05
3,0.0001,3.8e-05


TrainOutput(global_step=750, training_loss=0.004154805852022643, metrics={'train_runtime': 533.3514, 'train_samples_per_second': 11.25, 'train_steps_per_second': 1.406, 'total_flos': 0.0, 'train_loss': 0.004154805852022643, 'epoch': 3.0})

## Base Model

### [DiStillBERT](https://huggingface.co/distilbert/distilbert-base-uncased)

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives:

* Distillation loss: the model was trained to return the same probabilities as the BERT base model.
* Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
* Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

In [None]:

train_dataset = IMDbDataset(train_encodings, train_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# Define Custom Model
class CustomDistilBERT(nn.Module):
    def __init__(self):
        super(CustomDistilBERT, self).__init__()
        self.distilbert = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.custom_layer = nn.Linear(768, 256)  # Custom Layer
        self.activation = nn.LeakyReLU()         # Activation Function
        self.dropout = nn.Dropout(0.3)           # Regularization
        self.classifier = nn.Linear(256, 2)      # Output Layer for Binary Classification
        self.loss_fn = nn.CrossEntropyLoss()     # Loss function

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = outputs.last_hidden_state[:, 0, :]  # CLS token representation
        x = self.custom_layer(hidden_state)
        x = self.activation(x)
        x = self.dropout(x)
        logits = self.classifier(x)

        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return {"loss": loss, "logits": logits}

# ✅ Wrap Model with `torchtune` (if applying efficient fine-tuning methods)
wrapped_model = CustomDistilBERT()
#wrapped_model = TorchtuneModelWrapper(model)  # Only needed if using PEFT

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)

# Use Hugging Face Trainer
trainer = Trainer(
    model=wrapped_model,  # Keep using HF Trainer
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train the Model
trainer.train()


In [8]:
print(wrapped_model)

CustomDistilBERT(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1):

In [9]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
wrapped_model.to(device)

CustomDistilBERT(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1):

## Evaluation Results

In [13]:
import numpy as np
import evaluate

# Load the accuracy metric from the evaluate module
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Evaluate the model using the Trainer's built-in evaluation method
trainer.evaluate(eval_dataset=test_dataset, metric_key_prefix="eval")


{'eval_loss': 3.8206795579753816e-05,
 'eval_runtime': 6.3949,
 'eval_samples_per_second': 78.187,
 'eval_steps_per_second': 9.852,
 'epoch': 3.0}