# Transition Context: From RAG to Fine-Tuning

| Aspect             | Retrieval-Augmented Generation (RAG)             | Fine-Tuning (Full Precision)                                |
| ------------------ | ------------------------------------------------ | ----------------------------------------------------------- |
| **Core Idea**      | Retrieves relevant documents to guide generation | Trains the model itself on labeled data                     |
| **Data Ownership** | Can work on private data without retraining      | Needs supervised dataset aligned to task                    |
| **Adaptability**   | Plug-and-play; less compute-intensive            | More accurate, but needs GPU + training time                |
| **Use Case Fit**   | Great for exploratory QA, low-resource setups    | Better for classification, sentiment, domain-specific tasks |
| **Limitation**     | Model remains unchanged; can't learn from errors | Requires updates to model weights                           |


In [None]:
# Why Fine-Tuning after RAG?
# RAG is great for knowledge injection but doesn't adapt model behavior.
# Fine-tuning lets us specialize the model for tasks like emotion classification, medical triage, customer feedback analysis, etc.
# This session introduces full-precision training (FP32), which although compute-heavy, gives foundational insights into how models learn from data.

## Session Objectives

| Objective                                 | Description                                                                |
| ----------------------------------------- | -------------------------------------------------------------------------- |
| Understand full-precision fine-tuning     | Learn how to fine-tune a Hugging Face model without quantization           |
| Hands-on with a small dataset (`emotion`) | Prepare a dataset, tokenizer, model, trainer, and evaluate performance     |
| Compare to future sessions                | Set the stage for comparing with 8-bit and 16-bit fine-tuning (Session 17) |
| Learn evaluation metrics                  | Use metrics like accuracy to validate model performance                    |


In [None]:
# 1. What is Full Precision Fine-Tuning?
# Updates all weights in the model using 32-bit floating point arithmetic (FP32).
# This is the most accurate and stable form of fine-tuning but requires more GPU memory.

# 2. When to Use FP32 Fine-Tuning?
# When model precision and flexibility are more important than training cost.
# When performing academic experiments or benchmarking.

---



---

## How Fine-Tuning Helps: Before and After Examples

### Example 1: **Customer Support Ticket Classification**

| Input Prompt                                           | Model Type                             | Response                                       |
| ------------------------------------------------------ | -------------------------------------- | ---------------------------------------------- |
| *“My laptop shuts off automatically after 5 minutes.”* | **Pretrained model (e.g., base BERT)** | May classify as *“Other”* or *“Unknown issue”* |
|                                                        | **Fine-tuned on IT support tickets**   | Correctly classifies as *“Power issue”*        |

> **Explanation**: A base model lacks knowledge of internal company categories. Fine-tuning with labeled examples teaches the model specific categories like *"Power Issue"*, *"Screen Fault"*, *"Battery Problem"*, etc.

---

### Example 2: **Sentiment Classification in Finance**

| Input Text                                                               | Model Type                                 | Response   |
| ------------------------------------------------------------------------ | ------------------------------------------ | ---------- |
| *“The company has shown consistent growth and beat earnings estimates.”* | **Generic sentiment model**                | *Neutral*  |
|                                                                          | **Fine-tuned on financial sentiment data** | *Positive* |

> **Explanation**: Generic models may misinterpret domain-specific language. Fine-tuning aligns the model to **domain-specific sentiment** (in this case, finance).

---

### Example 3: **Medical Diagnosis from Symptoms**

| Input:           | *“Patient has persistent cough, shortness of breath, and chest pain.”* |
| ---------------- | ---------------------------------------------------------------------- |
| Base LLM         | Might return a vague or general answer like *“Consult a doctor”*       |
| Fine-tuned Model | Suggests *“Possible bronchitis or pneumonia; recommend chest X-ray”*   |

> **Explanation**: The base LLM avoids specifics. A fine-tuned model (trained on medical records or clinical notes) can make **task-specific, risk-aware predictions**.

---

### Example 4: **Emotion Detection in Text**

| Text                     | *“I can’t stop crying, I just lost my dog.”*  |
| ------------------------ | --------------------------------------------- |
| Pretrained Model         | Might say *“sad”* or mislabel as *“neutral”*  |
| Fine-tuned Emotion Model | Correctly classifies as *“grief”* or *“loss”* |

> **Explanation**: Fine-tuning with emotion-labeled datasets improves **empathy and nuance detection** in model predictions.

---

### Summary: Why Fine-Tuning?

| Feature                    | Base Pretrained Model      | Fine-Tuned Model                    |
| -------------------------- | -------------------------- | ----------------------------------- |
| Custom vocabulary handling | Limited                    | Learns in-domain terms              |
| Task-specific performance  | Generic                    | High accuracy on custom tasks       |
| Domain adaptation          | No                         | Yes (medical, legal, finance, etc.) |
| Flexibility for new labels | Fixed categories           | Learns new or custom labels easily  |
| Real-world readiness       | Needs prompt tuning or RAG | Task-ready with minimal inputs      |

---



In [None]:
# 1 Install dependencies
# !pip install -q datasets transformers evaluate

| Library        | Purpose                                                                                                                                |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `datasets`     | From Hugging Face — provides easy access to many public datasets (like IMDb, SST2, etc.) and lets you load and preprocess them easily. |
| `transformers` | Core library from Hugging Face — provides pre-trained models like BERT, GPT, etc., and tools for fine-tuning and using them.           |
| `evaluate`     | Also from Hugging Face — used to calculate model evaluation metrics (like accuracy, precision, recall, etc.) after training.           |
| `-q`           | Tells `pip` to run quietly (suppresses most output during installation).                                                               |


In [None]:
# Force install compatible latest versions
!pip install -q --upgrade transformers sentence-transformers datasets evaluate
# Restart runtime after this

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/10.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/10.5 MB[0m [31m25.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/10.5 MB[0m [31m73.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.5/10.5 MB[0m [31m123.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m88.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[?25h

| Library                 | Why it's needed for fine-tuning                                                    |
| ----------------------- | ---------------------------------------------------------------------------------- |
| `transformers`          | Main Hugging Face library — contains pre-trained models and fine-tuning pipelines. |
| `sentence-transformers` | Useful for encoding sentences into embeddings (common in sentence-level tasks).    |
| `datasets`              | Used to load and preprocess text datasets.                                         |
| `evaluate`              | Lets you compute performance metrics like accuracy, F1, BLEU, etc.                 |


In [None]:
# 1 Install dependencies
# !pip install -q datasets transformers evaluate

# 2 Imports
from datasets import load_dataset # You use this to easily load a public dataset (like IMDb, Yelp, SST2, etc.) or your own dataset for training. It's essential for fine-tuning tasks.
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding
)
# AutoTokenizer:
# A factory class that automatically loads the correct tokenizer associated with a given pretrained model.
# Why it's used:
# Transforms raw text into tokens and IDs that the model can understand.
# Automatically loads tokenization rules specific to models like BERT, RoBERTa, DistilBERT, etc.
# AutoModelForSequenceClassification:
# Automatically loads a pretrained model (like BERT) with an additional classification head on top (usually a linear layer).
# Why it's used:
# You use this class when your downstream task is classification (e.g., positive/negative sentiment, topic labeling, etc.).
# Saves time: no need to define your own architecture or head.
# Trainer:
# A high-level class that handles training, evaluation, checkpointing, and prediction.
# Why it's used:
# Simplifies the training loop; you don’t need to write boilerplate PyTorch code.
# Works with built-in features like logging, early stopping, distributed training, etc.
# Integrates seamlessly with Hugging Face datasets and models.
# TrainingArguments:
# A configuration class that holds all hyperparameters and settings needed for training.
# Why it's used:
# Lets you set batch size, learning rate, number of epochs, output directories, evaluation strategy,
# logging options, etc.
# Passed to the Trainer to control how training happens.
# DataCollatorWithPadding:
# Automatically pads all inputs in a batch to the length of the longest input in that batch.
# Why it's used:
# Makes sure the input tensors are the same shape within each batch.
# Avoids wasting memory with global max-length padding.
# Works well with dynamic-length sequences in classification tasks.

import evaluate
import numpy as np

# 3 Load and preprocess dataset
raw = load_dataset("emotion", split="train") # Loads the "emotion" dataset from Hugging Face’s datasets hub.
# The Emotion dataset is a widely used text classification benchmark that focuses on
# identifying the emotional tone of English-language sentences.
# It is hosted on the Hugging Face Datasets Hub.
# Task Type	- Text Classification
# Objective	- Predict the emotion expressed in a sentence (like joy, anger, sadness, etc.)
# Data Source	- Crowdsourced tweets from Twitter
# Language	- English
# Labels (Categories) -	6 emotions: sadness, joy, love, anger, fear, surprise
# Size	- Around 20,000+ labeled text samples
# Creator -	Saravia et al. (2018), from the paper "CARER: Contextualized Affect Representations for Emotion Recognition"
# Example Row from the Dataset:
# {
#   "text": "I'm so happy today!",
#   "label": 1
# }
# Where label 1 corresponds to joy.
# Typical Use Case:
# Train models (like BERT) to detect emotions in user messages, customer feedback, tweets, or chatbot conversations.
# Widely used in sentiment/emotion classification, social media analysis, and mental health applications.
# Why It's Used in Fine-Tuning:
# It's a multi-class classification task.
# Great for demonstrating how pre-trained transformers can be adapted to real-world NLP problems like emotion detection.

raw = raw.shuffle(seed=42).train_test_split(test_size=0.2)
# Randomly shuffles the dataset using a fixed seed (42) to ensure reproducibility.

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Loads the BERT base uncased tokenizer from Hugging Face
# “Uncased” means it will lowercase all words before tokenizing (e.g., "Happy" → "happy").
# Prepares the tokenizer so we can convert text like "I feel fantastic!" into input IDs and attention masks.
# BERT expects input in a certain format: [CLS] I feel fantastic ! [SEP]


def preprocess(batch): # This function will be applied to each batch to convert the raw text into token IDs that a transformer model can understand
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_ds = raw["train"].map(preprocess, batched=True)
eval_ds = raw["test"].map(preprocess, batched=True)

# 4 Load model
model = AutoModelForSequenceClassification.from_pretrained( # loads a pretrained BERT model with a classification layer on top.
    "bert-base-uncased", num_labels=len(raw["train"].features["label"].names)
)
# AutoModelForSequenceClassification is a special Hugging Face class that adds a
# classification head suitable for multi-class problems (like emotion classification).
# The model weights are downloaded from Hugging Face if not already cached.

# Prepare metrics
metric = evaluate.load("accuracy") # Loads the accuracy metric from Hugging Face’s evaluate library.
# his metric will be used to check how many predictions were correct during model evaluation.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return metric.compute(predictions=preds, references=labels)
# eval_pred is a tuple:
# >logits: raw outputs from the model (unnormalized scores)
# >labels: ground truth emotion labels from the test dataset
# logits shape: [batch_size, num_classes]

#  Prepare Trainer
training_args = TrainingArguments(
    output_dir="results", # Directory where logs and checkpoints would be saved if saving were enabled. Even though saving is disabled here (save_strategy="no"), some logs may still go here.
    eval_strategy="epoch", # Tells the trainer to evaluate the model once per epoch on the evaluation set. You’ll get validation accuracy at the end of each epoch.
    learning_rate=2e-5, # Sets the learning rate to 0.00002, which is a commonly used value when fine-tuning BERT. A smaller learning rate is used to avoid “destroying” the pretrained weights too quickly.

    # Specifies the batch size used per device (e.g., per GPU or CPU core).
    # If you're running on one GPU, this is the actual batch size.
    # Small batch size is often used when working on limited hardware (e.g., Google Colab).
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,

    num_train_epochs=2, # Sets the number of full passes through the training dataset to 2. 2 epochs is often sufficient for demonstration, but can be increased for better performance.
    save_strategy="no", # Disables saving of model checkpoints during training. This saves disk space and speeds up training, but you’ll lose model weights unless you save them manually later.
)

# Creates a data collator object.
# This ensures that each training/evaluation batch is padded dynamically to the longest sequence
#  in that batch.
# This is more efficient than padding everything to a global max length like 512 tokens.
# It uses the same tokenizer you defined earlier to perform this padding.
data_collator = DataCollatorWithPadding(tokenizer)


trainer = Trainer(
    model=model, # BERT model with classification head, loaded earlier
    args=training_args, # All training configuration (epochs, LR, batch size, etc.)
    train_dataset=train_ds, # Preprocessed and tokenized training dataset
    eval_dataset=eval_ds, # Tokenized test/validation dataset
    tokenizer=tokenizer, # Tokenizer used for decoding predictions (optional but helpful)
    data_collator=data_collator, # Ensures all inputs in a batch have the same length (efficient batching)
    compute_metrics=compute_metrics, # Function that returns evaluation metrics (like accuracy)
)

#  Train & evaluate
# Starts the fine-tuning process.
# Uses everything you set up earlier:
# Model (BERT)
# -Tokenized training data
# -Training arguments (learning rate, batch size, epochs, etc.)
# -Metric function (accuracy)
# The model will be trained for 2 epochs (as per num_train_epochs=2).
# After each epoch, evaluation will be run automatically because of: eval_strategy="epoch"

metrics = trainer.evaluate()
# After training is complete, this line evaluates the model on the validation set (eval_ds) once more.
# Returns a dictionary with the final evaluation metrics like:
# {'eval_loss': 0.28, 'eval_accuracy': 0.88, ...}

print(metrics)
# Simply prints the evaluation results to your notebook or console.
# Helps you understand how well the fine-tuned model is performing on unseen (validation) data.

Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mingledarshan[0m ([33mingledarshan-datacouch[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2523,0.260528,0.921562
2,0.1405,0.190841,0.934063


{'eval_loss': 0.19084063172340393, 'eval_accuracy': 0.9340625, 'eval_runtime': 6.7051, 'eval_samples_per_second': 477.252, 'eval_steps_per_second': 59.656, 'epoch': 2.0}


In [None]:
#  Install dependencies
# !pip install -q datasets transformers evaluate

#  Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding
)
import evaluate
import numpy as np
import torch

#  Load and preprocess dataset
raw = load_dataset("emotion", split="train")
raw = raw.shuffle(seed=42).train_test_split(test_size=0.2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_ds = raw["train"].map(preprocess, batched=True)
eval_ds = raw["test"].map(preprocess, batched=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(raw["train"].features["label"].names)
)

# Prepare metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return metric.compute(predictions=preds, references=labels)

# Prepare Trainer
training_args = TrainingArguments(
    output_dir="results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    save_strategy="no",
    logging_steps=10,
)

data_collator = DataCollatorWithPadding(tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Helper function: Predict emotion for given list of texts
# Defines a function named predict_emotions.
# Inputs:
# >texts: list of raw text strings (e.g., ["I'm so happy", "I'm scared"])
# >model: the fine-tuned BERT model
# >label_names: list mapping label IDs to string names (e.g., ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])
def predict_emotions(texts, model, label_names):
    results = []
    model.eval()
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(model.device)
        with torch.no_grad(): # Runs the model in inference mode (no gradient calculation).
            logits = model(**inputs).logits # Gets the raw logits (scores for each class/emotion) from the model.
        pred_label = label_names[logits.argmax(-1).item()]
        # logits.argmax(-1): finds the index of the highest score → predicted label ID
        # .item(): extracts the Python int from tensor
        # label_names[...]: maps the predicted label ID to the corresponding emotion name
        results.append((text, pred_label))
    return results

label_names = raw["train"].features["label"].names
sample_texts = [
    "I'm so frustrated with everything happening right now.",
    "I just got promoted and I’m feeling amazing!",
    "Why does everything bad happen to me?",
    "I'm laughing so hard at this meme!",
    "I feel very calm and peaceful today.",
    "I miss her so much, it hurts.",
    "This is the worst experience of my life."
]

# Evaluate BEFORE fine-tuning
print("🔍 Performance BEFORE fine-tuning:")
metrics_before = trainer.evaluate()
print(metrics_before)

print("\n📌 Predictions BEFORE fine-tuning:")
before_preds = predict_emotions(sample_texts, model, label_names)
for text, label in before_preds:
    print(f"Text: {text}\nPredicted Emotion: {label}\n")

# Train the model
trainer.train()

# Evaluate AFTER fine-tuning
print("🔍 Performance AFTER fine-tuning:")
metrics_after = trainer.evaluate()
print(metrics_after)

print("\n📌 Predictions AFTER fine-tuning:")
after_preds = predict_emotions(sample_texts, model, label_names)
for text, label in after_preds:
    print(f"Text: {text}\nPredicted Emotion: {label}\n")

# Side-by-side comparison
print("\n✅ Accuracy Comparison:")
print(f"Before fine-tuning: {metrics_before['eval_accuracy']:.4f}")
print(f"After fine-tuning : {metrics_after['eval_accuracy']:.4f}")


Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


🔍 Performance BEFORE fine-tuning:


  trainer = Trainer(


{'eval_loss': 1.7641874551773071, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.2703125, 'eval_runtime': 6.8135, 'eval_samples_per_second': 469.657, 'eval_steps_per_second': 58.707}

📌 Predictions BEFORE fine-tuning:
Text: I'm so frustrated with everything happening right now.
Predicted Emotion: joy

Text: I just got promoted and I’m feeling amazing!
Predicted Emotion: joy

Text: Why does everything bad happen to me?
Predicted Emotion: joy

Text: I'm laughing so hard at this meme!
Predicted Emotion: joy

Text: I feel very calm and peaceful today.
Predicted Emotion: joy

Text: I miss her so much, it hurts.
Predicted Emotion: joy

Text: This is the worst experience of my life.
Predicted Emotion: joy



Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,0.2453,0.219166,0.0041,0.931875
2,0.0407,0.171208,0.0041,0.94


🔍 Performance AFTER fine-tuning:


{'eval_loss': 0.17120762169361115, 'eval_model_preparation_time': 0.0041, 'eval_accuracy': 0.94, 'eval_runtime': 6.7158, 'eval_samples_per_second': 476.491, 'eval_steps_per_second': 59.561, 'epoch': 2.0}

📌 Predictions AFTER fine-tuning:
Text: I'm so frustrated with everything happening right now.
Predicted Emotion: anger

Text: I just got promoted and I’m feeling amazing!
Predicted Emotion: joy

Text: Why does everything bad happen to me?
Predicted Emotion: sadness

Text: I'm laughing so hard at this meme!
Predicted Emotion: joy

Text: I feel very calm and peaceful today.
Predicted Emotion: joy

Text: I miss her so much, it hurts.
Predicted Emotion: sadness

Text: This is the worst experience of my life.
Predicted Emotion: sadness


✅ Accuracy Comparison:


KeyError: 'accuracy'

In [None]:
# Side-by-side comparison
print("\nAccuracy Comparison:")
print(f"Before fine-tuning: {metrics_before['eval_accuracy']:.4f}")
print(f"After fine-tuning : {metrics_after['eval_accuracy']:.4f}")


✅ Accuracy Comparison:
Before fine-tuning: 0.2703
After fine-tuning : 0.9400


---