<a href="https://colab.research.google.com/github/pushan9/Colab-notebook/blob/main/1_Fine_Tuning_LLMs_(FP32_Training).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Creating a full LLM on your laptop is not possible because of:
# Insufficient memory
# Insufficient compute
# Insufficient power
# Insufficient data
# Lack of distributed infrastructure

# It’s an industrial-scale problem, not a personal-computing one.

# Transition Context: From RAG to Fine-Tuning

| Aspect             | Retrieval-Augmented Generation (RAG)             | Fine-Tuning (Full Precision)                                |
| ------------------ | ------------------------------------------------ | ----------------------------------------------------------- |
| **Core Idea**      | Retrieves relevant documents to guide generation | Trains the model itself on labeled data                     |
| **Data Ownership** | Can work on private data without retraining      | Needs supervised dataset aligned to task                    |
| **Adaptability**   | Plug-and-play; less compute-intensive            | More accurate, but needs GPU + training time                |
| **Use Case Fit**   | Great for exploratory QA, low-resource setups    | Better for classification, sentiment, domain-specific tasks |
| **Limitation**     | Model remains unchanged; can't learn from errors | Requires updates to model weights                           |


In [None]:
# Why Fine-Tuning after RAG?
# RAG is great for knowledge injection but doesn't adapt model behavior.
# Fine-tuning lets us specialize the model for tasks like emotion classification, medical triage, customer feedback analysis, etc.
# This session introduces full-precision training (FP32), which although compute-heavy, gives foundational insights into how models learn from data.

## Session Objectives

| Objective                                 | Description                                                                |
| ----------------------------------------- | -------------------------------------------------------------------------- |
| Understand full-precision fine-tuning     | Learn how to fine-tune a Hugging Face model without quantization           |
| Hands-on with a small dataset (`emotion`) | Prepare a dataset, tokenizer, model, trainer, and evaluate performance     |
| Compare to future sessions                | Set the stage for comparing with 8-bit and 16-bit fine-tuning (Session 17) |
| Learn evaluation metrics                  | Use metrics like accuracy to validate model performance                    |


In [None]:
# 1. What is Full Precision Fine-Tuning?
# Updates all weights in the model using 32-bit floating point arithmetic (FP32).
# This is the most accurate and stable form of fine-tuning but requires more GPU memory.

# 2. When to Use FP32 Fine-Tuning?
# When model precision and flexibility are more important than training cost.
# When performing academic experiments or benchmarking.

---



---

## How Fine-Tuning Helps: Before and After Examples

### Example 1: **Customer Support Ticket Classification**

| Input Prompt                                           | Model Type                             | Response                                       |
| ------------------------------------------------------ | -------------------------------------- | ---------------------------------------------- |
| *“My laptop shuts off automatically after 5 minutes.”* | **Pretrained model (e.g., base BERT)** | May classify as *“Other”* or *“Unknown issue”* |
|                                                        | **Fine-tuned on IT support tickets**   | Correctly classifies as *“Power issue”*        |

> **Explanation**: A base model lacks knowledge of internal company categories. Fine-tuning with labeled examples teaches the model specific categories like *"Power Issue"*, *"Screen Fault"*, *"Battery Problem"*, etc.

---

### Example 2: **Sentiment Classification in Finance**

| Input Text                                                               | Model Type                                 | Response   |
| ------------------------------------------------------------------------ | ------------------------------------------ | ---------- |
| *“The company has shown consistent growth and beat earnings estimates.”* | **Generic sentiment model**                | *Neutral*  |
|                                                                          | **Fine-tuned on financial sentiment data** | *Positive* |

> **Explanation**: Generic models may misinterpret domain-specific language. Fine-tuning aligns the model to **domain-specific sentiment** (in this case, finance).

---

### Example 3: **Medical Diagnosis from Symptoms**

| Input:           | *“Patient has persistent cough, shortness of breath, and chest pain.”* |
| ---------------- | ---------------------------------------------------------------------- |
| Base LLM         | Might return a vague or general answer like *“Consult a doctor”*       |
| Fine-tuned Model | Suggests *“Possible bronchitis or pneumonia; recommend chest X-ray”*   |

> **Explanation**: The base LLM avoids specifics. A fine-tuned model (trained on medical records or clinical notes) can make **task-specific, risk-aware predictions**.

---

### Example 4: **Emotion Detection in Text**

| Text                     | *“I can’t stop crying, I just lost my dog.”*  |
| ------------------------ | --------------------------------------------- |
| Pretrained Model         | Might say *“sad”* or mislabel as *“neutral”*  |
| Fine-tuned Emotion Model | Correctly classifies as *“grief”* or *“loss”* |

> **Explanation**: Fine-tuning with emotion-labeled datasets improves **empathy and nuance detection** in model predictions.

---

### Summary: Why Fine-Tuning?

| Feature                    | Base Pretrained Model      | Fine-Tuned Model                    |
| -------------------------- | -------------------------- | ----------------------------------- |
| Custom vocabulary handling | Limited                    | Learns in-domain terms              |
| Task-specific performance  | Generic                    | High accuracy on custom tasks       |
| Domain adaptation          | No                         | Yes (medical, legal, finance, etc.) |
| Flexibility for new labels | Fixed categories           | Learns new or custom labels easily  |
| Real-world readiness       | Needs prompt tuning or RAG | Task-ready with minimal inputs      |

---





---

### **Problem Statement**

In this exercise, the goal is to **fine-tune a pre-trained BERT model** (`bert-base-uncased`) for the task of **emotion classification** using the Hugging Face `emotion` dataset. The aim is to transform a general-purpose language model into a task-specific model capable of classifying text inputs into predefined emotion categories with improved accuracy.

---

### **Objectives**

1. **Understand the role of fine-tuning in NLP**
   Learn how pre-trained transformer models can be adapted to specific downstream tasks by training on labeled data.

2. **Prepare and tokenize a real-world dataset**
   Load the `emotion` dataset using the Hugging Face `datasets` library, and apply consistent tokenization for sequence classification tasks.

3. **Load and configure a transformer-based classification model**
   Use the BERT model as a base and configure it to output emotion labels by setting the correct number of classification heads.

4. **Set up evaluation metrics and training logic**
   Apply the `evaluate` library to calculate accuracy, and configure the `Trainer` API from Hugging Face to handle model training and evaluation.

5. **Experiment with hyperparameters and training arguments**
   Define learning rate, batch size, number of epochs, and evaluation strategy using the `TrainingArguments` class.

6. **Train the model and monitor performance**
   Perform training using the `Trainer.train()` method and evaluate the model on the test dataset to assess accuracy.

7. **Understand implications of full-precision fine-tuning (FP32)**
   All training is done in full-precision mode, which is compute-intensive and serves as a baseline for future comparisons with optimized fine-tuning techniques (e.g., 8-bit, 16-bit precision).

---



In [None]:
# 1 Install dependencies
# !pip install -q datasets transformers evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m81.9/84.1 kB[0m [31m18.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Force install compatible latest versions
!pip install -q --upgrade transformers sentence-transformers datasets evaluate
# sentence-transformers - Convert text into meaningful vectors (embeddings)
# datasets - Handle training data efficiently
# evaluate - Measure model performance

# Restart runtime after this

# datasets:
#  this library allows easy access to ready-to-use NLP datasets (like "emotion"), and supports streaming, preprocessing,
# and train-test splitting.
# evaluate:
# This provides easy-to-use functions for computing metrics like accuracy, precision, recall, etc., in training and evaluation steps.
# Here, you use it to compute accuracy during evaluation.

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# 1 Install dependencies
# !pip install -q datasets transformers evaluate

# 2 Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding
)
# AutoTokenizer - Converts text → numbers that the model understands.
# AutoModelForSequenceClassification - Loads a pretrained language model that is already set up
# Trainer: Provides a high-level API to train/evaluate Hugging Face models
# TrainingArguments: Allows you to configure settings like learning rate, batch size, epochs, etc., for the training run
# DataCollatorWithPadding: Handles dynamic padding during batching (makes all examples in a batch the same length).
# "I love AI"        → length 3
# "AI is powerful"  → length 4
# Padding makes them:
# [3, 4] → [4, 4]


import evaluate
import numpy as np

# 3 Load and preprocess dataset
# https://huggingface.co/datasets/dair-ai/emotion
raw = load_dataset("emotion", split="train")
# Loads the "emotion" dataset from Hugging Face's dataset hub.
# This dataset contains text labeled with one of several emotions (like joy, anger, sadness, etc.).
# The split="train" means you're loading the training part of the dataset.

raw = raw.shuffle(seed=42).train_test_split(test_size=0.2) # Randomizes the order of the dataset rows to make training unbiased. seed=42 ensures reproducibility—same shuffle every time you run

# Loads the tokenizer that is specifically designed for the bert-base-uncased model. This tokenizer will:
# – Lowercase text,
# – Split text into tokens,
# – Convert them into input IDs and attention masks that BERT understands.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_ds = raw["train"].map(preprocess, batched=True)
eval_ds = raw["test"].map(preprocess, batched=True)

# 4 Load model: Loads a BERT model that is already trained on a large dataset, and customizes it for sequence classification (e.g., emotion classification).
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(raw["train"].features["label"].names)
)
# num_labels: Specifies how many emotion categories there are. This number is automatically calculated by checking
# how many unique labels exist in the dataset.

# Prepare metrics
metric = evaluate.load("accuracy") # loads the "accuracy" metric from the evaluate library

# This function will be automatically called by the Trainer during evaluation
def compute_metrics(eval_pred): # eval_pred - It is a tuple provided by Hugging Face: (logits, labels)
    logits, labels = eval_pred # Takes model predictions (logits) and actual labels.
    # What are logits?
    # Logits are raw scores, not probabilities. eg: [ -1.2, 3.5, 0.8, -0.4, -2.1, 0.1 ]
    # Each number corresponds to one emotion: eg: [sadness, joy, love, anger, fear, surprise]
    # Higher score = model thinks that class is more likely.
    preds = np.argmax(logits, axis=1) # Converts raw prediction scores (logits) into actual class predictions by picking the class with the highest score.
    # argmax → 1 for the value 3.5, Which corresponds to: joy
    return metric.compute(predictions=preds, references=labels) # Computes accuracy by comparing model predictions (preds) with true answers (labels).

#  Prepare Trainer
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir="results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    save_strategy="no",
)
# | **Parameter**                   | **Meaning**                                                                     |
# | ------------------------------- | ------------------------------------------------------------------------------- |
# | `output_dir="results"`          | Where to save model checkpoints and logs.                                       |
# | `eval_strategy="epoch"`         | Evaluate the model **after every epoch** (one full pass through training data). |
# | `learning_rate=2e-5`            | Small step size for updating model weights. Commonly used for BERT fine-tuning. |
# | `per_device_train_batch_size=8` | Train on 8 examples at a time per GPU or CPU.                                   |
# | `per_device_eval_batch_size=8`  | Evaluate on 8 examples at a time.                                               |
# | `num_train_epochs=2`            | Train for 2 complete passes through the data.                                   |
# | `save_strategy="no"`            | Do **not** save intermediate checkpoints (to save space).                       |

# This handles dynamic padding so all inputs in a batch have the same length.
data_collator = DataCollatorWithPadding(tokenizer)
# During training, data is processed in batches.
# But sentences have different lengths:
# eg:
# "I am sad"
# "This makes me extremely happy today"
# Above line:
# a. Pads sequences correctly
# b. Makes batches GPU-friendly
# c. Prevents shape errors

# This object wraps everything and handles the full training + evaluation pipeline.
trainer = Trainer(
    model=model, # The BERT model for sequence classification.
    args=training_args, # Training configurations defined above.
    train_dataset=train_ds, # The tokenized and preprocessed training dataset.
    eval_dataset=eval_ds, # The tokenized and preprocessed test dataset.
    tokenizer=tokenizer, # Tokenizer for data preparation and decoding.
    data_collator=data_collator, # Ensures padding is applied as needed
    compute_metrics=compute_metrics, # Evaluates performance using accuracy.
)

#  Train & evaluate
trainer.train() # Starts fine-tuning the BERT model on the emotion classification dataset using the configurations.
metrics = trainer.evaluate() # this evaluates the final model on the test data and prints the accuracy or other computed metrics.
print(metrics)


Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mingledarshan[0m ([33mingledarshan-datacouch[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2523,0.260528,0.921562
2,0.1405,0.190841,0.934063


{'eval_loss': 0.19084063172340393, 'eval_accuracy': 0.9340625, 'eval_runtime': 6.7051, 'eval_samples_per_second': 477.252, 'eval_steps_per_second': 59.656, 'epoch': 2.0}


### Problem Statement

Build and fine-tune a **BERT-based sequence classification model** on an emotion-labeled dataset to predict the emotional sentiment expressed in a given sentence. The goal is to measure the model’s performance **before and after fine-tuning**, and evaluate the impact of full-precision fine-tuning on classification accuracy and prediction reliability.

---

### Objectives

1. **Dataset Utilization**:

   * Use the `emotion` dataset from Hugging Face Datasets, which contains short English sentences labeled with emotions such as *joy*, *anger*, *fear*, *sadness*, etc.

2. **Tokenization & Preprocessing**:

   * Apply `bert-base-uncased` tokenizer with proper padding and truncation to standardize inputs for the BERT model.

3. **Model Initialization**:

   * Load a pre-trained `bert-base-uncased` model for **sequence classification** and adapt it to handle the number of emotion labels in the dataset.

4. **Fine-Tuning Setup**:

   * Define training arguments such as batch size, learning rate, and number of epochs.
   * Configure the Hugging Face `Trainer` class for seamless training and evaluation.

5. **Evaluation Before Fine-Tuning**:

   * Evaluate and predict emotion labels on a sample text set **before** any model training to establish a baseline.

6. **Full Precision Fine-Tuning**:

   * Train the model on the training dataset using FP32 precision.
   * Apply supervised learning for 2 epochs and log performance metrics.

7. **Evaluation After Fine-Tuning**:

   * Re-evaluate the model and re-predict on the same sample text set **after** training.
   * Compare metrics (especially accuracy) before and after fine-tuning.

8. **Result Interpretation**:

   * Display emotion predictions and accuracy **side-by-side** to understand model improvements.

---

### How This Version Is Better Than the Previous One

| **Aspect**               | **Previous Version**             | **Current Version (Enhanced)**                                                      |
| ------------------------ | -------------------------------- | ----------------------------------------------------------------------------------- |
| **Evaluation Scope**     | Only evaluated after fine-tuning | Evaluates both *before* and *after* fine-tuning                                     |
| **Prediction Insight**   | No example predictions shown     | Predicts emotion labels on **real sample texts** to show qualitative improvement    |
| **Helper Function**      | Not included                     | `predict_emotions()` added to make predictions usable outside Trainer               |
| **Comparative Accuracy** | Not shown                        | Clearly prints **accuracy comparison** before vs after fine-tuning                  |
| **Logging Steps**        | Not configured                   | Adds `logging_steps=10` to monitor training progress during long runs               |
| **Educational Value**    | Limited to metrics               | Great for demonstrating the impact of fine-tuning both numerically and semantically |

---

### Summary of Key Benefits

* Helps visualize model improvement clearly.
* Provides real-world understanding through sentence-level predictions.
* Gives a complete fine-tuning lifecycle with **baseline**, **training**, and **evaluation** phases.
* Enhances reproducibility with logging, evaluation strategy, and fixed seed.



---

Yes! The `"emotion"` dataset from Hugging Face (part of the `dair-ai/emotion` collection) is a **text classification dataset** containing short English sentences labeled with one of six emotions.

When you run:

```python
from datasets import load_dataset
raw = load_dataset("emotion", split="train")
```

you get a `Dataset` object like this:

```
Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})
```

Here’s what it looks like if you inspect a few rows:

```python
>>> raw[0]
{'text': 'i didnt feel humiliated', 'label': 0}
```

If you decode the label with the dataset’s features:

```python
>>> raw.features['label'].names
['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
```

Then for example:

| text                                                                                                         | label_id | label   |
| ------------------------------------------------------------------------------------------------------------ | -------- | ------- |
| i didnt feel humiliated                                                                                      | 0        | sadness |
| i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake | 1        | joy     |
| im grabbing a minute to post i feel greedy wrong                                                             | 0        | sadness |
| i am ever feeling nostalgic about the fire place i will know that it is still on the property                | 0        | sadness |
| i am feeling grouchy                                                                                         | 3        | anger   |

---

✅ **Summary:**

* **Total samples:** 16,000 (train), 2,000 (validation), 2,000 (test)
* **Features:**

  * `text`: the sentence (string)
  * `label`: the emotion (categorical int 0–5)
* **Labels:** `['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']`


---

In [None]:
#  Install dependencies
# !pip install -q datasets transformers evaluate

#  Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding
)
import evaluate
import numpy as np
import torch

#  Load and preprocess dataset
raw = load_dataset("emotion", split="train")
raw = raw.shuffle(seed=42).train_test_split(test_size=0.2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_ds = raw["train"].map(preprocess, batched=True)
eval_ds = raw["test"].map(preprocess, batched=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(raw["train"].features["label"].names)
)

# Prepare metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return metric.compute(predictions=preds, references=labels)

# Prepare Trainer
training_args = TrainingArguments(
    output_dir="results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    save_strategy="no",
    logging_steps=10,
)

data_collator = DataCollatorWithPadding(tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Helper function: Predict emotion for given list of texts
# This function takes three inputs:
# texts: a list of raw sentences (strings)
# model: a trained sequence classification model (like BERT)
# label_names: list of emotion labels like ["joy", "anger", "fear", ...]
# It returns a list of predicted emotion labels for each sentence.
def predict_emotions(texts, model, label_names):
    results = [] # Initialize empty list to store final predictions as (text, predicted_label) pairs.
    model.eval() # Switch the model to evaluation mode. This turns off dropout layers and other training-specific features to ensure consistent predictions.
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(model.device)
        with torch.no_grad():
            logits = model(**inputs).logits
        pred_label = label_names[logits.argmax(-1).item()]
        results.append((text, pred_label)) # ("I'm so happy!", "joy")
    return results

label_names = raw["train"].features["label"].names
sample_texts = [
    "I'm so frustrated with everything happening right now.",
    "I just got promoted and I’m feeling amazing!",
    "Why does everything bad happen to me?",
    "I'm laughing so hard at this meme!",
    "I feel very calm and peaceful today.",
    "I miss her so much, it hurts.",
    "This is the worst experience of my life."
]

# Evaluate BEFORE fine-tuning
print("🔍 Performance BEFORE fine-tuning:")
metrics_before = trainer.evaluate()
print(metrics_before)

print("\n📌 Predictions BEFORE fine-tuning:")
before_preds = predict_emotions(sample_texts, model, label_names)
for text, label in before_preds:
    print(f"Text: {text}\nPredicted Emotion: {label}\n")

# Train the model
trainer.train()

# Evaluate AFTER fine-tuning
print("🔍 Performance AFTER fine-tuning:")
metrics_after = trainer.evaluate()
print(metrics_after)

print("\n📌 Predictions AFTER fine-tuning:")
after_preds = predict_emotions(sample_texts, model, label_names)
for text, label in after_preds:
    print(f"Text: {text}\nPredicted Emotion: {label}\n")

# Side-by-side comparison
print("\n✅ Accuracy Comparison:")
print(f"Before fine-tuning: {metrics_before['eval_accuracy']:.4f}")
print(f"After fine-tuning : {metrics_after['eval_accuracy']:.4f}")

# wandb API KEY: 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


README.md: 0.00B [00:00, ?B/s]

split/train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

split/validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

split/test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/12800 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

  trainer = Trainer(


🔍 Performance BEFORE fine-tuning:


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 1


[34m[1mwandb[0m: You chose 'Create a W&B account'
[34m[1mwandb[0m: Create an account here: https://wandb.ai/authorize?signup=true&ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdrdarshan[0m ([33mdrdarshan-datacouch[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 1.7045319080352783, 'eval_model_preparation_time': 0.0027, 'eval_accuracy': 0.279375, 'eval_runtime': 23.1, 'eval_samples_per_second': 138.528, 'eval_steps_per_second': 17.316}

📌 Predictions BEFORE fine-tuning:
Text: I'm so frustrated with everything happening right now.
Predicted Emotion: sadness

Text: I just got promoted and I’m feeling amazing!
Predicted Emotion: sadness

Text: Why does everything bad happen to me?
Predicted Emotion: sadness

Text: I'm laughing so hard at this meme!
Predicted Emotion: sadness

Text: I feel very calm and peaceful today.
Predicted Emotion: sadness

Text: I miss her so much, it hurts.
Predicted Emotion: sadness

Text: This is the worst experience of my life.
Predicted Emotion: sadness



Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,0.2301,0.217574,0.0027,0.93625
2,0.0993,0.18322,0.0027,0.937813


🔍 Performance AFTER fine-tuning:


{'eval_loss': 0.1832195520401001, 'eval_model_preparation_time': 0.0027, 'eval_accuracy': 0.9378125, 'eval_runtime': 22.2291, 'eval_samples_per_second': 143.955, 'eval_steps_per_second': 17.994, 'epoch': 2.0}

📌 Predictions AFTER fine-tuning:
Text: I'm so frustrated with everything happening right now.
Predicted Emotion: anger

Text: I just got promoted and I’m feeling amazing!
Predicted Emotion: joy

Text: Why does everything bad happen to me?
Predicted Emotion: sadness

Text: I'm laughing so hard at this meme!
Predicted Emotion: joy

Text: I feel very calm and peaceful today.
Predicted Emotion: joy

Text: I miss her so much, it hurts.
Predicted Emotion: sadness

Text: This is the worst experience of my life.
Predicted Emotion: sadness


✅ Accuracy Comparison:
Before fine-tuning: 0.2794
After fine-tuning : 0.9378


In [None]:
# Side-by-side comparison
print("\nAccuracy Comparison:")
print(f"Before fine-tuning: {metrics_before['eval_accuracy']:.4f}")
print(f"After fine-tuning : {metrics_after['eval_accuracy']:.4f}")


Accuracy Comparison:
Before fine-tuning: 0.2794
After fine-tuning : 0.9378


# Happy Learning