# BERT and LoRA Experiment Runner

This notebook runs the full pipeline for preprocessing data and training the three different models:
1.  **Small-BERT (From Scratch):** A 4-layer BERT trained from random initialization.
2.  **TinyBERT (Full Fine-Tune):** A pre-trained TinyBERT with all parameters fine-tuned.
3.  **TinyBERT (LoRA Fine-Tune):** A pre-trained TinyBERT fine-tuned using our custom LoRA implementation.

Finally, it visualizes the results.

## 1. Setup

Install the required packages and set up the environment.

In [None]:
!pip install -r requirements.txt
!mkdir -p data/raw models logs

**Action Required:** Before proceeding, you must download the dataset (`complaints_small.csv`) and place it in the `data/raw/` directory.

## 2. Part 1: Data Preprocessing

This step loads the raw data, cleans it, and creates the `train.csv`, `test.csv`, and `label_map.json` files in `data/processed/`.

In [None]:
!python scripts/preprocess.py --input-file data/raw/complaints_small.csv --output-dir data/processed

## 3. Part 2: Train Small-BERT (From Scratch)

This trains the 4-layer BERT model. Results will be saved in `models/bert_scratch/`.

In [None]:
!python scripts/train_bert_scratch.py --data-dir data/processed --output-dir models/bert_scratch --num-epochs 3

## 4. Part 3: Train TinyBERT (Full Fine-Tune)

This trains the full fine-tuning baseline. Results will be saved in `models/tinybert_full/`.

In [None]:
!python scripts/train_full_finetune.py --data-dir data/processed --output-dir models/tinybert_full --num-epochs 3

## 5. Part 4: Train TinyBERT (LoRA Fine-Tune)

This trains the LoRA model using our custom implementation. Results will be saved in `models/tinybert_lora/`.

In [None]:
!python scripts/train_lora.py --data-dir data/processed --output-dir models/tinybert_lora --num-epochs 3 --lora-r 8 --lora-alpha 32

## 6. Analysis and Visualization

Let's load the `results.json` file from each experiment and compare them.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

result_files = [
    "models/bert_scratch/results.json",
    "models/tinybert_full/results.json",
    "models/tinybert_lora/results.json"
]

data = []
for f in result_files:
    try:
        with open(f, 'r') as file:
            data.append(json.load(file))
    except FileNotFoundError:
        print(f"Warning: {f} not found. Did the script run correctly?")

df = pd.DataFrame(data)
df = df.set_index("model")
df

In [None]:
sns.set_style("whitegrid")

# Plot 1: F1-Score
plt.figure(figsize=(10, 6))
sns.barplot(x=df.index, y="test_f1_score", data=df)
plt.title("Model Comparison: Weighted F1-Score")
plt.ylabel("F1-Score")
plt.xlabel("Model")
plt.ylim(0.8, df['test_f1_score'].max() * 1.05) # Adjust ylim for better visibility
plt.show()

# Plot 2: Trainable Parameters (Log Scale)
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=df.index, y="trainable_params", data=df)
ax.set_yscale("log")
plt.title("Model Comparison: Trainable Parameters (Log Scale)")
plt.ylabel("Trainable Parameters (Log)")
plt.xlabel("Model")
for p in ax.patches:
    ax.annotate(f"{p.get_height():,.0f}", (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 9), textcoords='offset points')
plt.show()