# Sentiment Classification with LoRA  Project

## Pre-Check: 

Before running the project, verify the following technologies are installed and accessible:

## Environment
- Python 3.9+  
- Jupyter Notebook or JupyterLab  

## Core Libraries
- torch (PyTorch)  
- transformers (Hugging Face)  
- peft (Hugging Face PEFT for LoRA)  
- datasets (Hugging Face Datasets)  
- scikit-learn  
- numpy  

In [19]:
# Device Check
import torch
import transformers
import peft
import datasets
import sklearn
import numpy as np
import evaluate

print("=== Environment Pre-Check ===")
print(f"Torch version:          {torch.__version__}")
print(f"Transformers version:   {transformers.__version__}")
print(f"PEFT version:           {peft.__version__}")
print(f"Datasets version:       {datasets.__version__}")
print(f"Scikit-learn version:   {sklearn.__version__}")
print(f"NumPy version:          {np.__version__}")
print(f"Evaluate version: {evaluate.__version__}")

print("\n=== Device Check ===")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device in use:  {'cuda' if torch.cuda.is_available() else 'cpu'}")


=== Environment Pre-Check ===
Torch version:          2.9.1+cpu
Transformers version:   4.57.3
PEFT version:           0.18.0
Datasets version:       4.4.1
Scikit-learn version:   1.7.2
NumPy version:          2.3.5
Evaluate version: 0.4.6

=== Device Check ===
CUDA available: False
Device in use:  cpu


## 1. Dataset Preparation

In [20]:
from datasets import load_dataset, DatasetDict

# Load IMDb dataset
raw = load_dataset("imdb")

# Create validation split from train
splits = raw["train"].train_test_split(test_size=0.2, stratify_by_column="label", seed=42)
train_ds = splits["train"]
val_ds = splits["test"]
test_ds = raw["test"]

dataset = DatasetDict({"train": train_ds, "validation": val_ds, "test": test_ds})

# ‚ö° For CPU debugging, shrink dataset
dataset_small = DatasetDict({
    "train": dataset["train"].select(range(500)),
    "validation": dataset["validation"].select(range(200)),
    "test": dataset["test"].select(range(100))
})
dataset = dataset_small


### Dataset Preparation
- Load IMDb dataset from Hugging Face.  
- Create a stratified validation split.  
- Shrink dataset for CPU debugging:  
  - Train: 500  
  - Validation: 200  
  - Test: 100  

## 2. Tokenization & Formatting

In [21]:
from transformers import AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name, legacy=False)

def preprocess_fn(examples):
    inputs = [f"review: {t}" for t in examples["text"]]
    labels_text = ["negative" if l == 0 else "positive" for l in examples["label"]]
    enc = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    enc_targets = tokenizer(text_target=labels_text, max_length=5, truncation=True, padding="max_length")
    enc["labels"] = enc_targets["input_ids"]
    return enc

tokenized = dataset.map(preprocess_fn, batched=True, remove_columns=dataset["train"].column_names)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


## Tokenization & Formatting
- Inputs: "review: <text>"
- Labels: "positive" / "negative"
   - Tokenized into IDs with truncation and padding.
   - Labels tokenized via text_target; max_length set appropriately.

## 3. Baseline Comparison (No Fine‚ÄëTuning)

In [10]:
# baseline
from transformers import pipeline
from transformers import pipeline, AutoModelForSeq2SeqLM

baseline_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
baseline_pipe = pipeline(
    "text2text-generation",
    model=baseline_model,
    tokenizer=tokenizer,
    device=-1
)

def baseline_predict(texts):
    prompts = [f"review: {t}" for t in texts]
    # Tokenize with truncation to avoid >512 tokens
    enc = tokenizer(prompts, max_length=256, truncation=True, return_tensors="pt", padding=True)
    outs = baseline_model.generate(
        input_ids=enc["input_ids"],
        attention_mask=enc["attention_mask"],
        max_new_tokens=3
    )
    preds_str = tokenizer.batch_decode(outs, skip_special_tokens=True)
    return [1 if "positive" in s.lower() else 0 for s in preds_str]

sample = dataset["validation"].select(range(200))
baseline_preds = baseline_predict(sample["text"])
baseline_refs = sample["label"]

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

print("\nBaseline (no fine-tuning) on 200 validation samples:")
print("Accuracy:", accuracy.compute(predictions=baseline_preds, references=baseline_refs)["accuracy"])
print("Precision:", precision.compute(predictions=baseline_preds, references=baseline_refs, average="binary")["precision"])
print("Recall:", recall.compute(predictions=baseline_preds, references=baseline_refs, average="binary")["recall"])
print("F1:", f1.compute(predictions=baseline_preds, references=baseline_refs, average="binary")["f1"])

Device set to use cpu



Baseline (no fine-tuning) on 200 validation samples:
Accuracy: 0.49
Precision: 0.0
Recall: 0.0
F1: 0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 4. PEFT + LoRA Setup

In [11]:
from transformers import AutoModelForSeq2SeqLM
from peft import LoraConfig, get_peft_model

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 294,912 || all params: 60,801,536 || trainable%: 0.4850


### üß© PEFT + LoRA
- Load **T5‚Äësmall** (~60M params).  
- Apply LoRA to attention projections (`q`, `v`).  
- Trainable params <1% of total ‚Üí efficient fine‚Äëtuning.  


## 5. Training Configuration

In [13]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=1,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=50
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer
)

trainer.train()


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.1235,0.075764




TrainOutput(global_step=125, training_loss=0.14092777252197267, metrics={'train_runtime': 212.9274, 'train_samples_per_second': 2.348, 'train_steps_per_second': 0.587, 'total_flos': 17030971392000.0, 'train_loss': 0.14092777252197267, 'epoch': 1.0})

## 6. Evaluation Metrics

In [13]:
from sklearn.metrics import classification_report

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

def to_int_labels(strs):
    return [1 if "positive" in s.lower() else 0 for s in strs]

# Validation evaluation
val_out = trainer.predict(tokenized["validation"])
val_preds_str = tokenizer.batch_decode(val_out.predictions, skip_special_tokens=True)
val_labels_str = tokenizer.batch_decode(val_out.label_ids, skip_special_tokens=True)

val_preds = to_int_labels(val_preds_str)
val_refs = to_int_labels(val_labels_str)

print("Validation metrics:")
print("Accuracy:", accuracy.compute(predictions=val_preds, references=val_refs)["accuracy"])
print("Precision:", precision.compute(predictions=val_preds, references=val_refs, average="binary")["precision"])
print("Recall:", recall.compute(predictions=val_preds, references=val_refs, average="binary")["recall"])
print("F1:", f1.compute(predictions=val_preds, references=val_refs, average="binary")["f1"])

print("\nClassification report (validation):")
print(classification_report(val_refs, val_preds, target_names=["negative", "positive"]))




Validation metrics:
Accuracy: 0.67
Precision: 0.9736842105263158
Recall: 0.3627450980392157
F1: 0.5285714285714286

Classification report (validation):
              precision    recall  f1-score   support

    negative       0.60      0.99      0.75        98
    positive       0.97      0.36      0.53       102

    accuracy                           0.67       200
   macro avg       0.79      0.68      0.64       200
weighted avg       0.79      0.67      0.64       200



### üìä Validation Results
- **Accuracy: 0.67**  
  The model correctly classified about two‚Äëthirds of the validation reviews, showing improvement over the baseline.

- **Precision: 0.97**  
  Predictions of ‚Äúpositive‚Äù are almost always correct, meaning the model is highly reliable when it chooses that label.

- **Recall: 0.36**  
  The model misses many actual positives, defaulting to ‚Äúnegative‚Äù more often than it should.

- **F1 Score: 0.53**  
  Reflects the trade‚Äëoff: excellent precision but weaker recall for positives.

- **Class Breakdown:**  
  - Negative ‚Üí Precision: 0.60, Recall: 0.99, F1: 0.75  
  - Positive ‚Üí Precision: 0.97, Recall: 0.36, F1: 0.53  

**In short:** LoRA fine‚Äëtuning turned the baseline into a usable classifier ‚Äî strong at catching negatives, very precise with positives, but still limited in recall for positives.


## 7. Inference on Custom Reviews

In [14]:
def classify_review(text: str):
    prompt = f"review: {text}"
    gen = trainer.model.generate(**tokenizer(prompt, return_tensors="pt"), max_new_tokens=3)
    pred_str = tokenizer.decode(gen[0], skip_special_tokens=True)
    return "positive" if "positive" in pred_str.lower() else "negative"

# Clear positive case
print("Test 1:", classify_review("This movie was amazing!"))  # Expected: positive

# Clear negative case
print("Test 2:", classify_review("Terrible acting and a boring plot."))  # Expected: negative

# Ambiguous/mixed sentiment case
print("Test 3:", classify_review("The visuals were stunning, but the story was weak."))  # Model‚Äôs prediction may vary



Test 1: positive
Test 2: negative
Test 3: negative


### Inference
- Test model on custom reviews.  
- Outputs `"positive"` or `"negative"`.  
- **Test 1:** "This movie was amazing!" ‚Üí Output: positive  
- **Test 2:** "Terrible acting and a boring plot." ‚Üí Output: negative  
- **Test 3:** "The visuals were stunning, but the story was weak." ‚Üí Output: negative  

The model correctly handled clear sentiment cases. For mixed reviews, it leaned toward negative, consistent with the lower recall for positives.

## Conclusion

- **Baseline:** Failed to recognize positives, predicting only negatives.  
- **LoRA Fine‚ÄëTuning:** Improved performance with measurable gains in accuracy and F1 score.  
- **Strengths:** High precision for positives, strong recall for negatives.  
- **Limitations:** Positive recall remains low; the model struggles with mixed or nuanced sentiment.  