
# NLP & ML (Hugging Face)

**Author:** _MAYANK YADAV_  
**Date:** _20 Aug 2025_

This notebook completes the:  
1) **Dataset selection & preprocessing** (IMDB reviews)  
2) **Prompt engineering & model interaction** (FLAN-T5)  
3) **Fine-tuning & evaluation** (DistilBERT)  
4) **Troubleshooting** (common issues & fixes)





## 0. Environment Setup

Install required libraries. Re-run the cell if the install was interrupted.


In [1]:


import sys, platform, torch
print("Python:", sys.version)
print("Platform:", platform.platform())
print("Torch:", torch.__version__ if 'torch' in globals() else "not installed")
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)


Python: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Platform: Windows-11-10.0.22631-SP0
Torch: 2.5.1+cu121
Device: cuda



## 1. Imports & Reproducibility


In [17]:

import re
import random
import numpy as np
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer,
                          pipeline)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import evaluate

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
import torch
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
SEED


42


## 2. Dataset: IMDB Movie Reviews (Hugging Face Datasets)

- **Source:** `imdb` dataset from Hugging Face Datasets (`datasets.load_dataset("imdb")`)  
- **Why IMDB?** Binary sentiment labels (**positive/negative**), widely used benchmark, diverse and noisy real-world reviews.  
- **Size:** 25k train / 25k test reviews. We will create a **validation split** from train.


In [19]:

imdb = load_dataset("imdb")
imdb


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  " see this article:"
Generating train split: 100%|█████████████████████████████████████████| 25000/25000 [00:00<00:00, 410608.88 examples/s]
Generating test split: 100%|██████████████████████████████████████████| 25000/25000 [00:00<00:00, 501389.54 examples/s]
Generating unsupervised split: 100%|██████████████████████████████████| 50000/50000 [00:00<00:00, 372799.85 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


### 2.1 Train / Validation Split

We take 10% of the training set as validation (stratified by default shuffle).


In [21]:

imdb = imdb.shuffle(seed=SEED)
split = imdb["train"].train_test_split(test_size=0.1, seed=SEED)
imdb_train = split["train"]
imdb_val = split["test"]
imdb_test = imdb["test"]
len(imdb_train), len(imdb_val), len(imdb_test)


(22500, 2500, 25000)


## 3. Preprocessing (Cleaning, Tokenization)

**Cleaning applied:**
- Lowercasing
- Remove HTML tags (e.g., `<br />`), URLs
- Collapse repeated whitespace

> We **do not** remove punctuation/stopwords aggressively because modern tokenizers handle them well and excessive cleaning can remove useful sentiment cues.


In [23]:

def clean_text(text: str) -> str:
    text = re.sub(r"<.*?>", " ", text)          # HTML tags
    text = re.sub(r"http\S+|www\.\S+", " ", text)  # URLs
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()   # normalize spaces
    return text

# Preview cleaning
sample = imdb_train[0]["text"]
sample, clean_text(sample)


('I just saw this at the Venice Film Festival, and can\'t quite decide about it. We were never allowed to get close enough to any of the characters to care about them. Maybe that was the point, that we are all in a "bubble" of our own, but these people didn\'t compel me to be concerned about them or shocked at their various fates. At a running time of just over an hour, the characters weren\'t very well developed. Lots of time was devoted to shots of factory equipment (forklifts, conveyor belts, shovels); and the slightly-creepy-looking baby dolls with surprisingly lifelike eyes, that most of the characters made for a living, were somehow more interesting than the live people. An interesting experiment, but somehow it never quite came together.',
 'i just saw this at the venice film festival, and can\'t quite decide about it. we were never allowed to get close enough to any of the characters to care about them. maybe that was the point, that we are all in a "bubble" of our own, but the


### 3.1 Tokenization

We use **DistilBERT (uncased)** tokenizer with max length 256 and padding/truncation.


In [25]:

MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_batch(batch):
    texts = [clean_text(t) for t in batch["text"]]
    return tokenizer(texts, truncation=True, padding=False, max_length=256)

tokenized_train = imdb_train.map(tokenize_batch, batched=True, remove_columns=["text"])
tokenized_val   = imdb_val.map(tokenize_batch, batched=True, remove_columns=["text"])
tokenized_test  = imdb_test.map(tokenize_batch, batched=True, remove_columns=["text"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_train[0]


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  " see this article:"
Map: 100%|██████████████████████████████████████████████████████████████| 22500/22500 [00:06<00:00, 3304.52 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████| 2500/2500 [00:00<00:00, 3300.37 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████| 25000/25000 [00:07<00:00, 3384.70 examples/s]


{'label': 0,
 'input_ids': [101,
  1045,
  2074,
  2387,
  2023,
  2012,
  1996,
  7914,
  2143,
  2782,
  1010,
  1998,
  2064,
  1005,
  1056,
  3243,
  5630,
  2055,
  2009,
  1012,
  2057,
  2020,
  2196,
  3039,
  2000,
  2131,
  2485,
  2438,
  2000,
  2151,
  1997,
  1996,
  3494,
  2000,
  2729,
  2055,
  2068,
  1012,
  2672,
  2008,
  2001,
  1996,
  2391,
  1010,
  2008,
  2057,
  2024,
  2035,
  1999,
  1037,
  1000,
  11957,
  1000,
  1997,
  2256,
  2219,
  1010,
  2021,
  2122,
  2111,
  2134,
  1005,
  1056,
  4012,
  11880,
  2033,
  2000,
  2022,
  4986,
  2055,
  2068,
  2030,
  7135,
  2012,
  2037,
  2536,
  26417,
  1012,
  2012,
  1037,
  2770,
  2051,
  1997,
  2074,
  2058,
  2019,
  3178,
  1010,
  1996,
  3494,
  4694,
  1005,
  1056,
  2200,
  2092,
  2764,
  1012,
  7167,
  1997,
  2051,
  2001,
  7422,
  2000,
  7171,
  1997,
  4713,
  3941,
  1006,
  9292,
  18412,
  2015,
  1010,
  16636,
  2953,
  18000,
  1010,
  24596,
  2015,
  1007,
  1025,
  1998,



## 4. Fine-Tuning a Lightweight Model (DistilBERT)

We fine-tune **DistilBERT** for binary sentiment classification.  
To reduce compute, you can subsample:
- Uncomment `select(range(8000))` to train on 8k examples (faster).  
- Increase `num_train_epochs` for better accuracy if you have a GPU.


In [31]:

# Reduce dataset size for faster training (optional)
tokenized_train = tokenized_train.select(range(8000))
tokenized_val   = tokenized_val.select(range(2000))

# Label mappings
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

# Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Training arguments (latest HF API uses evaluation_strategy)
training_args = TrainingArguments(
    output_dir="./distilbert-imdb-checkpoints",
    eval_strategy="epoch", 
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train model
trainer.train()



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.3358,0.283175,0.8795,0.844071,0.924335,0.882382
2,0.1788,0.329232,0.891,0.895833,0.879346,0.887513


TrainOutput(global_step=1000, training_loss=0.25729830932617187, metrics={'train_runtime': 1343.1763, 'train_samples_per_second': 11.912, 'train_steps_per_second': 0.745, 'total_flos': 1059739189248000.0, 'train_loss': 0.25729830932617187, 'epoch': 2.0})


### 4.1 Evaluate on Test Set


In [33]:

test_metrics = trainer.evaluate(tokenized_test)
test_metrics


{'eval_loss': 0.2973911166191101,
 'eval_accuracy': 0.902,
 'eval_precision': 0.9068825910931174,
 'eval_recall': 0.896,
 'eval_f1': 0.9014084507042254,
 'eval_runtime': 654.084,
 'eval_samples_per_second': 38.221,
 'eval_steps_per_second': 1.196,
 'epoch': 2.0}


### 4.2 Save & Reload Model


In [35]:

trainer.save_model("distilbert-imdb-model")
tokenizer.save_pretrained("distilbert-imdb-model")

# Quick sanity-check inference
clf = pipeline("sentiment-analysis", model="distilbert-imdb-model", tokenizer="distilbert-imdb-model", device=0 if torch.cuda.is_available() else -1)
examples = [
    "The movie was absolutely wonderful, I loved every minute of it!",
    "It was a total waste of time. The plot made no sense."
]
for ex in examples:
    print(ex, "->", clf(ex))


Device set to use cuda:0


The movie was absolutely wonderful, I loved every minute of it! -> [{'label': 'POSITIVE', 'score': 0.9937892556190491}]
It was a total waste of time. The plot made no sense. -> [{'label': 'NEGATIVE', 'score': 0.9870553016662598}]



## 5. Prompt Engineering & Model Interaction (FLAN-T5)

We use a pretrained **instruction-tuned** LLM: `google/flan-t5-base` for text-to-text prompting.  
We design **three prompt variants** for the same sentiment task:

1. **Direct Question** – concise classification request.  
2. **Brief Reasoning** – ask for a short explanation, then final label.  
3. **Role Prompt** – set an expert role to encourage consistency.

We'll compare accuracy across prompts on a small sample of the IMDB **test** set.


In [37]:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

GEN_MODEL = "google/flan-t5-base"
gen_tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)

text2text = pipeline("text2text-generation", model=gen_model, tokenizer=gen_tokenizer, device=0 if torch.cuda.is_available() else -1)

PROMPTS = {
    "direct": "Classify the sentiment of this movie review as Positive or Negative. Review: {text}",
    "brief_reason": "Explain briefly and then answer strictly with Positive or Negative. Review: {text}",
    "role": "You are an expert movie critic. Decide if the sentiment is Positive or Negative. Review: {text}"
}

def normalize_label(s):
    s = s.strip().lower()
    if "positive" in s:
        return 1
    if "negative" in s:
        return 0
    # fallback: try to guess based on polarity words
    return 1 if any(w in s for w in ["good", "great", "excellent", "love", "amazing"]) else 0

# Evaluate on a small sample for speed
N = 200
sample_test = imdb_test.select(range(N))

def eval_prompt(prompt_key):
    preds, gold = [], []
    for ex in sample_test:
        text = ex["text"]
        lbl = ex["label"]
        prompt = PROMPTS[prompt_key].format(text=text)
        out = text2text(prompt, max_new_tokens=16, do_sample=False)[0]["generated_text"]
        preds.append(normalize_label(out))
        gold.append(lbl)
    acc = accuracy_score(gold, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(gold, preds, average="binary")
    return {"prompt": prompt_key, "accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

results = [eval_prompt(k) for k in PROMPTS.keys()]
results


Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (530 > 512). Running this sequence through the model will result in indexing errors
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'prompt': 'direct',
  'accuracy': 0.935,
  'precision': 0.9278350515463918,
  'recall': 0.9375,
  'f1': 0.9326424870466321},
 {'prompt': 'brief_reason',
  'accuracy': 0.925,
  'precision': 0.945054945054945,
  'recall': 0.8958333333333334,
  'f1': 0.9197860962566845},
 {'prompt': 'role',
  'accuracy': 0.93,
  'precision': 0.9361702127659575,
  'recall': 0.9166666666666666,
  'f1': 0.9263157894736842}]


### 5.1 Sample Prompt Outputs


In [39]:

samples = [
    "The cinematography was stunning and the performances were heartfelt.",
    "Boring, predictable, and painfully long."
]
for s in samples:
    print("\nReview:", s)
    for name, tmpl in PROMPTS.items():
        out = text2text(tmpl.format(text=s), max_new_tokens=16, do_sample=False)[0]["generated_text"]
        print(f"{name:>12} ->", out)



Review: The cinematography was stunning and the performances were heartfelt.
      direct -> Positive
brief_reason -> Positive
        role -> Positive

Review: Boring, predictable, and painfully long.
      direct -> Negative
brief_reason -> Negative
        role -> Negative



# 6. Troubleshooting (Common Issues & Fixes)

## I) Overfitting during fine-tuning
**Symptoms:** Training accuracy/metrics improve while validation/test stagnate or degrade.  
**Fixes:** Increase dropout/weight decay; early stopping; reduce epochs; use more data; data augmentation.

## II) Prompt sensitivity / drift
**Symptoms:** LLM outputs vary with small wording changes; inconsistent labels.  
**Fixes:** Constrain output format (e.g., answer strictly with Positive or Negative), add few-shot exemplars, temperature=0, use regex post-processing.

## III) Domain bias
**Symptoms:** Model favors popular movie tropes or short reviews.  
**Fixes:** Balance dataset, include domain-specific examples, calibrate thresholds, evaluate on multiple slices.

## IV) Class imbalance
**Symptoms:** High accuracy but low recall for minority class.  
**Fixes:** Weighted loss, stratified sampling, report precision/recall/F1 alongside accuracy.

## V) Resource constraints
**Symptoms:** Out-of-memory, slow training.  
**Fixes:** Reduce max sequence length, batch size; gradient accumulation; smaller model (distilbert, tiny-llama); use GPU/AMP.
lama); use GPU/AMP.



## 7. Summary Notes

### Project Overview  
This project focuses on **sentiment analysis of IMDB reviews** using transformer-based models.  
The primary objective is to classify reviews into **positive** or **negative** sentiment by applying modern NLP techniques and comparing fine-tuned models with prompting approaches.

---

### Dataset  
- **Source:** IMDB reviews (binary classification) dataset from Hugging Face.  
- **Size:** 50,000 labeled reviews (balanced between positive and negative).  
- **Usage:** Training, validation, and testing subsets created.

---

### Preprocessing  
- Converted all text to lowercase.  
- Removed HTML tags, URLs, and extra spaces.  
- Applied **BERT tokenizer** with maximum sequence length of `256`.  
- Created PyTorch datasets and dataloaders for training and evaluation.

---

### Model  
- Base model: **DistilBERT (lightweight BERT variant)**.  
- Fine-tuned for **2 epochs** on the preprocessed IMDB dataset.  
- Optimizer: AdamW, with learning rate scheduling.  
- Loss function: Cross-entropy loss.

---

### Evaluation Metrics  
On the test dataset, the following metrics were computed:  
- **Accuracy**  
- **Precision**  
- **Recall**  
- **F1-score**  

These provide a complete performance evaluation beyond just accuracy.

---

### Prompting Experiments  
In addition to fine-tuning, experimented with **zero-shot prompting** using **FLAN-T5**:  
- Tested with **200 random reviews**.  
- Used variations of prompts: direct questions, brief reasoning, and role-based instructions.  
- Compared performance with fine-tuned DistilBERT.

---

### Compromises & Constraints  
- Subsampled the dataset for faster iteration.  
- Used **lightweight transformer models** (DistilBERT, FLAN-T5) to ensure feasibility on limited hardware.  
- Reduced number of epochs to balance accuracy and training time.

---

### Reproducibility  
- Random seed set to `42` for consistent results.  
- Code cells numbered sequentially in the Jupyter Notebook.  
- Environment details (Python version, libraries, GPU/CPU availability) specified for replication.  

---

This summary provides a **clear, self-contained overview** of the entire workflow—from dataset to preprocessing, model training, evaluation, prompting experiments, compromises, and reproducibility.
ion, prompting experiments, compromises, and reproducibility.
tails.  

