# Project Overview
This project demonstrates the process of lightweight fine-tuning using a pre-trained model for sentiment analysis on movie reviews.

### Lightweight Fine-Tuning Project
- **PEFT Technique:** LoRA (Low Rank Adaptation)
- **Model:** bert-base-uncased (for sequence classification)
- **Evaluation Approach:** Accuracy using the Hugging Face Trainer
- **Fine-tuning Dataset:** IMDb (from Hugging Face library)
    

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load the pre-trained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Binary classification (positive/negative)

# Load the IMDb dataset and use a smaller subset for testing
dataset = load_dataset("imdb", split="train[:10%]")  # Use only 10% of the data for quick testing

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)  # Reduce max length to 128

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Ensure the dataset contains 'input_ids', 'attention_mask', and 'label'
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Verify tokenized dataset size and sample
print(f"Tokenized dataset size: {len(tokenized_dataset)}")
print("Tokenized dataset sample:", tokenized_dataset[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Tokenized dataset size: 2500
Tokenized dataset sample: {'labels': tensor(0), 'input_ids': tensor([  101,  1045, 12524,  1045,  2572,  8025,  1011,  3756,  2013,  2026,
         2678,  3573,  2138,  1997,  2035,  1996,  6704,  2008,  5129,  2009,
         2043,  2009,  2001,  2034,  2207,  1999,  3476,  1012,  1045,  2036,
         2657,  2008,  2012,  2034,  2009,  2001,  8243,  2011,  1057,  1012,
         1055,  1012,  8205,  2065,  2009,  2412,  2699,  2000,  4607,  2023,
         2406,  1010,  3568,  2108,  1037,  5470,  1997,  3152,  2641,  1000,
         6801,  1000,  1045,  2428,  2018,  2000,  2156,  2023,  2005,  2870,
         1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  1996,
         5436,  2003,  8857,  2105,  1037,  2402,  4467,  3689,  3076,  2315,
        14229,  2040,  4122,  2000,  4553,  2673,  2016,  2064,  2055,  2166,
         1012,  1999,  3327,  2016,  4122,  2000,  3579,  2014,  3086,  2015,
         2000,  2437,  2070,  4066,  1997,  4516,  2

In [2]:
# Define training arguments with smaller batch size and fewer epochs
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,  # Smaller batch size
    per_device_eval_batch_size=4,
    warmup_steps=100,  # Fewer warmup steps
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
)

# Create a Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

# Train the model
trainer.train()


Step,Training Loss
10,0.81
20,0.5564
30,0.3338
40,0.1217
50,0.0286
60,0.0066
70,0.0026
80,0.0014
90,0.001
100,0.0007


Step,Training Loss
10,0.81
20,0.5564
30,0.3338
40,0.1217
50,0.0286
60,0.0066
70,0.0026
80,0.0014
90,0.001
100,0.0007


TrainOutput(global_step=625, training_loss=0.02989454163410701, metrics={'train_runtime': 4242.4818, 'train_samples_per_second': 0.589, 'train_steps_per_second': 0.147, 'total_flos': 164444409600000.0, 'train_loss': 0.02989454163410701, 'epoch': 1.0})

## Loading and Preparing the Dataset
In this section, we load the IMDb dataset and prepare it for fine-tuning. We use only a subset of the data to ensure quick experimentation and validation.
    

In [3]:
from transformers import EvalPrediction
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Define compute metrics function
def compute_metrics(p: EvalPrediction):
    preds = p.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='weighted')
    acc = accuracy_score(p.label_ids, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# Evaluate the model
results = trainer.evaluate()
print("Model Performance:", results)


Model Performance: {'eval_loss': 3.452102464507334e-05, 'eval_runtime': 1128.4817, 'eval_samples_per_second': 2.215, 'eval_steps_per_second': 0.554, 'epoch': 1.0}


## Loading the Pre-trained Model
Here, we load the pre-trained BERT model (`bert-base-uncased`) and its tokenizer. We also configure the model to use the appropriate padding token.
    

In [4]:
test_dataset = load_dataset("imdb", split="test[:10%]")  # Use only 10% of the test data for quick validation

# Tokenize the test dataset
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Ensure the test dataset contains 'input_ids', 'attention_mask', and 'labels'
tokenized_test_dataset = tokenized_test_dataset.rename_column("label", "labels")
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Evaluate the model on the test set
results = trainer.evaluate(tokenized_test_dataset)
print("Test Set Performance:", results)


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Test Set Performance: {'eval_loss': 3.4507109376136214e-05, 'eval_runtime': 1132.6157, 'eval_samples_per_second': 2.207, 'eval_steps_per_second': 0.552, 'epoch': 1.0}


In [6]:
import random
import torch

# Select a few samples from the test set
sample_indices = random.sample(range(len(tokenized_test_dataset)), 5)
samples = [tokenized_test_dataset[i] for i in sample_indices]

# Generate and print predictions for these samples
for idx, sample in enumerate(samples):
    inputs = {key: value.unsqueeze(0) for key, value in sample.items() if key != 'labels'}
    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_label = logits.argmax(-1).item()
    true_label = sample['labels'].item()

    print(f"Text: {test_dataset[sample_indices[idx]]['text'][:200]}...")  # Print a snippet of the text
    print(f"True Label: {true_label}, Predicted Label: {predicted_label}\n")


Text: Dorothy Provine does the opposite here: She keeps growing and growing. I didn't detect any subtext, though. "The Incredible Shrinking Man" and other movies of its ilk during the period were parables a...
True Label: 0, Predicted Label: 0

Text: Guys, what can I tell you? I'm Bulgarian. I can't remember how many times I talk to Americans and let alone that they don't have a slightest clue where is Bulgaria, but they say things like: "There's ...
True Label: 0, Predicted Label: 0

Text: Hooray for Title Misspellings! After reading reviews and contemplating, my girlfriend and I confirmed that this movie is an utter piece of trash. This movie lost her as one of those Rare Tarantino fan...
True Label: 0, Predicted Label: 0

Text: Unfortunately I think this is one of those films that if you or I took it to the studio and said, 'can I make this great movie with my friends Mary, Mungo and Midge from school?' the studio would have...
True Label: 0, Predicted Label: 0

Text: This little ch

## Tokenizing the Dataset
The dataset is tokenized to convert the text into the format required by the BERT model. This includes adding padding and truncating the sequences to a fixed length.
    

## Fine-Tuning the Model
We fine-tune the BERT model using the tokenized IMDb dataset. The Hugging Face Trainer is used to handle the training process, including batching and optimization.
    

## Evaluating the Model
The model's performance is evaluated on a subset of the test dataset. We use accuracy as the primary metric for evaluation.
    

## Generating and Reviewing Predictions
To validate the model's performance, we generate predictions for a few samples from the test dataset and manually compare them with the true labels.
    