<a href="https://colab.research.google.com/github/nursenakok/IMDB-LoRA-Finetuning/blob/main/1_IMDB_LoRA_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 1. Library

!pip install -q transformers datasets peft accelerate # Install required libraries

In [2]:
# 2. Data

from datasets import load_dataset                     # Import the Hugging Face Datasets library
dataset = load_dataset("stanfordnlp/imdb")            # Load the IMDb dataset (50k movie reviews labeled as positive/negative)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
# 3. Tokenization

from transformers import AutoTokenizer  # Import the AutoTokenizer class from Hugging Face transformers
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Load the tokenizer for the DistilBERT model

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256) # Tokenize text and pad/truncate to max length

tokenized_datasets = dataset.map(tokenize_function, batched=True) # Apply tokenizer to entire dataset


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [17]:
# 4. Model

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)  # Load DistilBERT for 2-class classification
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_lin", "v_lin"], lora_dropout=0.3, bias="none", task_type="SEQ_CLS") # LoRA config
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


In [18]:
# 5. Training Arguments

import torch
from transformers import TrainingArguments, Trainer

# Check GPU memory before training
print("GPU STATUS BEFORE TRAINING:")
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

# Set training arguments
training_args = TrainingArguments(
    output_dir="./imdb-lora-model",    # Directory to save the trained model
    learning_rate=2e-4,                # Learning rate
    per_device_train_batch_size=8,     # Batch size per GPU
    per_device_eval_batch_size=8,
    num_train_epochs=3,                # of training epochs
    weight_decay=0.01,                 # Regularization
    eval_strategy="epoch",             # Evaluate every epoch
    save_strategy="epoch",             # Save model every epoch
    load_best_model_at_end=True,       # Load best model at the end
    logging_steps=100,                 # Log every 100 steps
    fp16=True,                         # Mixed precision for memory efficiency
    report_to="none"                   # Disable TensorBoard reporting
)

# Create Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer

)



# Final GPU memory check
print("GPU STATUS BEFORE STARTING TRAINING:")
print(f"Memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

GPU STATUS BEFORE TRAINING:
Memory allocated: 0.80 GB
Total memory: 14.74 GB
GPU STATUS BEFORE STARTING TRAINING:
Memory allocated: 0.79 GB


  trainer = Trainer(


In [19]:
# 6. Training

trainer.train()
print("TRAINING COMPLETED!")  # Notify that training has finished

Epoch,Training Loss,Validation Loss
1,0.2714,0.2985
2,0.2672,0.255064
3,0.2064,0.267084


TRAINING COMPLETED!


In [20]:
from sklearn.metrics import accuracy_score
import numpy as np

# compute_metrics fonk
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


test_results = trainer.evaluate(tokenized_datasets["test"])
print(f"🎯 TEST ACCURACY: {test_results['eval_accuracy']:.4f}")

  trainer = Trainer(


🎯 TEST ACCURACY: 0.9030


In [21]:
trainer.save_model("imdb-lora-model")

In [22]:
from transformers import pipeline

classifier = pipeline("text-classification", model="imdb-lora-model")

# Test et
test_texts = [
    "This movie was absolutely fantastic!",
    "Terrible acting and boring story.",
    "One of the best films I've ever seen!",
    "That was amazing",
    "Worst film ever made",
    "Brilliant cinematography and acting"

]

for text in test_texts:
    result = classifier(text)
    print(f"🎬 '{text[:30]}...' → {result[0]['label']} ({(result[0]['score']*100):.1f}%)")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


🎬 'This movie was absolutely fant...' → LABEL_1 (99.8%)
🎬 'Terrible acting and boring sto...' → LABEL_0 (100.0%)
🎬 'One of the best films I've eve...' → LABEL_1 (99.9%)
🎬 'That was amazing...' → LABEL_1 (99.1%)
🎬 'Worst film ever made...' → LABEL_0 (99.8%)
🎬 'Brilliant cinematography and a...' → LABEL_1 (99.8%)
