<a href="https://colab.research.google.com/github/neelsoumya/intro_to_LMMs/blob/main/fine_tune_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Simple Fine-Tuning of an Open Source LLM with Hugging Face

In [7]:
!pip install transformers datasets evaluate --quiet

## Load dataset (IMDB 1% sample)

It loads a pre-trained model called "distilbert-base-uncased" using AutoModelForSequenceClassification. This model is designed for sequence classification tasks like sentiment analysis.
It defines training parameters like the number of epochs, batch size, and evaluation strategy using TrainingArguments.
It sets up an evaluation metric (accuracy) to measure the model's performance.

In [8]:
!pip install transformers --upgrade



In [9]:
from datasets import load_dataset

# Load a small sentiment dataset for demonstration
dataset = load_dataset("imdb", split='train[:1%]').train_test_split(test_size=0.2)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
})

## Preprocess the data

It uses the AutoTokenizer from transformers to prepare the text data for the model. This involves:
Tokenization: Breaking down the text into individual words or subwords.
Truncation and padding: Ensuring all text inputs have the same length by truncating longer ones and padding shorter ones.
Formatting: Converting the data into a format suitable for the model.

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets.set_format("torch")

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

## Load model and prepare training

It loads a pre-trained model called "distilbert-base-uncased" using AutoModelForSequenceClassification. This model is designed for sequence classification tasks like sentiment analysis.
It defines training parameters like the number of epochs, batch size, and evaluation strategy using TrainingArguments.
It sets up an evaluation metric (accuracy) to measure the model's performance.

In [11]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./results",
    #evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    logging_dir="./logs",
    logging_steps=10,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine-tune the model

It uses the Trainer class from transformers to fine-tune the pre-trained model on the IMDB dataset. This process adjusts the model's parameters to make it better at classifying movie reviews.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

trainer.train()

## Evaluate the model

After training, it evaluates the fine-tuned model on the test dataset to assess its performance on unseen data.

In [None]:
trainer.evaluate()