## Hands-On Example: Applying What You've Learned

In this notebook, you will apply the concepts covered in the previous sessions, including:

1. Understanding the Huggingface Ecosystem
2. Working with Transformer models
3. Implementing Tokenization and Embeddings
4. Utilizing a pre-trained model for a NLP task

### Objective:

Fine-tune a pre-trained Transformer model (e.g., BERT) on a text classification task (sentiment analysis using the IMDB dataset). During this exercise, you will:
- Load and preprocess the dataset.
- Tokenize the input data.
- Apply a pre-trained model to extract embeddings.
- Fine-tune the model using memory-efficient techniques.

Let's get started!

In [None]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import torch

## Load and Explore the Dataset

We will use the IMDB dataset for binary sentiment classification. The goal is to classify movie reviews as positive or negative.

In [None]:
# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

# Display a sample from the dataset
print("Sample from the IMDB dataset:", dataset['train'][0])

## Tokenize the Data

Use a pre-trained tokenizer to convert the text data into token IDs that the model can understand. We will use the BERT tokenizer for this exercise.

In [None]:
# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased")

# Define a function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Set the format for PyTorch
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

## Load a Pre-trained Model

Now, let's load a pre-trained BERT model for sequence classification. This model will be fine-tuned on the IMDB dataset.

In [None]:
# Load a pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased", num_labels=2)

## Configure Training Arguments

Set up the training arguments for fine-tuning the model. This includes the output directory, batch size, number of epochs, and logging settings.

In [None]:
# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

## Fine-Tune the Model

Use the Huggingface `Trainer` API to fine-tune the model on the IMDB dataset.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

In [None]:
# Fine-tune the model
trainer.train()

## Evaluate the Model

Evaluate the model's performance on the test set to understand its accuracy and other metrics.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

print("Evaluation results:", eval_results)

## Hands-On Exercise: Fine-Tune Another Model

### Instructions:

1. Choose a different pre-trained model from the Huggingface Hub (e.g., "distilbert-base-uncased").
2. Load and tokenize the dataset.
3. Configure training arguments.
4. Fine-tune the model.
5. Evaluate its performance.

**Questions to Consider:**

- How does the model's performance compare to BERT?
- What are the differences in memory usage between models?
- What steps can be taken to optimize memory usage further?

Try it out below!


In [None]:
# Step 1: Choose a different model
model_dir = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased"  # Change to any other pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Step 2: Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# Step 3: Configure training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

# Step 4: Fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

# Step 5: Evaluate the model
eval_results = trainer.evaluate()

print("Evaluation results for the new model:", eval_results)

## Conclusion

In this hands-on example, you applied the concepts learned in previous sessions to fine-tune a pre-trained Transformer model on a text classification task. You practiced loading datasets, tokenizing text, and configuring training arguments to achieve optimal results.

In [None]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)