## Hands-On Example: Applying What You've Learned

In this notebook, you will apply the concepts covered in the previous sessions, including:

1. Understanding the Huggingface Ecosystem
2. Working with Transformer models
3. Implementing Tokenization and Embeddings
4. Utilizing a pre-trained model for a NLP task

### Objective:

Fine-tune a pre-trained Transformer model (e.g., BERT) on a text classification task (sentiment analysis using the IMDB dataset). During this exercise, you will:
- Load and preprocess the dataset.
- Tokenize the input data.
- Apply a pre-trained model to extract embeddings.
- Fine-tune the model using memory-efficient techniques.

Let's get started!

In [1]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import torch

## Load and Explore the Dataset

We will use the IMDB dataset for binary sentiment classification. The goal is to classify movie reviews as positive or negative.

In [2]:
# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

# Display a sample from the dataset
print("Sample from the IMDB dataset:", dataset['train'][0])

Sample from the IMDB dataset: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nu

## Tokenize the Data

Use a pre-trained tokenizer to convert the text data into token IDs that the model can understand. We will use the BERT tokenizer for this exercise.

In [3]:
# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased")

# Define a function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Set the format for PyTorch
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

## Load a Pre-trained Model

Now, let's load a pre-trained BERT model for sequence classification. This model will be fine-tuned on the IMDB dataset.

In [4]:
# Load a pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Configure Training Arguments

Set up the training arguments for fine-tuning the model. This includes the output directory, batch size, number of epochs, and logging settings.

In [5]:
# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

## Fine-Tune the Model

Use the Huggingface `Trainer` API to fine-tune the model on the IMDB dataset.

In [6]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2025-09-08 10:22:08,075] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-08 10:22:10,166] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False


In [7]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2591,0.304622


TrainOutput(global_step=3125, training_loss=0.37470179908752443, metrics={'train_runtime': 203.8489, 'train_samples_per_second': 122.64, 'train_steps_per_second': 15.33, 'total_flos': 1644444096000000.0, 'train_loss': 0.37470179908752443, 'epoch': 1.0})

## Evaluate the Model

Evaluate the model's performance on the test set to understand its accuracy and other metrics.

In [8]:
# Evaluate the model
eval_results = trainer.evaluate()

print("Evaluation results:", eval_results)

Evaluation results: {'eval_loss': 0.3046218454837799, 'eval_runtime': 43.3839, 'eval_samples_per_second': 576.25, 'eval_steps_per_second': 72.031, 'epoch': 1.0}


## Hands-On Exercise: Fine-Tune Another Model

### Instructions:

1. Choose a different pre-trained model from the Huggingface Hub (e.g., "distilbert-base-uncased").
2. Load and tokenize the dataset.
3. Configure training arguments.
4. Fine-tune the model.
5. Evaluate its performance.

**Questions to Consider:**

- How does the model's performance compare to BERT?
- What are the differences in memory usage between models?
- What steps can be taken to optimize memory usage further?

Try it out below!


In [9]:
# Step 1: Choose a different model
model_dir = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased"  # Change to any other pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir, num_labels=2)

# Step 2: Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# Step 3: Configure training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
)

# Step 4: Fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

# Step 5: Evaluate the model
eval_results = trainer.evaluate()

print("Evaluation results for the new model:", eval_results)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,0.2802,0.325185


Evaluation results for the new model: {'eval_loss': 0.3251846134662628, 'eval_runtime': 23.185, 'eval_samples_per_second': 1078.283, 'eval_steps_per_second': 134.785, 'epoch': 1.0}


## Conclusion

In this hands-on example, you applied the concepts learned in previous sessions to fine-tune a pre-trained Transformer model on a text classification task. You practiced loading datasets, tokenizing text, and configuring training arguments to achieve optimal results.

In [10]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}