# **Fine-Tuning BERT with Hugging Face**
### **Kaggle Notebook**
Author: *Rafael Hidalgo*  
Date: *03/02/2025*  

## **1. Introduction**
This notebook demonstrates how to fine-tune a BERT model for sentiment analysis using the IMDb dataset. We will use Hugging Face's `transformers` and `datasets` libraries to:
- Preprocess and tokenize the dataset
- Train a BERT model for text classification
- Debug and optimize training performance
- Evaluate the fine-tuned model using key metrics
- Explore potential real-world applications


In [1]:
!pip install transformers datasets torch



In [2]:
# Import necessary libraries
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

## **3. Load and Prepare the IMDb Dataset**

In [3]:
# Load dataset
dataset = load_dataset('imdb')

# Load tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# Apply tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Rename the label column
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# Convert dataset to PyTorch format
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Subset the dataset for quick training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## **4. Load Pre-Trained BERT Model**

In [4]:
# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **5. Define Training Arguments**

In [5]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_strategy="steps",  # Ensures logs appear at each step
    logging_steps=10,  # Log every 10 steps
    save_strategy="epoch",  # Save model checkpoints at every epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    save_total_limit=2,  # Keeps only the last 2 checkpoints
    report_to="none",  # Prevents logging to external platforms like TensorBoard
    fp16=True,  # Enables mixed precision training on GPU
)




## **6. Define Trainer and Train the Model**

In [6]:
import torch

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [7]:
from transformers import TrainerCallback

class ConsoleLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None and "loss" in logs:
            print(f"Step {state.global_step} | Loss: {logs['loss']:.4f}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[ConsoleLoggingCallback()]  # Attach callback
)

trainer.train()




Epoch,Training Loss,Validation Loss
1,0.4624,0.398827
2,0.3579,0.442025
3,0.2183,0.36144


Step 10 | Loss: 0.7142
Step 20 | Loss: 0.6808
Step 30 | Loss: 0.6656
Step 40 | Loss: 0.6472
Step 50 | Loss: 0.5845
Step 60 | Loss: 0.4624




Step 70 | Loss: 0.4357
Step 80 | Loss: 0.3729
Step 90 | Loss: 0.3631
Step 100 | Loss: 0.3551
Step 110 | Loss: 0.3146
Step 120 | Loss: 0.3579




Step 130 | Loss: 0.3655
Step 140 | Loss: 0.2766
Step 150 | Loss: 0.2616
Step 160 | Loss: 0.2063
Step 170 | Loss: 0.2943
Step 180 | Loss: 0.2183




TrainOutput(global_step=189, training_loss=0.41055599406913473, metrics={'train_runtime': 102.0539, 'train_samples_per_second': 58.792, 'train_steps_per_second': 1.852, 'total_flos': 394666583040000.0, 'train_loss': 0.41055599406913473, 'epoch': 3.0})

## **7. Debugging Issues During Training**

### **Possible Issues & Solutions**
- **Overfitting**: Reduce epochs or increase dropout.
- **Underfitting**: Increase training data or adjust learning rate.
- **Long training time**: Use `distilbert` instead of `bert-base-uncased` for a smaller, faster model.

To experiment, try:
```python
training_args.num_train_epochs = 5  # Increase epochs if underfitting
training_args.per_device_train_batch_size = 8  # Reduce batch size if memory issue
```


## **8. Evaluate Model Performance**

In [8]:
# Evaluate the model
eval_result = trainer.evaluate()
print(f"Evaluation results: {eval_result}")



Evaluation results: {'eval_loss': 0.3614395558834076, 'eval_runtime': 2.4274, 'eval_samples_per_second': 205.982, 'eval_steps_per_second': 6.591, 'epoch': 3.0}


In [9]:
# Define compute metrics function
def compute_metrics(pred):
    predictions = pred.predictions  # Extract predictions
    labels = pred.label_ids  # Extract true labels
    predictions = np.argmax(predictions, axis=1)  # Get predicted class

    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    return {"accuracy": acc, "f1_score": f1}

# Make predictions
eval_predictions = trainer.predict(test_dataset)

# Compute evaluation metrics
metrics = compute_metrics(eval_predictions)

print(f"Final Evaluation Metrics: {metrics}")


Final Evaluation Metrics: {'accuracy': 0.846, 'f1_score': 0.8481262327416175}


## **9. Apply Model to Real-World Task**

In [10]:
# Example text inputs
texts = ["This movie was fantastic! I loved every moment.", 
         "The film was terrible. I regret watching it."]

# Tokenize inputs
inputs = tokenizer(texts, padding="max_length", truncation=True, max_length=128, return_tensors="pt")

# Ensure inputs are moved to the correct device (if using GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = {key: val.to(device) for key, val in inputs.items()}
model.to(device)

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1).cpu()  # Move predictions back to CPU

# Print results
for text, pred in zip(texts, predictions):
    label = "Positive" if pred == 1 else "Negative"
    print(f"Review: {text} \nPredicted Sentiment: {label}\n")


Review: This movie was fantastic! I loved every moment. 
Predicted Sentiment: Positive

Review: The film was terrible. I regret watching it. 
Predicted Sentiment: Negative



## **10. Conclusion**

In this notebook, we:
- Fine-tuned `bert-base-uncased` on the IMDb dataset
- Addressed common debugging issues
- Evaluated the model using accuracy and F1-score
- Applied the model to classify unseen text

### **Next Steps:**
- Try different datasets (e.g., SQuAD for question answering)
- Experiment with hyperparameters for better accuracy
- Deploy the model as an API for real-world applications

**Thank you for exploring BERT with me! 🚀**
