# 🧠 Text Classification with Hugging Face 🤗


Welcome! 👋 In this notebook, we’ll explore how to **fine-tune a DistilBERT Transformer model** for **emotion classification** using the powerful 🤗 Hugging Face ecosystem.

By the end, you’ll be able to:
- Load and understand a real-world emotion dataset
- Tokenize text for Transformers
- Fine-tune `DistilBERT` on labeled text
- Evaluate and predict with your custom model
- Load our fine tuned model to Hugging Face Hub 🤗

Let’s get started!


In [None]:
# Uncomment the following lines if you're running this notebook locally
# and haven't already installed Hugging Face Transformers and Datasets

#!pip install transformers==4.13.0 #datasets==2.8.0
!pip install -U datasets huggingface_hub fsspec

## 📂 Step 1: Load the Dataset


We'll use the 🤗 `datasets` library — a powerful and flexible tool for accessing, inspecting, and preprocessing datasets directly from the Hugging Face Hub.
> It allows for efficient streaming, filtering, and preprocessing — all within your notebook!


Each dataset is identified by a unique name. We'll start by loading the `emotion` dataset.


Let’s now load the `emotion` dataset using `load_dataset()`. This dataset contains short English text samples (tweets) labeled with one of six emotions like "joy", "anger", or "sadness".


🧾 **About the Dataset**:  
- It contains 3 splits: `train`, `validation`, and `test`
- Each example is a tweet labeled with one of six basic emotions
- The `label` field is an integer index mapped to emotion names

Perfect for learning how to fine-tune language models for classification tasks!


Each example has two fields:
- `text`: the tweet
- `label`: an integer mapped to an emotion


In [None]:
from datasets import load_dataset

# Load the emotion dataset from Hugging Face Hub
emotionDataset = load_dataset("emotion")

# Print type and summary of the dataset
print("📦 Data Set Type:", type(emotionDataset))
print(emotionDataset)

💡 We import the load_dataset() function and fetch the "emotion" dataset. Notice it returns a dictionary-like object with keys for each split: train, validation, and test.

Let’s inspect what we just loaded 📊

The `emotionDataset` object behaves like a Python dictionary, with keys corresponding to each split (`train`, `validation`, `test`).


We can access each split like a normal dictionary:

```python
emotionTrainSet = emotionDataset["train"]


📌 Inspect the Training Set (Code)


In [None]:
# Get the training split
emotionTrainSet = emotionDataset["train"]

# View dataset summary
print(emotionTrainSet)
print("📊 Size of Training Set:", emotionTrainSet.num_rows)
print("🧱 Column Names:", emotionTrainSet.column_names)
print("🔍 Column Features:", emotionTrainSet.features)

🎯 This gives us the first few samples in the training set — Every sample is a tweet and its corresponding label.

In [None]:
# Preview the first training example
print(emotionTrainSet[0])

In [None]:
# View first 5 text samples (tweets)
print(emotionTrainSet[:5]["text"])

In [None]:
print(emotionTrainSet[:5]["label"])

🧠 This gives us the raw input the model will learn from after transformation. We will see this in the follwoing cells.

🎯 Each number represents a class (like 0 = sadness, 1 = joy, etc.)



## 🧪 Step 2: From Text to Tokens

Before feeding text into our Transformer model, we need to tokenize it.

Tokenization breaks the text down into smaller pieces — and in Transformers, we use **subword tokenization** which offers a nice balance between vocabulary size and flexibility.

There are three common types of tokenization:
- 🔤 Character-level: one token per character
- 🧱 Word-level: one token per word
- 🧬 Subword-level (used in BERT/DistilBERT): flexible hybrid method

➡️ We’ll use Hugging Face’s `AutoTokenizer` to convert text to token IDs.

🧠 This tokenizer converts raw text into token IDs using a pretrained vocabulary and includes special tokens for padding, classification, and separation

In [None]:
from transformers import AutoTokenizer
from transformers import DistilBertTokenizer

# Set the pretrained model checkpoint
model_checkpoint = "distilbert-base-uncased"

# Load tokenizer using AutoTokenizer (recommended)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Alternative: use task-specific tokenizer class
#distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

# Display tokenizer configuration and metadata
print("🧠 Tokenizer Vocabulary Size: ", tokenizer.vocab_size)
print("🪟 Max Input Length (Context Window): ", tokenizer.model_max_length)
print("🔖 Special Token IDs: ", tokenizer.all_special_ids)
print("🔤 Special Token Names: ", tokenizer.all_special_tokens)
print("📎 Model Inputs: ", tokenizer.model_input_names)

📌 Tokenize a Sample Sentence

In [None]:
# Example sentence
text = "This tutorial will show you Hugging Face Models can be fine tuned using transformers and PyTorch!"

# Tokenize the sentence into IDs and structure
encodedText = tokenizer(text)

# Print the encoded structure
print("🧾 Text Encoded:", encodedText)


🔍 We see how the tokenizer breaks the sentence into IDs and adds metadata like attention masks.

📌 Let's isualize Tokens.. This helps us verify how the model sees the text — including [CLS] and [SEP] tokens

In [None]:
# Convert token IDs back into actual tokens

print("Tokens: ",tokenizer.convert_ids_to_tokens(encodedText["input_ids"]))

In [None]:
# Reconstruct the original sentence from token IDs

decodedText = tokenizer.decode(encodedText["input_ids"])
print("🔁 Text Decoded:", decodedText)

🎯 This confirms that the tokenizer can reverse the process — useful for debugging model outputs

### 🗂️ Tokenizing the Entire Dataset

To prepare our model for training, we need to tokenize all the text in our dataset. We'll do this efficiently using the `map()` function from Hugging Face Datasets.


In [None]:
# Reset the dataset format in case it was modified earlier
#⚠️This clears any previous formatting (e.g., PyTorch/TensorFlow formatting) and resets to default.
emotionDataset.reset_format()

In [None]:
# Define a function that tokenizes batches of text

def tokenizeBatch(batch):
  return tokenizer(batch["text"], padding=True, truncation=True)

🧠 We pad and truncate to make input lengths uniform — critical for batch processing.

🔎 Let's verify that our tokenizer works on multiple examples and outputs the expected fields.

In [None]:
# Preview the raw text batch
print("📝 Original Text Batch:\n", emotionDataset["train"][:4])

# Apply the tokenization function to that batch
print("🔢 Tokenized Text Batch:\n", tokenizeBatch(emotionDataset["train"][:4]))


Let’s now tokenize the **entire dataset**.  
🚀 Using `map()` with `batched=True` allows efficient parallel processing across dataset entries.


In [None]:
# Apply the tokenization function to all dataset splits

emotionsDatasetEncoded = emotionDataset.map(tokenizeBatch, batched=True, batch_size=None)

📌 The tokenized dataset includes new fields like input_ids and attention_mask, which are required by Transformers.

In [None]:
# Print original and tokenized dataset columns
print("📋 Original Dataset Columns:", emotionDataset["train"].column_names)
print("📋 Dataset Columns:", emotionsDatasetEncoded["train"].column_names)


### 🧾 What Does the Tokenizer Return?

After tokenization, each example now contains:
- `input_ids`: the token ID sequence
- `attention_mask`: tells the model which tokens are actual input (1) and which are padding (0)


## 🏗️ Step 3: Training a Text Classifier


🤖 `DistilBERT` model is trained for **masked language modeling** and not classification.  
So we need to modify the model and use a classification head.

There are two main options when adapting pretrained models for classification:

- 🔍 **Feature Extraction**:  
  Freeze the pretrained model and use its hidden states to train a separate classifier (like transfer learning in CNNs).

- 🔧 **Fine-Tuning**:  
  Train the entire model end-to-end — including the pretrained layers.

> We’ll use **fine-tuning** in this notebook for better accuracy.

✅ **Pro Tip**: You can also explore **PEFT (Parameter-Efficient Fine-Tuning)** methods like LoRA, AdapterFusion, or BitFit to reduce compute and memory cost.


### 🧩 Option 1: Load the Pretrained DistilBERT Model

We’ll now load the base DistilBERT model using Hugging Face’s `AutoModel` class.

> This gives us access to the model's raw hidden states, without any classification head. 🔍 Using AutoModel instead of AutoModelForSequenceClassification lets us access the model's internal representations (hidden states) directly.


In [None]:
import torch
from transformers import AutoModel

# Check if a GPU is available; fallback to CPU if not
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("📟 Device chosen:", device)

# Load the pretrained DistilBERT model and move it to the appropriate device
#Remember model checkpoint: distilber-base-uncased
model = AutoModel.from_pretrained(model_checkpoint).to(device)

#### 🧠 Extracting the Last Hidden States

Let’s warm up by retrieving the **last hidden states** from DistilBERT for a single input text.

To do this, we must convert our input text into **PyTorch-compatible tensors** using the tokenizer.

> This is done by passing `return_tensors="pt"` to the tokenizer.


In [None]:
# Print the original example sentence
print("📝 Original example text:", text)

# Preview its tokenized version (ID format)
print("🔢 Tokenized as ID map:", encodedText)

In [None]:
# Tokenize again, but return as PyTorch tensors
encodedTextPT = tokenizer(text, return_tensors="pt")

# Print input tensor shape and structure
print("📐 Encoded tensor shape:", encodedTextPT["input_ids"].size())
print("📦 Tokenized Tensor Dictionary:", encodedTextPT)

💡 Now our input is in the format expected by PyTorch models — a dictionary of tensors.

In [None]:
# Function to get hidden states from a batch input
def get_hidden_state(batch):
    # Only keep keys the model needs (input_ids, attention_mask)
    inputs = {k:v.to(device) for k,v in batch.items() if k in tokenizer.model_input_names} #Model Input Names (Only what model needs):

    #Disable gradient tracking for inference
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state

    #return last_hidden_state

    # Return only the [CLS]-like token representation (first token)
    # [CLS]: include attention to all tokens
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

# Test on our single input
print("📤 DistilBERT Hidden State Output:\n", get_hidden_state(encodedTextPT))

#print(get_hidden_state(encodedTextPT).size())


🧠 Extracting the embedding of the first token [CLS] is commonly used to represent the entire sequence in classification applications. The reason is that, the embeddings concise attention to all other tokens in the sequence.

🔍 Tensor Dimensions Breakdown

The output shape of `last_hidden_state` is: [batch_size, sequence_length, embedding_dimension]


Where:
- `batch_size`: number of inputs in a batch
- `sequence_length`: number of tokens per input
- `embedding_dimension`: typically 768 for DistilBERT

We use `torch.no_grad()` to disable gradient calculations (since we’re just extracting values, not training).


### 💡 Apply to All Data

Let’s now compute the hidden states for the entire dataset. We’ll use `map()` to apply our `get_hidden_state()` function to every example.

> This creates a new column called `hidden_state` in each dataset split.

🔄 Prepare Dataset for PyTorch

Before applying the model, we need to convert the dataset columns into PyTorch tensors.


In [None]:
# Convert key columns to torch.Tensor for compatibility with model input
emotionsDatasetEncoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Confirm new format
emotionsDatasetEncoded

In [None]:
# 📌 Print a training example's inputs and label
print("📜 Text Example:", emotionsDatasetEncoded['train']['text'][0])
print("🏷️ Label:", emotionsDatasetEncoded['train']['label'][0])
print("🔢 Input IDs:", emotionsDatasetEncoded['train']['input_ids'][0])
print("🎭 Attention Mask:", emotionsDatasetEncoded['train']['attention_mask'][0])

Let’s apply `get_hidden_state()` to all examples using `batched=True` for faster performance.

> ⏱️ **Note**: Running the next cell on CPU can be slow. GPU acceleration is highly recommended.


In [None]:
# Apply hidden state extraction to all dataset splits using batched processing
emotionsDatasetEncoded= emotionsDatasetEncoded.map(get_hidden_state, batched=True)

# Show structure
emotionsDatasetEncoded

🧠 `batch_size=None` here means that the dataset library used its default value, which is `batch_size=1000` during the `map()` operation.

This batch size is used when applying our custom function (like `get_hidden_state`) across examples.


🚀 Each example now includes a new hidden_state vector — a numerical representation of its text, ready for downstream tasks like classification.

📌 We confirm that the dataset now contains hidden representations (hidden_state) alongside the raw and tokenized text.

In [None]:
# View available columns in the tokenized dataset
print("🧾 Column Names:", emotionsDatasetEncoded.column_names)


# Preview the first example in the training set
print("📝 Text:", emotionsDatasetEncoded['train']['text'][0])
print("🏷️ Label:", emotionsDatasetEncoded['train']['label'][0])
print("🔢 Input IDs:", emotionsDatasetEncoded['train']['input_ids'][0])
print("🎭 Attention Mask:", emotionsDatasetEncoded['train']['attention_mask'][0])
print("🔮 Hidden State Vector:", emotionsDatasetEncoded['train']['hidden_state'][0])


 `batch_size=None` in this case, so the default `batch_size=1000`

### 🧱 Creating a Feature Matrix for Classification

Now that we’ve generated vector representations for each sentence, we’re ready to train a classifier.

We’ll use these hidden states as **input features** and the emotion labels as **targets**.

The encoded dataset already contains everything we need.

We'll convert the hidden state vectors and their corresponding labels into NumPy arrays — perfect for training models in `Scikit-learn`.


In [None]:
import numpy as np

# Extract features (X) and labels (y) for training and validation
emotionXTrain = np.array(emotionsDatasetEncoded["train"]["hidden_state"])
emotionXVal = np.array(emotionsDatasetEncoded["validation"]["hidden_state"])

emotionYTrain = np.array(emotionsDatasetEncoded["train"]["label"])
emotionYVal = np.array(emotionsDatasetEncoded["validation"]["label"])

# Display the shapes of the resulting matrices
print("📐 Training set shape:", emotionXTrain.shape)
print("📐 Validation set shape:", emotionXVal.shape)

📊 We now have numerical matrices ready to feed into traditional classifiers like `Logistic Regression`.



In [None]:
from sklearn.linear_model import LogisticRegression

# Train logistic regression on the sentence embeddings
lr_classifier = LogisticRegression(max_iter=3000)
lr_classifier.fit(emotionXTrain, emotionYTrain)

# Evaluate model on validation set
print("✅ LR Validation Accuracy:",lr_classifier.score(emotionXVal, emotionYVal))

🎯 This simple model gives us a strong baseline using only DistilBERT's embeddings — without fine-tuning!
#### 🤖 Performance Comparison with a Dummy Classifier

Let’s compare our logistic regression model with a baseline model that **always predicts the most frequent class**.

📌 This sets a low baseline — any model doing better than this is actually learning something!


In [None]:
from sklearn.dummy import DummyClassifier

# Create and fit a baseline dummy classifier
dmy_classifier = DummyClassifier(strategy="most_frequent")
dmy_classifier.fit(emotionXTrain, emotionYTrain)

# Evaluate dummy classifier accuracy
print("📉 Dummy Classifier Validation Accuracy:", dmy_classifier.score(emotionXVal, emotionYVal))

## 🔁 Option 2: Fine-Tuning the Transformer Model Directly

Instead of using sentence embeddings, we can **fine-tune the entire DistilBERT model** end-to-end for classification.

🔍 Why use a neural network head?

To make the classification head trainable with the Transformer model, we need a **differentiable output layer**.

This is why we typically append a neural network classification head to the base model — allowing gradient-based learning.


#### ⚡ Benefits of Fine-Tuning

Fine-tuning allows the model to:
- Adapt its internal weights to the target task
- Improve performance on domain-specific data
- Potentially surpass the performance of frozen-feature approaches

This is especially powerful when working with large pretrained models on smaller datasets.


### 🔧 Load a Pretrained Model for Classification

We’ll now use `AutoModelForSequenceClassification` instead of `AutoModel`.

This gives us:
- A pretrained DistilBERT backbone
- A classification head on top

You only need to specify the number of output labels (in our case, six).

The `AutoModelForSequenceClassification` class wraps a base model like DistilBERT with a **trainable classification layer**.

> It’s ideal for supervised tasks like sentiment analysis, spam detection, or — in our case — **emotion classification**.


In [None]:
from transformers import AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score

# Define the number of output labels — we have 6 emotion classes
#'sadness', 'joy', 'love', 'anger', 'fear', 'surprise'
num_labels =6

# Load the DistilBERT model with a classification head for sequence classification
tf_classifier = (AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels).to(device))

🧠 This loads DistilBERT and attaches a randomly initialized classification head — the part that we’ll fine-tune!
> ⚠️ You might see a warning that some model weights were randomly initialized — that’s expected!  
The base DistilBERT model is pretrained, but the classification head is new and will be trained from scratch during fine-tuning.


To monitor training progress, we’ll define a custom `compute_metrics()` 📏 function.  
This is used by Hugging Face’s `Trainer` to calculate metrics like accuracy and F1-score during evaluation.

The function receives predictions and labels from the model and returns a dictionary of named metrics.

📊 We use both accuracy (overall performance) and F1-score (accounts for class imbalance).

In [None]:
from sklearn.metrics import accuracy_score, f1_score

# Evaluation metric function used by Hugging Face's Trainer
def compute_metrics(batch):
    # Get predicted class (highest logit)
    predictions = batch.predictions.argmax(-1) #Y_hat
    # Actual class labels
    labels = batch.label_ids                   #Y
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    return {"accuracy": accuracy, "f1": f1}

### 🤗 What is Hugging Face’s `Trainer`?

`Trainer` is a high-level class that abstracts away much of the boilerplate code for training and evaluation.

It manages:
- Training loop
- Evaluation
- Logging
- Checkpointing
- Pushing models to the Hub

We’ll also use `TrainingArguments` to configure its behavior (like batch size, learning rate, number of epochs, etc.)


🔐 To push our fine-tuned model to your Hugging Face account, we need to authenticate via the Hub.


In [None]:
from huggingface_hub import notebook_login

# Log in to your Hugging Face account to enable model uploads
notebook_login()

### 🏁 Start Training the Model

We’ll configure our training using the `TrainingArguments` class.

🧪 We’ll also enable logging with tools like `wandb` (Weights & Biases), which is useful for:
- Tracking model performance in real-time
- Visualizing training loss and metrics
- Debugging and optimizing

You can set `output_dir` to determine where training artifacts (like checkpoints and logs) are stored.

🔥🔥🔥Now let's launch the full fine-tuning of DistilBERT with logging and evaluation after each epoch.

In [None]:
from transformers import Trainer, TrainingArguments

#emotionsDatasetEncoded
batch_size = 64 #128, 256
logging_steps = len(emotionsDatasetEncoded["train"])//batch_size
modelName = f"{model_checkpoint}-finetuned-emotion"

# Configure training parameters
trainingArgs = TrainingArguments(output_dir = modelName,
                                 num_train_epochs = 2,
                                 per_device_train_batch_size = batch_size,
                                 per_device_eval_batch_size = batch_size,
                                 learning_rate = 2e-5,
                                 weight_decay = 0.01,
                                 eval_strategy = "epoch",
                                 disable_tqdm = False,
                                 logging_steps = logging_steps,
                                 push_to_hub = True,
                                 log_level = "error"
                                 )

# Instantiate the Trainer
trainer  = Trainer(model = tf_classifier,
                   args = trainingArgs,
                   compute_metrics = compute_metrics,
                   train_dataset = emotionsDatasetEncoded["train"] ,
                   eval_dataset = emotionsDatasetEncoded["validation"] ,
                   tokenizer = tokenizer)

# Start the fine-tuning process!
trainer.train()

📈 Let’s compare our fine-tuned Transformer model to the earlier **Logistic Regression baseline**.

You’ll likely see a significant improvement thanks to end-to-end learning!


>💡💡💡 Before we dive deeper in the future with fine tuning, we’ll have a couple of short dedicated tutorials to explain **prompt engineering** and **template-based fine-tuning**. In these upcoming lessons, we will explain how prompting works with language models.

➡️ Stay tuned for that lesson next!


In [None]:
# Run evaluation on the validation set
modelPredictions = trainer.predict(emotionsDatasetEncoded["validation"])

# Show evaluation results
print("Model Predictions Validation Report\n================================\n")
print(modelPredictions.metrics)

🧠 The result also includes **raw logits** — the model’s confidence scores for each class.

We can use `np.argmax()` to convert those into predicted class indices, just like we did earlier with Logistic Regression.


In [None]:
# Check shape of prediction logits from validation set
print("Model Predictions Shape: ",modelPredictions.predictions.shape)

# First prediction: logits for all classes
print("First Row predicitons: ", modelPredictions.predictions[0])

# Convert logits to predicted label indices using argmax
predictions_Labels = np.argmax(modelPredictions.predictions, axis=1)

# Check prediction label shapes and first result
print("Model Predictions Labels Shape: ",predictions_Labels.shape)
print("First Row Label predicitons: ", predictions_Labels[0])

#### 🔎 Error Analysis

Before wrapping up, let’s perform some basic **error analysis**.  


Let's build a function that returns both the loss and the predicted label for each sample. By analyzing the highest-loss examples, we can uncover where the model struggles the most.

📌 This final analysis gives a clear look at the most difficult examples for our model — valuable for debugging or data augmentation.





In [None]:
from torch.nn.functional import cross_entropy

# Function to compute per-example loss and predicted labels
def get_model_output_loss(batch):
  model_inputs = {k:v.to(device) for k,v in batch.items() if k in tokenizer.model_input_names}
  with torch.no_grad():
    model_outputs = tf_classifier(**model_inputs)
    loss = cross_entropy(model_outputs.logits, batch["label"].to(device), reduction="none")
  return {"loss": loss.cpu().numpy(), "predictedLabel": torch.argmax(model_outputs.logits, axis=-1).cpu().numpy()}

# Reformat and apply to validation set
emotionsDatasetEncoded.set_format("torch", columns =["input_ids", "attention_mask", "label"])
emotionsDatasetEncoded["validation"] = emotionsDatasetEncoded["validation"].map(get_model_output_loss, batched=True, batch_size=32)


📉 This adds new columns to the validation set: loss and predictedLabel, giving us insight into model confidence.



In [None]:
# Function to convert numeric label ID to string class name
def label_int2str(row):
    return emotionDataset["train"].features["label"].int2str(row)

In [None]:
# Convert validation dataset to pandas
emotionsDatasetEncoded.set_format("pandas")

#Last two columns added
columns = ["text","label", "predictedLabel", "loss"]

# Create DataFrame
validationDF = emotionsDatasetEncoded["validation"][:][columns]

# Convert label integers to strings
validationDF["label"]= validationDF["label"].apply(label_int2str)
validationDF["predictedLabel"] = (validationDF["predictedLabel"].apply(label_int2str))

# Display the top 10 misclassified examples with highest loss
print("🔝 Top 10 Misclassified Texts\n")
validationDF.sort_values("loss", ascending=False).head(10)

#### 💾 Saving and Sharing the Model


🤗 Hugging Face makes it easy to share your model with the community — just like we downloaded DistilBERT from the Hub, you can push your fine-tuned model back to it.

Let’s publish our trained model with one command.


In [None]:

# Push model to your Hugging Face account with a commit message
trainer.push_to_hub(commit_message="DistilBert trained with Emotion dataset")

In [None]:
from transformers import pipeline

# Load the model from the Hugging Face Hub
model_checkpoint = "moghalis/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_checkpoint)

# Try a test example
tweet = "I am really happy. I know how to do fine tuning with hugging face!"
prediction = classifier(tweet, return_all_scores=True)
prediction


🌍 We successfully deployed our custom model — now anyone can use it just like this.

## ✅ Conclusion

In this tutorial, we explored two approaches to fine-tuning a Transformer:

1. **Feature extraction** using `AutoModel` and traditional ML
2. **End-to-end fine-tuning** using `AutoModelForSequenceClassification` and `Trainer`

We wrapped up with evaluation, error analysis, and even model sharing!

🚀 Keep experimenting with different architectures and datasets — and don’t forget to share your models on the Hub!

👉 Like this tutorial? Subscribe, give it a ⭐ on GitHub, and follow for more hands-on NLP content!
