# AI Powered Request Dispatcher - HuggingFace Trainer Edition

This notebook demonstrates how to fine-tune a DistilBERT model for text classification using the **HuggingFace `datasets` library** and **`Trainer` API** - the modern, recommended approach for fine-tuning transformers.

## Why Use HuggingFace Trainer?

- **Less boilerplate**: No manual training loops needed
- **Built-in features**: Checkpointing, logging, evaluation, mixed precision
- **Best practices**: Gradient accumulation, learning rate scheduling
- **Easy experimentation**: Just change `TrainingArguments`

## What You'll Learn

1. Load and preprocess data with HuggingFace `datasets`
2. Tokenize datasets efficiently with `.map()`
3. Configure training with `TrainingArguments`
4. Fine-tune using `Trainer`
5. Evaluate with custom metrics

## Imports

In [1]:
import os
# 禁用 tokenizers 并行，彻底消除警告: huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import pandas as pd
import torch

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# HuggingFace libraries
import evaluate
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)

import helper_model



In [2]:
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: mps


## 1. Load and Prepare Data

We'll use the Databricks Dolly 15k dataset for instruction classification.

In [3]:
# Load the dataset
data_path = "./databricks-dolly-15k-dataset/databricks-dolly_augmented.csv"
df = pd.read_csv(data_path).dropna().reset_index(drop=True)

print(f"Loaded {len(df)} samples")
print(f"\nOriginal categories: {df['category'].unique().tolist()}")

Loaded 26238 samples

Original categories: ['closed_qa', 'classification', 'open_qa', 'information_extraction', 'brainstorming', 'general_qa', 'summarization', 'creative_writing']


In [4]:
# Consolidate similar categories for a more focused dispatcher
category_map = {
    "general_qa": "q_and_a",
    "open_qa": "q_and_a",
    "closed_qa": "q_and_a",
    "information_extraction": "information_distillation",
    "summarization": "information_distillation"
}

df['category'] = df['category'].replace(category_map)
print(f"Consolidated categories: {df['category'].unique().tolist()}")

Consolidated categories: ['q_and_a', 'classification', 'information_distillation', 'brainstorming', 'creative_writing']


In [5]:
# Create label mappings
unique_categories = df['category'].unique().tolist()
label2id = {category: i for i, category in enumerate(unique_categories)}
id2label = {i: category for category, i in label2id.items()}

# Add numeric labels
df['label'] = df['category'].map(label2id)

print("Label mappings:")
for cat, idx in label2id.items():
    count = len(df[df['label'] == idx])
    print(f"  {idx}: {cat:<25} ({count} samples)")

Label mappings:
  0: q_and_a                   (13243 samples)
  1: classification            (3918 samples)
  2: information_distillation  (4681 samples)
  3: brainstorming             (3113 samples)
  4: creative_writing          (1283 samples)


## 2. Create HuggingFace Dataset

Instead of creating a custom PyTorch Dataset, we use HuggingFace's `Dataset` class which integrates perfectly with the `Trainer`.

In [6]:
# Create HuggingFace Dataset from pandas DataFrame
# We only need the 'instruction' (text) and 'label' columns
dataset = Dataset.from_pandas(df[['instruction', 'label']])

# Rename 'instruction' to 'text' for clarity
dataset = dataset.rename_column('instruction', 'text')

print(dataset)
print(f"\nExample: {dataset[0]}")

Dataset({
    features: ['text', 'label'],
    num_rows: 26238
})

Example: {'text': 'When did Virgin Australia start operating?', 'label': 0}


In [7]:
# Split into train/validation (80/20)
dataset_split = dataset.train_test_split(test_size=0.2, seed=42)

# Rename 'test' to 'validation' for clarity
dataset_dict = DatasetDict({
    'train': dataset_split['train'],
    'validation': dataset_split['test']
})

print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20990
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5248
    })
})


## 3. Load Tokenizer and Model

We'll use DistilBERT - fast, efficient, and perfect for this task.

In [8]:
model_name = "distilbert-base-uncased"
num_labels = len(unique_categories)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

print(f"Model loaded with {num_labels} classification labels")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded with 5 classification labels


## 4. Tokenize Dataset

Using `.map()` is the HuggingFace way - efficient, batched, and cacheable!

In [9]:
def tokenize_function(examples):
    """Tokenize a batch of examples."""
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=512,
        # Don't pad here - let DataCollator do dynamic padding
    )

# Apply tokenization to all splits
tokenized_datasets = dataset_dict.map(
    tokenize_function,
    batched=True,  # Process in batches for efficiency
    remove_columns=['text'],  # Remove original text column
)

print(tokenized_datasets)
print(f"\nTokenized example keys: {tokenized_datasets['train'][0].keys()}")

Map:   0%|          | 0/20990 [00:00<?, ? examples/s]

Map:   0%|          | 0/5248 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 20990
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 5248
    })
})

Tokenized example keys: dict_keys(['label', 'input_ids', 'attention_mask'])


In [10]:
# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 5. Define Evaluation Metrics

The `Trainer` needs a function to compute metrics during evaluation.

In [11]:
# Load evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    """Compute accuracy and F1 score for evaluation."""
    # predictions.shape: (sample size, n_classes): (5248, 5)
    # labels.shape: (sample size, ): (5248, )
    predictions, labels_ids = eval_pred
    
    # 打印类型和形状，帮助调试
    # print(f"预测值类型: {type(predictions)}, 形状: {predictions.shape}")
    # print(f"真实标签类型: {type(labels_ids)}, 形状: {labels_ids.shape}")

    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels_ids)
    f1 = f1_metric.compute(predictions=predictions, references=labels_ids, average='weighted')
    
    return {
        'accuracy': accuracy['accuracy'],
        'f1': f1['f1'],
    }

## 6. Optional: Freeze Layers for Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) - train only the last few layers.

In [12]:
def freeze_base_layers(model, layers_to_train=3):
    """
    Freeze all but the last N transformer layers and classification head.
    
    Args:
        model: DistilBERT model
        layers_to_train: Number of final transformer layers to keep trainable
    """
    # Freeze all parameters first
    for param in model.parameters():
        param.requires_grad = False
    
    # Unfreeze last N transformer layers
    transformer_layers = model.distilbert.transformer.layer
    for i in range(layers_to_train):
        for param in transformer_layers[-(i + 1)].parameters():
            param.requires_grad = True
    
    # Always unfreeze classification head
    for param in model.pre_classifier.parameters():
        param.requires_grad = True
    for param in model.classifier.parameters():
        param.requires_grad = True
    
    # Count trainable parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")
    
    return model

# Apply freezing (optional - comment out for full fine-tuning)
model = freeze_base_layers(model, layers_to_train=3)

Trainable parameters: 21,858,053 / 66,957,317 (32.6%)


## 7. Configure Training with TrainingArguments

This is where HuggingFace Trainer shines - all hyperparameters in one place!

In [13]:
training_args = TrainingArguments(
    output_dir="./model_results/dispatcher-checkpoints",
    
    # Training hyperparameters
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    
    # Evaluation strategy
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    
    dataloader_num_workers=2
)

print("Training configuration ready!")

Training configuration ready!


## 8. Create Trainer and Train!

The Trainer handles everything: training loop, evaluation, checkpointing, logging.

In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer initialized!")

  trainer = Trainer(


Trainer initialized!


In [15]:
# Train the model!
print("Starting training...\n")
train_result = trainer.train()

print("\n" + "=" * 50)
print("Training complete!")
print(f"Total training time: {train_result.metrics['train_runtime']:.1f} seconds")

Starting training...





Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7691,0.453181,0.842226,0.839662
2,0.4184,0.37646,0.875381,0.873578
3,0.3109,0.370492,0.885671,0.884201
4,0.1631,0.401992,0.887005,0.886145
5,0.1226,0.433573,0.891006,0.889733





Training complete!
Total training time: 1461.5 seconds


## 9. Evaluate the Model

In [16]:
# Final evaluation on validation set
eval_results = trainer.evaluate()

print("\n" + "=" * 50)
print("Final Validation Metrics")
print("=" * 50)
print(f"Loss:     {eval_results['eval_loss']:.4f}")
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"F1 Score: {eval_results['eval_f1']:.4f}")




Final Validation Metrics
Loss:     0.4336
Accuracy: 0.8910
F1 Score: 0.8897


In [17]:
# Generate predictions for confusion matrix
predictions = trainer.predict(tokenized_datasets['validation'])
pred_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

# Confusion matrix
cm = confusion_matrix(true_labels, pred_labels)
print("\nConfusion Matrix:\n")
print(f"{'':>25}", end='')
for i in range(len(id2label)):
    print(f"{id2label[i][:10]:>12}", end='')
print()
for i, row in enumerate(cm):
    print(f"{id2label[i]:>25}", end='')
    for val in row:
        print(f"{val:>12}", end='')
    print()




Confusion Matrix:

                              q_and_a  classifica  informatio  brainstorm  creative_w
                  q_and_a        2477           4          78          70          13
           classification          10         768           3           3           0
 information_distillation         191           3         742          11           0
            brainstorming         103           2          11         489           3
         creative_writing          51           1           1          14         200


## 10. Test on New Instructions

Let's test our dispatcher on some unseen instructions!

In [18]:
def predict_category(text: str) -> str:
    """Predict the category for a given instruction."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        predicted_class = torch.argmax(outputs.logits, dim=1).item()
    
    return id2label[predicted_class]

# Test examples
test_instructions = [
    "What is the capital of France?",
    "Write a short poem about coding",
    "Is a tomato a fruit or vegetable?",
    "Summarize the main points of this article",
    "Give me 5 ideas for a birthday party",
]

print("Predictions on new instructions:\n")
for instruction in test_instructions:
    category = predict_category(instruction)
    print(f"  [{category:^25}] {instruction}")

Predictions on new instructions:

  [         q_and_a         ] What is the capital of France?
  [    creative_writing     ] Write a short poem about coding
  [     classification      ] Is a tomato a fruit or vegetable?
  [information_distillation ] Summarize the main points of this article
  [      brainstorming      ] Give me 5 ideas for a birthday party


## Summary

### What We Did

1. **Loaded data** into HuggingFace `Dataset` (no custom PyTorch Dataset needed!)
2. **Tokenized efficiently** using `.map()` with batching
3. **Configured training** with `TrainingArguments` (all hyperparameters in one place)
4. **Fine-tuned** using `Trainer` (no manual training loop!)
5. **Evaluated** with built-in metrics computation

### Benefits of This Approach

- **~50% less code** compared to manual training loops
- **Automatic checkpointing** and best model selection
- **Built-in logging** (TensorBoard, W&B support)
- **Easy to experiment** with hyperparameters
- **Production-ready** with proper evaluation

### Next Steps

- Try different models (`bert-base-uncased`, `roberta-base`)
- Experiment with learning rate schedules
- Add class weights for imbalanced data
- Use `push_to_hub()` to share on HuggingFace Hub