Lecture: AI I - Advanced 

Previous:
[**Chapter 4.2.2: Sentiment Analysis**](../02_nlp/02_sentiment.ipynb)

---

# Chapter 4.2.3: Named Entity Recognition

In the previous chapter we trained DistilBERT to assign a single label to an entire movie review — positive or negative. That was **sequence classification**: one label per input.

Named Entity Recognition (NER) is fundamentally different. Here every **token** in the sentence gets its own label. The task is to find and classify named entities — people, locations, organisations, and miscellaneous proper nouns — directly within the text. Given the sentence *"Barack Obama visited Paris"*, the model should output:

| Token | Barack | Obama | visited | Paris |
|---|---|---|---|---|
| Label | B-PER | I-PER | O | B-LOC |

This is called **token classification**: the same DistilBERT backbone is used, but instead of pooling down to a single `[CLS]` vector we take the hidden state of **every** token and pass each one through its own classification head. Architecturally the only change is swapping `AutoModelForSequenceClassification` for `AutoModelForTokenClassification`.

## BIO Tagging

Labels follow the **BIO (Begin–Inside–Outside)** scheme:
- **B-** marks the **first** token of an entity (`B-PER` = beginning of a person name)
- **I-** marks a **continuation** token of the same entity (`I-PER` = still inside a person name)
- **O** means the token belongs to **no** entity

The B/I distinction is essential when two entities of the same type appear next to each other — without it, *"New York"* (one location) and *"New"* + *"York"* (two separate locations) would be indistinguishable.

## Why DistilBERT?

Exactly the same reasoning as the sentiment chapter: DistilBERT retains ~97% of BERT's performance while being 40% smaller and 60% faster. We again use `distilbert-base-cased` — the cased variant is particularly important for NER because capitalisation is a strong signal. *"Paris"* (a city) and *"paris"* (unlikely to be a city) carry different information, and the cased model preserves that distinction.

## The CoNLL-2003 Dataset

| Split | Sentences | Entity types |
|---|---|---|
| train | 14 041 | PER, ORG, LOC, MISC |
| validation | 3 250 | PER, ORG, LOC, MISC |
| test | 3 684 | PER, ORG, LOC, MISC |

Each example contains a `tokens` list and a `ner_tags` list of the same length. The 9 possible tag IDs are:

| ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| Tag | O | B-PER | I-PER | B-ORG | I-ORG | B-LOC | I-LOC | B-MISC | I-MISC |


## Loading the Data

We load the dataset using the HuggingFace datasets library:

In [1]:
from datasets import load_dataset

dataset = load_dataset("conll2003", trust_remote_code=True)

print("length of training set:", len(dataset["train"]))
print("length of validation set:", len(dataset["validation"]))
print("length of test set:", len(dataset["test"]))
print("example from training set:", dataset["train"][0])

length of training set: 14041
length of validation set: 3250
length of test set: 3453
example from training set: {'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


## Tokenization & Label Alignment

CoNLL-2003 arrives pre-tokenised into words, so we pass `is_split_into_words=True` to tell the tokeniser not to split on whitespace itself. DistilBERT then applies its own **WordPiece** subword splitting on top of that — the word *"Washington"* might become `["Was", "##hing", "##ton"]`.

This creates a mismatch: we have **one label per word**, but now **multiple subword tokens per word**. We resolve it with a simple rule:

1. Use `word_ids()` to find out which original word each subword token came from.
2. Assign the real label only to the **first** subword of each word.
3. Set all other subwords — and special tokens like `[CLS]`/`[SEP]` — to **-100**, which `CrossEntropyLoss` ignores automatically.

| | [CLS] | Was | ##hing | ##ton | visited | Paris | [SEP] |
|---|---|---|---|---|---|---|---|
| word\_id | None | 0 | 0 | 0 | 1 | 2 | None |
| label | -100 | B-LOC | -100 | -100 | O | B-LOC | -100 |

We also switch from `DataCollatorWithPadding` to `DataCollatorForTokenClassification` — it pads both `input_ids` and `labels` to the same length within each batch, keeping the padding labels at -100.

In [None]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

def align_labels(examples):
    """Tokenize pre-split words and align NER labels with subword tokens."""
    tokenized = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)

    aligned_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        previous_word_id = None
        label_ids = []

        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)
            elif word_id != previous_word_id:
                label_ids.append(labels[word_id])
            else:
                label_ids.append(-100)

            previous_word_id = word_id
        aligned_labels.append(label_ids)
    tokenized["labels"] = aligned_labels
    return tokenized


tokenized_conll = dataset.map(align_labels, batched=True)
data_collator   = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
import evaluate
import numpy as np

seqeval   = evaluate.load("seqeval")
label_list = dataset["train"].features["ner_tags"].feature.names

def compute_metrics(eval_pred):
    """
    seqeval expects lists of string labels, not integers.
    We filter out -100 (subword / special tokens) before scoring.
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]  # noqa: E741
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (_, l) in zip(prediction, label) if l != -100]  # noqa: E741
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["precision"],
        "recall": results["recall"],
        "f1":results["f1"],
        "accuracy": results["accuracy"],
    }

## Training

The setup mirrors the sentiment chapter almost exactly. The only architectural change is `AutoModelForTokenClassification`: instead of a single linear layer on top of `[CLS]`, it applies a linear layer to **every** token's hidden state independently.

`seqeval` evaluates at the **entity level**, not the token level. An entity is counted as correct only if *every* token in it is predicted correctly with the right B/I distinction. This is a stricter metric than per-token accuracy and is the standard for NER benchmarks. We therefore use `metric_for_best_model="f1"` so the Trainer saves the checkpoint with the highest entity-level F1 score.

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

id2label  = {i: name for i, name in enumerate(label_list)}
label2id  = {name: i for i, name in enumerate(label_list)}

model = AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-cased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

training_args = TrainingArguments(
    output_dir="./data/03_ner",
    learning_rate=5e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_conll["train"],
    eval_dataset=tokenized_conll["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

---

Lecture: AI I - Advanced 

Next: [**Chapter 4.2.4: AI Agents**](../02_nlp/04_agent.ipynb)