### Evaluation of Named Entity Recognition (NER) Model Using Confusion Matrix

#### Aim & Problem Statement: 
- Aim: To evaluate the performance of a Named Entity Recognition (NER) model using a confusion matrix by comparing manually annotated ground-truth labels with model-predicted labels.
- Problem Statement:
Train or use a pre-trained NER model (spaCy) and evaluate its predictions on sports-related text using precision, recall, F1-score, and a confusion matrix.

#### Why Evaluation Is Needed
Natural Language Processing models are not perfect

Even if NER model identifies entities, we must answer
- How many entities were correctly identified?
- How many were missed?
- How many were wrongly classified 

To answer these questions scientifically, we use
- Confusion Matrix
- Precision
- Recall
- F1-Score 


#### Step 1: Dataset Finalization
Why this step exusts:
- A confusion matrix requires ground truth (actual labels)
- Since sports-specific labeled NER datasets are not available, we manually create a gold-standard dataset.

What I did:
- Selected 20 sports-related sentences 
- Tokenized them 
- Assigned BIO labels manually 


#### Step 2: Import Required Libraries 


In [27]:
# Core libraries 
import nltk
import spacy

# Evaluation Libraries 
from sklearn.metrics import confusion_matrix, classification_report

# Download tokenizer 
# nltk.download("punkt")
from nltk.tokenize import word_tokenize

Why these libraries?
- nltk -> tokenization 
- spaCy -> NER prediction
- sklearn -> confusion matrix 
- pandas -> tabular display

#### Step 3: Manually Labeled Dataset 

In [28]:
dataset = []

def add(sentence, labels):
    tokens = word_tokenize(sentence)
    assert len(tokens) == len(labels), "Token-label mismatch!"
    dataset.append(list(zip(tokens, labels)))

add("Chennai Super Kings won the IPL trophy.",
    ["B-ORG", "I-ORG", "I-ORG", "O", "O", "B-MISC", "O", "O"])

add("La Liga is a competitive football league.",
    ["B-MISC", "I-MISC", "O", "O", "O", "O", "O", "O"])

add("Ahmedabad hosted the World Cup final.",
    ["B-LOC", "O", "O", "B-MISC", "I-MISC", "O", "O"])

add("BCCI appointed Gautam Gambhir as head coach.",
    ["B-ORG", "O", "B-PER", "I-PER", "O", "O", "O", "O"])

add("Real Madrid plays at Santiago Bernabeu.",
    ["B-ORG", "I-ORG", "O", "O", "B-LOC", "I-LOC", "O"])

add("Virat Kohli plays for Royal Challengers Bangalore.",
    ["B-PER", "I-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O"])

add("FIFA organizes the World Cup every four years.",
    ["B-ORG", "O", "O", "B-MISC", "I-MISC", "O", "O", "O", "O"])

add("Anfield is known for its atmosphere in Liverpool.",
    ["B-LOC", "O", "O", "O", "O", "O", "O", "B-LOC", "O"])

add("Bayern Munich defeated Barcelona in Lisbon.",
    ["B-ORG", "I-ORG", "O", "B-ORG", "O", "B-LOC", "O"])

add("Wimbledon is the oldest tennis tournament.",
    ["B-MISC", "O", "O", "O", "O", "O", "O"])

add("Eden Gardens hosted the KKR match.",
    ["B-LOC", "I-LOC", "O", "O", "B-ORG", "O", "O"])

add("Mohammed Shami took wickets in the World Cup.",
    ["B-PER", "I-PER", "O", "O", "O", "O", "B-MISC", "I-MISC", "O"])

add("Old Trafford is the home of Manchester United.",
    ["B-LOC", "I-LOC", "O", "O", "O", "O", "B-ORG", "I-ORG", "O"])

add("Hardik Pandya left Gujarat Titans for Mumbai Indians.",
    ["B-PER", "I-PER", "O", "B-ORG", "I-ORG", "O", "B-ORG", "I-ORG", "O"])

add("Sourav Ganguly was the president of BCCI.",
    ["B-PER", "I-PER", "O", "O", "O", "O", "B-ORG", "O"])

add("Premier League is popular in India.",
    ["B-MISC", "I-MISC", "O", "O", "O", "B-LOC", "O"])

add("Vinicius Junior plays for Real Madrid.",
    ["B-PER", "I-PER", "O", "O", "B-ORG", "I-ORG", "O"])

add("Arun Jaitley Stadium is located in Delhi.",
    ["B-LOC", "I-LOC", "I-LOC", "O", "O", "O", "B-LOC", "O"])

add("Team India departed for South Africa today.",
    ["O", "B-LOC", "O", "O", "B-LOC", "I-LOC", "O", "O"])

add("Euro 2024 was an exciting tournament.",
    ["B-MISC", "I-MISC", "O", "O", "O", "O", "O"])

add("Kolkata Knight Riders celebrated their victory at Eden Gardens.",
    ["B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O", "B-LOC", "I-LOC", "O"])

add("The Border-Gavaskar Trophy is played between India and Australia.",
    ["O", "B-MISC", "I-MISC", "O", "O", "O", "B-LOC", "O", "B-LOC", "O"])

add("Xavi Hernandez managed Barcelona before Hansi Flick took over.",
    ["B-PER", "I-PER", "O", "B-ORG", "O", "B-PER", "I-PER", "O", "O", "O"])

add("Dharamsala Stadium offers a great view of the Himalayas.",
    ["B-LOC", "I-LOC", "O", "O", "O", "O", "O", "O", "B-LOC", "O"])

add("The Champions League final will be held in Munich.",
    ["O", "B-MISC", "I-MISC", "O", "O", "O", "O", "O", "B-LOC", "O"])

add("Sunil Chhetri is the captain of the Indian football team.",
    ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-LOC", "O", "O", "O"])

add("Manchester City signed Erling Haaland from Borussia Dortmund.",
    ["B-ORG", "I-ORG", "O", "B-PER", "I-PER", "O", "B-ORG", "I-ORG", "O"])

add("The Ashes is a famous cricket series between England and Australia.",
    ["O", "B-MISC", "O", "O", "O", "O", "O", "O", "B-LOC", "O", "B-LOC", "O"])

add("Neeraj Chopra won a gold medal at the Tokyo Olympics.",
    ["B-PER", "I-PER", "O", "O", "O", "O", "O", "O", "B-MISC", "I-MISC", "O"])

add("Lionel Messi played for Barcelona for many years.",
    ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "O", "O", "O"])

add("The T20 World Cup was held in the West Indies.",
    ["O", "B-MISC", "I-MISC", "I-MISC", "O", "O", "O", "O", "B-LOC", "I-LOC", "O"])

add("Ansu Fati returned to the Ciutat Esportiva Joan Gamper.",
    ["B-PER", "I-PER", "O", "O", "O", "B-LOC", "I-LOC", "I-LOC", "I-LOC", "O"])

add("Shubman Gill opened the innings at the Gabba.",
    ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-LOC", "O"])

add("The Santiago Bernabeu hosted the Copa Libertadores final.",
    ["O", "B-LOC", "I-LOC", "O", "O", "B-MISC", "I-MISC", "O", "O"])

add("Mohun Bagan and East Bengal are rivals in Kolkata.",
    ["B-ORG", "I-ORG", "O", "B-ORG", "I-ORG", "O", "O", "O", "B-LOC", "O"])

add("Kylian Mbappe joined Real Madrid in the summer window.",
    ["B-PER", "I-PER", "O", "B-ORG", "I-ORG", "O", "O", "O", "O", "O"])

add("The ISL has improved the quality of football in India.",
    ["O", "B-MISC", "O", "O", "O", "O", "O", "O", "O", "B-LOC", "O"])

add("San Siro is shared by AC Milan and Inter Milan.",
    ["B-LOC", "I-LOC", "O", "O", "O", "B-ORG", "I-ORG", "O", "B-ORG", "I-ORG", "O"])

add("Pat Cummins led Australia to a World Test Championship win.",
    ["B-PER", "I-PER", "O", "B-LOC", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O", "O"])

add("Pep Guardiola implemented tiki-taka at Barcelona.",
    ["B-PER", "I-PER", "O", "O", "O", "B-ORG", "O"])

print(f"Dataset successfully created with {len(dataset)} sentences.")

Dataset successfully created with 40 sentences.


In [29]:
# Load pre-trained spaCy NER model
nlp = spacy.load("en_core_web_sm")

#### Step 4: Collect Actual & Predicted Labels 


In [30]:
actual_labels= []

# data extraction
for sentence in dataset:
    for token, label in sentence:
        actual_labels.append(label)

print("Total actual labels: ", len(actual_labels))

Total actual labels:  358


In [31]:
predicted_labels = []

for sentence in dataset:
    text = " ".join(token for token, _ in sentence)
    doc = nlp(text)

    for token in doc:
        if token.ent_type_ == "":
            predicted_labels.append("O")
        else:
            predicted_labels.append(token.ent_iob_ + "-" + token.ent_type_)

print("Total predicted labels:", len(predicted_labels))

Total predicted labels: 362


In [32]:
# Ambiguity in the labels 
print(len(actual_labels) == len(predicted_labels))
for sentence in dataset:
    tokens_nltk = [t for t, l in sentence]
    text = " ".join(tokens_nltk)
    doc = nlp(text)
    
    if len(tokens_nltk) != len(doc):
        print(f"Mismatch found!")
        print(f"Original: {tokens_nltk} (Length: {len(tokens_nltk)})")
        print(f"SpaCy saw: {[t.text for t in doc]} (Length: {len(doc)})")
        print("-" * 30)

False
Mismatch found!
Original: ['The', 'Border-Gavaskar', 'Trophy', 'is', 'played', 'between', 'India', 'and', 'Australia', '.'] (Length: 10)
SpaCy saw: ['The', 'Border', '-', 'Gavaskar', 'Trophy', 'is', 'played', 'between', 'India', 'and', 'Australia', '.'] (Length: 12)
------------------------------
Mismatch found!
Original: ['Pep', 'Guardiola', 'implemented', 'tiki-taka', 'at', 'Barcelona', '.'] (Length: 7)
SpaCy saw: ['Pep', 'Guardiola', 'implemented', 'tiki', '-', 'taka', 'at', 'Barcelona', '.'] (Length: 9)
------------------------------


#### Step 5: Confusion Matrix
It shows how many times:
- A true label was predicted correctly 
- Or predicted incorrectly as another label 



In [33]:
# Define Label Set 

labels = sorted(set(actual_labels + predicted_labels))
# Set removes all duplicates, leaving only unique tags 

label_to_index = {label: i for i, label in enumerate(labels)}
# It assigns a number (index) to each number 

Build The Confusion Matrix

In [34]:
import numpy as np

matrix = np.zeros((len(labels), len(labels)), dtype=int)
# np.zeros(...): Creates a grid filled entirely with zeros.
# Defines the shape. If you have 5 unique labels.
# dtype = int: Ensures the numbers in the grid. 

for actual, predicted in zip(actual_labels, predicted_labels):
    i = label_to_index[actual]
    j = label_to_index[predicted]
    matrix[i][j] += 1

Display Confusion Matrix

In [35]:
import pandas as pd

confusion_df = pd.DataFrame(matrix, index=labels, columns=labels)
confusion_df

Unnamed: 0,B-DATE,B-EVENT,B-FAC,B-GPE,B-LOC,B-MISC,B-NORP,B-ORG,B-PER,B-PERSON,...,I-EVENT,I-FAC,I-GPE,I-LOC,I-MISC,I-NORP,I-ORG,I-PER,I-PERSON,O
B-DATE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B-EVENT,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B-FAC,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B-GPE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B-LOC,0,0,1,8,1,0,0,1,0,1,...,0,0,0,0,0,0,1,0,1,15
B-MISC,0,1,0,0,0,0,0,1,0,1,...,3,0,0,0,0,0,0,0,0,10
B-NORP,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
B-ORG,0,0,0,3,0,0,1,6,0,1,...,0,0,0,1,0,0,2,0,2,8
B-PER,0,0,0,5,0,0,0,2,0,4,...,1,0,2,0,0,0,1,0,1,1
B-PERSON,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Step 6: Label Normalization

In [None]:
def normalize_label(label):
    """
    Normalize labels into a common schema so that
    ground truth and spaCy predictions match.
    """

    # Outside label
    if label == "O":
        return "O"

    # Remove BIO prefix if present
    if "-" in label:
        label = label.split("-")[1]

    # Map spaCy labels to dataset labels
    label_mapping = {
        "PERSON": "PER",
        "PER": "PER",

        "ORG": "ORG",

        "GPE": "LOC",
        "LOC": "LOC",
        "FAC": "LOC",

        "NORP": "MISC",
        "EVENT": "MISC",
        "MISC": "MISC",

        "DATE": "DATE"
    }

    return label_mapping.get(label, "O")


Actual generalized labels: Counter({'O': 208, 'ORG': 43, 'LOC': 43, 'PER': 34, 'MISC': 30})
Predicted generalized labels: Counter({'O': 218, 'ORG': 37, 'PERSON': 32, 'GPE': 31, 'EVENT': 20, 'DATE': 8, 'LOC': 4, 'NORP': 4, 'FAC': 4})


In [None]:
actual_general = []
predicted_general = []

for a, p in zip(actual_labels, predicted_labels):
    actual_general.append(normalize_label(a))
    predicted_general.append(normalize_label(p))


In [73]:
from collections import Counter

print("Actual generalized labels:", Counter(actual_general))
print("Predicted generalized labels:", Counter(predicted_general))


Actual generalized labels: Counter({'O': 208, 'ORG': 43, 'LOC': 43, 'PER': 34, 'MISC': 30})
Predicted generalized labels: Counter({'O': 218, 'LOC': 39, 'ORG': 37, 'PER': 32, 'MISC': 24, 'DATE': 8})


In [74]:
general_labels = sorted(set(actual_general + predicted_general))
label_to_index_general = {label: i for i, label in enumerate(general_labels)}

print(general_labels)

['DATE', 'LOC', 'MISC', 'O', 'ORG', 'PER']


#### Step 7: Generalized Confusion Matrix


In [75]:
general_matrix = np.zeros((len(general_labels), len(general_labels)), dtype=int)

for actual, predicted in zip(actual_general, predicted_general):
    i = label_to_index_general[actual]
    j = label_to_index_general[predicted]
    general_matrix[i][j] +=1

In [67]:
general_confusion_df = pd.DataFrame(
    general_matrix,
    index = general_labels,
    columns= general_labels
)

general_confusion_df

Unnamed: 0,DATE,EVENT,FAC,GPE,LOC,MISC,NORP,O,ORG,PER,PERSON
DATE,0,0,0,0,0,0,0,0,0,0,0
EVENT,0,0,0,0,0,0,0,0,0,0,0
FAC,0,0,0,0,0,0,0,0,0,0,0
GPE,0,0,0,0,0,0,0,0,0,0,0
LOC,0,0,2,10,2,0,0,24,3,0,2
MISC,1,11,0,0,0,0,0,14,2,0,2
NORP,0,0,0,0,0,0,0,0,0,0,0
O,7,8,2,9,1,0,1,151,15,0,14
ORG,0,0,0,5,1,0,3,17,12,0,5
PER,0,1,0,7,0,0,0,12,5,0,9


#### Step 8: Calculation for Accuracy, Precision, Recall, F1-Score 
Accuracy: Measures overall correctness 


In [76]:
# Accuracy
correct = 0
total = len(actual_general)

for a, p in zip(actual_general, predicted_general):
    if a==p:
        correct+=1

accuracy = correct/ total
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.55


- Precision: How many predicted entities are correct 
- Recall: How many actual entities were founf
- F1-score: Balance between Precision & Recall

In [77]:
for label in general_labels:
    idx = label_to_index_general[label]

    TP = general_matrix[idx][idx]
    FP = general_matrix[:, idx].sum() - TP
    FN = general_matrix[idx, :].sum() - TP

    precision = TP / (TP + FP) if (TP + FP) != 0 else 0
    recall = TP / (TP + FN) if (TP + FN) != 0 else 0
    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) != 0 else 0

    print(f"\nEntity: {label}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall:    {recall:.2f}")
    print(f"F1-score:  {f1:.2f}")






Entity: DATE
Precision: 0.00
Recall:    0.00
F1-score:  0.00

Entity: LOC
Precision: 0.36
Recall:    0.33
F1-score:  0.34

Entity: MISC
Precision: 0.46
Recall:    0.37
F1-score:  0.41

Entity: O
Precision: 0.69
Recall:    0.73
F1-score:  0.71

Entity: ORG
Precision: 0.32
Recall:    0.28
F1-score:  0.30

Entity: PER
Precision: 0.28
Recall:    0.26
F1-score:  0.27


#### Step 9: Error Analysis
The generalized confusion matrix shows that the NER model performs best on PERSON entities, indicating that player names are easier to identify due to capitalization and common naming patterns. However, confusion is observed between ORGANIZATION and GPE entities, especially in football-related text where club names and city names are closely related. For example, teams like “Barcelona” and “Chelsea” are sometimes confused with locations. Several PERSON entities are misclassified as O, indicating missed detections, which reduces recall. This occurs because spaCy’s pre-trained model is trained on general news data and not sports-specific corpora.


The model also shows false positives where non-entity words are incorrectly tagged as entities, affecting precision. Multi-word names sometimes suffer from boundary detection errors, where only part of the entity is identified. Abbreviations and short names also contribute to misclassification. Informal sports writing style further increases ambiguity. The dominance of O labels inflates accuracy, making accuracy alone unreliable.


To improve performance, the model can be fine-tuned on a sports-specific NER dataset. Increasing the number of annotated sports sentences would help improve recall. Custom entity categories like TEAM and PLAYER could reduce confusion. Using contextual embeddings or domain-adapted models would further enhance performance. Proper preprocessing and consistent annotation can also reduce errors significantly.