# **Testing Named Entity Recognition (NER) Task**
#### Tests fundamental capability: Understanding

## Introduction

NER (Named Entity Recognition) is a fundamental task in Natural Language Processing (NLP). It involves identifying and categorizing named entities in text, like people, organizations, and locations. The main goal of NER is to assess a model's ability to classify named entities in unstructured text into predefined categories.

In this tutorial, we will evaluate the Named Entity Recognition (NER) capabilities of the GPT-3.5 Turbo language model. The tutorial uses the CoNLL 2003 dataset, a benchmark for NER tasks, to assess the model's ability to identify and classify named entities such as people, organizations, and locations in text. The notebook includes code to load the dataset, interact with the GPT-3.5 Turbo model via the OpenAI API, and evaluate the model's performance by comparing its predictions to the ground truth annotations in the CoNLL dataset.


 The CoNLL dataset consists of text data with annotated named entities: Person, Organization, Location, and Miscellaneous.

For additional information about the CoNLL benchmark: https://arxiv.org/pdf/cs/0306050v1


## Step 1: Install Pre-requisites

In step 1, we will load the pre-requisites

We need to install the following libraries:
- `openai`: For interacting with the OpenAI API to query the LLM.
- `python-dotenv`: To manage API keys securely using environment variables.
- `datasets`: The datasets library provides easy access to a wide variety of datasets commonly used for natural language processing tasks.
- `tqdm`: Adds progress bars to loops, making it easier to monitor & visualize the progress
- `rich`: A library to render rich text (for display purposes)
- `scikit-learn`: A ML library with implmenetations of algorithms, metrics for classification, regression, clustering tasks.

In [None]:
# %pip install openai==1.102.0 python-dotenv datasets tqdm rich scikit-learn

Next, we will import the necessary libraries that will be used for various activities such as data processing, API interaction, and environment management tasks.

In [None]:
import os
import openai
import random
from dotenv import load_dotenv
from rich import print as rprint
from datasets import load_dataset
from sklearn.metrics import precision_recall_fscore_support
from IPython.display import display, HTML
from tqdm import tqdm
import ast
import re

## Step 2: Load LLM

Establishing connection with LLM through API key

In [None]:
# Load API key from environment file
# load_dotenv(dotenv_path="../apikey.env.txt")  # replace the "file path" with the location of your API key file

# APIKEY = os.getenv("APIKEY")

# openai.api_key = APIKEY

import openai
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:11434/v1",  # Local Ollama API
    api_key="ollama"                       # Dummy key
)

A function to interact with the LLM and extract NER tokens from the model

In [None]:
def GetModelResponse(system_content, user_content):
    system = {'role': 'system', 'content': system_content}
    user = {'role': 'user', 'content': user_content}

    response = client.chat.completions.create(
        model="gemma3:4b",
        messages=[system, user],
        #max_tokens=2000
        #temperature=1.0,
    )

    content = response.choices[0].message.content
    return content

## Step 3: Load test dataset

The CoNLL dataset consists of articles with annotated entities like person, organization, location and miscellaneous.

Next, we will download and load the CoNLL 2003 dataset.

In [None]:
conll_data = load_dataset("conll2003", revision="refs/convert/parquet")

**Exploring the CoNLL dataset**

The CoNLL dataset consists of three parts: a training set, a validation set, and a test set.

We will use the test dataset to perform NER evaluations.

In [None]:
# Display basic statistics about the dataset
train_data = conll_data['train']
test_data = conll_data['test']
valid_data = conll_data['validation']

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(valid_data)}")
print(f"Test set size: {len(test_data)}")

**The test set consists of 3453 instances**.

As a tester, you have the option to select the number of instances (out of 3453) on which the LLM will be evaluated. **If no options are provided, the script will default to evaluating 25 instances**.

In [None]:
# Ask user for the number of test instances to evaluate
num_test_instances = input("Enter the number of test instances to evaluate (default 25): ")
num_test_instances = int(num_test_instances) if num_test_instances else 25

# Select the random test instances
random.seed(12)
test_instances = random.sample(list(test_data), num_test_instances)

rprint(f"[blue]Number of test instances: {len(test_instances)}[/blue]")

## Step 4: Prompt Construction

We will construct the prompt to instruct the model to identify and classify entities from the input text. Then, we will evaluate the model performance by comparing the predicted label with the ground truth

In [None]:
# Map numeric tags to string labels for CoNLL dataset
label_mapping = {
    0: 'O',
    1: 'B-PER',
    2: 'I-PER',
    3: 'B-LOC',
    4: 'I-LOC',
    5: 'B-ORG',
    6: 'I-ORG',
    7: 'B-MISC',
    8: 'I-MISC'
}

def get_ner(tokens):
    # Convert tokens to a string format that's cleaner
    tokens_str = str(tokens)
    
    response = client.chat.completions.create(
        model="gemma3:4b",
        messages=[
            {"role": "system", "content": "You are a named entity recognition system. You must respond with ONLY a valid Python dictionary where keys are tokens and values are NER tags. No explanations, no additional text, just the dictionary."},
            {"role": "user", "content": f"""Tokens: {tokens_str}

Tag each token with one of: O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, B-MISC, I-MISC

Example format: {{'John': 'B-PER', 'lives': 'O', 'in': 'O', 'London': 'B-LOC'}}

Response format (dictionary only):"""}
        ]
    )
    
    ner_results = response.choices[0].message.content.strip()
    
    # Remove any markdown formatting
    ner_results = re.sub(r'```python\s*', '', ner_results)
    ner_results = re.sub(r'```\s*', '', ner_results)

    # Fix - to remove "json" prefix
    ner_results = re.sub(r'^json\s*\n?', '', ner_results)
    ner_results = ner_results.strip()
    
    try:
        # Use ast.literal_eval for safe evaluation
        ner_dict = ast.literal_eval(ner_results)
        if isinstance(ner_dict, dict):
            # Ensure all tokens are covered, default to 'O' if missing
            result = {}
            for token in tokens:
                result[token] = ner_dict.get(token, 'O')
            return result
    except (ValueError, SyntaxError) as e:
        print(f"Parsing error: {e}")
        print(f"Raw response: {ner_results}")
        # Return default tags for all tokens
        return {token: 'O' for token in tokens}
    
    return {token: 'O' for token in tokens}

Next, we'll evaluate the LLM's performance by comparing its predicted NER tags against the ground truth labels.

Correctly predicted tokens will be highlighted in green, while mismatches will be displayed in orange.

In [None]:
# Compare LLM predictions with ground truth
predictions = []
ground_truths = []

for instance in test_instances:
    tokens = instance['tokens']
    ground_truth = {tokens[i]: label_mapping[instance['ner_tags'][i]] for i in range(len(tokens))}
    prediction = get_ner(tokens)

    # print(tokens)
    # print(ground_truth)
    # print(prediction)

    predictions.append(prediction)
    ground_truths.append(ground_truth)

    display_text = []
    for token in tokens:
        pred_tag = prediction.get(token, 'O')
        gt_tag = ground_truth[token]
        if pred_tag == gt_tag:
            display_text.append(f"<span style='color: green;'>{token}</span>")
        else:
            display_text.append(f"<span style='color: orange;'>{token}</span>")

    display(HTML(f"<b>Input Text:</b> {' '.join(display_text)}"))
    display(HTML(f"<b>Ground Truth:</b> {ground_truth}"))
    display(HTML(f"<b>Prediction:</b> {prediction}"))
    display(HTML("<hr>"))

# Function to convert NER tags to a comparable format
def convert_to_comparable_format(ner_dict):
    return [(word, tag) for word, tag in ner_dict.items()]

# Calculate precision, recall, and F1-score
y_true = []
y_pred = []

for gt, pred in zip(ground_truths, predictions):
    gt_converted = convert_to_comparable_format(gt)
    pred_converted = convert_to_comparable_format(pred)

    for word, tag in gt_converted:
        y_true.append(tag)
        y_pred.append(pred.get(word, 'O'))  # 'O' for non-entity

precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted', zero_division=0)

**Finally, print out results.**

In [None]:
print("Results summary")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")