# Named Entity Recognition

In this notebook we'll

* List some common applications of NER
* Give a brief history of NER
* Demonstrate how to setup and fine-tune a DistilBERT model for NER
* Discuss some of the issues with using an LLM for an NER task

First, make sure your course package is updated for this lesson and homework.  You need to do this once per server, but not once per notebook.  The exact path will depend on where this notebook is in relation to the folder /Lessons/Course_Tools.

In [None]:
!pip install ../Course_Tools/introdl

After running that cell, you should restart the kernel.

## Applications of NER

I wasn't really familiar with Named Entity Recognition before building this course.  However, after studying it for a bit I realize it's very similar to object detection and instance segmentation in computer vision where we're trying to "tag" individual objects in an image.  Now we're doing it with text.  Now that I know more about it I realize that NER is everywhere:

- **Information Extraction from Text**
  - Identify names of people, places, organizations, and dates in news articles, legal documents, and academic papers.

- **Search and Question Answering**
  - Improve retrieval and understanding by recognizing key entities in queries and documents (e.g., “Where was Barack Obama born?”).

- **Social Media Monitoring**
  - Detect mentions of public figures, brands, products, and locations in tweets, posts, and comments for sentiment analysis or moderation.

- **Marketing and Trend Analysis**
  - Track mentions of brands, competitors, or topics over time to identify emerging trends and customer interests.

- **Content Recommendation**
  - Extract entities (e.g., movies, products, places) from reviews and user posts to personalize content or advertisements.

- **Customer Support Automation**
  - Identify product names, user accounts, and issue types in support chats and emails to assist routing and auto-response systems.

- **Financial and Business Intelligence**
  - Extract company names, stock tickers, monetary values, and events from reports or articles to support decision-making.

- **Medical and Clinical Text Analysis**
  - Identify diseases, medications, and procedures in clinical notes for tasks like anonymization, coding, or record analysis.

- **Legal and Compliance Monitoring**
  - Recognize case names, organizations, and laws in legal documents to support research, auditing, or compliance checks.

- **Resume and Job Post Parsing**
  - Extract structured information such as skills, education, job titles, and companies to streamline recruitment processes.

## **Chronology of State-of-the-Art Approaches for Named Entity Recognition (NER)**  

The evolution of NER closely parallels the evolution of algorithms for text classification.  Early approaches were based on statistical models, then word embeddings and recurrent neural networks, before transformer architectures revolutionized the field since 2017.  

Here's a timeline of some of the key advancements in NER:

---

### **Pre-2010s: Rule-Based Systems and Feature Engineering**  
Early NER systems used **hand-crafted rules**, lookup lists (called **gazetteers**), and basic statistical models like **Hidden Markov Models (HMMs)** and **Conditional Random Fields (CRFs)**.  
- **HMMs** modeled sequences by predicting the most likely tag (e.g., PERSON, LOCATION) for each word based on probabilities.
- **CRFs** improved on HMMs by allowing more flexible features and considering the entire sequence when making predictions.

These approaches required heavy manual feature engineering—like marking whether a word is capitalized, its part of speech, or its prefix/suffix.

- **1990s–2000s**: Rule-based systems and statistical models dominated tasks like newswire NER.
- **2003**: The CoNLL-2003 shared task standardized benchmarks and boosted interest in developing better NER models.

---

### **2010s: Word Embeddings and Neural Sequence Models**  
NER systems improved significantly with the introduction of **word embeddings** like **Word2Vec** and **GloVe**, which represented words in continuous vector space based on context. These embeddings replaced sparse, manual features.

- **2013–2015**: **Word2Vec** and **GloVe** made it easier to train neural models for NER.
- **2015–2016**: **BiLSTM-CRF** architectures became popular—combining bidirectional LSTMs (which read sentences both forward and backward) with a CRF layer to model dependencies between entity tags.
- **2015**: **spaCy** launched as a fast, practical NLP library with built-in NER support, making NER accessible for developers and educators.
- **2016–2017**: Character-level embeddings and CNNs were added to improve robustness to spelling variation and rare words.

---

### **Late 2010s: Contextual Embeddings and Transformers**  
NER took a major leap with **contextualized embeddings** from transformer-based models.

- **2018**: **ELMo** introduced deep contextualized word representations that vary based on sentence context.
- **2018**: **BERT** achieved state-of-the-art NER results by treating NER as a token classification problem using bidirectional transformer layers.
- **2019**: **Flair** added character-level contextual embeddings to further improve performance on small or domain-specific datasets.

---

### **2020s: Prompting and Large Language Models (LLMs)**  
Recent NER approaches increasingly use **LLMs** like **GPT-4**, **Claude**, and **Gemini**, which can extract entities using **natural language prompts** instead of token-level supervision.

- **2020–2022**: Models like **RoBERTa**, **SpanBERT**, and **LUKE** fine-tuned transformer architectures for better span detection and entity-aware representations.
- **spaCy** added support for transformer-based pipelines (e.g., `en_core_web_trf`) to make state-of-the-art NER accessible for production use.
- **2023–2025**: Instruction-tuned models like **GLiNER** and general-purpose LLMs now handle **zero-shot or few-shot NER** using prompts like *"Find all organizations and people in this sentence."* These models reduce the need for annotated datasets and allow rapid prototyping for new entity types.

  While LLMs offer flexibility and ease of use, they may be less precise than traditional models. Hybrid systems often combine LLMs with structured postprocessing or constrained decoding to improve accuracy.

---

We'll focus on two of these tools.  We'll fine-tune a BERT model for NER and we'll look at some of the hurdles to using LLMs for NER.  You'll explore both of these topics further in the homework.

Here's our main import cell before we dive into the rest of the material.

In [None]:
from datasets import load_dataset
import evaluate # Hugging Face library for evaluation
from IPython.display import display
import json
import numpy as np
import pandas as pd
import torch
from transformers import (AutoTokenizer, AutoModelForTokenClassification, 
                          TrainingArguments, Trainer, DataCollatorForTokenClassification)

# local packages
from helpers import (display_ner_html, format_ner_eval_results, evaluate_ner,
                     extract_gold_entities) 
from introdl.utils import config_paths_keys, wrap_print_text
from introdl.nlp import llm_generate, llm_configure, llm_list_models

print = wrap_print_text(print, width=120) # you can specify the wrap width for all print statements

paths = config_paths_keys() # import paths and keys
MODELS_PATH = paths['MODELS_PATH']
DATA_PATH = paths['DATA_PATH']


MODELS_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\models
DATA_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\data
TORCH_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HUB_CACHE=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
Successfully logged in to Hugging Face Hub.


## The Dataset - CoNLL2003 for NER

For our examples, well use the CoNLL2003 dataset.  It is one of the first widely used benchmarks for Named Entity Recognition (NER). It was introduced as part of the CoNLL-2003 shared task and contains annotated text for four entity types: **PER** (person), **LOC** (location), **ORG** (organization), and **MISC** (miscellaneous). The dataset is derived from Reuters news articles and is structured in the BIO format, making it a standard for evaluating NER models.

Multiple versions of the dataset are available in Hugging Face.  We chose "tomaarsen/conll2003" because the NER tags are available in BIO format and because the list of possible labels is easy to extract.

In [2]:

# Load CoNLL2003 dataset (this is not the most well known version of teh dataset, but it is the one that is easiest to load with the datasets library)
dataset = load_dataset("tomaarsen/conll2003")
BIO_tags_list = dataset["train"].features["ner_tags"].feature.names
print("Possible BIO tags", BIO_tags_list)

# delete the pos_tags and chunk_tags columns, as we don't need them
for split in dataset.keys():
    dataset[split] = dataset[split].remove_columns(["pos_tags", "chunk_tags"])


Possible BIO tags ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Each sample in the dataset consists of a single sentence or headline.  Here is how it's stored:

In [3]:
print(dataset["train"][12])

{'id': '12', 'document_id': 1, 'sentence_id': 12, 'tokens': ['Only', 'France', 'and', 'Britain', 'backed', 'Fischler',
"'s", 'proposal', '.'], 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0]}


Notice that the tokens are the words in sentence split up by whitespace and punctuation.  The ner_tags correspond to indices of the entity tags in our list.  The next bit of code also shows you how to get the BIO tags corresponding to each token:

In [4]:
# Extract tokens and ner_tags from dataset["train"][12]
tokens = dataset["train"][12]["tokens"]
ner_tags = dataset["train"][12]["ner_tags"]

# Map ner_tags to their corresponding BIO tags using label_list
bio_tags = [BIO_tags_list[tag] for tag in ner_tags]

# Create a DataFrame
df = pd.DataFrame({"Tokens": tokens, "NER Tags (IDs)": ner_tags, "BIO Tags": bio_tags})

# Display the DataFrame
display(df)

Unnamed: 0,Tokens,NER Tags (IDs),BIO Tags
0,Only,0,O
1,France,5,B-LOC
2,and,0,O
3,Britain,5,B-LOC
4,backed,0,O
5,Fischler,1,B-PER
6,'s,0,O
7,proposal,0,O
8,.,0,O


[spaCy is a whole ecosystem](https://spacy.io/) of tools for NLP that we won't really dive into much in this course, but it's worth a look if you're going to be working in this area.  They provide some great tools for visualization of tagged text.  We've use their package to make a little function called `display_ner_html` which takes lists of tokens, tag IDs, and the list of labels to produce HTML visualizations of the tags.  The function is in helper.py if you're curious.  Here's how we can use it:

In [5]:
# tokens and ner_tags were defined in the previous code cell

display_ner_html(tokens, ner_tags, BIO_tags_list)

In [6]:
# here's another example
display_ner_html(dataset["train"][4]["tokens"], dataset["train"][4]["ner_tags"], BIO_tags_list)

## Fine-tune DistilBERT for ConNLL2003

#### L10_1_Fine-tune_BERT Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_fine-tune_bert/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_fine-tune_bert/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/EgBF1mreyjw" target="_blank">Open Descript version of video in new tab</a>


Now we want to fine-tune a BERT model so that it can provide similar tagging for new text.  First we'll load a model and its tokenizer.
`distilbert-base-cased` is a smaller, faster, and lighter version of BERT that retains 97% of its language understanding capabilities while being 40% smaller. It is case-sensitive, meaning it distinguishes between "Apple" and "apple" which is useful for NER tasks. It was trained using masked language modeling on the same data as BERT, including the English Wikipedia and BookCorpus, but with a reduced architecture to improve efficiency. 

Note that we make use of `AutoModelForTokenClassification` which adds a classification head to the backbone the same way we did for transfer learning applications in image classification.  The backbone uses pretrained weights while the classification head weights are randomly initialized and learned during fine-tuning.

In [7]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-cased", num_labels=len(BIO_tags_list))


One of the main issues we'll need to deal with is to map the BIO tags to the tokens that are produced by tokenizer that comes with our selected BERT model.  That tokenizer will break some of our words into subwords.  For those subwords we'll introduce an ID of -100 that tells the model not to predict tags for those tokens.

The function, `tokenize_and_align_labels` below takes care of aligning the ID tags from the input sequence in the dataset to the output tokens in the tokenizer.  We've included some comments in the code if you want to study it, or you can use an AI to help you walk through the details.

In [8]:
# Helper function to align labels with tokens
def tokenize_and_align_labels(examples):
    # Tokenize the input text (list of tokens) while keeping track of word-to-token alignment
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    
    # Initialize a list to store the aligned labels for each example
    labels = []
    
    # Iterate over each example in the batch
    for i, label in enumerate(examples["ner_tags"]):
        # Get the word-to-token mapping for the current example
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        
        # Initialize variables to track the previous word index and the label IDs
        previous_word_idx = None
        label_ids = []
        
        # Iterate over the word IDs corresponding to the tokens
        for word_idx in word_ids:
            if word_idx is None:
                # If the token is a special token (e.g., [CLS], [SEP]), ignore it by assigning -100
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # If the token corresponds to a new word, assign the label of that word
                label_ids.append(label[word_idx])
            else:
                # If the token is part of the same word (e.g., subword tokens), ignore it by assigning -100
                label_ids.append(-100)
            
            # Update the previous word index to the current one
            previous_word_idx = word_idx
        
        # Append the aligned label IDs for the current example
        labels.append(label_ids)
    
    # Add the aligned labels to the tokenized inputs
    tokenized_inputs["labels"] = labels
    
    # Return the tokenized inputs with aligned labels
    return tokenized_inputs

# Tokenize datasets
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


The next cell demonstrates how our tokenizer works the alignment function to get the tokenization expected by the model and to introduce IDs of -100 for each of the subwords introduced by the tokenizer.  

In [9]:
# Get the example
example = dataset["train"][7]

# Wrap in a batch of one for compatibility with tokenize_and_align_labels
batch = {"tokens": [example["tokens"]], "ner_tags": [example["ner_tags"]]}

# Apply the tokenization and alignment function
tokenized = tokenize_and_align_labels(batch)

# Extract and display results
tokens = tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
labels = tokenized["labels"][0]

print(("Before model tokenization:\n"))
display_ner_html(dataset["train"][7]["tokens"], dataset["train"][7]["ner_tags"], BIO_tags_list)
print(("\nAfter model tokenization:\n"))
display_ner_html(tokens, labels, BIO_tags_list)


Before model tokenization:




After model tokenization:



You can see that the tokenizer divided some of the original words into subwords which get assigned an ID of -100 to be ignored by the model.  During training those tokens are ignored by the loss function and the outputs corresponding to those tokens are ignored during model evaluation.

Before we fine-tune the model we define a custom metrics function that does two things:
1. Uses the `seqeval` package to evaluate entire entity spans (e.g, e.g., `B-LOC`, `I-LOC`, etc. forming `"New York"`) instead of evaluating individual labels as we'd do with the scikit-learn metrics.
2. Ignores the tokens with IDs of -100 for the evaluation metrics:

In [10]:
# Load seqeval metric
metric = evaluate.load("seqeval")

# Note if you have a different list of possible tags, you'll need to change the default value of label_list
def compute_metrics(p, label_list=BIO_tags_list):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return metric.compute(predictions=true_predictions, references=true_labels)


For the actual fine-tuning we use a similar setup to what we did for text classification:

In [None]:

# Training arguments
training_args = TrainingArguments(
    output_dir= MODELS_PATH / "distilbert-ner",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    seed=42,
    disable_tqdm=False,
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss,Loc,Misc,Org,Per,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.0534,0.056684,"{'precision': 0.9189046866771985, 'recall': 0.949918345127926, 'f1': 0.934154175588865, 'number': 1837}","{'precision': 0.8034744842562432, 'recall': 0.8026030368763557, 'f1': 0.8030385241454151, 'number': 922}","{'precision': 0.8747268754552076, 'recall': 0.8956002982848621, 'f1': 0.8850405305821666, 'number': 1341}","{'precision': 0.9753363228699552, 'recall': 0.9446254071661238, 'f1': 0.9597352454495311, 'number': 1842}",0.907813,0.913161,0.910479,0.98458
2,0.0162,0.04885,"{'precision': 0.9451907576571735, 'recall': 0.9575394665215025, 'f1': 0.9513250405624661, 'number': 1837}","{'precision': 0.8494623655913979, 'recall': 0.8568329718004338, 'f1': 0.8531317494600432, 'number': 922}","{'precision': 0.8949329359165424, 'recall': 0.8956002982848621, 'f1': 0.8952664927320164, 'number': 1341}","{'precision': 0.9627027027027028, 'recall': 0.9668838219326819, 'f1': 0.9647887323943664, 'number': 1842}",0.924453,0.930831,0.927631,0.987598
3,0.0159,0.046552,"{'precision': 0.959061135371179, 'recall': 0.9564507348938487, 'f1': 0.9577541564458981, 'number': 1837}","{'precision': 0.8449531737773153, 'recall': 0.8806941431670282, 'f1': 0.8624535315985129, 'number': 922}","{'precision': 0.894659839063643, 'recall': 0.9120059656972409, 'f1': 0.9032496307237814, 'number': 1341}","{'precision': 0.9678824169842134, 'recall': 0.9652551574375678, 'f1': 0.966567001902691, 'number': 1842}",0.928798,0.937395,0.933076,0.988552


TrainOutput(global_step=2634, training_loss=0.06429638274879554, metrics={'train_runtime': 302.6591, 'train_samples_per_second': 139.176, 'train_steps_per_second': 8.703, 'total_flos': 525319502290632.0, 'train_loss': 0.06429638274879554, 'epoch': 3.0})

In [12]:

# Evaluate on test set
results_BERT = trainer.evaluate(tokenized_datasets["test"])
print("\nTest set evaluation results:")
print(results_BERT)



Test set evaluation results:
{'eval_loss': 0.12091150879859924, 'eval_LOC': {'precision': 0.9082240762812872, 'recall': 0.9136690647482014, 'f1':
0.9109384339509861, 'number': 1668}, 'eval_MISC': {'precision': 0.7157622739018088, 'recall': 0.7891737891737892, 'f1':
0.7506775067750678, 'number': 702}, 'eval_ORG': {'precision': 0.8491555037856727, 'recall': 0.8777844671884407, 'f1':
0.8632326820603909, 'number': 1661}, 'eval_PER': {'precision': 0.9567126725219574, 'recall': 0.943104514533086, 'f1':
0.9498598567424478, 'number': 1617}, 'eval_overall_precision': 0.8781884435190005, 'eval_overall_recall':
0.8960694050991501, 'eval_overall_f1': 0.8870388221891157, 'eval_overall_accuracy': 0.977969204264025, 'eval_runtime':
6.8877, 'eval_samples_per_second': 501.332, 'eval_steps_per_second': 31.36, 'epoch': 3.0}


That's some ugly output!  Let's put it in a data frame with some formatting

In [13]:
df_results_BERT = format_ner_eval_results(results_BERT)
display(df_results_BERT)

Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,LOC,0.9082,0.9137,0.9109,1668.0,
1,MISC,0.7158,0.7892,0.7507,702.0,
2,ORG,0.8492,0.8778,0.8632,1661.0,
3,PER,0.9567,0.9431,0.9499,1617.0,
4,Overall,0.8782,0.8961,0.887,,0.978


That's better!  You can think of f1 as a balanced version of accuracy.  We can see that the model does a great job on identifying people and is also good at identifying locations and organizations.  It doesn't do quite as well as identifying miscellaneous entities (but I'm not sure what those are supposed to be either ...).

You can also use the model to do inference by making predictions on new text.  The function below is also included in the helpers.py file, but we include it here so you can study it and see how it works:

In [14]:
def predict_ner_tags(text, model, tokenizer):
    """
    Tokenizes and predicts NER tags for the given text using a Hugging Face model.

    Args:
        text (str): Input sentence (e.g., "Barack Obama was born in Hawaii").
        model: A Hugging Face token classification model (e.g., DistilBERT).
        tokenizer: The tokenizer corresponding to the model.

    Returns:
        tokens (List[str]): Original word tokens from the input text.
        predicted_tag_ids (List[int]): One predicted tag index per word (subwords/specials skipped).
    """

    # Step 1: Split the input text into whitespace-separated words
    words = text.split()

    # Step 2: Tokenize the list of words and retain word alignment
    inputs = tokenizer(words, return_tensors="pt", is_split_into_words=True).to(model.device)

    # Step 3: Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Step 4: Convert logits to predicted label indices
    predictions = torch.argmax(outputs.logits, dim=2)[0].cpu().numpy()

    # Step 5: Get word IDs for each token
    word_ids = inputs.word_ids(batch_index=0)

    # Step 6: Extract one prediction per word (first subword only)
    predicted_tag_ids = []
    seen_words = set()
    for token_idx, word_idx in enumerate(word_ids):
        if word_idx is not None and word_idx not in seen_words:
            predicted_tag_ids.append(int(predictions[token_idx]))
            seen_words.add(word_idx)
        # skip subwords and special tokens

    # Step 7: Return the original words and corresponding predicted tags
    return words, predicted_tag_ids


Now we'll apply it to some example text we copied from the internet (about GPT-4o's new image generation capability).  First we'll load the model from our last checkpoint so that we don't need to retrain the model to make predictions.  You may have to adjust the path to the checkpoint if you changed the traiing or save path.

In [15]:
# Load the tokenizer and model from the final checkpoint
final_checkpoint_path = MODELS_PATH / "distilbert-ner" / "checkpoint-2634"
tokenizer = AutoTokenizer.from_pretrained(final_checkpoint_path)
model = AutoModelForTokenClassification.from_pretrained(final_checkpoint_path)

Now we can run the model:

In [16]:
example_text = """
It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds are already flooded with AI-generated memes in the style of Studio Ghibli, the cult-favorite Japanese animation studio behind blockbuster films such as “My Neighbor Totoro” and “Spirited Away.”

In the last 24 hours, we’ve seen AI-generated images representing Studio Ghibli versions of Elon Musk, “The Lord of the Rings“, and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his new profile picture a Studio Ghibli-style image, presumably made with GPT-4o’s native image generator. Users seem to be uploading existing images and pictures into ChatGPT and asking the chatbot to re-create it in new styles.
"""

tokens, tags = predict_ner_tags(example_text, model, tokenizer)
print(tokens,tags)

['It’s', 'only', 'been', 'a', 'day', 'since', 'ChatGPT’s', 'new', 'AI', 'image', 'generator', 'went', 'live,', 'and',
'social', 'media', 'feeds', 'are', 'already', 'flooded', 'with', 'AI-generated', 'memes', 'in', 'the', 'style', 'of',
'Studio', 'Ghibli,', 'the', 'cult-favorite', 'Japanese', 'animation', 'studio', 'behind', 'blockbuster', 'films',
'such', 'as', '“My', 'Neighbor', 'Totoro”', 'and', '“Spirited', 'Away.”', 'In', 'the', 'last', '24', 'hours,', 'we’ve',
'seen', 'AI-generated', 'images', 'representing', 'Studio', 'Ghibli', 'versions', 'of', 'Elon', 'Musk,', '“The', 'Lord',
'of', 'the', 'Rings“,', 'and', 'President', 'Donald', 'Trump.', 'OpenAI', 'CEO', 'Sam', 'Altman', 'even', 'seems', 'to',
'have', 'made', 'his', 'new', 'profile', 'picture', 'a', 'Studio', 'Ghibli-style', 'image,', 'presumably', 'made',
'with', 'GPT-4o’s', 'native', 'image', 'generator.', 'Users', 'seem', 'to', 'be', 'uploading', 'existing', 'images',
'and', 'pictures', 'into', 'ChatGPT', 'and', 'asking', '

Of course, the raw output is kind of difficult to interpret, but we can easily visualize it with `display_ner_html`

In [17]:
display_ner_html(tokens, tags, BIO_tags_list)


That seems pretty amazing for entity recognition on text the model has never seen!

## NER by Zero-Shot LLM Prompting

#### L10_1_LLM_NER Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_llm_ner/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_llm_ner/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/EgBF1mreyjw" target="_blank">Open Descript version of video in new tab</a>

In this section we'll explore using LLMs for NER.  LLMs can do this quite well, but there are some differences to be aware of though.  LLMs are naturally better at extracting spans (the relevant words for each identified entity) or structured output, not token-level labeling, because:

* The process text holistically, not token-by-token.
* There's no inherent token alignment.
* They can hallucinate or skip tokens when generating lists.
* The extracted spans may not exactly match the strings in the text, e.g. "ChatGPT's" gets extracted as "ChatGPT"

When we use an LLM to extract entities, we'll get lists of spans of each type.  You'll need to prompt carefully:
* try to get the LLM to extract the entities as they appear in the text
* you may need to provide examples or explanations of the entity types

When we evaluate the results, we won't be able to compare token by token as we did above for the output of our BERT model (that kind of evaluation is similar to evaluating semantic segmentation results where we can compare every pixel in the image to every pixel in the mask).  Instead we can just determine if each found each entity and whether it had the correct entity type.  It will help to use "fuzzy" matching which doesn't require exacty matching of strings to accout for misspellings and different presentations of words.

**Note:**  It's possible to use an LLM to produce token-level tags for each token through a combination of careful prompting and post-processing, but we'll stick with the simpler problem of identifying entities without identifying their positions in the text which is adequate for many applications.

We'll use `llm_generate` as we've done previously.    Here's the list of models that are easy to use with `llm_generate`.  You can adjust the code below to use other models, or the Groq or Together.AI APIs.

In [None]:
llm_list_models();

Available models:
 llama-3p2-3B => HuggingFace: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit
 llama-3p1-8B => HuggingFace: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
 mistral-7B => HuggingFace: unsloth/mistral-7b-instruct-v0.3-bnb-4bit
 qwen-2p5-3B => HuggingFace: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
 qwen-2p5-7B => HuggingFace: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
 gemini-flash-lite => needs GEMINI_API_KEY
 gemini-flash => needs GEMINI_API_KEY
 gpt-4o => needs OPENAI_API_KEY
 gpt-4o-mini => needs OPENAI_API_KEY


### Using an LLM for NER - The Basics

We'll start by crafting a prompt and asking a local model to identify the CoNLL entities (PER, LOC, ORG, MISC) in the example text from the last section.  We're going to specify the entity types in the prompt and try to get the model to produce JSON ouput.  JSON is output that's been formatted like a Python dictionary.  Let's see what happens.

In [19]:
llm_config = llm_configure("llama-3p1-8B")

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Example Text
example_text = """It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds 
are already flooded with AI-generated memes in the style of Studio Ghibli, the cult-favorite 
Japanese animation studio behind blockbuster films such as “My Neighbor Totoro” and “Spirited Away.”

In the last 24 hours, we’ve seen AI-generated images representing Studio Ghibli versions of Elon Musk, 
“The Lord of the Rings“, and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his 
new profile picture a Studio Ghibli-style image, presumably made with GPT-4o’s native image generator. 
Users seem to be uploading existing images and pictures into ChatGPT and asking the chatbot to re-create 
it in new styles."""

# Prompt for CoNLL2003-style entity extraction
prompt = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: """ + example_text + " \nThe Entities JSON:"

response = llm_generate(llm_config, prompt, system_prompt = system_instruct, search_strategy='deterministic', remove_input_prompt=False)
print(response)

🚀 Loading model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit (this may take a while)...
🟢 Model unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit loaded successfully.

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant for named entity recognition. You return entity spans in JSON.user

Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess.
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds
are already flooded with AI-generated memes in the style of Studio

Note some things about the output:
* `llm_generate` may fail to remove the prompt from the output. We forced it to keep it above by passing `remove_input_prompt = False`, but sometimes it fails because our cleaning algorithm doesn't correctly detect the input prompt in the output.  You should generally use `remove_input_prompt=True` or just leave it out since it defaults to True.
* It mis-identified "ChatGPT" as a person.
* It also returned the the span as "ChatGPT" instead of "ChatGPT's" as it occurs in the text.

We can fix the first issue by passing a `split_string` to `llm_generate` which will delete all the text up to the string.  We might be able to fix the second issue by providing examples (few-shot prompting) or more careful instructions to the LLM.  The third issue is why we'll need to use some inexact matching to match predicted spans with the input text.  

First let's see how to get rid of that input prompt if necessary.

In [20]:
response = llm_generate(llm_config, prompt, system_prompt = system_instruct, 
                        search_strategy='deterministic', split_string='JSON:assistant')
print(response)

{"PER":["ChatGPT", "Elon Musk", "Sam Altman", "Donald Trump"], "ORG":["Studio Ghibli", "OpenAI"], "LOC":[], "MISC":[]}


That's better.  You may not need the split_string with some LLMs (particularly the API-based LLMs) or you may need to adjust it for different models.  

Finally, the output is still a string, but we'd like to load that string as an actual dictionary. We can use `json.loads` to load the JSON formatted string as a dictionary in Python.  Some LLMs, like Gemini, will return the output with Markdown formatting like this:

<pre>
```json
{"PER":["ChatGPT", "Elon Musk", "Sam Altman", "Donald Trump"], "ORG":["Studio Ghibli", "OpenAI"], "LOC":[], "MISC":[]}
```
</pre>

So we may need to strip those extra characters away before using `json.loads`.  Here's a little function to do both of those things.  It's also in helpers.py:

In [21]:
def json_extractor(text):
    # Extract the JSON object from the response
    try:
        text = text.strip("```json").strip("```").strip()
        json_object = json.loads(text)
    except json.JSONDecodeError:
        json_object = {"error": "Could not parse JSON"}
    return json_object

Now, to see it in action:

In [22]:
entities = json_extractor(response)
entities

{'PER': ['ChatGPT', 'Elon Musk', 'Sam Altman', 'Donald Trump'],
 'ORG': ['Studio Ghibli', 'OpenAI'],
 'LOC': [],
 'MISC': []}

Now if we review our example_text with the tags predicted by our BERT model:

In [23]:
# you must have run the BERT model above on the example_text
display_ner_html(tokens, tags, BIO_tags_list)


Our LLM approach didn't tag any "MISC" entities but the BERT model identified a bunch.  The BERT model also incorrectly identifed "Elon" as a "MISC" entity instead of identify "Elon Musk" as a "PER".  We may be able to improve our prompt by providing examples of "MISC" or by giving better instructions.  We'll leave that a homework exercise.

### Using an LLM for NER - Streamlining the Process

Similar to the way we made `llm_text_classifier` for text classification, we'll put our pipeline together here in a single function that expects us to input a list of texts to be tagged and outputs a list of entity dictionaries.  

If you have to do a lot of this sort of work you should explore [LangChain](https://www.langchain.com/) which in ecosystem of tools for developing applications powered by LLMs.  If you're curious check out the [documentation here](https://python.langchain.com/docs/introduction/).  Look at the tutorial for text classification to see how it compares to what we did in Lesson 8.

In [24]:
def llm_ner_extractor(llm_config,
                      texts,
                      system_prompt,
                      prompt_template,
                      batch_size=1,
                      estimate_cost=False,
                      rate_limit=None,
                      split_string=None,
                      return_raw=False):
    """
    Extract named entities using a Large Language Model (LLM) in zero-shot fashion.

    Args:
        llm_config (ModelConfig): Configuration for the LLM.
        texts (list of str): List of input texts to process.
        system_prompt (str): System prompt guiding the LLM behavior.
        prompt_template (str): Template to construct the user prompt for each text.
        batch_size (int, optional): Batch size for local LLMs. Defaults to 1.
        estimate_cost (bool, optional): Estimate LLM cost. Defaults to False.
        rate_limit (int, optional): Throttle requests for API models. Defaults to None.
        split_string (str, optional): String to split the LLM output. Defaults to None.
        return_raw (bool, optional): Whether to return raw LLM outputs. Defaults to False.

    Returns:
        list of dict: List of JSON objects containing extracted entities for each input text.
    """

    # Step 1: Create user prompts by formatting the prompt template with each input text.
    # This ensures that each text is passed to the LLM with the same structure.
    user_prompts = [prompt_template.format(text=text) for text in texts]

    # Step 2: Generate raw outputs from the LLM using the provided configuration and prompts.
    # The `llm_generate` function sends the prompts to the LLM and retrieves the responses.
    raw_outputs = llm_generate(llm_config,
                               user_prompts,
                               system_prompt=system_prompt,
                               search_strategy='deterministic',  # Ensures consistent outputs.
                               batch_size=batch_size,  # Number of prompts processed at once.
                               estimate_cost=estimate_cost,  # Optionally estimate the cost of LLM usage.
                               rate_limit=rate_limit,  # Throttle requests if needed.
                               split_string=split_string)  # Optionally split the output.

    # Step 3: If the user wants raw outputs, return them directly without further processing.
    if return_raw:
        return raw_outputs

    # Step 4: Process the raw outputs to extract JSON objects.
    # Initialize an empty list to store the processed JSON outputs.
    json_outputs = []
    for output in raw_outputs:
        try:
            # Step 4.1: Clean the output by removing any extra formatting (e.g., Markdown code blocks).
            output = output.strip("```json").strip("```").strip()

            # Step 4.2: Parse the cleaned output as a JSON object and append it to the list.
            json_outputs.append(json.loads(output))
        except json.JSONDecodeError:
            # Step 4.3: If parsing fails, append an error message along with the raw output.
            json_outputs.append({"Error": "Could not parse JSON", "raw_output": output})

    # Step 5: Return the list of JSON objects containing the extracted entities.
    return json_outputs

Now we'll apply `llm_ner_extractor` to the first 100 texts in the validation set to extract the entity dictionaries.  We'll use the `gemini-flash-lite` model which requires a Gemini API key.  If you're using the free version you may need to pass `rate_limit=30` to `llm_ner_extractor`.

In [26]:
llm_config = llm_configure('gemini-flash-lite')

# Extract N examples from the validation split of CoNLL2003
N = 100
subset = dataset["validation"].select(range(N))

texts = [' '.join(tokens) for tokens in subset["tokens"]] # Convert tokens to text

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Prompt template adapted for CoNLL2003-style entity extraction.  
# You must keep {text} in the template for the text to be inserted.
prompt_template = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: {text}
The Entities JSON:
"""

# Used to split off the assistant output from the JSON (if needed)
split_string = "JSON:assistant"

# Call the LLM-based NER extractor
predicted_entities = llm_ner_extractor(
    llm_config,
    texts,
    system_instruct,
    prompt_template,
    batch_size=10,
    estimate_cost=False,
    rate_limit=None,
    split_string=split_string
)

# Display the first few predictions for inspection
for i, text in enumerate(texts[:10]):
    print(f"Text: {text}")
    print("The Entities JSON:")
    print(predicted_entities[i])
    print("\n")


Text: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
The Entities JSON:
{'PER': [], 'ORG': ['LEICESTERSHIRE'], 'LOC': [], 'MISC': []}


Text: LONDON 1996-08-30
The Entities JSON:
{'PER': [], 'ORG': [], 'LOC': ['LONDON'], 'MISC': []}


Text: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and
39 runs in two days to take over at the head of the county championship .
The Entities JSON:
{'PER': ['Phil Simmons'], 'ORG': ['Leicestershire', 'Somerset'], 'LOC': [], 'MISC': []}


Text: Their stay on top , though , may be short-lived as title rivals Essex , Derbyshire and Surrey all closed in on
victory while Kent made up for lost time in their rain-affected match against Nottinghamshire .
The Entities JSON:
{'PER': [], 'ORG': ['Essex', 'Derbyshire', 'Surrey', 'Kent', 'Nottinghamshire'], 'LOC': [], 'MISC': []}


Text: After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended th

Let's look to see if there were any problems extracting JSON from the LLM output.  We can count the number of output dictionaries that include 'Error' as a key:

In [28]:
error_count = sum(1 for prediction in predicted_entities if 'Error' in prediction)
print(f"Number of dictionaries with 'Error' as a key: {error_count}")

Number of dictionaries with 'Error' as a key: 0


Great.  We were able to successfully extract JSON from every response.  Let's now evaluate the performance.  Since we're not comparing tags token-by-token what we'll do is:

1.  Use the token-by-token tags in the dataset to compute an entity dictionary for each input text.

2.  Compare the predicted entity dictionary to the "gold" entity dictionary for each example using fuzzy matching (inexact string matches).  In the context of NER the ground-truth labels are sometime called the "gold" labels!

You can learn more about fuzzy string matching and the package in the [RapidFuzz Documentation](https://rapidfuzz.github.io/RapidFuzz/).

We built a helper function called `extract_gold_entities` which takes an example from our dataset and extracts the "gold" dictionary.  For example, here's an example from the validation set:



In [29]:
subset[2]

{'id': '2',
 'document_id': 1,
 'sentence_id': 2,
 'tokens': ['West',
  'Indian',
  'all-rounder',
  'Phil',
  'Simmons',
  'took',
  'four',
  'for',
  '38',
  'on',
  'Friday',
  'as',
  'Leicestershire',
  'beat',
  'Somerset',
  'by',
  'an',
  'innings',
  'and',
  '39',
  'runs',
  'in',
  'two',
  'days',
  'to',
  'take',
  'over',
  'at',
  'the',
  'head',
  'of',
  'the',
  'county',
  'championship',
  '.'],
 'ner_tags': [7,
  8,
  0,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  0,
  3,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

Here's the extracted gold or ground-truth entities:

In [30]:
gold_entities = extract_gold_entities(subset[2], BIO_tags_list)
gold_entities

{'MISC': ['West Indian'],
 'PER': ['Phil Simmons'],
 'ORG': ['Leicestershire', 'Somerset']}

While here are the predicted entities from our LLM model:

In [31]:
predicted_entities[2]

{'PER': ['Phil Simmons'],
 'ORG': ['Leicestershire', 'Somerset'],
 'LOC': [],
 'MISC': []}

### Computing the Performance Metrics

To evaluate Named Entity Recognition (NER), we compare the entities predicted by the model with the **gold (true)** entities from the dataset.

We compute the following metrics **for each entity type** (e.g., PER, LOC, ORG):

- **Precision** = Correct predictions / All predictions  
- **Recall** = Correct predictions / All gold (true) entities  
- **F1 score** = Harmonic mean of precision and recall  
- **Accuracy** = Correct predictions / (Correct + Wrong + Missed predictions)

We include the function `evaluate_ner` in `helpers.py` to do the computations.  It's imported above.  We show you how to use it in the next cell assuming that `subset` from above for which our LLM NER model gave us the entity `predicted_entities`.

In [32]:
# Extract gold entities
gold_entities = [extract_gold_entities(ex, BIO_tags_list) for ex in subset]

# Evaluate
results_llm = evaluate_ner(predicted_entities, gold_entities, labels = ["PER", "ORG", "LOC", "MISC"])

# Format the evaluation results
df_results_llm = format_ner_eval_results(results_llm)
display(df_results_llm)


Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,PER,1.0,1.0,1.0,58.0,
1,ORG,0.8154,0.7571,0.7852,70.0,
2,LOC,0.8393,0.8103,0.8246,58.0,
3,MISC,0.6667,0.1667,0.2667,12.0,
4,Overall,0.8791,0.8081,0.8421,,0.7273


The LLM NER results are terrific for people, and pretty good for locations and organizations, but only find about 17% the true MISC entities in the texts.  Maybe you can get it to work better by providing examples of MISC entities and additional instructions in the prompt.

Here are the results from the BERT model (applied to the whole test set) for comparison:

In [33]:
display(df_results_BERT)

Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,LOC,0.9082,0.9137,0.9109,1668.0,
1,MISC,0.7158,0.7892,0.7507,702.0,
2,ORG,0.8492,0.8778,0.8632,1661.0,
3,PER,0.9567,0.9431,0.9499,1617.0,
4,Overall,0.8782,0.8961,0.887,,0.978


**Note:** The accuracies are very different, in part, because they're computed differently.  In the case of the BERT model we are able to include all the tokens tagged as 'O' (other) which is most of the tokens.  This inflates the accuracy just like computing accuracy for a segmentation model in which most of the pixels are background.  