<a href="https://colab.research.google.com/github/pelagios/llm-lod-enriching-heritage/blob/main/notebooks/tasks/ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Named Entity Recognition

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that involves identifying and classifying named entities (like people, places, organizations) within text. For example, in the sentence "Shakespeare wrote Romeo and Juliet in London", a NER system would identify "Shakespeare" as a person, "Romeo and Juliet" as a work of art, and "London" as a location. NER is crucial for extracting structured information from unstructured text, making it valuable for tasks like information retrieval, question answering, and metadata enrichment. In this notebook, we'll explore how to perform NER using both traditional NLP approaches and modern Large Language Models.

### Rationale

This notebook demonstrates how to use OpenAI's GPT models to perform Named Entity Recognition (NER) by converting input text into annotated markdown format. Rather than using traditional NLP libraries, we leverage a Large Language Model's natural language understanding capabilities to identify and classify named entities. The notebook takes plain text as input and outputs markdown where entities are annotated in the format [Entity](TYPE), such as [London](LOCATION). This approach showcases how LLMs can be used for structured information extraction tasks in cultural heritage metadata enrichment.


### Process Overview

The process consists of the following steps:

1. **Import required software libraries**: We start with importing required software libraries
2. **Read the text that requires processing**: Next we obtain the input text from the preprocessing notebook
3. **Named entity recognition**: The text is sent to GPT with a prompt that instructs it to identify entities. The LLM marks entities in markdown format: \[entity text\]\(entity type\)
4. **Named entity visualization**: The annotated text is displayed with colour-coded entity highlighting
5. **Save results**: Save the results of named entity recognition for future processing

This approach leverages the LLM's natural language understanding while producing structured, machine-readable output.

### Dependencies

This notebook depends on three files:

* utils.py: helper functions
* output_data_preparation_11f98441067263d80ee1a6bac27babf0f2c6734b.json: output file of data preparation task
* ner_cache.json: context-dependent cache of names found earlier

Please make sure they are available in this folder so that the notebook can run smoothly. You can download them from [Github](https://github.com/pelagios/llm-lod-enriching-heritage/tree/main/notebooks/tasks).

## 2.1. Import required software libraries

Preprocessing data requires importing some standard software libraries. This step may take some time when run for the first time but in successive runs it will be a lot faster.

We start with checking if the notebook is running on Google Colab. If that is the case, we need to connect to the notebook's environment

In [None]:
import os

def check_notebook_environment_on_colab():
    """Test if run on Colab, if so test if environment is available, if not install it"""
    try:
        from google.colab import files
        try:
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
            print("Found notebook environment")
        except:
            print("notebook environment not found, installing...")
            !git clone https://github.com/pelagios/llm-lod-enriching-heritage.git
            os.chdir("/content/llm-lod-enriching-heritage/notebooks/tasks")
    except:
        print("Not running on Google Colab")

check_notebook_environment_on_colab()

Next we import standard libraries which should always be available

In [None]:
from dotenv import load_dotenv
import hashlib
import importlib
from IPython.display import clear_output, HTML
import json
import os
import regex
import subprocess
import time
from typing import List, Dict, Any, Tuple, Optional
import utils

Next we import packages which may require installation on this device

In [None]:
openai = utils.safe_import("openai")
spacy = utils.safe_import("spacy")

We set settings required for Google Colab

In [None]:
in_colab = utils.check_google_colab()

This function is needed in different sections, we define it here.

In [None]:
CONTEXT = """extracted from records of objects in the collection of
             the Egyptian museum, Torino, the Museo Egizio â€“ Torino"""

def make_prompt(texts_input, target_labels):
    """Create an LLM prompt, given a text and target labels and return it"""
    return f"""
Convert the following text {CONTEXT} into a structured markdown format,
where you annotate the entities in the text in the following format:
[Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
{target_labels}

Do this for the following text:
{texts_input}

Only return the markdown output, nothing else.
"""

## 2.2. Read the text that requires processing

The text should have been preprocessed by the `data_preparation.ipynb` notebook. The file read here is an output file of that notebook. We select the first text and show it.

In [None]:
infile_name = "output_data_preparation_11f98441067263d80ee1a6bac27babf0f2c6734b.json"


with open(infile_name, "r") as infile:
    input_data = json.load(infile)
    infile.close()

texts_input = [text["text_cleaned"] for text in input_data]
print(texts_input[0])

## 2.3. Named entity recognition with ChatGPT

We use OpenAI's ChatGPT for recognizing named entities in the text. For this approach you need an OpenAI API key. Store it in the environment variable OPENAI_API_KEY or in the file OPENAI_API_KEY. If you are working in Google Colab, you may also store the key among the Secrets, using the name OPENAI_API_KEY.

For alternative approaches to named entity recognition that do not require the OPENAI_API_KEY, see chapter 6. Also when you just want to process the example notebook data from Egyptian Museum, you can just proceed here without providing the OPENAI_API_KEY. This data has already been processed. The associated names will be fetched from the file `ner_cache.json`.

First we define two helper functions

In [None]:
model = "gpt-4o-mini"
target_labels=["PERSON", "LOCATION"]
NER_CACHE_FILE = "ner_cache.json"

def openai_ner(model, texts_input):
    ner_cache = utils.read_json_file(NER_CACHE_FILE)
    texts_output = []
    for index, text in enumerate(texts_input):
        if text in ner_cache and model in ner_cache[text]:
            utils.squeal(f"Retrieving entities for text {index + 1} from cache")
            texts_output.append(ner_cache[text][model])
        else:
            if "openai_client" not in vars():
                openai_api_key = utils.get_openai_api_key()
                openai_client = utils.connect_to_openai(openai_api_key)
            utils.squeal(f"Processing text {index + 1} with GPT")
            prompt = make_prompt(text, target_labels)
            gpt_response = utils.process_text_with_gpt(openai_client, model, prompt)
            texts_output.append(gpt_response)
            if text not in ner_cache:
                ner_cache[text] = {}
            ner_cache[text][model] = gpt_response
            utils.write_json_file(NER_CACHE_FILE, ner_cache)
    print("Finished processing")
    utils.save_data_to_json_file(ner_cache, file_name=NER_CACHE_FILE, in_colab=in_colab)
    return texts_output

Next, we use the functions to connect to GPT, send it the text and collect the results, which are shown in markdown format. Change the value of the `MAX_PROCESSED` variable if you do not want all texts to be processed by the LLM

In [None]:
MAX_PROCESSED = 100

texts_output = openai_ner(model, texts_input[:MAX_PROCESSED])
print(texts_output[0])

## 2.4. Named entity visualization

Here we show the results of named entity recognition in a more readable format

First we define three helper functions

In [None]:
def extract_entities_from_markdown(texts_output):
    """Extract the locations and labels of the entities from the markdown and return these"""
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    text_prefix = ""
    current_char = 0
    entities = []
    for match in regex.finditer(pattern, texts_output):
        text_prefix += texts_output[current_char: match.start()]
        entities.append({"text": match.group(1),
                         "label": match.group(2),
                         "start_char": len(text_prefix),
                         "end_char": len(text_prefix) + len(match.group(1))})
        text_prefix += match.group(1)
        current_char = match.end()
    text_prefix += texts_output[current_char:]
    return entities, text_prefix

In [None]:
def check_llm_output_text(text_llm_output, texts_input):
    if text_llm_output != texts_input:
        print(f"{utils.CHAR_FAILURE} Output text of named entity recognition is different from input text!")

Next we call the function to extract the entities from the results, check the output text and visualize the results, for the first five results

In [None]:
entities_all = []
for index, text in enumerate(texts_output):
    entities, text_llm_output = extract_entities_from_markdown(text)
    entities_all.append({"entities": entities, "text_llm_output": text_llm_output})
    if index < 5:
        check_llm_output_text(text_llm_output, texts_input[index])
        display(HTML(utils.mark_entities_in_text(text_llm_output, entities)))

## 2.5. Save results

Just like in the data preprocessing notebook, we save the results in a json file

We define a function for saving the results

We add the named entity analysis to the input data and save the results in a json file

In [None]:
for index, entities in enumerate(entities_all):
    input_data[index]["entities"] = entities["entities"]
    input_data[index]["text_llm_output"] = entities["text_llm_output"]
for rest_index in range(index, len(input_data)):
    input_data[rest_index]["entities"] = []
    input_data[rest_index]["text_llm_output"] = []   
utils.save_data_to_json_file(input_data, file_name="output_ner.json", in_colab=in_colab)

## 2.6. Alternatives for named entity recognition

Here we define some alternatives for recognising named entities, for example if you do not have an OpenAI API key or if you do not want to share your data with OpenAI.

### 2.6.1. Named entity recognition with Spacy

This named entity recognizer runs at your own computer and does not require an access key

First, we define three helper functions

In [None]:
LANG_MODELS = {
    "en": "en_core_web_sm",
    "de": "de_core_news_sm",
    "fr": "fr_core_news_sm",
    "es": "es_core_news_sm",
    "pt": "pt_core_news_sm",
    "it": "it_core_news_sm",
    "nl": "nl_core_news_sm",
    "xx": "xx_ent_wiki_sm"
}

def load_spacy_model(language_id):
    """Load the Spacy model for the current language and return it"""
    if language_id not in LANG_MODELS.keys():
        print(f"{utils.CHAR_FAILURE} warning: unknown language {language_id}. Switching to multilingual model...")
        language_id = "xx"
    try:
        nlp = spacy.load(LANG_MODELS[language_id])
    except:
        try:
            print(f"Model {LANG_MODELS[language_id]} not found, trying to download it...")
            spacy.cli.download(LANG_MODELS[language_id])
            nlp = spacy.load(LANG_MODELS[language_id])
        except:
            raise(RuntimeError(f"{utils.CHAR_FAILURE} Cannot find language model {LANG_MODELS[language_id]}!"))
    return nlp

In [None]:
def convert_spacy_entities_to_markdown(doc, texts_input):
    """Extract the entities recognised by Spacy and return them in a markdown string"""
    entities = [{"text": entity.text,
                 "label": entity.label_,
                 "start_char": entity.start_char,
                 "end_char": entity.end_char} for entity in doc.ents]
    for entity in reversed(entities):
        if entity["label"] in ["FAC", "GPE", "LOC"]:
            entity["label"] = "LOCATION"
        if entity["label"] in ["LOCATION", "PERSON"]:
            texts_input = (texts_input[:entity["end_char"]] +
                          f"]({entity['label']})" +
                          texts_input[entity["end_char"]:])
            texts_input = texts_input[:entity["start_char"]] + "[" + texts_input[entity["start_char"]:]
    return texts_input

Next, we call the helper functions for finding the language of the input text, selecting the right Spacy model, calling the model and converting the results to a markdown string, which is shown. The markdown string can be fed to the visualisation code of chapter 4.

In [None]:
MAX_PROCESSED = 100

texts_output = []
last_language_id = ""
for index, text in enumerate(texts_input[:MAX_PROCESSED]):
    utils.squeal(f"Processing text {index + 1}")
    language_id = utils.detect_text_language(text)
    if language_id != last_language_id:
        nlp = load_spacy_model(language_id)
        last_language_id = language_id
    doc = nlp(text)
    texts_output.append(convert_spacy_entities_to_markdown(doc, text))
print("Finished processing")
print(texts_output[0])

### 2.6.2. Named entity recognition with Qwen

Here we use [Qwen](https://en.wikipedia.org/wiki/Qwen), a locally-run model developed by the company Alibaba. The model has hardware requirements which your computer may not satisfy: it needs a GPU and about 12 gigabytes of memory. When running this model on Google Colab, it is recommended to use the runtime environment `T4 GPU`. If the model is too slow for you, the model `qwen3:0.6` can be used as an alternative.

In [None]:
MODEL = "qwen3:8b"
MAX_PROCESSED = 5

texts_output = utils.ollama_run(MODEL, texts_input[:MAX_PROCESSED], make_prompt, in_colab)
print(texts_output[1])