<a href="https://colab.research.google.com/github/libbyseline/nicar24/blob/main/00-Entity%20recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget --quiet -O items.zip https://github.com/jsoma/nicar24-beyond-chatgpt/raw/main/items.zip
!unzip -o -q items.zip

# Named entity recognition

Finding names, dates, places, and more. If you want to do this without programming, [Google Pinpoint](https://journaliststudio.google.com/pinpoint/collections) will work great (until they [unceremoniously shut it down](https://killedbygoogle.com/)).

## spaCy

After the dark days of NLTK but before the rise of ChatGPT, [spaCy](https://spacy.io/) was the champion of text analysis. Still is, in many ways! It has great documentation, is very fast and easy to use – everything you could ask for in a piece of software.

### Installing spacy

spaCy is a Python package, you install it the normal way with `pip`.

In [None]:
!pip install --quiet spacy

Before we use spaCy, we first need to download a *model*. spaCy itself doesn't understand language or text, it's just an interface! We'll end up using different models based on the languages we're analyzing and how important speed is.

Down below we use `en_core_web_sm`, which is a small, quick model for the English language.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [None]:
text = """
Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

for entity in doc.ents:
  print(entity.label_, ' | ', entity.text)

You can find [a list of models on the spaCy site](https://spacy.io/usage/models). We might not be not very impressed with the results above, so we'll try one that's bigger, better, fancier..! and slower: `en_core_web_trf`.

We first need to download `en_core_web_trf` in the same way we downloaded the smaller model.

In [None]:
!python -m spacy download en_core_web_trf

And we'll use it in the same way as the last one, only changing the line where we load it in through spaCy.

In [None]:
%%time

text = """
Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

nlp = spacy.load('en_core_web_trf')
doc = nlp(text)

for entity in doc.ents:
  print(entity.label_, ' | ', entity.text)

Is that great? I don't know, *maybe*. If we poke around on [the packages link to all of the available English models](https://spacy.io/models/en) we see a couple more: `en_core_web_md` and `en_core_web_lg`. Might as well give the large one a try?

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
%%time

text = """
Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

nlp = spacy.load('en_core_web_lg')
doc = nlp(text)

for entity in doc.ents:
  print(entity.label_, ' | ', entity.text)

While we like to think of computers as always giving the One True Right Answer, text analysis is a little different: it's much closer to **the model's opinion** than anything else.

People do spend time comparing models to humans, though! Take a look at the **Accuracy evaluation** tabs on the [models page](https://spacy.io/models/en) to see how each of the models compare on different tasks.

## Hugging Face Transformers

The `_trf` in `en_core_web_trf` stands for *transformers*, which is a fancy approach to machine learning that showed up well after spaCy was created. The introduction of transformers is a big reason why we've seen such an explosion in AI and machine learning since 2017!

While you *can* use spaCy with these fancy new models, it's not really what most people do. Instead, there's a much bigger ecosystem that you can find browsing around at [Hugging Face](https://huggingface.co/). You're no longer restricted to the four model options spaCy presented to you – everyone from Microsoft to random people upload new models every day.

Hugging Face even has a package called `transformers` that helps you use these new models. Let's install it!

In [None]:
!pip install transformers torch

Which model should we use? Well... [how about we just browse?](https://huggingface.co/models)

You *immediately* see things are different when we're using Hugging Face as compared to spaCy. Even though spaCy wasn't the modern cutting-edge, it was at least easy to know what was going on! Now that we're in the Hugging Face world we're dropped in to selecting a model without a clue what's going on.

The models page has models that do all sorts of things: generate images, categorize text, transcribe audio, and a million other options. Pulling names and places out of text is called **entity extraction** or **named entity recognition** (NER for short). It also falls under **token classification** problem, which is (kind of) the idea of putting words into categories.

We can go [search for token classification models](https://huggingface.co/models?pipeline_tag=token-classification&sort=trending), but then we have to decide which one to pick. The hottest trending one? The most downloaded one? The most recent?

Most of the time there's no way to know the right answer: **you just try them out and see what works best!**

> That's kind of a lie, you can always look at [what's currently state-of-the-art on benchmarks](https://paperswithcode.com/sota/token-classification-on-conll2003).

We can start with [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) since it's (currently) on top of the list of trending models.

We can see a suggestion on how to use the model in the column of text to the left, but I like to click the **Use in Transformers** button to see a little quick-start version. It's usually shorter!

In [None]:
# https://huggingface.co/dslim/bert-base-NER
# I'm going to use "ner" instead of "token-classification"
nlp = pipeline("ner", model="dslim/bert-base-NER")

results = nlp(text)
for entity in results:
    print(entity)

It splits up the *pieces* of Biden for some reason, along with 'Nikki' and 'Haley' being split. If you scour the internet you can find that you need to also tell it the **aggregation strategy**, ways of combining entities that are next to each other.

In [None]:
# https://huggingface.co/dslim/bert-base-NER
nlp = pipeline("token-classification",
               model="dslim/bert-base-NER",
               aggregation_strategy='simple')

results = nlp(text)
for entity in results:
    print(entity)
    # print(entity['word'], entity['entity_group'])

It also suggests a [larger model](https://huggingface.co/dslim/bert-large-NER), `bert-large` instead of `bert-base`. We can try that, I guess??

In [None]:
# https://huggingface.co/dslim/bert-large-NER
nlp = pipeline("ner",
               model="dslim/bert-large-NER",
               aggregation_strategy='simple')

results = nlp(text)
for entity in results:
    print(entity['word'], '|', entity['entity_group'])

Seems pretty good, yeah?

If we go back to the [models page and sort by most downloads](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads), we see another model to try: [Jean-Baptiste/roberta-large-ner-english](https://huggingface.co/Jean-Baptiste/roberta-large-ner-english). Might as well give it a shot while we're here!

In [None]:
from transformers import pipeline

# https://huggingface.co/Jean-Baptiste/roberta-large-ner-english
nlp = pipeline("ner",
               model="Jean-Baptiste/roberta-large-ner-english",
               aggregation_strategy='simple')

results = nlp(text)
for entity in results:
    print(entity['word'], '|', entity['entity_group'])

If we go back to the [bert-large-NER page](https://huggingface.co/dslim/bert-large-NER) hiding on the bottom right-hand side, we see a section labeled **Evaluation results**. If we click the **View on Papers With Code** link, we end up on [a leaderboard for NER tasks](https://paperswithcode.com/sota/token-classification-on-conll2003).

Sitting pretty at the top of the list is `microsoft-deberta-v3-large_ner_conll2003` from Microsoft. If we click it we end up at [Gladiator/microsoft-deberta-v3-large_ner_conll2003](https://huggingface.co/Gladiator/microsoft-deberta-v3-large_ner_conll2003), where *some random guy who doesn't work at Microsoft* has tweaked a Microsoft model and turned it into something that performs well on entity recognition.

Let's support his efforts by giving it a try!

In [None]:
%%time

# https://huggingface.co/Gladiator/microsoft-deberta-v3-large_ner_conll2003
nlp = pipeline("ner",
               model="Gladiator/microsoft-deberta-v3-large_ner_conll2003",
               aggregation_strategy='simple')

results = nlp(text)
for entity in results:
    print(entity['word'], '|', entity['entity_group'])

Now it's your turn: find [another model on the token classification model list](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads) that you can try out. Maybe it's in [French](https://huggingface.co/Jean-Baptiste/camembert-ner) or [Japanese](https://huggingface.co/tsmatz/xlm-roberta-ner-japanese), or instead of recognizing people and places it can [analyze medical records](https://huggingface.co/Clinical-AI-Apollo/Medical-NER)!

In [None]:
%%time

text = """
This is some text
"""

nlp = pipeline("ner",
               model="Gladiator/microsoft-deberta-v3-large_ner_conll2003",
               aggregation_strategy='simple')

results = nlp(text)
for entity in results:
    print(entity['word'], '|', entity['entity_group'])

Want to make your own? Try [Hugging Face's AutoTrain](https://huggingface.co/autotrain)! [Docs are here](https://huggingface.co/docs/autotrain/en/index) and kiiinda hard to understand until you get things working. Maybe there's a good YouTube video??

### Wrap-up

So far we've seen that named entity recognition is fun, but also kind of tough to get a hold onto.

**Different models perform at different levels,** and finding the best one ("state of the art") is tough. We saw [that one leaderboard](https://paperswithcode.com/sota/token-classification-on-conll2003), but how do we know that it's the best one? The Papers With Code website even has [a whole section of token classification tests and scores](https://paperswithcode.com/task/token-classification)!

**The models can only tag what they know,** so you can't use a names-organizations-locations tagger on a dataset to find money measurements and dates. It isn't as customizable

## A new world: LLM models

After the introduction of transformers, one of the biggest splashes in the world of AI was ChatGPT. You'd ask it for a poem about your cat and it would deliver! Amazing!

Beyond cat poems, ChatGPT is impressive because it's just so *flexible* – it can do so much! Even if it doesn't always give you the right answer, hallucinates frequently, won't talk about things it determines are too sensitive, expresses all sorts of bias, and is owned by a big corporation, *it can still amaze you*.

One of the wildest things about ChatGPT and other large language models is that they can do all sorts of traditional natural language processing tasks, [including named entity recognition](https://chat.openai.com/share/dfba5755-04ae-483d-9c87-cf5889341fca):

> **ME:** List the entities in the text below. On each line, present one entity,
a comma, and the category it belongs to (PERSON, ORG, DATE)
>
> TEXT:
>
> Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden
>
> It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
>
> **ChatGPT:**
>
> - Trump, PERSON
> - Haley, PERSON
> - November, DATE
> - Nikki Haley, PERSON
> - Donald Trump, PERSON
> - President Biden, PERSON

Amazing, right? It would be a pain to do with a hundred thousand entries in a database, though, so instead we'll directly to OpenAI through Python. This involves using the **API**, a way for computers to talk to each other, and an **API key**, a long string that identifies me to the OpenAI systems. They'll use it to track my usage and charge me!

We'll start by installing the package that allows us to talk to GPT. Note that we're using the official OpenAI package here, **not** the popular [LangChain](https://langchain.readthedocs.io/) package. LangChain is always updating and breaking and a lot of people find it overly complicated for what it does, so we're going to set it aside for now.

In [None]:
!pip install --quiet openai

Even though we're just writing Python code, we're still having a conversation. We give GPT a list of messages that pass back and forth between us and the system - `user` and the `assistant`. The first message is a `system` prompt that sets the stage for the interaction, [you can see the complete OpenAI one here](https://pastebin.com/vnxJ7kQk), we're just using the incredibly boring "You are a helpful assistant."

In [None]:
from openai import OpenAI

client = OpenAI(api_key='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')

prompt = """
List the entities in the text below. On each line, present one entity,
a comma, and the category it belongs to (PERSON, ORG, DATE).

TEXT:

Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

messages = [
    { "role": "system", "content": "You are a helpful assistant."},
    { "role": "user", "content": prompt}
]

chat_completion = client.chat.completions.create(
    messages=messages,
    model="gpt-3.5-turbo",
    temperature=0
)

print(chat_completion.choices[0].message.content)

If we want to create a reusable piece of code, it's best to transform it into a **function**. The code below will allow us to write `get_entities_llm("This is a story about Joe Biden")` to pull the entities out of the text instead of writing the big long long long piece of code.

In [None]:
client = OpenAI(api_key='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')

def get_entities_llm(text):
    prompt_template = """
    List the entities in the text below. On each line, present one entity,
    a comma, and the category it belongs to (PERSON, ORG, DATE).

    TEXT:

    {text}
    """

    prompt = prompt_template.format(text=text)

    messages = [
        { "role": "system", "content": "You are a helpful assistant."},
        { "role": "user", "content": prompt}
    ]

    chat_completion = client.chat.completions.create(
        messages=messages,
        model="gpt-3.5-turbo",
        temperature=0
    )

    return chat_completion.choices[0].message.content

Now way every time we use it, it'll be nice and short.

In [None]:
text = """
Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

results = get_entities_llm(text)
print(results)

There's a problem, though: if we use this again and again and again and again across tens of thousands of pieces of text, **sometimes it's going to give us incorrect output.**

### Guardrails

[Guardrails](https://github.com/guardrails-ai/guardrails) is a Python library that magically convinces the large language model to format the response in the right way. Along with *asking* for it, it will also attempt to validate a response. If the validation fails, Guardrails will *demand that the LLM fix it!*

It's kind of a pain to set up, but it's a must-have for larger projects.

In [None]:
from guardrails import validators, Guard
from rich import print
from pydantic import BaseModel, Field
import openai
import json

prompt = """
    Analyze the following text, extracting all of the entities.

    ${text}

    ## Detailed instructions

    ${gr.complete_json_suffix_v2}
"""

from pydantic import BaseModel
from typing import List

class Entity(BaseModel):
    text: str = Field(description="Text of entity")
    category: str = Field("Entity category", validators=[
        validators.ValidChoices(
            choices=['PERSON', 'ORGANIZATION', 'LOCATION'],
            on_fail='reask'
        )
    ])

class EntityList(BaseModel):
    entities: List[Entity]

guard = Guard.from_pydantic(output_class=EntityList, prompt=prompt)

Now that we've set up our expectated results, let's try it out!

In [None]:
text = """
Trump Seeks to Bump Haley From Race, Eyeing November Rematch With Biden

It is a seemingly do-or-die moment for Nikki Haley’s campaign, as Donald Trump’s
probable victory lines him up to face off with President Biden in the fall.
"""

response = guard(
    openai.chat.completions.create,
    prompt_params={"text": text},
    response_format={"type": "json_object"},
    max_tokens=1024,
    temperature=0.3,
)

# Print the validated output from the LLM
print(json.dumps(response.validated_output, indent=2))

We can look at `guard.history.last.tree` to see the entire conversation - it's a lot more than just our prompt!

In [None]:
print(guard.history.last.tree)

## Reflection

Wow, doing entity recognition with LLMs is a lot more flexible and potentially accurate than traditional models! It costs more and is slower, but it might be worth it??