# Entity Extraction with Generative Models

This notebook demonstrates how to use Cohere's generative models to extract the name of a film from the title of an article. This demonstrates Named Entity Recognition (NER) of entities which are harder to isolate using other NLP methods (and where pre-training provides the model with some context on these entities). This also demonstrates the broader usecase of sturctured generation based on providing multiple examples in the prompt.



![Extracting Entities from text](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/keyword-extraction-gpt-models.png)


We'll use post titles from the r/Movies subreddit. And for each title, we'll extract which movie the post is about. If the model is unable to detect the name of a movie being mentioned, it will return "none".

## Setup
Let's start by installing the packages we need.

In [1]:
!pip install cohere requests tqdm

Collecting cohere
  Downloading cohere-4.11.2-py3-none-any.whl (39 kB)
Collecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting importlib_metadata<7.0,>=6.0 (from cohere)
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Installing collected packages: importlib_metadata, backoff, cohere
Successfully installed backoff-2.2.1 cohere-4.11.2 importlib_metadata-6.7.0


We'll then import these packages and declare the function that retrieves post titles from reddit.

In [14]:
import cohere
import pandas as pd
import requests
import datetime
from tqdm import tqdm
pd.set_option('display.max_colwidth', None)

def get_post_titles(**kwargs):
    """ Gets data from the pushshift api. Read more: https://github.com/pushshift/api """
    base_url = f"https://api.pushshift.io/reddit/search/submission/"
    payload = kwargs
    request = requests.get(base_url, params=payload)
    print(request.json())
    # return [a['title'] for a in request.json()['data']]


You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet.

In [3]:
# Paste your API key here. Remember to not share publicly
api_key = '263AjTqtMaRsy4Dehu6ko2ThCjC6KS85e8dF45ZV'

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

## Preparing examples for the prompt

In our prompt, we'll present the model with examples for the type of output we're after. We basically get a set of subreddit article titles, and label them ourselves. The label here is the name of the movie mentioned in the title (and "none" if no movie is mentioned).


![Labeled dataset of text and extracted text](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/keyword-extraction-dataset.png)



In [4]:

movie_examples = [
("Deadpool 2", "Deadpool 2 | Official HD Deadpool's \"Wet on Wet\" Teaser | 2018"),
("none", "Jordan Peele Just Became the First Black Writer-Director With a $100M Movie Debut"),
("Joker", "Joker Officially Rated “R”"),
("Free Guy", "Ryan Reynolds’ 'Free Guy' Receives July 3, 2020 Release Date - About a bank teller stuck in his routine that discovers he’s an NPC character in brutal open world game."),
("none", "James Cameron congratulates Kevin Feige and Marvel!"),
("Guardians of the Galaxy", "The Cast of Guardians of the Galaxy release statement on James Gunn"),
]




## Creating the extraction prompt

We'll create a prompt that demonstrates the task to the model. The prompt contains the examples above, and then presents the input text and asks the model to extract the movie name.


![Extraction prompt containing the examples and the input text](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/extraction-prompt-example.png)


In [20]:
#@title Create the prompt (Run this cell to execute required code) {display-mode: "form"}

class cohereExtractor():
    def __init__(self, examples, example_labels, labels, task_desciption, example_prompt):
        self.examples = examples
        self.example_labels = example_labels
        self.labels = labels
        self.task_desciption = task_desciption
        self.example_prompt = example_prompt

    def make_prompt(self, example):
        examples = self.examples + [example]
        labels = self.example_labels + [""]
        return (self.task_desciption +
                "\n---\n".join( [examples[i] + "\n" +
                                self.example_prompt +
                                 labels[i] for i in range(len(examples))]))

    def extract(self, example):
      extraction = co.generate(
          model='command',
          prompt=self.make_prompt(example),
          max_tokens=10,
          temperature=0.1,
          stop_sequences=["\n"])
      return(extraction.generations[0].text[:-1])


cohereMovieExtractor = cohereExtractor([e[1] for e in movie_examples],
                                       [e[0] for e in movie_examples], [],
                                       "",
                                       "extract the movie title from the post:")

# Uncomment to inspect the full prompt:
# print(cohereMovieExtractor.make_prompt('<input text here>'))

In [8]:
# This is what the prompt looks like:
print(cohereMovieExtractor.make_prompt('<input text here>'))

Deadpool 2 | Official HD Deadpool's "Wet on Wet" Teaser | 2018
extract the movie title from the post:Deadpool 2
---
Jordan Peele Just Became the First Black Writer-Director With a $100M Movie Debut
extract the movie title from the post:none
---
Joker Officially Rated “R”
extract the movie title from the post:Joker
---
Ryan Reynolds’ 'Free Guy' Receives July 3, 2020 Release Date - About a bank teller stuck in his routine that discovers he’s an NPC character in brutal open world game.
extract the movie title from the post:Free Guy
---
James Cameron congratulates Kevin Feige and Marvel!
extract the movie title from the post:none
---
The Cast of Guardians of the Galaxy release statement on James Gunn
extract the movie title from the post:Guardians of the Galaxy
---
<input text here>
extract the movie title from the post:


## Getting the data
Let's now make the API call to get the top posts for 2021 from r/movies.

In [15]:
num_posts = 10

movies_list = get_post_titles(size=num_posts,
      after=str(int(datetime.datetime(2021,1,1,0,0).timestamp())),
      before=str(int(datetime.datetime(2022,1,1,0,0).timestamp())),
      subreddit="movies",
      sort_type="score",
      sort="desc")

# Show the list
movies_list

{'detail': 'Not authenticated'}


## Running the model
And now we loop over the posts and process each one of them with our extractor.

In [16]:
movies_list = ['Avengers: Endgame is on air in the next week', 'My favorite movie is the terminator series','What is the best Harry Porter movie in your mind?']

In [21]:
results = []
for text in tqdm(movies_list):
    try:
        extracted_text = cohereMovieExtractor.extract(text)
        results.append(extracted_text)
    except Exception as e:
        print('ERROR: ', e)

100%|██████████| 3/3 [00:01<00:00,  1.69it/s]


Let's look at the results:

In [22]:
results

['Avengers: Endgam', 'terminato', 'Harry Porte']

In [23]:
pd.DataFrame(data={'text': movies_list, 'extracted_text': results})

Unnamed: 0,text,extracted_text
0,Avengers: Endgame is on air in the next week,Avengers: Endgam
1,My favorite movie is the terminator series,terminato
2,What is the best Harry Porter movie in your mind?,Harry Porte


Looking at these results, the model got 9/10 correctly. It didn't pick up on Shaolin Soccer and God of Gambler in example \#4. It also called the second example "Pixar's Luca" instead of "Luca". But maybe we'll let this one slide.

When experimenting with extrction prompts, we'll often find edge-cases along the way. What if a post has two movies mentioned, for example? The more we run into such examples, the more examples we can add to the prompt that address these cases.

## How well does this work?
We can better measure the performance of this extraction method using a larger labeled dataset. So let's load a test set of 100 examples:

In [24]:
test_df = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/movie_extraction_test_set_100.csv',index_col=0)
test_df

Unnamed: 0,text,label
0,Disney's streaming service loses some movies due to old licensing deals,none
1,"Hi, I’m Sam Raimi, producer of THE GRUDGE which hits theaters tonight. Ask Me Anything!",The Grudge
2,'Parasite' Named Best Picture by Australia's AACTA Awards,Parasite
3,Danny Trejo To Star In Vampire Spaghetti Western ‘Death Rider in the House of Vampires’,Death Rider in the House of Vampires
4,I really wish the 'realistic' CGI animal trend would end.,none
...,...,...
95,Hair Love | Oscar Winning Short Film (Full),Hair Love
96,First image of Jason Alexander in Christian film industry satire 'Faith Based',Faith Based
97,"'Borderlands' Movie in the Works From Eli Roth, Lionsgate",Borderlands
98,"Taika Waititi putting his Oscar ""away"" after winning best adapted screenplay for JOJO RABBIT",Jojo Rabbit


Let's run the extractor on these post titles (calling the API in parallel for quicker results):

In [27]:
from concurrent.futures import ThreadPoolExecutor

extracted = []

# # Run the model to extract the entities
# with ThreadPoolExecutor(max_workers=8) as executor:
#     for i in executor.map(cohereMovieExtractor.extract, test_df['text']):
#         extracted.append(str(i).strip())

for text in tqdm(test_df['text'][5:10]):
    try:
        extracted_text = cohereMovieExtractor.extract(text)
        extracted.append(extracted_text)
    except Exception as e:
        print('ERROR: ', e)

# Save results

test_df['extracted_text'] = extracted

100%|██████████| 5/5 [00:06<00:00,  1.26s/it]


ValueError: ignored

Let's look at some results:

In [None]:
test_df.head()

Unnamed: 0,text,label,extracted_text
0,Disney's streaming service loses some movies due to old licensing deals,none,none
1,"Hi, I’m Sam Raimi, producer of THE GRUDGE which hits theaters tonight. Ask Me Anything!",The Grudge,The Grudge
2,'Parasite' Named Best Picture by Australia's AACTA Awards,Parasite,Parasite
3,Danny Trejo To Star In Vampire Spaghetti Western ‘Death Rider in the House of Vampires’,Death Rider in the House of Vampires,Death Rider
4,I really wish the 'realistic' CGI animal trend would end.,none,none


Let's calculate the accuracy by comparing to the labeled examples

In [None]:
# Compare the label to the extracted text
test_df['correct'] = (test_df['label'].str.lower() == test_df['extracted_text'].str.lower()).astype(int)

# Print the accuracy
print(f'Classification accuracy {test_df["correct"].mean() *100}%')

Classification accuracy 89.0%


So it seems this prompt works well on this small test set. It's not guaranteed it will do as well on other sets, however. The prompt can be improved by trying on more data, discovering edge cases, and adding more examples to the prompt.

We can look at the examples it got wrong:

In [None]:
test_df[test_df['correct']==0]

Unnamed: 0,text,label,extracted_text,correct
3,Danny Trejo To Star In Vampire Spaghetti Western ‘Death Rider in the House of Vampires’,Death Rider in the House of Vampires,Death Rider,0
6,De Niro recreating a scene from Goodfellas to test Irishman deaging (3:30 in),Goodfellas,none,0
12,Is there anyway way I could get a copy of 1917 for my dying father in law?,1917,none,0
30,How Uncut Gems Won Over the Diamond District,Uncut Gems,none,0
31,Michael J. Fox and Christopher Lloyd posing for the Back to the Future II poster in 1989 that would later be illustrated by Drew Struzan,Back to the Future II,Back to the Future,0
39,2019 in film - with 'Movies' by Weyes Blood,none,Movies,0
57,The Mad Max franchise is my all time favorite movie series. I finally watched Waterworld tonight. Oh man why didnt I see this sooner?,Mad Max,Waterworld,0
69,How A New Hope created Pixar Animation Studios,Star Wars,none,0
75,A scene from the movie 1917 was recreated from the stroyboards.,1917,none,0
82,New Wonder Woman image,Wonder Woman,none,0


It indeed failed to pick up a few examples. Sometimes this uncovers edge cases and understandable mistakes (e.g. two films are mentioned in the text).


We can look at the classification report for a more detailed look at what's included in the test set, and what the model got right and wrong:

In [None]:
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

print(classification_report(test_df['label'].str.lower(), test_df['extracted_text'].str.lower()))

                                      precision    recall  f1-score   support

                                1917       0.00      0.00      0.00         2
               2001: a space odyssey       1.00      1.00      1.00         1
                            ad astra       1.00      1.00      1.00         1
     alice doesn't live here anymore       1.00      1.00      1.00         1
                       austin powers       1.00      1.00      1.00         1
                  back to the future       0.00      0.00      0.00         0
               back to the future ii       0.00      0.00      0.00         1
                        blood simple       1.00      1.00      1.00         1
                   bohemian rhapsody       1.00      1.00      1.00         1
                         borderlands       1.00      1.00      1.00         1
                     brief encounter       1.00      1.00      1.00         1
                                cats       1.00      1.00      

This type of extraction is interesting because it doesn't just blindly look at the text. The model has picked up on movie information during its pretraining process and that helps it understand the task from only a few examples.

You can think about extending this to other subreddits, to extract other kinds of entities and information. [Let us know in the forum](https://community.cohere.ai/) what you experiment with and what kinds of results you see!

Happy building!