# Named Entity Recognition Part 2 — Workbook Solutions

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

*Note: You can explore this [workbook](https://mybinder.org/v2/gh/INFO1350/Intro-CA-SP21/master?urlpath=lab/tree/book/05-Text-Analysis/06.5-Named-Entity-Recognition-WORKBOOK.ipynb) in the cloud via Binder.*

---

## Install spaCy and Download Language Model

If you haven't installed spaCy yet, you should uncomment and run these cells. If you've already installed spaCy and the language model, you don't need to do it again.

In [None]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

## Import Libraries

We're going to import `spacy` and `displacy`. We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library.

In [1]:
import spacy
from spacy import displacy
import en_core_web_sm

from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600

## Load Language Model

We need to load the language model and save it as the variable `nlp`

In [2]:
nlp = en_core_web_sm.load()

## NER with Hand Curation

In [3]:
filepath = "../texts/literature/The-House-on-Mango-Street-Sandra-Cisneros.txt"
text = open(filepath).read()
document = nlp(text)

### Use spaCy to identify places

In [4]:
places = []
for named_entity in document.ents:
    if named_entity.label_ == 'LOC' or named_entity.label_ == "GPE":
        places.append(named_entity.text)
places_tally = Counter(places)
df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Marin,15
1,Mexico,4
2,Aunt Lala,4
3,Texas,3
4,Loomis,2
5,Paulina,2
6,France,2
7,Mango,2
8,Puerto Rico,2
9,Yolanda,2


### Output places to CSV file

In [5]:
df.to_csv("spacy-places.csv", encoding="utf-8", index=False)

### Manually curate CSV file

In [6]:
confirmed_df = pd.read_csv("confirmed-places.csv", encoding="utf-8")
confirmed_places = confirmed_df['place'].to_list()

### Match *all* named entities if they match the curated list

In [7]:
places = []
for named_entity in document.ents:
    # Check to see if named entity matches one of the manually curated places
    if named_entity.text in confirmed_places:
        places.append(named_entity.text)
places_tally = Counter(places)
df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Mango Street,16
1,Loomis,4
2,Mexico,4
3,Texas,3
4,Paulina,2
5,France,2
6,Puerto Rico,2
7,North Broadway,2
8,Chicago,1
9,Milwaukee,1


## Get NER in Context

We can use spaCy to get individual sentences from documents with `document.sents`

In [8]:
sentences = [sentence for sentence in document.sents]

Let's randomly examine sentences 101-105 in this list.

In [9]:
sentences[100:104]

[,
 My great-grandmother.,
 I would’ve liked to have known her, a wild horse of a woman, so wild she wouldn’t marry.,
 Until my great-grandfather threw a sack over her head and carried her off.]

Because we can grab sentences, we can make a function that locates named entities in the context of the sentence in which they appear. 

We will loop through all the sentences, and if a keyword appears in one of the named entities, then we will display the keyword bolded and with its entity label. 

In [10]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document):
     
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        for named_entity in sentence.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower():
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown(f'---\n**{named_entity.label_}**  \n{sentence_text}'))

In [11]:
get_ner_in_context("Egypt", document)

---
**GPE**  
Sally is the girl with eyes like **Egypt** and nylons the color of smoke.

In [12]:
get_ner_in_context("India", document)

---
**GPE**  
I read somewhere in **India** there are priests who can will their heart to stop beating.

In [13]:
get_ner_in_context("Mars", document)

---
**LOC**  
There were sunflowers big as flowers on **Mars** and thick cockscombs bleeding the deep red fringe of theater curtains.

In [14]:
get_ner_in_context("Puerto Rico", document)

---
**GPE**  
She lives with Louie’s family because her own family is in **Puerto Rico**.

---
**GPE**  
Marin’s boyfriend is in **Puerto Rico**.

## Process Multiple Documents

To run spaCy on multiple documents, we will use `list(nlp.pipe(texts))`. We will test this out on NYT obituaries.

In [15]:
import glob
from pathlib import Path
import random

Make a list of all files that end with `.txt` in the `NYT-Obituaries` directory

In [16]:
directory = "../texts/history/NYT-Obituaries"
files = glob.glob(f"{directory}/*.txt")

Create a list of all the NYT obituary texts

In [17]:
texts = []
for file in files:
    text = open(file, encoding='utf-8').read()
    texts.append(text)

Randomly select 50 obituaries from the list of texts

In [18]:
texts = random.sample(texts, 50)

Process all the texts

In [19]:
documents = list(nlp.pipe(texts))

## Get People for Multiple Documents

We can use almost the exact same workflow, but we have to add one more for lop to consider each document in the list of documents.

In [20]:
people = []

# Consider each document in the list of processed documents
for document in documents:
    for named_entity in document.ents:
        if named_entity.label_ == 'PERSON':
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df[:100]

Unnamed: 0,character,count
0,Sadat,66
1,Barnum,63
2,Stalin,58
3,Gillespie,48
4,Rabin,47
5,Sinclair,45
6,Selznick,44
7,Gromyko,43
8,Watson,38
9,Sinatra,38


## Process a Long Text

If you try to use spaCy on a long document, such as a long novel, you will likely get an error. To process a long document, we need to break up the text in some way.

In [21]:
filepath = "../texts/literature/Pride-and-Prejudice_Jane-Austen.txt"
text = open(filepath).read()

Here we split up *Pride and Prejudice* by line breaks and make it into a long list. Then we run spaCy on this long list.

In [22]:
chunked_text = text.split('\n')
chunked_documents = list(nlp.pipe(chunked_text))

In [23]:
entities = []

# Consider each document in the list of processed documents
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == 'PERSON':
            entities.append(named_entity.text)

entity_tally = Counter(entities)

df = pd.DataFrame(entity_tally.most_common(), columns=['entity', 'count'])
df

Unnamed: 0,entity,count
0,Elizabeth,616
1,Darcy,399
2,Jane,287
3,Bennet,229
4,Wickham,178
5,Collins,164
6,Bingley,137
7,Lizzy,92
8,Gardiner,88
9,Lady Catherine,83


## Named Entities

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


## Discussion

- How well does spaCy's NER seem to be performing?
- What does it do well or not so well?
- How could you imagine researchers, data scientists, yourself using NER in a project?