# Named Entity Recognition — Workbook Solutions

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

*Note: You can explore this [workbook](https://mybinder.org/v2/gh/INFO1350/Intro-CA-SP21/master?urlpath=lab/tree/book/05-Text-Analysis/05.5-Named-Entity-Recognition-WORKBOOK.ipynb) in the cloud via Binder.*

---

## Install spaCy and Download Language Model

We need to install spaCy and the English-language model (`en_core_web_sm`). This is the model that was trained on the annotated "OntoNotes" corpus.

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [1]:
import spacy
from spacy import displacy
import en_core_web_sm

from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600

## Load Language Model

We need to load the language model and save it as the variable `nlp`

In [2]:
nlp = en_core_web_sm.load()

## Process Document

In the cell below, we open and read a text file. Then we run `nlp()` on the text to create our processed spaCy document.

In [3]:
filepath = "../texts/literature/House-on-Mango-Street/04-My-Name.txt"
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [4]:
displacy.render(document, style="ent")

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property.

In [5]:
document.ents

(English,
 Spanish,
 nine,
 Mexican,
 Sunday,
 mornings,
 the Chinese year,
 Chinese,
 Chinese,
 Mexicans,
 Spanish,
 Magdalena,
 Magdalena,
 Nenny,
 Esperanza)

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing). To get the entity label, we can use the `.label_` attribute.

In [6]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

English LANGUAGE
Spanish NORP
nine CARDINAL
Mexican NORP
Sunday DATE
mornings TIME
the Chinese year DATE
Chinese NORP
Chinese NORP
Mexicans NORP
Spanish LANGUAGE
Magdalena PERSON
Magdalena PERSON
Nenny PERSON
Esperanza PERSON


To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [7]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Magdalena
Magdalena
Nenny
Esperanza


## NER with The House on Mango Street

In [8]:
text = open("../texts/literature/The-House-on-Mango-Street-Sandra-Cisneros.txt").read()
document = nlp(text)

## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

The code below will extract named entitities from *The House on Mango Street*, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **people**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [9]:
people = []

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Nenny,35
1,Sally,31
2,Lucy,29
3,Rachel,25
4,Ruthie,13
5,Benny,9
6,Cathy,8
7,Earl,8
8,Esperanza,7
9,Geraldo,7


## Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

The code below will extract named entitities from *The House on Mango Street*, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are *either* **countries, cities, states** OR **non-GPE locations**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [10]:
places = []

for named_entity in document.ents:
    if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
        places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Marin,15
1,Mexico,4
2,Aunt Lala,4
3,Texas,3
4,Loomis,2
5,Paulina,2
6,France,2
7,Mango,2
8,Puerto Rico,2
9,Yolanda,2


## Get Languages

|Type Label|Description|
|:---:|:---:|
|LANGUAGE|Any named language.|

The code below will extract named entitities from *The House on Mango Street*, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **languages**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [11]:
languages = []

for named_entity in document.ents:
    if named_entity.label_ == "LANGUAGE":
        languages.append(named_entity.text)

languages_tally = Counter(languages)

df = pd.DataFrame(languages_tally.most_common(), columns = ['language', 'count'])
df

Unnamed: 0,language,count
0,English,10
1,Spanish,2


## Get Another Entity

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


Based on the examples above, make a DataFrame of a different named entity from the *The House on Mango Street*.

In [15]:
entities = []

for named_entity in document.ents:
    if named_entity.label_ == 'NORP':
        entities.append(named_entity.text)

entity_tally = Counter(entities)

df = pd.DataFrame(entity_tally.most_common(), columns=['entity', 'count'])
df

Unnamed: 0,entity,count
0,Spanish,5
1,Chinese,2
2,Mexican,1
3,Mexicans,1
4,Puerto Rican,1
5,Spartans,1
6,Catholic,1
7,Oriental,1
8,Southern,1
9,English,1


## Discussion

- How well does spaCy's NER seem to be performing?
- What does it do well or not so well?
- How could you imagine researchers or data scientists using NER?