# Class 14 Exercises - Named Entity Recognition — Solutions

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Install spaCy and Download Language Model

We need to install spaCy and the English-language model (`en_core_web_sm`). This is the model that was trained on the annotated "OntoNotes" corpus.

In [2]:
#!pip install -U spacy
!python -m spacy download en_core_web_sm

[38;5;3m⚠ Skipping model package dependencies and setting `--no-deps`. You
don't seem to have the spaCy package itself installed (maybe because you've
built from source?), so installing the model dependencies would cause spaCy to
be downloaded, which probably isn't what you want. If the model package has
other dependencies, you'll have to install them manually.[0m
Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
     |████████████████████████████████| 12.0 MB 9.3 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=e424438c97353fe4dcf67a70cc18890f3f96a7c44214d2deda3d1cf4277a681d
  Stored in directory: /private/var/folders/t

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [1]:
import spacy
from spacy import displacy
import en_core_web_sm

from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600

## Load Language Model

We need to load the language model and save it as the variable `nlp`

In [2]:
nlp = en_core_web_sm.load()

## Process Document

In the cell below, we open and read a text file. Then we run `nlp()` on the text to create our processed spaCy document.

In [4]:
filepath = "Harry-Potter-Sorcerer's-Stone.txt"
text = open(filepath, mode='r', encoding='utf=8').read()
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [5]:
displacy.render(document, style="ent")

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property.

In [6]:
document.ents

(Harry Potter,
 Sorcerer,
 Dursley,
 four,
 Privet Drive,
 Dursley,
 Grunnings,
 Dursley,
 Dursleys,
 Dudley,
 Dursleys,
 Dursley,
 several years,
 Dursley,
 Dursleys,
 Potters,
 Dursleys,
 Dudley,
 Dursley,
 Tuesday,
 Dursley,
 Dursley,
 Dudley,
 Dursley,
 Dursley,
 Dudley,
 Dudley,
 Dursley,
 four,
 first,
 second,
 Dursley,
 Dursley,
 Dursley,
 Privet Drive,
 Dursley,
 that day,
 Dursley,
 Dursley,
 Dursley,
 a few minutes later,
 Dursley,
 Grunnings,
 Dursley,
 ninth,
 that morning,
 Dursley,
 owl-free morning,
 five,
 baker,
 The Potters,
 Harry,
 Dursley,
 Potter,
 Harry,
 Harry,
 Harvey,
 Harold,
 Dursley,
 that afternoon,
 five o'clock,
 a few seconds,
 Dursley,
 today,
 Muggles,
 Dursley,
 Dursley,
 Muggle,
 four,
 one,
 Shoo,
 Dursley,
 Dursley,
 Dursley,
 normal day,
 Next Door,
 Dudley,
 Dursley,
 Dudley,
 today,
 night,
 hundreds,
 Jim McGuffin,
 tonight,
 Jim,
 Ted,
 today,
 Kent,
 Yorkshire,
 Dundee,
 yesterday,
 Bonfire Night,
 next week,
 tonight,
 Dursley,
 Britain,
 

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing). To get the entity label, we can use the `.label_` attribute.

In [7]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

Harry Potter PERSON
Sorcerer NORP
Dursley PERSON
four CARDINAL
Privet Drive LOC
Dursley PERSON
Grunnings ORG
Dursley PERSON
Dursleys PERSON
Dudley PERSON
Dursleys PERSON
Dursley PERSON
several years DATE
Dursley PERSON
Dursleys PERSON
Potters PERSON
Dursleys PERSON
Dudley PERSON
Dursley PERSON
Tuesday DATE
Dursley PERSON
Dursley PERSON
Dudley PERSON
Dursley PERSON
Dursley PERSON
Dudley PERSON
Dudley PERSON
Dursley PERSON
four CARDINAL
first ORDINAL
second ORDINAL
Dursley PERSON
Dursley PERSON
Dursley PERSON
Privet Drive LOC
Dursley PERSON
that day DATE
Dursley PERSON
Dursley PERSON
Dursley PERSON
a few minutes later TIME
Dursley PERSON
Grunnings ORG
Dursley PERSON
ninth ORDINAL
that morning TIME
Dursley PERSON
owl-free morning TIME
five CARDINAL
baker PERSON
The Potters WORK_OF_ART
Harry PERSON
Dursley PERSON
Potter ORG
Harry PERSON
Harry PERSON
Harvey ORG
Harold PERSON
Dursley PERSON
that afternoon TIME
five o'clock TIME
a few seconds TIME
Dursley PERSON
today DATE
Muggles PERSON
Durs

To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [8]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Harry Potter
Dursley
Dursley
Dursley
Dursleys
Dudley
Dursleys
Dursley
Dursley
Dursleys
Potters
Dursleys
Dudley
Dursley
Dursley
Dursley
Dudley
Dursley
Dursley
Dudley
Dudley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
baker
Harry
Dursley
Harry
Harry
Harold
Dursley
Dursley
Muggles
Dursley
Dursley
Dursley
Dursley
Dursley
Next Door
Dudley
Dursley
Dudley
Jim McGuffin
Jim
Ted
Kent
Yorkshire
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dursley
Dudley
Dursley
Howard
Dursley
Dursley
Dursley
Dursleys
Dursley
Dursley
Petunia
Petunia
Dursley
Albus Dumbledore
Albus Dumbledore
Dursley
McGonagall
Dursleys
Dedalus Diggle
McGonagall
Dumbledore
Dumbledore
McGonagall
Dumbledore
McGonagall
Pomfrey
McGonagall
McGonagall
Lily
James Potter
McGonagall
Albus
McGonagall
Potter
Harry
Harry Potter
Harry
Dumbledore
McGonagall
Hagrid
Harry
McGonagall
Harry Potter
McGonagall
Harry Potter
Harry
Dumbledore
McGonagall
Dumbledore
Harry
Hagrid
Dumbledore
Dumbledore
Dumbledore

We can make a DataFrame from Python collections like lists

In [9]:
sample_list = ['Harry', 'Ron', 'Snape', 'Harry']
Counter(sample_list).most_common()

[('Harry', 2), ('Ron', 1), ('Snape', 1)]

In [10]:
pd.DataFrame(Counter(sample_list).most_common())

Unnamed: 0,0,1
0,Harry,2
1,Ron,1
2,Snape,1


In [11]:
pd.DataFrame(Counter(sample_list).most_common(), columns=['character', 'count'])

Unnamed: 0,character,count
0,Harry,2
1,Ron,1
2,Snape,1


## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **people**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [13]:
people = []

for named_entity in document.ents:
    if named_entity.label_ == 'PERSON':
        people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Harry,1001
1,Ron,412
2,Hagrid,241
3,Dudley,129
4,Quirrell,98
5,Hermione,94
6,Malfoy,80
7,Dumbledore,70
8,McGonagall,68
9,Dursley,52


## Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are *either* **countries, cities, states** OR **non-GPE locations**. Insert the appropriate Python code below.

*Hint: Look at the examples above for guidance.*

In [15]:
places = []

for named_entity in document.ents:
    if named_entity.label_ == 'GPE' or named_entity.label_ == 'LOC':
        places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,Neville,103
1,Snape,35
2,London,13
3,Muggles,11
4,Dumbledore,11
5,Privet Drive,9
6,Gringotts,9
7,Ravenclaw,9
8,Vernon,6
9,Uncle Vernon,6


## Get Languages

|Type Label|Description|
|:---:|:---:|
|LANGUAGE|Any named language.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **languages**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [17]:
languages = []

for named_entity in document.ents:
    if named_entity.label_ == 'LANGUAGE':
        languages.append(named_entity.text)

languages_tally = Counter(languages)

df = pd.DataFrame(languages_tally.most_common(), columns=['language', 'count'])
df

Unnamed: 0,language,count
0,Snape,1


## Get Another Entity

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


Based on the examples above, make a DataFrame of a different named entity.

In [19]:
entities = []

for named_entity in document.ents:
    if named_entity.label_ == 'EVENT':
        entities.append(named_entity.text)

entities_tally = Counter(entities)

df = pd.DataFrame(entities_tally.most_common(), columns=['entity', 'count'])
df

Unnamed: 0,entity,count
0,Room 17,1
1,Blowing Gum,1
2,Naughty,1
3,a World Cup,1
4,the Twentieth Century,1
5,the house cup,1


## Discussion

- How well does spaCy's NER seem to be performing?
- What does it do well or not so well?
- How could you imagine researchers or data scientists using NER?

## Spacy vs BookNLP

Let's compare spaCy to BookNLP! Click here: https://colab.research.google.com/drive/1lYezboblkBlf_zPuZ7Kootw5dSvP0tOT#scrollTo=k5oJL-a01kUq

BookNLP documentation: https://people.ischool.berkeley.edu/~dbamman/booknlp/README.html