# Class 14 Exercises - Named Entity Recognition

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Install spaCy and Download Language Model

We need to install spaCy and the English-language model (`en_core_web_sm`). This is the model that was trained on the annotated "OntoNotes" corpus.

In [2]:
#!pip install -U spacy
!python -m spacy download en_core_web_sm

[38;5;3m⚠ Skipping model package dependencies and setting `--no-deps`. You
don't seem to have the spaCy package itself installed (maybe because you've
built from source?), so installing the model dependencies would cause spaCy to
be downloaded, which probably isn't what you want. If the model package has
other dependencies, you'll have to install them manually.[0m
Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
     |████████████████████████████████| 12.0 MB 9.3 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=e424438c97353fe4dcf67a70cc18890f3f96a7c44214d2deda3d1cf4277a681d
  Stored in directory: /private/var/folders/t

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

In [3]:
import spacy
from spacy import displacy
import en_core_web_sm

from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600

## Load Language Model

We need to load the language model and save it as the variable `nlp`

In [4]:
nlp = en_core_web_sm.load()

## Process Document

In the cell below, we open and read a text file. Then we run `nlp()` on the text to create our processed spaCy document.

In [5]:
filepath = "Harry-Potter-Sorcerer's-Stone.txt"
text = #Open and read the text file
document = nlp(text)

## spaCy Named Entities

Below is a Named Entities chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


To quickly see spaCy's NER in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) with the `style=` parameter set to "ent"  (short for entities):

In [None]:
displacy.render(document, style="ent")

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property.

In [None]:
document.ents

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing). To get the entity label, we can use the `.label_` attribute.

In [None]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

To extract just the named entities that have been identified as `PERSON`, we can add a simple `if` statement into the mix:

In [None]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

We can make a DataFrame from Python collections like lists

In [None]:
sample_list = ['Harry', 'Ron', 'Snape', 'Harry']
Counter(sample_list).most_common()

In [None]:
pd.DataFrame(Counter(sample_list).most_common())

In [None]:
pd.DataFrame(Counter(sample_list).most_common(), columns=['character', 'count'])

## Get People

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **people**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [None]:
people = []

for named_entity in document.ents:
    # Your Code Here
        #Your Code Here

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

## Get Places

|Type Label|Description|
|:---:|:---:|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are *either* **countries, cities, states** OR **non-GPE locations**. Insert the appropriate Python code below.

*Hint: Look at the examples above for guidance.*

In [None]:
places = []

for named_entity in document.ents:
    # Your Code Here
        # Your Code Here

places_tally = #Your code here

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

## Get Languages

|Type Label|Description|
|:---:|:---:|
|LANGUAGE|Any named language.|

The code below will extract named entities from the text, count them up with `Counter()`, and then create a DataFrame of the results.

Your job is to make sure that we only get named entities that are **languages**. Insert the approriate Python code below.

*Hint: Look at the examples above for guidance.*

In [None]:
languages = []

for named_entity in document.ents:
    # Your Code Here
        # Your Code Here

languages_tally = Counter(languages)

df = pd.DataFrame(languages_tally.most_common(), #Your code here)
df

## Get Another Entity

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


Based on the examples above, make a DataFrame of a different named entity.

In [None]:
# Your Code Here 👇






## Discussion

- How well does spaCy's NER seem to be performing?
- What does it do well or not so well?
- How could you imagine researchers or data scientists using NER?

## Spacy vs BookNLP

Let's compare spaCy to BookNLP! Click here: https://colab.research.google.com/drive/1lYezboblkBlf_zPuZ7Kootw5dSvP0tOT#scrollTo=k5oJL-a01kUq

BookNLP documentation: https://people.ischool.berkeley.edu/~dbamman/booknlp/README.html