# Named Entity Recognition

Named Entity Recognition (NER) Notebook. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

For this notebook I will use Ruth Ginsberg's obituary as input (Link: https://www.legacy.com/news/celebrity-deaths/ruth-bader-ginsburg-1933-2020-influential-u-s-supreme-court-justice/)

NER is useful for extracting key information from texts and is a fundamental task in the field of natural language processing (NLP). NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. 

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown a lot in interest.

Open-source NLP tools are also getting very good. In this notebok I'm going to use one of these open-source tools, the Python library spaCy.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 1.0 MB/s eta 0:00:01     |█████████████████████████▏      | 10.8 MB 4.8 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
import en_core_web_sm
nlp = en_core_web_sm.load()

First let's to process the document with the loaded NLP model. After processing, the document object will contain the language data — named entities, sentence boundaries, parts of speech.

In [6]:
PATHNAME = "ruth.txt"
text = open(PATHNAME, encoding='utf-8').read()
document = nlp(text)

In [7]:
displacy.render(document, style="ent")

From a quick glance at the text above, we can see that spaCy is doing quite well with NER. But it’s definitely not perfect.

In [8]:
document.ents

(Ruth Bader Ginsburg,
 the U.S. Supreme Court,
 second,
 first,
 Jewish,
 U.S.,
 Supreme Court,
 the Supreme Court,
 Bill Clinton,
 1993,
 Ginsburg,
 Ginsberg,
 United States,
 Virginia,
 1996,
 Ginsburg,
 the Virginia Military Institute’s,
 Bush,
 Gore,
 2000,
 Obergefell,
 Hodges,
 2015,
 One,
 Ginsburg,
 One,
 only nine,
 about 500,
 Ginsburg,
 first,
 U.S.,
 the Women’s Rights Project,
 the American Civil Liberties Union,
 Reed,
 Reed,
 1971,
 the Fourteenth Amendment’s Equal Protection Clause,
 U.S. Court of Appeals,
 the Supreme Court,
 Ginsburg,
 the U.S. Court of Appeals,
 Jimmy Carter,
 1980,
 the Supreme Court,
 13 years later,
 later years,
 Ginsburg,
 the Supreme Court,
 first,
 1999,
 a single day,
 2009,
 2018,
 Ginsburg,
 2019,
 first,
 July 2020,
 Ginsburg,
 Ginsburg,
 later years,
 The Notorious R.B.G.,
 The Notorious B.I.G.,
 RBG,
 On the Basis of Sex,
 Ginsburg,
 Ginsburg,
 the Supreme Court,
 nine,
 nine,
 10th,
 Circuit Bench & Bar Conference,
 2012,
 Ginsburg,
 th

In [9]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

Ruth Bader Ginsburg PERSON
the U.S. Supreme Court ORG
second ORDINAL
first ORDINAL
Jewish NORP
U.S. GPE
Supreme Court ORG
the Supreme Court ORG
Bill Clinton PERSON
1993 DATE
Ginsburg PERSON
Ginsberg PERSON
United States GPE
Virginia GPE
1996 DATE
Ginsburg PERSON
the Virginia Military Institute’s ORG
Bush PERSON
Gore PERSON
2000 DATE
Obergefell ORG
Hodges PERSON
2015 DATE
One CARDINAL
Ginsburg GPE
One CARDINAL
only nine CARDINAL
about 500 CARDINAL
Ginsburg PERSON
first ORDINAL
U.S. GPE
the Women’s Rights Project ORG
the American Civil Liberties Union ORG
Reed PERSON
Reed PERSON
1971 DATE
the Fourteenth Amendment’s Equal Protection Clause LAW
U.S. Court of Appeals ORG
the Supreme Court ORG
Ginsburg PERSON
the U.S. Court of Appeals ORG
Jimmy Carter PERSON
1980 DATE
the Supreme Court ORG
13 years later DATE
later years DATE
Ginsburg PERSON
the Supreme Court ORG
first ORDINAL
1999 DATE
a single day DATE
2009 DATE
2018 DATE
Ginsburg PERSON
2019 DATE
first ORDINAL
July 2020 DATE
Ginsburg PERS

In [10]:
for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Ruth Bader Ginsburg
Bill Clinton
Ginsburg
Ginsberg
Ginsburg
Bush
Gore
Hodges
Ginsburg
Reed
Reed
Ginsburg
Jimmy Carter
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ginsburg
Ruth Bader Ginsburg
John G. Roberts


In [11]:
people = []

for named_entity in document.ents:
    if named_entity.label_ == "PERSON":
        people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Ginsburg,13
1,Ruth Bader Ginsburg,2
2,Reed,2
3,Bill Clinton,1
4,Ginsberg,1
5,Bush,1
6,Gore,1
7,Hodges,1
8,Jimmy Carter,1
9,John G. Roberts,1


In [12]:
places = []
for named_entity in document.ents:
    if named_entity.label_ == "GPE" or named_entity.label_ == "LOC":
        places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,U.S.,3
1,Ginsburg,2
2,United States,1
3,Virginia,1


In [13]:
streets = []
for named_entity in document.ents:
    if named_entity.label_ == "FAC":
        streets.append(named_entity.text)

streets_tally = Counter(streets)

df = pd.DataFrame(streets_tally.most_common(), columns = ['street', 'count'])
df

Unnamed: 0,street,count
0,Arlington National Cemetery,1


In [14]:
works_of_art = []
for named_entity in document.ents:
    if named_entity.label_ == "WORK_OF_ART":
        works_of_art.append(named_entity.text)

art_tally = Counter(works_of_art)

df = pd.DataFrame(art_tally.most_common(), columns = ['work_of_art', 'count'])
df

Unnamed: 0,work_of_art,count
0,The Notorious R.B.G.,1
1,On the Basis of Sex,1
