# Named Entity Recognition (NER)

This notebook deals with the recognition of **named entities**.

Named entities are:
- Person names
- Organizations
- Locations
- Medical codes
- Time expressions
- Quantities
- Monetary values
- Percentages
- ... more examples provided in Section 1.

Spacy recognizes them for us automatically and makes the available via `doc.ents` and `ent.label_` & other properties liested in Section 1.
Additionally, we can easily create our own custom entities.

Overview of contents:

1. List of Named Entity Properties and Label Tags
2. Examples of Named Entities
3. Adding New Entities
    - 3.1 Adding Single Term Entities
    - 3.2 Adding Entities with Several Terms: Phrases
4. Counting Named Entities
5. Visualizing Named Entities

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. List of Named Entity Properties and Label Tags

#### Entity Annotations: Properties of `ents`

<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>

#### NER Tags: `ent.label_`

<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>



## 2. Examples of Named Entities

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
# Write a function to display basic entity info
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [4]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

In [5]:
show_ents(doc)

Washington - GPE - Countries, cities, states
DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [6]:
doc = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')

In [7]:
for ent in doc.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

500 dollars 4 6 20 31 MONEY
Microsoft 11 12 53 62 ORG


In [8]:
show_ents(doc)

500 dollars - MONEY - Monetary values, including unit
Microsoft - ORG - Companies, agencies, institutions, etc.


## 3. Adding New Entities

### 3.1 Adding Single Term Entities

In [15]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')

In [16]:
# We see that Tesla is not recognized as a named entity: 
# we can add it to the list of the named entities!
show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [17]:
from spacy.tokens import Span

# Get the hash value of the ORG entity label
# Look at the list provided above or in the docs
ORG = doc.vocab.strings[u'ORG']  

# Create a Span for the new entity
# doc - the name of the Doc object
# 0 - the start index position of the span
# 1 - the stop index position (exclusive)
# label=ORG - the label assigned to our entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object (we can also use append())
doc.ents = list(doc.ents) + [new_ent]

In [18]:
# Now Tesla is recognized as a named entity
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### 3.2 Adding Entities with Several Terms: Phrases

In [27]:
# We want to add the two variations or "vacuum cleaner" as PRODUCT entities
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
          u'If successful, the vacuum cleaner will be our first product.')

In [28]:
show_ents(doc)

first - ORDINAL - "first", "second", etc.


In [29]:
# To that end, we need to match the variations in the Doc
# Import PhraseMatcher and create a matcher object
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [30]:
# Create the desired phrase patterns
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

In [31]:
# Apply the patterns to our matcher object
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object
matches = matcher(doc)

# See what matches occur
matches

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

In [32]:
# Here we create Spans from each match, and create named entities from them
from spacy.tokens import Span

# Create the label (look at the provided list above/docs)
PROD = doc.vocab.strings[u'PRODUCT']

# New entities, created with list comprehension
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

# Don't forget updating the list of entities (equiv. to append())
doc.ents = list(doc.ents) + new_ents

In [33]:
show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.


## 4. Counting Named Entities

In [34]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

In [35]:
show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [36]:
len([ent for ent in doc.ents if ent.label_=='MONEY'])

2

## 5. Visualizing Named Entities

In [37]:
nlp = spacy.load('en_core_web_sm')

In [38]:
# Import the displaCy library
from spacy import displacy

In [40]:
# We pass two sentences
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
          u'By contrast, Sony sold only 7 thousand Walkman music players.')

In [43]:
# Choosing style='ent', named entities are visualized
displacy.render(doc, style='ent', jupyter=True)

In [42]:
# We can also visualize the text sentence by sentence
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)

In [44]:
# With options, we can filter the entity classes/types we'd like to visualize
options = {'ents': ['ORG', 'PRODUCT']}
displacy.render(doc, style='ent', jupyter=True, options=options)

In [45]:
# With options, we can also control the visualization color
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}
options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}
displacy.render(doc, style='ent', jupyter=True, options=options)