<a href="https://colab.research.google.com/github/krisograbek/named_entities_intro/blob/main/Named_Entities_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

Important: As of today (19 May 2021) Colab uses spacy 2. This notebook uses code compatible with spacy 3. If after running the next cell, spacy 3 will be installed, please restart the runtime and run the cell again.

In [1]:
import spacy

if int(spacy.__version__[0]) < 3:
    !pip install -U pip setuptools wheel
    !pip install -U spacy
    !python -m spacy download en_core_web_sm
    print(spacy.__version__)
    print("Please restart runtime")
else:
    print(spacy.__version__)

3.0.6


In [2]:
nlp = spacy.load("en_core_web_sm")

# Simple, yet Powerful: A Smooth Introduction to Named Entities

Named Entities are simple to detect. Despite it, they are widely used in NLP Applications. They boost recommendation systems, indexing documents,  keywords extraction, and many others. There are open-source tools, that recognize Named Entities in text. In this article, we'll take a closer look to the one provided by spaCy.

## Table of contents:

1. [Named Entities](#ne)
2. [Named Entity Recognition](#ner)
 - [Use cases](#use-cases)
 - [Built-in Named Entites in Spacy](#built-in)
3. [Information Extraction](#ie)
2. Spacy
 - Built-in Entitites
 - Show Entities as text
 - Show Entities with displacy
 - Relations with Matcher
7. [Final Thoughts](#wrap-up)
8. [References](#refs)



## Named Entities <a name="ne"></a>

**Named Entities** are real-world objects, such as persons, locations, organizations. 
Examples:
 - **Names**: Michael Jordan, Barack Obama, Lady Gaga, Bugs Bunny, Nicole Kidman
 - **Locations**: Russia, Africa, Los Angeles, Mount Everest
 - **Organizations**: Google, NBA

## Named Entity Recognition <a name="ner"></a>

**Named Entity Recognition** is an NLP technique that consists of 2 steps:
 - identifying Named Entities in a text
 - classifying them to predefined categories

Initially, the idea was to spot "names" in a text. Names are relatively easy to find, because they are proper nouns hence always capitalized. Basic regex knowledge is enough to detect such words. Proper nouns help finding **people, countries, cities, organizations, or companies**. 
Similarly, finding numbers along with punctuations and signs isn't a challenge with a simple regex. Some NER applications took advantage of this fact and are able to detect **cardinals, dates, monetary values, percentages**, or other **quantities**. 

## Use Cases <a name="use-cases"></a>

Named Entity Recognition (NER) is useful in many tasks:
 - Entities in Emails
 - Listing Tags from Articles
 - Content Recommendation
 - and many more...

## Spacy's built-in Named Entities <a name="built-in"></a>

It recognizes 18 different Named Entities

In [3]:
for label in nlp.get_pipe("ner").labels:
    print("{0:>12} - {1} ".format(label, spacy.explain(label)))

    CARDINAL - Numerals that do not fall under another type 
        DATE - Absolute or relative dates or periods 
       EVENT - Named hurricanes, battles, wars, sports events, etc. 
         FAC - Buildings, airports, highways, bridges, etc. 
         GPE - Countries, cities, states 
    LANGUAGE - Any named language 
         LAW - Named documents made into laws. 
         LOC - Non-GPE locations, mountain ranges, bodies of water 
       MONEY - Monetary values, including unit 
        NORP - Nationalities or religious or political groups 
     ORDINAL - "first", "second", etc. 
         ORG - Companies, agencies, institutions, etc. 
     PERCENT - Percentage, including "%" 
      PERSON - People, including fictional 
     PRODUCT - Objects, vehicles, foods, etc. (not services) 
    QUANTITY - Measurements, as of weight or distance 
        TIME - Times smaller than a day 
 WORK_OF_ART - Titles of books, songs, etc. 


## Information Extraction <a name="ie"></a>

**Information Extraction** is an NLP-task of transforming unstructured data (text) into a structured form, e.g. table, data-base like. The goal is to organize information so it can be easily processed. Named Entity Recognition is a subtask of Information Extraction.
Finding relation between Named Entities gives additional possibilities and flexibility to extracting information. The process of finding relations between Named Entities is called **Relation Extraction.**

## Functions

In [4]:
from spacy import displacy

def displacy_filter(text, marked_ents=[]):
    doc = nlp(text)
    options = {}
    if marked_ents:
        options = {"ents": marked_ents}
    displacy.render(doc, style="ent", jupyter=True, options=options)

### Extracting Named Entities from an email

Extracting entities from emails brings following benefits:
 - quickly detecting the subject and most important information
 - automatically triggering actions, e.g. adding meetings or events like flights to digital calendars

This example email contains a few Named Entities. Let's see, if they get recognized. Spacy lists Named Entities in `doc.ents`

In [5]:
email = """
Subject: Meeting with Rep. Walton, Re: Water Restrictions; Reply Requested
Dear Mr. Wood,
I am writing on behalf of House Representative Jesse Walton to set up a meeting 
with you to discuss the water restrictions in Temple Terrace. 
He is available to meet at either 10:30 a.m., 11 a.m. or 4:30 p.m. 
next Tuesday, August 15 at his office, 3278 W. 14th Street, Tampa.
Please confirm a meeting time at your earliest convenience. 
Thank you,
Shailene Cobb
Assistant to H.R. Jesse Walton
"""

In [6]:
doc = nlp(email)
for ent in doc.ents:
    print("{0:>15} - {1}".format(ent.text, ent.label_))

         Walton - PERSON
           Wood - PERSON
          House - ORG
   Jesse Walton - PERSON
 Temple Terrace - ORG
     10:30 a.m. - TIME
        11 a.m. - TIME
      4:30 p.m. - TIME
  next Tuesday, - DATE
           3278 - CARDINAL
          Tampa - GPE
  Shailene Cobb - PERSON
           H.R. - GPE


We see every entity and its label recognized by spacy.

## Named Entities with Displacy

Listing Entities is a nice feature but spacy doesn't stop there. We all like visuals. Spacy provides `displacy`, which is a great tool to mark all detected Entities. It's easy to use and well designed. 


In [7]:
displacy_filter(email)

You have to agree this looks good :) It is also easier to analyse NER performed by spacy. It detected many Named Entities but the results aren't perfect. It missed August 15, and classified H.R. as GPE. This isn't bad for 2 lines of code. 

## Finding Tags for Articles

The following text is an NBA recap article. This also contains quite a few Named Entities. Feel free NOT to read it :)

In [8]:
text = """Curry became the oldest scoring champion since Michael Jordan at age 35 in 1998, finishing with 46 points as Golden State held off the Memphis Grizzlies 113-101 on Sunday in a regular-season finale that determined the play-in tournament's eighth and ninth spots.
Curry is still playing his best basketball in his 12th NBA season, and the two-time MVP and his Golden State supporting cast still have a chance at making a splash in the playoffs the way the underdog Davis-led group did in 2007.
The Warriors wrapped up the No. 8 seed and will visit LeBron James and the Los Angeles Lakers on Wednesday. Memphis finishes at No. 9 and will host the Spurs on Wednesday.
At 33, Curry and Jordan are the only scoring champions age 33 or older. Curry also joins Jordan, Wilt Chamberlain and Kareem Abdul-Jabbar as the only players with multiple scoring titles, MVPs and championships.
Curry locked up his first scoring title since 2015-16 with his second basket of the game late in the first quarter. 
He made 9 of 22 3s and also contributed nine assists and seven rebounds in Golden State's sixth straight win. Curry averaged 32 points during his second 2,000-point season.
After a slow start, Curry finished with key 3-pointers with 3:17, 2:12 and 1:35 left. 
He wound up with 337 3-pointers after making 96 in April alone for an NBA record in a single month.
Jonas Valanciunas had 29 points and 16 rebounds for the Grizzlies. Ja Morant added 16 points and nine assists.
Dillon Brooks started the fourth quarter on an 8-0 run all by himself to pull Memphis within 86-77 — and Curry immediately returned, hitting a 3 with 8:23 to go.
It marked quite a special day for outgoing Warriors President and COO Rick Welts, who is departing the position after 10 years with the franchise. 
He led the construction plans for second-year Chase Center and will stay on in an advisory role. 
Kerr spoke on the court before tip-off, both thanking the fans and honoring all that former Seattle SuperSonics ball boy Welts has done for the NBA and Warriors.
Curry missed his first four shots and initial three 3s before a layup with 3:45 remaining in the opening quarter. 
He then hit a jumper the next time down to become scoring champion — drawing roars and a standing ovation.
"""

I could list all Named Entities here but it wouldn't be helpful. I'll go directly to displacy.

In [9]:
displacy_filter(text)

Again, pretty good job detecting entities. There are few misclassifications, e.g. Jordan as a GPE. However, Jordan is also a country in the Middle East, so it's not a big mistake. NBA games consist of 4 12-minute quarters. They are misclassified as dates, e.g. *the first quarter* or *the opening quarter*. However, since we sptil years to 4 quarters, I'm perfectly fine with such mistake and wouldn't expect any other output.

## Filtering Named Entities with displacy

We see in previous example, there can be many Named Entities in a text. Sometimes, too much is too much. It's normal that not all of them are important and we don't need to see them. Displacy can handle that, too. We just need to pass to `options` a list of labels, that we wanna see.
```
ents = ["PERSON", "ORG"]
options = {"ents": marked_ents}
displacy.render(doc, style="ent", jupyter=True, options=options)
```


In [10]:
displacy_filter(text, ["PERSON", "ORG"])

This is also a good place to create tags for our articles

In [11]:
from collections import Counter

doc = nlp(text)
persons = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
coll = Counter(persons)
tags = [key for key in list(coll.keys())[:6]]
print(tags)

['Curry', 'Michael Jordan', 'Davis', 'LeBron James', 'Jordan', 'Kareem Abdul-Jabbar']


## Content Recommendation

Internet is full of recommendation systems. A lot of them rely on named entity recognition. The purpose of recommendation systems is to improve customer experience. Based on the search history, they learn user's preferences and make suggestions. How does it work? Every content, that users consume, has its own categories, i.e. key subjects. Recommendation systems learn users through their consumption and suggest the most similar content. Would Netflix be so successful without its recommendation systems? Arguably not. Another example - I hardly type in YouTube's search options. I use YT to listen to music, watch NBA highlights, or content about investing. I don't have to bother about searching for them anymore.

# Final Thoughts

I tried to show how Named Entities are helpful in various NLP tasks. Despite the simplicity of detecting them, they bring a lot of value. In this article, I only touched the surface of how to use Named Entities. I only mentioned Relation Extraction. And there is much more, e.g.:
- Indexing documents
- Extracting keywords from Résumé
- Categorizing tickets for customer support

So whenever you approach an NLP problem, that needs to categorize text or extract key information, you should remember Named Entities. 

# References

- [spacy DocEnts](https://spacy.io/api/doc#ents)
- [MonkeyLearn](https://monkeylearn.com/blog/named-entity-recognition/)
- [Wikipedia](https://en.wikipedia.org/wiki/Named_entity#:~:text=In%20information%20extraction%2C%20a%20named,or%20have%20a%20physical%20existence.)