# Entity Extratction with spaCy

## Overview

We're going to look for all the people mentioned in a pile of documents.

### Entites

"Entities" in documents are, generally, names -- names of people, places, and things such as companies. Finding out which entities are mentioned in a trove of documents can be pretty helpful, especially when you don't previously _know_ someone or some place is included the document.

There are services online that do this kind of extraction, including [DocumentCloud](https://www.documentcloud.org/) ([see how here](https://www.documentcloud.org/faq#faq-analyzing-1)), [Amazon Comprehend](https://aws.amazon.com/comprehend/features/) and [Google Natural Language](https://cloud.google.com/natural-language/).

### Using spaCy

We're going to do our entity extraction right here in our notebook using a pre-trained natural language model called [spaCy](https://spacy.io/). Specifically, we're using the spaCy [large English language model](https://spacy.io/models/en#en_core_web_lg) trained on the [OntoNotes dataset](https://catalog.ldc.upenn.edu/LDC2013T19) -- a trove of "telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs" that includes nearly 1.5 million English words.  

The spaCy project has a lot of great language features. We'll be looking at the [named entities feature](https://spacy.io/usage/linguistic-features#named-entities). Note also that there are [models for several languages](https://spacy.io/models) being developed in spaCy.


## The Plan

- We'll download the spaCy software and the large English language model.
- We'll also download a (smallish) pile of emails released in a court case.
- We'll learn how to use spaCy functions to extract entities
- We'll use the spaCy functions to scan all the pages of the emails.

## Credits

This notebook was written by John Keefe [Quartz](https://qz.com) at Quartz and includes document-processing code written included in [a blog post](https://qz.ai/discovering-interesting-documents-in-the-mauritius-leaks/) and a [Jupyter notebook](https://github.com/Quartz/aistudio-doc2vec-for-investigative-journalism/blob/master/Doc2vec%20for%20Investigative%20Journalism.ipynb) by Jeremy B. Merrill at Quartz, who used it to help find documents inside a document dump known as the [Mauritius Leaks](https://qz.com/1670632/how-quartz-used-ai-to-help-reporters-search-the-mauritius-leaks/).  

-- John Keefe, [Quartz](https://qz.com), October 2019

## Setup

### For those using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes), or if you use one for more than 12 hours.

If you're using Google Colaboratory, be sure to set your runtime to "GPU" which speeds up your notebook for machine learning:

![change runtime](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/change_runtime_2.jpg)
![pick gpu](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/pick_gpu_2.jpg)

### Everybody do this ...

Everyone needs to run the next cell, which initializes the Python libraries we'll use in this notebook.

In [1]:
## *EVERYBODY* SHOULD RUN THIS CELL
## This can take up to 3 minutes ... but that's normal
%cat /usr/local/cuda/version.txt

!pip install -U spacy --quiet
!python -m spacy download en_core_web_lg
!pip install PyPDF2 --quiet

import spacy
import en_core_web_lg
import PyPDF2
import json
from os.path import exists

CUDA Version 10.0.130
[K     |████████████████████████████████| 10.2MB 4.9MB/s 
[K     |████████████████████████████████| 3.7MB 46.9MB/s 
[K     |████████████████████████████████| 122kB 52.3MB/s 
[K     |████████████████████████████████| 2.1MB 41.8MB/s 
[?25hCollecting en_core_web_lg==2.2.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.0/en_core_web_lg-2.2.0.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.1MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.0-cp36-none-any.whl size=829187831 sha256=ff8c2e0d2e093c1c26d3e90cb73bfd6ef89c52d8611927687f643d5c44833712
  Stored in directory: /tmp/pip-ephem-wheel-cache-60n9aon3/wheels/9f/3c/d6/3ade7ed8195030f4d7f299cf73d856a84d7b3effd5890133fb
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Succ

## The Data

In this tutorial, we're going to look at some emails from the office of New York City mayor Bill de Blasio that were released under the Freedom of Information Law. 

The emails were part of the ["Agent of the City" hubbub](https://www.ny1.com/nyc/all-boroughs/news/2018/05/24/agents-of-the-city-emails-released), in which 4,000 city emails were released. You can download the [original file here](https://a860-openrecords.nyc.gov/response/120252?token=c784372fd140497081b4bfcff9f0e3a0) -- though we'll be using a file containing just [the first 100 pages](https://qz-aistudio-public.s3.amazonaws.com/workshops/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf) for this exercise. 

In [7]:
# Run this cell to download the data we'll use for this exercise
!wget -N https://qz-aistudio-public.s3.amazonaws.com/workshops/deblasio_emails_data.zip --quiet
!unzip -q deblasio_emails_data.zip
print('Done!')

Done!


Let's look at what we have.

In [0]:
%ls data/

## Trying the entity extraction feature

In [0]:
# First we load the model into the notebook
nlp = en_core_web_lg.load()

In [0]:
# Now let's give it a try
doc = nlp(u"San Francisco considers banning sidewalk delivery robots")


In [0]:
for entity in doc.ents:
    print(entity.text, entity.label_, spacy.explain(entity.label_))

There's [a whole list of entities spaCy can detect](https://spacy.io/api/annotation#named-entities)!

In [0]:
my_story = """
John drove his Volkswagen Golf north on Interstate 35 to Duluth, Minnesota,
where he stopped at the Aerial Lift Bridge and looked out over
Lake Superior. 
"""

doc = nlp(my_story)

## Load the emails into a "jsonl" file

JSONL is a file format that stores data in a JSON file, with each record living on its own line in the file.

This next block reads the PDF file and turns it into a JSONL file, which is much easier to work with.

In [0]:
# read the PDF file into a new file called 'nyc_docs.jsonl'
jsonl_file = "nyc_docs.jsonl"
if not exists(jsonl_file):
    pdf_file = open('data/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf', 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    with open(jsonl_file, 'w') as f:
        for page_num in range(read_pdf.getNumPages()):
            page = read_pdf.getPage(page_num)
            page_content = page.extractText().encode('utf-8').decode("utf-8") 
            f.write(json.dumps({"_source": {"content": page_content}, "_id": f"p{page_num+1}"}) + "\n")

In [0]:
# let's take a look at the first few lines of the file
!head nyc_docs.jsonl

Each line in the JSON file now represents a single page in the original document. So now we'll step through each line (aka page) and grab all the entities in the text. Then we'll print out all the entities.

## Finding and listing the names

In [0]:
with open(jsonl_file, 'r') as f:        # open the jsonl file
    for line in f:                      # loop through each line ...
        line = json.loads(line)            # read the line 
        text = line["_source"]["content"]  # grab the text of the email
        page_number = line["_id"]          # grab the page number we're on
        doc = nlp(text)                    # load the text into the nlp model
        for ent in doc.ents:               # loop through each entity in the text...
            if (ent.label_ == "PERSON"):      # if the entity is a person's name ...
                print(page_number, ent.text)  # print the page number and the name

Really we want a list of _names_ not pages, right?

In [0]:
list_of_names = {}

with open(jsonl_file, 'r') as f:
    for line in f:
        line = json.loads(line)
        text = line["_source"]["content"]
        page_number = line["_id"]
        doc = nlp(text)
        
        # loop through the entities in the page
        for ent in doc.ents:
            
            # is the entity is a person ...
            if (ent.label_ == "PERSON"):
                
                # check if we already have this entity
                if ent.text in list_of_names:
                    
                    # add this page to the entity's list of pages
                    list_of_names[ent.text] += " " + page_number
                    
                else:
                    
                    # otheriwise start a list of pages
                    list_of_names[ent.text] = page_number

In [0]:
list_of_names

In [0]:
for name, pages in sorted(list_of_names.items()):
    print(name, "(" + pages + ")" )

Once you know a name is _there_ then you can search for it [in the original document](https://qz-aistudio-public.s3.amazonaws.com/workshops/2018.05.24_BerlinRosen_Responsive_Records_100pgs.pdf).

# Detecting Document Similarity with spaCy

## The Data

All of the data and the installations steps still apply.

## Credits

I found [this Medium post super helpful](https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c), and used some of the code described there.

## Code


In [0]:
# From https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c

def process_text(text):
    doc = nlp(text.lower())
    result = []
    for token in doc:
        if token.text in nlp.Defaults.stop_words:
            continue
        if token.is_punct:
            continue
        if token.lemma_ == '-PRON-':
            continue
        if "$" in token.text:
            continue
        if "#" in token.text:
            continue
        result.append(token.lemma_)
    return " ".join(result)

In [8]:
my_story

'\nJohn drove his Volkswagen Golf north on Interstate 35 to Duluth, Minnesota,\nwhere he stopped at the Aerial Lift Bridge and looked out over\nLake Superior. \n'

In [13]:
my_story_cleaned = process_text(my_story)
my_story_cleaned

'\n john drive volkswagen golf north interstate 35 duluth minnesota \n stop aerial lift bridge look \n lake superior \n'

In [0]:
doc = nlp(my_story_cleaned)

In [15]:
print(doc.vector)

[ 1.01778105e-01  2.05319315e-01  6.73728958e-02 -2.20699400e-01
  3.70768964e-01  6.27897680e-05  1.58605985e-02  5.98380081e-02
  5.01086004e-02  1.24153340e+00 -2.20715478e-01 -1.25634655e-01
  6.20462485e-02 -3.26115526e-02 -2.36044571e-01 -9.01962072e-02
 -1.12321198e-01  1.00773060e+00 -3.90413441e-02 -1.21259749e-01
  5.85537739e-02 -4.95426506e-02  1.21275544e-01 -7.99641535e-02
  5.94000928e-02  4.94726002e-02 -9.45585519e-02 -3.74344997e-02
 -4.61814478e-02  2.42259949e-01 -1.78351998e-02  8.88676420e-02
 -5.56827560e-02  2.93218531e-02  5.01873977e-02 -8.63832012e-02
  5.04705422e-02 -7.31010512e-02 -6.32342547e-02  7.33639747e-02
 -6.43608570e-02 -6.48629516e-02  3.13216262e-02  1.23825409e-01
  6.50239736e-03 -1.29929511e-02 -5.02037182e-02  2.12189499e-02
 -2.44025476e-02  8.08378980e-02  1.77441575e-02  3.86951454e-02
  1.16834000e-01 -8.08318425e-03  9.68424574e-05  2.53466014e-02
 -8.64771307e-02  1.10955402e-01  7.95033649e-02  3.20553035e-02
 -2.00200111e-01 -1.56275

In [0]:
list_of_vectors = []
list_of_text = []

with open(jsonl_file, 'r') as f:        # open the jsonl file
    for line in f:                      # loop through each line ...
        line = json.loads(line)            # read the line 
        text = line["_source"]["content"]  # grab the text of the email
        cleaned_text = process_text(text)  # clean up the text
        doc = nlp(cleaned_text)            # load the text into the nlp model
        list_of_vectors.append(doc.vector)
        list_of_text.append(line)
        print(cleaned_text)
        print("---")

In [0]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [29]:
list_of_text[0]

{'_id': 'p1',
 '_source': {'content': '  THE CITY OF NEW YORK\n OFFICE OF THE MAYOR\n NEW YORK, NY 10007\n  May \n24, 2018  Dear \nRequester\n,  This letter is in\n response to \nprevious \nrequest\ns pursuant to the Freedom of Information Law \nreceived\n by this Office, seeking\n generally\n  Correspondence between \nthe Office of the Mayor and \nJonathan Rosen or \nBerlinRosen.\n  Due to the number of FOIL requests the MayorÕs Office has received for similar \ncommunications, as a courtesy the documents being disclosed to you today include materials that \nare outside the scope\n of your requests.\n !The responsive records comprise four volumes of material:\n A. Pages 3\n-729: Material previously \nwithheld in full or in part\n pursuant to the inter\n-agency \nexemption \n¤87(2\n)(g) within the time range. This volume also includes 73 pages of \nmateria\nl previously withheld in full pursuant to \n¤87(2\n)(b).\n  Range: January 1, 2014 to April 3, 2015.\n B. Pages 730\n-2844: Materi

In [30]:
list_of_text[96]

{'_id': 'p97',
 '_source': {'content': 'acai bowl master class sooday, september 7 á 9:30 am sweetgreen nomad sunset yoga tuesday, september 9 á 7:00pm pier 25 Want to receive passport alerts for a different region? Update your preferences. august passport copyright© 2014 sweetgreen 829 7th st nw washington. de 20001 unsubscribe 1 view in browser '}}

In [37]:
list_of_vectors[96].shape

(300,)

In [35]:
single_item = list_of_vectors[96].reshape(1,300)
all_items = np.array(list_of_vectors)
similarities = cosine_similarity(single_item, list_of_vectors, dense_output=False)
similarities

array([[0.72411454, 0.5992696 , 0.74719894, 0.77367425, 0.6525796 ,
        0.7291043 , 0.37880737, 0.73526955, 0.68038476, 0.6802126 ,
        0.72070014, 0.6816759 , 0.77796584, 0.7404872 , 0.79079974,
        0.70257735, 0.5037015 , 0.7428045 , 0.7713428 , 0.7088518 ,
        0.48074597, 0.78784287, 0.74586225, 0.42302385, 0.52736354,
        0.79586506, 0.37276882, 0.        , 0.        , 0.81597537,
        0.55032504, 0.7607349 , 0.77962303, 0.5618724 , 0.31616932,
        0.31600136, 0.7805393 , 0.6916374 , 0.6909097 , 0.6975071 ,
        0.72139496, 0.7090056 , 0.7369091 , 0.753876  , 0.74646807,
        0.7297063 , 0.64439297, 0.52973866, 0.747718  , 0.6919215 ,
        0.5654484 , 0.42426598, 0.7491683 , 0.6675257 , 0.6792298 ,
        0.7851023 , 0.        , 0.6576525 , 0.7488511 , 0.6674544 ,
        0.5199123 , 0.        , 0.        , 0.        , 0.7771566 ,
        0.4403249 , 0.5913658 , 0.7201503 , 0.73884267, 0.        ,
        0.75180197, 0.69269425, 0.6223636 , 0.70

In [39]:
related_docs_indices = similarities[0].argsort()[:-5:-1]
print (related_docs_indices)

[96 99 29 25]


In [41]:
list_of_text[29]

{'_id': 'p30',
 '_source': {'content': 'From: To: Cc: Jonathan Rosen Drew. Chloe Hatch. Peter Subject: Re: Favor/totally fine if no Date: Wednesday, January 29, 2014 10:23:50 AM Thanks much. Sent from my iPhone On Jan 29, 2014, at 9:11AM, "Drew, Chloe" <CDrew@cicyhall.nyc.gov> wrote: Of course and done! From: Jonathan Rosen [ \nmailto:Jonathan@berlinrosen.com] Sent: Tuesday, January 28, 2014 12:42 PM To: Drew, Chloe; Hatch, Peter Subject: FW: Favor/totally fine if no Could you pass this to the new commissioner. She is actually quite a person. colleague here at BR who worked on the campaign. From: Ben Wyskida Sent: Tuesday, January 28, 2014 12:11 PM To: Jonathan Rosen Subject: Fwd: Favor/totally fine if no and would be a get. Ben is my on? She\'s a winner from what I know --, has a ton of experience, ve1y committed. Begin f01warded message: From: Date: anuruy at To: Ben Wyskida -.> Subject: Favor/totally fine if no Hi Ben, DYCD has posted a job I\'m super interested in -Director of Comm

In [43]:
# and here we print out the actual values
print (similarities[0, related_docs_indices])

[1.0000004  0.9163209  0.81597537 0.79586506]
