# Part-of-speech Extratction with spaCy

## Overview

We're going to look for all the people mentioned in a pile of documents.

### Entites

"Entities" in documents are, generally, names -- names of people, places, and things such as companies. Finding out which entities are mentioned in a trove of documents can be pretty helpful, especially when you don't previously _know_ someone or some place is included the document.

There are services online that do this kind of extraction, including [DocumentCloud](https://www.documentcloud.org/) ([see how here](https://www.documentcloud.org/faq#faq-analyzing-1)), [Amazon Comprehend](https://aws.amazon.com/comprehend/features/) and [Google Natural Language](https://cloud.google.com/natural-language/).

### Using spaCy

We're going to do our entity extraction right here in our notebook using a pre-trained natural language model called [spaCy](https://spacy.io/). Specifically, we're using the spaCy [large English language model](https://spacy.io/models/en#en_core_web_lg) trained on the [OntoNotes dataset](https://catalog.ldc.upenn.edu/LDC2013T19) -- a trove of "telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs" that includes nearly 1.5 million English words.  

The spaCy project has a lot of great language features. We'll be looking at the [named entities feature](https://spacy.io/usage/linguistic-features#named-entities). Note also that there are [models for several languages](https://spacy.io/models) being developed in spaCy.


## The Plan

- We'll download the spaCy software and the large English language model.
- We'll also download a (smallish) pile of emails released in a court case.
- We'll learn how to use spaCy functions to extract entities
- We'll use the spaCy functions to scan all the pages of the emails.

## Credits

This notebook was written by John Keefe [Quartz](https://qz.com) at Quartz and includes document-processing code written included in [a blog post](https://qz.ai/discovering-interesting-documents-in-the-mauritius-leaks/) and a [Jupyter notebook](https://github.com/Quartz/aistudio-doc2vec-for-investigative-journalism/blob/master/Doc2vec%20for%20Investigative%20Journalism.ipynb) by Jeremy B. Merrill at Quartz, who used it to help find documents inside a document dump known as the [Mauritius Leaks](https://qz.com/1670632/how-quartz-used-ai-to-help-reporters-search-the-mauritius-leaks/).  

-- John Keefe, [Quartz](https://qz.com), October 2019

## Setup

### For those using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes), or if you use one for more than 12 hours.

Note that there's a part of this notebook that doesn't seem to work with a GPU (It's the vector part below) so we'll stay on the CPU for now.

### Everybody do this ...

Everyone needs to run the next cell, which initializes the Python libraries we'll use in this notebook.

In [1]:
## *EVERYBODY* SHOULD RUN THIS CELL
## Running this can take 3-5 minutes ... but that's normal
## When it's done, you should see a line in green: 
## "✔ Download and installation successful"

!pipenv install spacy
!python -m spacy download en_core_web_lg

import spacy
import en_core_web_lg
import json
from os.path import exists

[39m[1mInstalling [32m[1mspacy[39m[22m…[39m[22m
[K[39m[1mAdding[39m[22m [32m[1mspacy[39m[22m [39m[1mto Pipfile's[39m[22m [31m[1m[packages][39m[22m[39m[1m…[39m[22m
[K[?25h✔ Installation Succeeded[0m 
[39m[1mInstalling [32m[1m--quiet[39m[22m…[39m[22m
[K[?25hTraceback (most recent call last):
  File "/Users/johnkeefe/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pkg_resources/_vendor/packaging/requirements.py", line 90, in __init__
    req = REQUIREMENT.parseString(requirement_string)
  File "/Users/johnkeefe/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1654, in parseString
    raise exc
  File "/Users/johnkeefe/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1644, in parseString
    loc, tokens = self._parse( instring, 0 )
  File "/Users/johnkeefe/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py", line 1402, in _parse

In [2]:
!pipenv install pandas

[39m[1mInstalling [32m[1mpandas[39m[22m…[39m[22m
[K[39m[1mAdding[39m[22m [32m[1mpandas[39m[22m [39m[1mto Pipfile's[39m[22m [31m[1m[packages][39m[22m[39m[1m…[39m[22m
[K[?25h✔ Installation Succeeded[0m 
[31m[1mPipfile.lock (7221ac) out of date, updating to (7ea8e8)…[39m[22m
[39m[22mLocking[39m[22m [31m[22m[dev-packages][39m[22m [39m[22mdependencies…[39m[22m
[39m[22mLocking[39m[22m [31m[22m[packages][39m[22m [39m[22mdependencies…[39m[22m
[KBuilding requirements...
[KResolving dependencies...
[K[?25h[32m[22m✔ Success![39m[22m[0m 
[39m[1mUpdated Pipfile.lock (7ea8e8)![39m[22m
[39m[1mInstalling dependencies from Pipfile.lock (7ea8e8)…[39m[22m
  🐍   [32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[39m[22m[32m[1m▉[

In [None]:
import pandas as pd

## The Data

This file was provided by Newslens and covers articles they downloaded for the previous two days starting about 7 pm Eastern on Monday, March 9, 2020.

In [2]:
# Run this cell to download the data we'll use for this exercise
!wget -N https://www.dropbox.com/s/qhnjczden1sfkby/data_dump_analysis.json?dl=0
print('Done!')

Done!


Let's look at what we have.

In [3]:
%ls 

'data_dump_analysis.json?dl=0'   [0m[01;34msample_data[0m/


In [None]:
%mv data_dump_analysis.json?dl=0 data_dump_analysis.json

In [14]:
%ls

data_dump_analysis.json  [0m[01;34msample_data[0m/


## Trying spaCy's entity extraction feature

In [None]:
# First we load the model into the notebook
nlp = en_core_web_lg.load()

In [None]:
# Now let's give it a try
doc = nlp(u"San Francisco considers banning sidewalk huge delivery robots")


Let's analyze the parts of speech, [from this spaCy documentaiton](https://spacy.io/usage/linguistic-features).

In [12]:
for token in doc:
    print(token.text, token.pos_)


 SPACE
John PROPN
drove VERB
his DET
blue ADJ
Volkswagen PROPN
Golf PROPN
north NOUN
on ADP
Interstate PROPN
35 NUM
to ADP
Duluth PROPN
, PUNCT
Minnesota PROPN
, PUNCT

 SPACE
where ADV
he PRON
stopped VERB
at ADP
the DET
Aerial PROPN
Lift PROPN
Bridge PROPN
and CCONJ
looked VERB
out ADP
over ADP

 SPACE
Lake PROPN
Superior PROPN
. PUNCT

 SPACE


There's [a whole list of entities spaCy can detect](https://spacy.io/api/annotation#named-entities)!

In [None]:
my_story = """
John drove his blue Volkswagen Golf north on Interstate 35 to Duluth, Minnesota,
where he stopped at the Aerial Lift Bridge and looked out over
Lake Superior. 
"""

doc = nlp(my_story)

In [15]:
"drove" in my_story

True

Loading the JSON file we got from Newslens

In [101]:
json_file = "data_dump_analysis.json"
# adjectives = []
# superlatives = []
count = 0

with open(json_file) as f:         # open the json file
    the_data=json.loads(f.read())
    for item in the_data:                    # loop through each item ...
      
      if "content" not in item.keys():
        continue

      story = item['content']
      if "stock market" not in story:
        continue

      if "safest" in story:
        print("safest", item['url'])

      if "zaniest" in story:
        print("zaniest", item['url'])

      # count += 1
      # print(count)

      # doc = nlp(story)
      # for token in doc:
      #   if token.pos_ == "ADJ":
      #     adjectives.append(token.text.lower())
      #   if token.tag_ == "JJS":
      #     superlatives.append(token.text.lower())

        

print(adjectives)
print("----")
print(superlatives)



safest https://www.nytimes.com/2020/03/06/business/coronavirus-stock-market.html
['biggest', 'financial', 'double', 'global', 'industrial', 'huge', 'asian', 'worldwide', 'economic', 'such', 'crude', 'biggest', 'single', 'first', 'chief', 'international', 'more', 'more', 'large', 'more', 'rich', 'closed', 'large', 'possible', 'worst', 'global', 'financial', 'australian', 'double', 'fresh', 'japanese', 'safe', 'benchmark', 'genuine', 'chief', 'clearer', 'clear', 'such', 'global', 'simultaneous', 'chief', 'biggest', 'distinct', 'worst', 'next', 'several', 'weaker', 'chinese', 'global', 'last', 'severe', 'global', 'worst', 'dramatic', 'british', 'severe', 'biggest', 'last', 'underlying', 'certain', 'political', 'economic', 'marginal', 'general', 'significant', 'greater', 'political', 'economic', 'new', 'next', 'crucial', 'biggest', 'top', 'ferocious', 'main', 'crude', 'last', 'crude', 'largest', 'biggest', 'crude', 'saudi', 'more', 'dependent', 'global', 'huge', 'crude', 'unprecedented', '

In [92]:
len(adjectives)

8772

In [93]:
len(superlatives)

500

In [94]:
len(the_data)

2392

In [None]:
supers = pd.DataFrame(superlatives)

In [None]:
supers.columns = ['word']

In [None]:
supers_list = supers['word'].value_counts().rename_axis('unique_values').reset_index(name='counts')

In [None]:
supers_list.to_csv('supers.txt', index=False)

In [99]:
!cat supers.txt

unique_values,counts
biggest,130
largest,65
worst,51
least,48
most,41
latest,41
lowest,40
highest,29
best,16
steepest,6
greatest,4
busiest,3
darkest,3
smallest,2
holiest,2
sharpest,2
earliest,2
broadest,2
weakest,1
newest,1
finest,1
clearest,1
wealthiest,1
safest,1
longest,1
fastest,1
easiest,1
youngest,1
deepest,1
toughest,1
deadliest,1


In [69]:
supers['word'].value_counts()

biggest       117
largest        64
least          47
worst          42
most           40
latest         40
lowest         38
highest        27
best           14
steepest        4
greatest        4
busiest         3
smallest        2
holiest         2
sharpest        2
darkest         2
earliest        2
broadest        2
weakest         1
newest          1
finest          1
clearest        1
wealthiest      1
safest          1
longest         1
fastest         1
easiest         1
youngest        1
deepest         1
toughest        1
deadliest       1
Name: word, dtype: int64

In [71]:
!cat supers.txt

word
latest
largest
biggest
most
finest
worst
most
weakest
most
latest
least
largest
largest
lowest
biggest
worst
lowest
largest
lowest
biggest
latest
latest
lowest
most
most
biggest
latest
latest
latest
latest
biggest
largest
latest
latest
earliest
earliest
least
least
best
fastest
least
best
largest
largest
greatest
latest
highest
greatest
best
worst
biggest
latest
lowest
clearest
largest
largest
highest
lowest
biggest
most
biggest
lowest
biggest
lowest
biggest
largest
most
biggest
largest
largest
highest
best
least
latest
best
largest
least
biggest
lowest
biggest
lowest
biggest
biggest
largest
most
biggest
largest
biggest
highest
sharpest
safest
greatest
biggest
lowest
biggest
lowest
biggest
biggest
largest
biggest
largest
biggest
highest
lowest
biggest
lowest
biggest
biggest
largest
biggest
largest
biggest
highest
biggest
biggest
worst
biggest
worst
worst
biggest
worst
largest
least
most
biggest
lowest
biggest
lowest
biggest
biggest
largest
biggest
largest
biggest
highest
worst
hol

In [None]:
adj = pd.DataFrame(adjectives)

In [None]:
adj.columns = ['word']

In [59]:
adj['word'].value_counts() 

more            273
first           260
other           204
global          177
last            153
               ... 
traumatizing      1
ceos              1
urban             1
unimaginable      1
extensive         1
Name: word, Length: 1103, dtype: int64

In [None]:
df = adj['word'].value_counts().rename_axis('unique_values').reset_index(name='counts')

In [62]:
df

Unnamed: 0,unique_values,counts
0,more,273
1,first,260
2,other,204
3,global,177
4,last,153
...,...,...
1098,traumatizing,1
1099,ceos,1
1100,urban,1
1101,unimaginable,1


In [None]:
df.to_csv('out.txt', index=False)

In [64]:
%ls cat out.txt

ls: cannot access 'cat': No such file or directory
out.txt


In [65]:
%ls

data_dump_analysis.json  out.txt  [0m[01;34msample_data[0m/


In [None]:
!cat out.txt