# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

In [5]:
! pip install -q -U spacy
! python -m spacy download en_core_web_sm # downloads English NLP model info

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 7.5 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


You might need to restart your jupyter kernel.

BTW, there are other, non-English [language models](https://spacy.io/usage/models).

Let's load the Tesla IPO again:

In [16]:
! curl -H "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36" https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2306k    0 2306k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M


In [17]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['S-1', '1', 'ds1.htm', 'REGISTRATION', 'STATEMENT', 'ON', 'FORM', 'S-1', 'Registration', 'Statement', 'on', 'Form', 'S-1', 'Table', 'of', 'Conte']


## Tokenizing with Spacy

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [7]:
doc = nlp(tsla[0:5000])
type(doc)

spacy.tokens.doc.Doc

In [18]:
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

S-1
1
ds1.htm
REGISTRATION
STATEMENT
ON
FORM
S-1
Registration
Statement
on
Form
S-1
Table
of
Contents
As
filed
with
the
Securities
and
Exchange


## Parts of speech

In [9]:
import pandas as pd
winfo = []
for token in doc[100:120]:
    winfo.append([token.text, token.pos_, token.is_stop])
winfo

[['jurisdiction', 'NOUN', False],
 ['of', 'ADP', True],
 ['incorporation', 'NOUN', False],
 ['or', 'CCONJ', True],
 ['organization', 'NOUN', False],
 [')', 'PUNCT', False],
 ['\n\xa0\n ', 'SPACE', False],
 ['(', 'PUNCT', False],
 ['Primary', 'PROPN', False],
 ['Standard', 'PROPN', False],
 ['Industrial', 'PROPN', False],
 ['Classification', 'PROPN', False],
 ['Code', 'PROPN', False],
 ['Number', 'PROPN', False],
 [')', 'PUNCT', False],
 ['\n\xa0\n ', 'SPACE', False],
 ['(', 'PUNCT', False],
 ['I.R.S.', 'PROPN', False],
 ['Employer', 'PROPN', False],
 ['Identification', 'PROPN', False]]

In [10]:
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,jurisdiction,NOUN,False
1,of,ADP,True
2,incorporation,NOUN,False
3,or,CCONJ,True
4,organization,NOUN,False
5,),PUNCT,False
6,\n \n,SPACE,False
7,(,PUNCT,False
8,Primary,PROPN,False
9,Standard,PROPN,False


In [11]:
winfo = []
for ent in doc.ents[:20]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,S-1,PRODUCT
1,1,CARDINAL
2,ds1.htm,GPE
3,REGISTRATION STATEMENT,PERSON
4,Table of Contents,WORK_OF_ART
5,the Securities and Exchange Commission,ORG
6,"January 29, 2010",DATE
7,UNITED STATES SECURITIES AND EXCHANGE COMMISSION,ORG
8,Washington,GPE
9,FORM S-1 \n REGISTRATION STATEMENT UNDER THE...,ORG


**Word vectors**

In [12]:
winfo = []
for t in doc[100:110]:
    winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])

Unnamed: 0,word,vector
0,jurisdiction,"[0.9122888, -0.4078743, 0.7634988, -0.50670886..."
1,of,"[-0.7129048, 0.04475537, -0.94641596, -0.08683..."
2,incorporation,"[0.6066091, -0.52781343, 0.21630868, -0.501968..."
3,or,"[-0.381503, 0.49661726, -0.35056427, 0.0716877..."
4,organization,"[0.24081895, -0.7348472, 1.149735, -0.9842273,..."
5,),"[-0.17163272, -0.5210204, -0.7458245, -0.38518..."
6,\n \n,"[0.42153037, -0.016289651, -0.23510829, 0.0340..."
7,(,"[-0.36409053, 0.41856855, -0.64084274, -0.0303..."
8,Primary,"[-0.1312578, -1.4340324, -0.10657543, 0.838625..."
9,Standard,"[0.0501505, -0.055509098, 0.39489675, 0.167570..."


## Visualizing entities in notebook

In [13]:
from spacy import displacy
displacy.render(doc[100:180], style='ent')

## Splitting into sentences

In [14]:
winfo = []
for s in doc.sents:
    if len(s.text.strip())>2:
        winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\nS-1\n1\nds1.htm\n
1,REGISTRATION STATEMENT ON FORM S-1\n\nRegistra...
2,Table of Contents\nAs filed with the Securitie...
3,333- \n UNITED STATES SE...
4,FORM S-1 \n REGISTRATION STATEMENT UNDER...
5,\n\n\n\n\n\n\n\n\nDelaware\n \n3711\n \n...
6,"3500 Deer Creek Road\n Palo Alto, California 9..."
7,Elon Musk \n Chief Executive Officer Tes...
8,Copies to: \n\n\n\n\n\n\n Larry W. Sons...
9,"Page Mill Road Palo Alto, California 94304\n (..."


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. See [Spacy 101](https://spacy.io/usage/spacy-101). Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.