# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

```bash
pip install spacy
python -m spacy download en # downloads English NLP model info
```

There are other, non-English [language models](https://spacy.io/usage/models).

Let's load the Tesla IPO again:

In [1]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2306k    0 2306k    0     0  4939k      0 --:--:-- --:--:-- --:--:-- 4928k


In [2]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['S-1', '1', 'ds1.htm', 'REGISTRATION', 'STATEMENT', 'ON', 'FORM', 'S-1', 'Registration', 'Statement', 'on', 'Form', 'S-1', 'Table', 'of', 'Co']


## Tokenizing with Spacy

In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp(tsla[0:5000])
type(doc)

spacy.tokens.doc.Doc

In [5]:
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

S-1
1
ds1.htm
REGISTRATION
STATEMENT
ON
FORM
S-1
Registration
Statement
on
Form
S-1
Table
of
Contents
As
filed
with
the
Securities
and
Exchange


## Parts of speech

In [6]:
import pandas as pd
winfo = []
for token in doc[100:120]:
    winfo.append([token.text, token.pos_, token.is_stop])
    
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,jurisdiction,NOUN,False
1,of,ADP,True
2,incorporation,NOUN,False
3,or,CCONJ,True
4,organization,NOUN,False
5,),PUNCT,False
6,\n \n,SPACE,False
7,(,PUNCT,False
8,Primary,PROPN,False
9,Standard,PROPN,False


In [7]:
winfo = []
for ent in doc.ents[:20]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,FORM,ORG
1,the Securities and Exchange Commission,ORG
2,"January 29, 2010",DATE
3,UNITED,ORG
4,SECURITIES AND EXCHANGE COMMISSION,ORG
5,Washington,GPE
6,1933,DATE
7,Delaware,GPE
8,3711,DATE
9,91,CARDINAL


**Word vectors**

In [8]:
winfo = []
for t in doc[100:110]:
    winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])

Unnamed: 0,word,vector
0,jurisdiction,"[1.1151067, 0.003911406, -0.79827666, -0.62766..."
1,of,"[0.10711861, -2.6413453, -0.9461815, -0.516161..."
2,incorporation,"[0.9371015, 2.2627826, -0.53676647, -1.74161, ..."
3,or,"[-1.1419694, 2.3251572, -3.1423151, 0.6522193,..."
4,organization,"[0.7901013, 0.7636175, 0.62965834, -0.75441694..."
5,),"[1.9197803, 0.09777585, -1.8893478, 0.02061188..."
6,\n \n,"[-0.24489927, 0.92699367, -0.77028555, 1.06281..."
7,(,"[-1.4775455, 2.3933463, 0.7396773, 0.18752888,..."
8,Primary,"[3.0065293, 1.6848834, 1.9169567, -2.7015536, ..."
9,Standard,"[4.5301094, 1.7672288, 2.259016, 1.4483219, 4...."


## Visualizing entities in notebook

In [9]:
from spacy import displacy
displacy.render(doc[100:180], style='ent')

## Splitting into sentences

In [10]:
winfo = []
for s in doc.sents:
    winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\nS-1\n1\n
1,ds1.htm\n
2,REGISTRATION STATEMENT ON FORM
3,S-1\n\n\nRegistration Statement on Form S-1\n\...
4,As filed with the Securities and Exchange Comm...
5,Registration No.
6,333- \n
7,UNITED STATES SECURITIES AND EXCHANGE COMMISS...
8,REGISTRATION STATEMENT
9,UNDER


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. See [Spacy 101](https://spacy.io/usage/spacy-101). Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.