# Spacy.io NLP stuff

[Spacy](https://spacy.io/) is "Industrial-Strength Natural Language Processing" (NLP)

In [20]:
! pip install -q -U spacy
! python -m spacy download en_core_web_sm # downloads English NLP model info

Collecting en-core-web-sm==3.1.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


You might need to restart your jupyter kernel.

BTW, there are other, non-English [language models](https://spacy.io/usage/models).

Let's load the Tesla IPO again:

In [21]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4792  100  4792    0     0  72606      0 --:--:-- --:--:-- --:--:-- 72606


In [22]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'lxml')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
tsla = html2text(html_text)
print(tsla[0:100].split())

['SEC.gov', '|', 'Request', 'Rate', 'Threshold', 'Exceeded', 'U.S.', 'Securities', 'and', 'Exchange', 'Commission', 'Your', 'Reques']


## Tokenizing with Spacy

In [23]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [24]:
doc = nlp(tsla[0:5000])
type(doc)

spacy.tokens.doc.Doc

In [25]:
for token in doc[:30]:
    if len(str(token).strip())>0:
        print(token.text.strip())

SEC.gov
|
Request
Rate
Threshold
Exceeded
U.S.
Securities
and
Exchange
Commission
Your
Request
Originates
from
an
Undeclared
Automated
Tool
To
allow
for
equitable
access
to
all


## Parts of speech

In [26]:
import pandas as pd
winfo = []
for token in doc[100:120]:
    winfo.append([token.text, token.pos_, token.is_stop])
winfo

[['from', 'ADP', True],
 ['SEC.gov', 'NOUN', False],
 [',', 'PUNCT', False],
 ['including', 'VERB', False],
 ['the', 'DET', True],
 ['latest', 'ADJ', False],
 ['EDGAR', 'PROPN', False],
 ['filings', 'NOUN', False],
 [',', 'PUNCT', False],
 ['visit', 'VERB', False],
 ['sec.gov/developer', 'X', False],
 ['.', 'PUNCT', False],
 ['You', 'PRON', True],
 ['can', 'AUX', True],
 ['also', 'ADV', True],
 ['sign', 'VERB', False],
 ['up', 'ADP', True],
 ['for', 'ADP', True],
 ['email', 'NOUN', False],
 ['updates', 'VERB', False]]

In [27]:
pd.DataFrame(data=winfo, columns=['word','part of speech', 'stop word'])

Unnamed: 0,word,part of speech,stop word
0,from,ADP,True
1,SEC.gov,NOUN,False
2,",",PUNCT,False
3,including,VERB,False
4,the,DET,True
5,latest,ADJ,False
6,EDGAR,PROPN,False
7,filings,NOUN,False
8,",",PUNCT,False
9,visit,VERB,False


In [28]:
winfo = []
for ent in doc.ents[:20]:
    winfo.append([ent.text, ent.label_])
pd.DataFrame(data=winfo, columns=['word', 'label'])

Unnamed: 0,word,label
0,Request Rate Threshold Exceeded,WORK_OF_ART
1,U.S. Securities and Exchange Commission,ORG
2,SEC,ORG
3,SEC,ORG
4,SEC,ORG
5,Site Privacy,ORG
6,the U.S. Securities and Exchange Commission,ORG
7,0.145c3a17.1628632334.1d44db2,CARDINAL
8,the Computer Fraud,ORG
9,1986,DATE


**Word vectors**

In [29]:
winfo = []
for t in doc[100:110]:
    winfo.append([t.text, t.vector])
pd.DataFrame(data=winfo, columns=['word', 'vector'])

Unnamed: 0,word,vector
0,from,"[0.3551415, -0.061315462, -0.7487595, -0.52455..."
1,SEC.gov,"[0.74365944, -0.42298967, 0.12626238, 0.241557..."
2,",","[-0.47439197, 0.17869307, -0.31013155, -0.4700..."
3,including,"[-0.43066767, -0.23000172, -0.1267103, 0.36281..."
4,the,"[-0.2958218, -0.48785746, -0.5411534, -0.30848..."
5,latest,"[-0.93013877, -0.44527623, 1.3953073, -0.79231..."
6,EDGAR,"[-0.10970956, -0.51127374, 0.35253358, 0.00803..."
7,filings,"[0.09033578, -0.56289196, -0.17694888, 0.13208..."
8,",","[-0.25684774, -0.28116375, -0.74895024, -0.339..."
9,visit,"[0.51473784, -0.49357933, -0.29623887, -0.1765..."


## Visualizing entities in notebook

In [30]:
from spacy import displacy
displacy.render(doc[100:180], style='ent')

## Splitting into sentences

In [31]:
winfo = []
for s in doc.sents:
    if len(s.text.strip())>2:
        winfo.append([s.text])
pd.DataFrame(data=winfo, columns=['sentence'])

Unnamed: 0,sentence
0,\n\n\nSEC.gov | Request Rate Threshold Exceede...
1,Your request has been identified as part of a ...
2,Please declare your traffic by updating your u...
3,For best practices on efficiently downloading ...
4,You can also sign up for email updates on the ...
5,"For more information, contact opendata@sec.gov."
6,"For more information, please see the SEC’s Web..."
7,Thank you for your interest in the U.S. Securi...
8,\nReference ID: 0.145c3a17.1628632334.1d44db2\n\n
9,More Information\nInternet Security Policy\n


## Exercise

Extract any word in the TSLA doc that is a number per Spacy. See [Spacy 101](https://spacy.io/usage/spacy-101). Your output should look like (assuming you used `doc = nlp(tsla[0:5000])`):

```
[1, 29, 2010, 20549, 1933, 3711, 91, 2197729, 3500, 94304, 650, 413, 4000, 3500, 94304, 650, 413, 4000, 650, 94304, 650, 493, 9300, 2550, 94304, 650, 251, 5000, 415, one, 0.001, 100,000,000, 7,130, 1, 457, 1933, 2, 1933, 29, 2010]
```

See [solution](https://github.com/parrt/msds692/tree/master/notes/code/spacy) if you get stuck.