# Metadata

```yaml
Course:   DS 5001
Module:   92 Helper Notebooks
Topic:    Using SpaCy 
Author:   R.C. Alvarado
```

# Notes

## How to install

* `conda install -c conda-forge spacy`
* `python -m spacy download en_core_web_sm`

## About SpaCy

* More than a library; it is an **entire platform** for text processing. It is designed to be integrated into production-level data products.
* Designed for performance. It uses **best of breed** tools and can be somewhat opaque.
* **A replacement for NLTK**, especially for linguistic annonation in the preprocessing stages. It can work with Gensim and SciKit Learn.
* Designed to be **accessed by API**, not be dumping to a database -- but it can be done.
* Should be installed in **its own Python environment**.  
  * For example, do `conda create -n spacy` and then do `conda activate spacy`. From there, install SpaCy and everything else you need for your project.

## SpaCy's Object Model
* Note: this is not a true data model, but an object model that bundles data with algorithms (methods).

<img src="images/space-architecture.svg" width="500" />

# Set Up

## Config

In [1]:
data_home = "../data"
local_lib = "../lib"
data_prefix = 'novels'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']

## Import Library

In [2]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import spacy

In [3]:
spacy.__version__

'2.3.2'

# Import CORPUS

In [4]:
LIB = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [5]:
CORPUS = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [6]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy


In [7]:
def gather_docs(CORPUS, ohco_level, str_col='term_str', glue=' '):
    OHCO = CORPUS.index.names
    CORPUS[str_col] = CORPUS[str_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[str_col].apply(lambda x: glue.join(x)).to_frame('doc_str')
    return DOC

## Gather CHAPS

In [8]:
SENTS = gather_docs(CORPUS, 4) # We do this to preserve sentence boundaries in CHAPs
CHAPS = gather_docs(SENTS, 2, str_col='doc_str', glue='. ')

In [9]:
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
book_id,chap_id,Unnamed: 2_level_1
adventures,1,a scandal in bohemia. i. to sherlock holmes sh...
adventures,2,the red headed league. i had called upon my fr...
adventures,3,a case of identity. my dear fellow said sherlo...
adventures,4,the boscombe valley mystery. we were seated at...
adventures,5,the five orange pips. when i glance over my no...
...,...,...
udolpho,54,vi. unnatural deeds do breed unnatural trouble...
udolpho,55,vii. but in these cases we still have judgment...
udolpho,56,viii. then fresh tears stood on her cheek as d...
udolpho,57,ix. now my task is smoothly done i can fly or ...


# Use SpaCy

## Load Statistical Models

These are also called "trained pipelines" in the documentation.

**Trained pipelines for English:**
* `en_core_web_sm`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_md`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_lg`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_trf`: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

<img 
     width="500"
     src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/>
     
See <a href="https://spacy.io/usage/processing-pipelines">the docs</a> for more.

In [10]:
spacy.__version__

'2.3.2'

In [11]:
trained_pipeline = 'en_core_web_md'

In [12]:
# !python -m spacy download {trained_pipeline}

In [13]:
# doc = spacy.nlp(doc_str)

In [15]:
nlp = spacy.load(trained_pipeline)

## Generate Annotations

In [16]:
# pipleline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
# disable= ["attribute_ruler", "lemmatizer", "parser"]
disable = []
DOCS = [doc.to_json() for doc in nlp.pipe(CHAPS.doc_str.values, disable=disable)]

## Convert to DataFrames

In [17]:
features = list(DOCS[0].keys())

In [18]:
features

['text', 'ents', 'sents', 'tokens']

In [19]:
feature_data = {f:[] for f in features}
for i in range(len(DOCS)):    
    text = DOCS[i]['text']
    for feature in features[1:]:
        df = pd.DataFrame(DOCS[i][feature])
        df[f'{feature[:-1]}_str'] = df.apply(lambda x: text[x.start:x.end], 1)
        df['doc_id'] = i
        feature_data[feature].append(df)
    
class mySpaCy(): pass
spcy = mySpaCy()
for feature in features[1:]:
    setattr(spcy, feature, pd.concat(feature_data[feature]).rename_axis(f'{feature[:-1]}_id'))

In [20]:
spcy.ents

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,13,20,GPE,bohemia,0
1,22,24,ORG,i.,0
2,28,43,GPE,sherlock holmes,0
3,244,255,PERSON,irene adler,0
4,279,282,CARDINAL,one,0
...,...,...,...,...,...
108,39150,39158,PERSON,madeline,319
109,39162,39167,ORG,usher,319
110,39938,39946,LOC,red moon,319
111,40392,40400,CARDINAL,thousand,319


## Explore

### TOKEN

In [21]:
spcy.tokens

Unnamed: 0_level_0,id,start,end,pos,tag,dep,head,token_str,doc_id
token_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,0,1,DET,DT,det,1,a,0
1,1,2,9,NOUN,NN,ROOT,1,scandal,0
2,2,10,12,ADP,IN,prep,1,in,0
3,3,13,20,PROPN,NNP,pobj,2,bohemia,0
4,4,20,21,PUNCT,.,punct,1,.,0
...,...,...,...,...,...,...,...,...,...
7453,7453,40494,40496,ADP,IN,prep,7452,of,319
7454,7454,40497,40500,DET,DT,det,7455,the,319
7455,7455,40501,40506,PROPN,NNP,pobj,7453,house,319
7456,7456,40507,40509,ADP,IN,prep,7455,of,319


In [22]:
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy
...,...,...,...,...,...,...
baskervilles,11,114,1,7,RBR,more
baskervilles,11,114,1,8,JJ,comfortable
baskervilles,11,114,1,9,IN,outside
baskervilles,11,114,1,10,IN,than


### VOCAB

In [23]:
spcy.VOCAB = spcy.tokens.value_counts('token_str').to_frame('n')

In [24]:
spcy.VOCAB['max_pos'] = spcy.tokens.value_counts(['token_str','pos']).unstack().idxmax(1)

In [25]:
spcy.VOCAB[spcy.VOCAB.max_pos == 'PROPN'].sample(10)

Unnamed: 0_level_0,n,max_pos
token_str,Unnamed: 1_level_1,Unnamed: 2_level_1
theophilus,1,PROPN
raikes,10,PROPN
abramoff,1,PROPN
markham,26,PROPN
frizinghall,91,PROPN
brigham,2,PROPN
journal,85,PROPN
wwwgutenbergorgcontact,1,PROPN
roberts,7,PROPN
carlton,3,PROPN


### ENT

In [26]:
spcy.ents.label.value_counts()

PERSON         24625
CARDINAL        4707
TIME            3740
DATE            3487
GPE             3259
ORG             2646
ORDINAL         1991
NORP            1440
LOC              677
FAC              558
QUANTITY         365
PRODUCT          162
LANGUAGE          99
MONEY             46
EVENT             46
WORK_OF_ART       39
LAW                8
Name: label, dtype: int64

In [27]:
spcy.ents[spcy.ents.label=='PERSON'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
80,8524,8530,PERSON,julius,232
128,17184,17190,PERSON,howard,257
0,305,311,PERSON,philip,189
20,2853,2861,PERSON,lawrence,259
115,20437,20452,PERSON,sherlock holmes,23
146,21519,21525,PERSON,holmes,1
156,23087,23093,PERSON,holmes,7
118,15819,15826,PERSON,wilkins,260
51,16506,16512,PERSON,buskin,194
187,32382,32389,PERSON,manfred,27


In [28]:
spcy.ents[spcy.ents.label=='PERSON'].value_counts(['doc_id','ent_str']).unstack().sum().sort_values()

ent_str
a dr adams                      1.0
manfred prince of otranto       1.0
manfred rose                    1.0
manfred thou                    1.0
manfred thy                     1.0
                              ...  
montoni                       427.0
holmes                        428.0
annette                       443.0
tommy                         507.0
emily                        1974.0
Length: 3124, dtype: float64

In [29]:
spcy.ents[spcy.ents.label=='ORG'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
224,41163,41175,ORG,westminister,63
164,21049,21056,ORG,journal,92
2,51,61,ORG,hillingham,48
80,16291,16310,ORG,sie nicht verstehen,242
17,1871,1884,ORG,latour claret,141
190,28109,28114,ORG,onlie,190
471,93573,93590,ORG,the court of rome,96
93,17020,17031,ORG,black larch,279
142,21034,21041,ORG,du pont,295
91,18173,18182,ORG,ladyships,126


In [30]:
spcy.ents[spcy.ents.label=='DATE'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
482,79206,79216,DATE,four weeks,92
232,36008,36015,DATE,january,4
25,5398,5413,DATE,merry christmas,32
0,0,10,DATE,a few days,169
178,30106,30123,DATE,the next ten days,149
347,50113,50121,DATE,tomorrow,149
136,19575,19582,DATE,one day,186
87,16189,16200,DATE,seventeenth,144
20,3489,3495,DATE,morrow,63
118,20145,20153,DATE,tomorrow,160


### SENT

In [31]:
spcy.sents

Unnamed: 0_level_0,start,end,sent_str,doc_id
sent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,21,a scandal in bohemia.,0
1,22,43,i. to sherlock holmes,0
2,44,68,she is always the woman.,0
3,69,126,i have seldom heard him mention her under any ...,0
4,127,190,in his eyes she eclipses and predominates the ...,0
...,...,...,...,...
303,39717,39764,suddenly there shot along the path a wild light,319
304,39765,39885,and i turned to see whence a gleam so unusual ...,319
305,39886,40123,the radiance was that of the full setting and ...,319
306,40124,40270,while i gazed this fissure rapidly widened the...,319


In [32]:
SENTS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,doc_str
book_id,chap_id,para_num,sent_num,Unnamed: 4_level_1
adventures,1,0,1,a scandal in bohemia
adventures,1,1,0,i
adventures,1,2,0,to sherlock holmes she is always the woman
adventures,1,2,1,i have seldom heard him mention her under any ...
adventures,1,2,2,in his eyes she eclipses and predominates the ...
...,...,...,...,...
usher,1,47,0,from that chamber and from that mansion i fled...
usher,1,47,1,the storm was still abroad in all its wrath as...
usher,1,47,2,suddenly there shot along the path a wild ligh...
usher,1,47,3,the radiance was that of the full setting and ...


# Save

In [33]:
import sqlite3

In [39]:
with sqlite3.connect(f"{data_home}/output/space-demo.db") as db:
    for feature in features[1:]:
        getattr(spcy, feature).to_sql(feature, db, index=True, if_exists='replace')