# Metadata

```
course:   DS 5001
module:   00 Final Projects
topic:    Using SpaCy 
author:   R.C. Alvarado
```

# Notes

## How to install

* `conda install -c conda-forge spacy`
* `python -m spacy download en_core_web_sm`

## About SpaCy

* More than a library; it is an **entire platform** for text processing. It is designed to be integrated into production-level data products.
* Designed for performance. It uses **best of breed** tools and can be somewhat opaque.
* **A replacement for NLTK**, especially for linguistic annonation in the preprocessing stages. It can work with Gensim and SciKit Learn.
* Designed to be **accessed by API**, not be dumping to a database -- but it can be done.
* Should be installed in **its own Python environment**.  
  * For example, do `conda create -n spacy` and then do `conda activate spacy`. From there, install SpaCy and everything else you need for your project.

## SpaCy's Object Model
* Note: this is not a true data model, but an object model that bundles data with algorithms (methods).

<img src="https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg" width="500" />

# Set Up

## Config

In [1]:
data_home = "../data"
local_lib = "../lib"
data_prefix = 'novels'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']

## Import Library

In [2]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import spacy

In [3]:
spacy.__version__

'3.2.1'

# Import CORPUS

In [4]:
LIB = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [5]:
CORPUS = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [6]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy


In [7]:
def gather_docs(CORPUS, ohco_level, str_col='term_str', glue=' '):
    OHCO = CORPUS.index.names
    CORPUS[str_col] = CORPUS[str_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[str_col].apply(lambda x: glue.join(x)).to_frame('doc_str')
    return DOC

## Gather CHAPS

In [8]:
SENTS = gather_docs(CORPUS, 4) # We do this to preserve sentence boundaries in CHAPs
CHAPS = gather_docs(SENTS, 2, str_col='doc_str', glue='. ')

In [9]:
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
book_id,chap_id,Unnamed: 2_level_1
adventures,1,a scandal in bohemia. i. to sherlock holmes sh...
adventures,2,the red headed league. i had called upon my fr...
adventures,3,a case of identity. my dear fellow said sherlo...
adventures,4,the boscombe valley mystery. we were seated at...
adventures,5,the five orange pips. when i glance over my no...
...,...,...
udolpho,54,vi. unnatural deeds do breed unnatural trouble...
udolpho,55,vii. but in these cases we still have judgment...
udolpho,56,viii. then fresh tears stood on her cheek as d...
udolpho,57,ix. now my task is smoothly done i can fly or ...


# Use SpaCy

## Load Statistical Models

These are also called "trained pipelines" in the documentation.

**Trained pipelines for English:**
* `en_core_web_sm`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_md`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_lg`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_trf`: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

<img 
     width="500"
     src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/>
     
See <a href="https://spacy.io/usage/processing-pipelines">the docs</a> for more.

In [10]:
spacy.__version__

'3.2.1'

In [11]:
trained_pipeline = 'en_core_web_md'

In [12]:
# !python -m spacy download {trained_pipeline}

In [None]:
# doc = spacy.nlp(doc_str)

In [13]:
nlp = spacy.load(trained_pipeline)

## Generate Annotations

In [14]:
# pipleline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
# disable= ["attribute_ruler", "lemmatizer", "parser"]
disable = []
DOCS = [doc.to_json() for doc in nlp.pipe(CHAPS.doc_str.values, disable=disable)]

## Convert to DataFrames

In [38]:
features = list(DOCS[0].keys())

In [39]:
features

['text', 'ents', 'sents', 'tokens']

In [60]:
feature_data = {f:[] for f in features}
for i in range(len(DOCS)):    
    text = DOCS[i]['text']
    for feature in features[1:]:
        df = pd.DataFrame(DOCS[i][feature])
        df[f'{feature[:-1]}_str'] = df.apply(lambda x: text[x.start:x.end], 1)
        df['doc_id'] = i
        feature_data[feature].append(df)
    
class mySpaCy(): pass
spcy = mySpaCy()
for feature in features[1:]:
    setattr(spcy, feature, pd.concat(feature_data[feature]).rename_axis(f'{feature[:-1]}_id'))

In [59]:
spcy.ents

Unnamed: 0_level_0,start,end,label,ents_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,28,43,PERSON,sherlock holmes,0
1,244,255,PERSON,irene adler,0
2,1111,1122,PERSON,irene adler,0
3,1334,1339,ORDINAL,first,0
4,1594,1606,DATE,week to week,0
...,...,...,...,...,...
45,35407,35413,ORDINAL,second,319
46,35454,35462,CARDINAL,thousand,319
47,35783,35803,TIME,the last few minutes,319
48,38036,38041,ORDINAL,first,319


## Explore

### TOKEN

In [61]:
spcy.tokens

Unnamed: 0_level_0,id,start,end,tag,pos,morph,lemma,dep,head,token_str,doc_id
token_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,0,1,DT,DET,Definite=Ind|PronType=Art,a,det,1,a,0
1,1,2,9,NN,NOUN,Number=Sing,scandal,ROOT,1,scandal,0
2,2,10,12,IN,ADP,,in,prep,1,in,0
3,3,13,20,NNP,PROPN,Number=Sing,bohemia,pobj,2,bohemia,0
4,4,20,21,.,PUNCT,PunctType=Peri,.,punct,1,.,0
...,...,...,...,...,...,...,...,...,...,...,...
7453,7453,40494,40496,IN,ADP,,of,prep,7452,of,319
7454,7454,40497,40500,DT,DET,Definite=Def|PronType=Art,the,det,7455,the,319
7455,7455,40501,40506,NN,NOUN,Number=Sing,house,pobj,7453,house,319
7456,7456,40507,40509,IN,ADP,,of,prep,7455,of,319


In [62]:
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy
...,...,...,...,...,...,...
baskervilles,11,114,1,7,RBR,more
baskervilles,11,114,1,8,JJ,comfortable
baskervilles,11,114,1,9,IN,outside
baskervilles,11,114,1,10,IN,than


### VOCAB

In [63]:
spcy.VOCAB = spcy.tokens.value_counts('token_str').to_frame('n')

In [64]:
spcy.VOCAB['max_pos'] = spcy.tokens.value_counts(['token_str','pos']).unstack().idxmax(1)

In [65]:
spcy.VOCAB[spcy.VOCAB.max_pos == 'PROPN'].sample(10)

Unnamed: 0_level_0,n,max_pos
token_str,Unnamed: 1_level_1,Unnamed: 2_level_1
grano,1,PROPN
isa,4,PROPN
slav,1,PROPN
dd,1,PROPN
nugent,2,PROPN
lucas,5,PROPN
habitude,1,PROPN
doria,2,PROPN
october,50,PROPN
klux,2,PROPN


### ENT

In [66]:
spcy.ents.label.value_counts()

PERSON      11352
CARDINAL     3749
DATE         3060
TIME         3025
GPE          2185
ORDINAL      2107
NORP          985
ORG           966
QUANTITY      270
LOC           116
FAC           108
LANGUAGE       77
MONEY          25
PRODUCT        13
EVENT          11
LAW             2
Name: label, dtype: int64

In [67]:
spcy.ents[spcy.ents.label=='PERSON'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
177,48117,48132,PERSON,sherlock holmes,7
36,7349,7355,PERSON,wilson,1
330,69659,69674,PERSON,robinson crusoe,150
46,15086,15095,PERSON,la vallee,316
359,109503,109510,PERSON,raymond,96
121,19559,19575,PERSON,rosanna spearman,116
6,1196,1209,PERSON,coombe tracey,22
0,48,52,PERSON,toby,244
92,25403,25408,PERSON,henry,18
8,2545,2559,PERSON,franklin blake,125


In [68]:
spcy.ents[spcy.ents.label=='PERSON'].value_counts(['doc_id','ent_str']).unstack().sum().sort_values()

ent_str
a. carter          1.0
jephro             1.0
jerome amazed      1.0
jerome heaven      1.0
jessamine          1.0
                 ...  
catherine        350.0
st aubert        352.0
tommy            367.0
franklin         367.0
emily            492.0
Length: 1523, dtype: float64

In [69]:
spcy.ents[spcy.ents.label=='ORG'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
264,81807,81822,ORG,lindenberg hole,96
6,1763,1771,ORG,manfreds,31
87,19393,19400,ORG,arthurs,52
32,25227,25234,ORG,matilda,103
171,68790,68805,ORG,madame quesnels,277
207,46778,46797,ORG,the foundations ein,31
1,1445,1461,ORG,chateau le blanc,299
64,11175,11201,ORG,the young adventurers ltd.,209
29,16005,16021,ORG,chateau le blanc,306
174,69385,69399,ORG,madame quesnel,277


In [70]:
spcy.ents[spcy.ents.label=='DATE'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
109,59876,59899,DATE,the two succeeding days,94
134,33040,33047,DATE,a month,262
8,2082,2088,DATE,august,44
1,103,111,DATE,thursday,208
123,27532,27545,DATE,two years ago,182
83,16526,16532,DATE,sunday,92
146,44641,44658,DATE,the preceding day,273
16,3183,3196,DATE,ten years old,151
39,11260,11268,DATE,last day,205
32,7241,7256,DATE,the coming week,121


### SENT

In [71]:
spcy.sents

Unnamed: 0_level_0,start,end,sent_str,doc_id
sent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,21,a scandal in bohemia.,0
1,22,68,i. to sherlock holmes she is always the woman.,0
2,69,126,i have seldom heard him mention her under any ...,0
3,127,190,in his eyes she eclipses and predominates the ...,0
4,191,256,it was not that he felt any emotion akin to lo...,0
...,...,...,...,...
251,39628,39716,the storm was still abroad in all its wrath as...,319
252,39717,39764,suddenly there shot along the path a wild light,319
253,39765,39885,and i turned to see whence a gleam so unusual ...,319
254,39886,40123,the radiance was that of the full setting and ...,319


In [29]:
SENTS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,doc_str
book_id,chap_id,para_num,sent_num,Unnamed: 4_level_1
adventures,1,0,1,a scandal in bohemia
adventures,1,1,0,i
adventures,1,2,0,to sherlock holmes she is always the woman
adventures,1,2,1,i have seldom heard him mention her under any ...
adventures,1,2,2,in his eyes she eclipses and predominates the ...
...,...,...,...,...
usher,1,47,0,from that chamber and from that mansion i fled...
usher,1,47,1,the storm was still abroad in all its wrath as...
usher,1,47,2,suddenly there shot along the path a wild ligh...
usher,1,47,3,the radiance was that of the full setting and ...


# Save

In [73]:
import sqlite3

In [75]:
with sqlite3.connect(f"{data_home}/output/space-demo.db") as db:
    for feature in features[1:]:
        getattr(spcy, feature).to_sql(table_name, db, index=True, if_exists='replace')