# Metadata

```
course:   DS 5001
module:   00 Final Projects
topic:    Using SpaCy 
author:   R.C. Alvarado
```

# Notes

## How to install

* `conda install -c conda-forge spacy`
* `python -m spacy download en_core_web_sm`

## About SpaCy

* More than a library; it is an **entire platform** for text processing. It is designed to be integrated into production-level data products.
* Designed for performance. It uses **best of breed** tools and can be somewhat opaque.
* **A replacement for NLTK**, especially for linguistic annonation in the preprocessing stages. It can work with Gensim and SciKit Learn.
* Designed to be **accessed by API**, not be dumping to a database -- but it can be done.
* Should be installed in **its own Python environment**.  
  * For example, do `conda create -n spacy` and then do `conda activate spacy`. From there, install SpaCy and everything else you need for your project.

## SpaCy's Object Model
* Note: this is not a true data model, but an object model that bundles data with algorithms (methods).

<img src="https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg" width="500" />

# Set Up

## Config

In [1]:
data_home = "../data"
local_lib = "../lib"
data_prefix = 'novels'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']

## Import Library

In [2]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import spacy

In [3]:
spacy.__version__

'3.2.1'

# Import CORPUS

In [4]:
LIB = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [5]:
CORPUS = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [6]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy


In [7]:
def gather_docs(CORPUS, ohco_level, str_col='term_str', glue=' '):
    OHCO = CORPUS.index.names
    CORPUS[str_col] = CORPUS[str_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[str_col].apply(lambda x: glue.join(x)).to_frame('doc_str')
    return DOC

## Gather CHAPS

In [8]:
SENTS = gather_docs(CORPUS, 4) # We do this to preserve sentence boundaries in CHAPs
CHAPS = gather_docs(SENTS, 2, str_col='doc_str', glue='. ')

In [9]:
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
book_id,chap_id,Unnamed: 2_level_1
adventures,1,a scandal in bohemia. i. to sherlock holmes sh...
adventures,2,the red headed league. i had called upon my fr...
adventures,3,a case of identity. my dear fellow said sherlo...
adventures,4,the boscombe valley mystery. we were seated at...
adventures,5,the five orange pips. when i glance over my no...
...,...,...
udolpho,54,vi. unnatural deeds do breed unnatural trouble...
udolpho,55,vii. but in these cases we still have judgment...
udolpho,56,viii. then fresh tears stood on her cheek as d...
udolpho,57,ix. now my task is smoothly done i can fly or ...


# Use SpaCy

## Load Statistical Models

These are also called "trained pipelines" in the documentation.

**Trained pipelines for English:**
* `en_core_web_sm`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_md`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_lg`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_trf`: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

<img 
     width="500"
     src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/>
     
See <a href="https://spacy.io/usage/processing-pipelines">the docs</a> for more.

In [10]:
spacy.__version__

'3.2.1'

In [11]:
trained_pipeline = 'en_core_web_md'

In [12]:
# !python -m spacy download {trained_pipeline}

In [13]:
nlp = spacy.load(trained_pipeline)

## Generate Annotations

In [14]:
# pipleline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
# disable= ["attribute_ruler", "lemmatizer", "parser"]
disable = []
DOCS = [doc.to_json() for doc in nlp.pipe(CHAPS.doc_str.values, disable=disable)]

## Convert to DataFrames

In [15]:
DOCS[0].keys()

dict_keys(['text', 'ents', 'sents', 'tokens'])

In [16]:
token_data = []
sent_data = []
ent_data = []

for i in range(len(DOCS)):
    
    text = DOCS[i]['text']
    
    sent = pd.DataFrame(DOCS[i]['sents'])
    sent['sent_str'] = sent.apply(lambda x: text[x.start:x.end], 1)
    sent['doc_id'] = i
    sent_data.append(sent)
    
    ent = pd.DataFrame(DOCS[i]['ents'])
    ent['ent_str'] = ent.apply(lambda x: text[x.start:x.end], 1)   
    ent['doc_id'] = i
    ent_data.append(ent)

    tokens = pd.DataFrame(DOCS[i]['tokens'])
    tokens['token_str'] = tokens.apply(lambda x: text[x.start:x.end], 1)
    tokens['doc_id'] = i
    token_data.append(tokens)  

In [17]:
class mySpaCy(): pass

spcy = mySpaCy()
spcy.SENT = pd.concat(sent_data).rename_axis('sent_id') 
spcy.TOKEN = pd.concat(token_data).rename_axis('token_id')
spcy.ENT = pd.concat(ent_data).rename_axis('ent_id')

## Explore

### TOKEN

In [18]:
spcy.TOKEN

Unnamed: 0_level_0,id,start,end,tag,pos,morph,lemma,dep,head,token_str,doc_id
token_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,0,1,DT,DET,Definite=Ind|PronType=Art,a,det,1,a,0
1,1,2,9,NN,NOUN,Number=Sing,scandal,ROOT,1,scandal,0
2,2,10,12,IN,ADP,,in,prep,1,in,0
3,3,13,20,NNP,PROPN,Number=Sing,bohemia,pobj,2,bohemia,0
4,4,20,21,.,PUNCT,PunctType=Peri,.,punct,1,.,0
...,...,...,...,...,...,...,...,...,...,...,...
7453,7453,40494,40496,IN,ADP,,of,prep,7452,of,319
7454,7454,40497,40500,DT,DET,Definite=Def|PronType=Art,the,det,7455,the,319
7455,7455,40501,40506,NN,NOUN,Number=Sing,house,pobj,7453,house,319
7456,7456,40507,40509,IN,ADP,,of,prep,7455,of,319


In [19]:
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy
...,...,...,...,...,...,...
baskervilles,11,114,1,7,RBR,more
baskervilles,11,114,1,8,JJ,comfortable
baskervilles,11,114,1,9,IN,outside
baskervilles,11,114,1,10,IN,than


### VOCAB

In [20]:
spcy.VOCAB = spcy.TOKEN.value_counts('token_str').to_frame('n')

In [21]:
spcy.VOCAB['max_pos'] = spcy.TOKEN.value_counts(['token_str','pos']).unstack().idxmax(1)

In [22]:
spcy.VOCAB[spcy.VOCAB.max_pos == 'PROPN'].sample(10)

Unnamed: 0_level_0,n,max_pos
token_str,Unnamed: 1_level_1,Unnamed: 2_level_1
pauls,6,PROPN
rambler,1,PROPN
moren,2,PROPN
dover,5,PROPN
jura,5,PROPN
wolfenbach,1,PROPN
eine,1,PROPN
francesco,1,PROPN
bradley,1,PROPN
palladio,1,PROPN


### ENT

In [23]:
spcy.ENT.label.value_counts()

PERSON      11352
CARDINAL     3749
DATE         3060
TIME         3025
GPE          2185
ORDINAL      2107
NORP          985
ORG           966
QUANTITY      270
LOC           116
FAC           108
LANGUAGE       77
MONEY          25
PRODUCT        13
EVENT          11
LAW             2
Name: label, dtype: int64

In [24]:
spcy.ENT[spcy.ENT.label=='PERSON'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20,8933,8940,PERSON,signora,287
196,64365,64385,PERSON,don gastons daughter,96
35,5767,5774,PERSON,gregson,200
22,9903,9909,PERSON,agatha,80
68,42886,42894,PERSON,antonias,103
31,9458,9466,PERSON,isabella,168
67,18380,18385,PERSON,henry,18
296,60373,60388,PERSON,robinson crusoe,149
29,11399,11411,PERSON,lucy ferrier,206
6,5716,5723,PERSON,justine,90


In [25]:
spcy.ENT[spcy.ENT.label=='PERSON'].value_counts(['doc_id','ent_str']).unstack().sum().sort_values()

ent_str
a. carter          1.0
jephro             1.0
jerome amazed      1.0
jerome heaven      1.0
jessamine          1.0
                 ...  
catherine        350.0
st aubert        352.0
tommy            367.0
franklin         367.0
emily            492.0
Length: 1523, dtype: float64

In [26]:
spcy.ENT[spcy.ENT.label=='ORG'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
118,42397,42405,ORG,caterina,279
37,12522,12531,ORG,godalming,61
4,1785,1803,ORG,the parisian dames,96
93,32950,32965,ORG,hampstead heath,49
22,3228,3234,ORG,turner,3
101,28301,28311,ORG,margaretta,262
34,10893,10897,ORG,agra,248
31,3633,3656,ORG,the breakfast bell rang,122
87,12571,12577,ORG,conrad,224
81,20647,20665,ORG,knights place toby,243


In [27]:
spcy.ENT[spcy.ENT.label=='DATE'].sample(10)

Unnamed: 0_level_0,start,end,label,ent_str,doc_id
ent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
58,16170,16175,DATE,today,93
55,14428,14440,DATE,twenty years,189
82,22844,22853,DATE,yesterday,252
45,10711,10717,DATE,annual,190
68,18078,18088,DATE,many years,129
63,12787,12803,DATE,the month of may,73
82,17374,17386,DATE,the next day,183
6,2546,2556,DATE,many years,319
148,75887,75894,DATE,the day,94
40,12700,12716,DATE,twelve years old,188


### SENT

In [28]:
spcy.SENT

Unnamed: 0_level_0,start,end,sent_str,doc_id
sent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,21,a scandal in bohemia.,0
1,22,68,i. to sherlock holmes she is always the woman.,0
2,69,126,i have seldom heard him mention her under any ...,0
3,127,190,in his eyes she eclipses and predominates the ...,0
4,191,256,it was not that he felt any emotion akin to lo...,0
...,...,...,...,...
251,39628,39716,the storm was still abroad in all its wrath as...,319
252,39717,39764,suddenly there shot along the path a wild light,319
253,39765,39885,and i turned to see whence a gleam so unusual ...,319
254,39886,40123,the radiance was that of the full setting and ...,319


In [29]:
SENTS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,doc_str
book_id,chap_id,para_num,sent_num,Unnamed: 4_level_1
adventures,1,0,1,a scandal in bohemia
adventures,1,1,0,i
adventures,1,2,0,to sherlock holmes she is always the woman
adventures,1,2,1,i have seldom heard him mention her under any ...
adventures,1,2,2,in his eyes she eclipses and predominates the ...
...,...,...,...,...
usher,1,47,0,from that chamber and from that mansion i fled...
usher,1,47,1,the storm was still abroad in all its wrath as...
usher,1,47,2,suddenly there shot along the path a wild ligh...
usher,1,47,3,the radiance was that of the full setting and ...


# Save

In [32]:
import sqlite3

In [33]:
tables = 'ENT SENT TOKEN VOCAB'.split()
with sqlite3.connect(f"{data_home}/output/space-demo.db") as db:
    for table_name in tables:
        table = getattr(spcy, table_name)
        table.to_sql(table_name, db, index=True, if_exists='replace')