# Metadata

```
course:   DS 5001
module:   00 Final Projects
topic:    Using SpaCy 
author:   R.C. Alvarado
```

# Notes

## How to install

* `conda install -c conda-forge spacy`
* `python -m spacy download en_core_web_sm`

## About SpaCy

<img src="https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg" width="500" />

# Set Up

## Config

In [1]:
data_home = "../data"
local_lib = "../lib"
data_prefix = 'novels'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']

## Import Library

In [23]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import spacy

# Load English resource package

**Trained pipeline packages for English:**
* `en_core_web_sm`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_md`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_lg`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_trf`: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

In [3]:
lang_mod = 'en_core_web_sm'

In [4]:
nlp = spacy.load(lang_mod)

# Import CORPUS

In [5]:
LIB = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [6]:
CORPUS = pd.read_csv(f"{data_home}/{data_prefix}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [7]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos,term_str
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
secretadversary,1,0,1,0,DT,the
secretadversary,1,0,1,1,NNP,young
secretadversary,1,0,1,2,NNP,adventurers
secretadversary,1,0,1,3,NNP,ltd
secretadversary,1,1,0,0,JJ,tommy


In [8]:
def gather_docs(CORPUS, ohco_level, str_col='term_str', glue=' '):
    OHCO = CORPUS.index.names
    CORPUS[str_col] = CORPUS[str_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[str_col].apply(lambda x: glue.join(x)).to_frame('doc_str')
    return DOC

In [9]:
SENTS = gather_docs(CORPUS, 4)
CHAPS = gather_docs(SENTS, 2, str_col='doc_str', glue='. ')

In [10]:
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
book_id,chap_id,Unnamed: 2_level_1
adventures,1,a scandal in bohemia. i. to sherlock holmes sh...
adventures,2,the red headed league. i had called upon my fr...
adventures,3,a case of identity. my dear fellow said sherlo...
adventures,4,the boscombe valley mystery. we were seated at...
adventures,5,the five orange pips. when i glance over my no...
...,...,...
udolpho,54,vi. unnatural deeds do breed unnatural trouble...
udolpho,55,vii. but in these cases we still have judgment...
udolpho,56,viii. then fresh tears stood on her cheek as d...
udolpho,57,ix. now my task is smoothly done i can fly or ...


# Use SpaCy

In [68]:
# disable=["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
disable=["attribute_ruler", "lemmatizer", "tok2vec", "parser"]
M = nlp.pipe(CHAPS.doc_str.values, disable=disable)
DOCS = [doc.to_json() for doc in M]

In [69]:
DOCS[0].keys()

dict_keys(['text', 'ents', 'sents', 'tokens'])

In [79]:
# DOCS[0]['text']

In [86]:
pd.DataFrame(DOCS[0]['ents']).set_index(['start','end'])

Unnamed: 0_level_0,Unnamed: 1_level_0,label
start,end,Unnamed: 2_level_1
22,24,ORG
244,255,PERSON
279,282,CARDINAL
663,675,PERSON
890,894,PERSON
...,...,...
43254,43259,CARDINAL
44339,44358,TIME
44550,44572,GPE
44611,44617,PERSON


In [84]:
pd.DataFrame(DOCS[0]['sents']).set_index(['start','end'])

start,end
0,21
22,43
44,68
69,126
127,190
...,...
44360,44495
44496,44646
44647,44697
44698,44737


In [85]:
pd.DataFrame(DOCS[0]['tokens']).set_index(['start','end'])

Unnamed: 0_level_0,Unnamed: 1_level_0,id,pos,tag,dep,head
start,end,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,0,DET,DT,det,1
2,9,1,NOUN,NN,ROOT,1
10,12,2,ADP,IN,prep,1
13,20,3,NOUN,NN,pobj,2
20,21,4,PUNCT,.,punct,1
...,...,...,...,...,...,...
44831,44841,9291,ADJ,JJ,amod,9292
44842,44847,9292,NOUN,NN,pobj,9289
44848,44850,9293,ADP,IN,prep,9292
44851,44854,9294,DET,DT,det,9295


In [65]:
span = DOCS[0]['tokens'][6]['start'], DOCS[0]['tokens'][6]['end']

In [66]:
DOCS[0]['text'][span[0]:span[1]]

'to'

In [88]:
# DOCS[0][0]

In [61]:
nlp.pipe?

[0;31mSignature:[0m
[0mnlp[0m[0;34m.[0m[0mpipe[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtexts[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mas_tuples[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_threads[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m=[0m[0;36m1000[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdisable[0m[0;34m=[0m[0;34m[[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcleanup[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomponent_cfg[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_process[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Process texts as a stream, and yield `Doc` objects in order.

texts (iterator): A sequence of texts to process.
as_tuples (bool): If set to True, inputs should be a sequence of
    (text, context)