# Metadata

```yaml
Course:   DS 5001
Module:   92 Helper Notebooks
Topic:    Using SpaCy 
Author:   R.C. Alvarado
```

# Notes

## How to install

* `conda install -c conda-forge spacy`
* `python -m spacy download en_core_web_sm`

## About SpaCy

* More than a library; it is an **entire platform** for text processing. It is designed to be integrated into production-level data products.
* Designed for performance. It uses **best of breed** tools and can be somewhat opaque.
* **A replacement for NLTK**, especially for linguistic annonation in the preprocessing stages. It can work with Gensim and SciKit Learn.
* Designed to be **accessed by API**, not be dumping to a database -- but it can be done.
* Should be installed in **its own Python environment**.  
  * For example, do `conda create -n spacy` and then do `conda activate spacy`. From there, install SpaCy and everything else you need for your project.

## SpaCy's Object Model

Note: this is not a true data model, but an object model that bundles data with algorithms (methods).

<img src="images/spacy-architecture.svg" width="500" />

# Set Up

## Config

In [1]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

In [2]:
data_prefix = 'austen-melville'
OHCO = ['book_id', 'chap_id', 'para_num', 'sent_num', 'token_num']

## Import Library

In [3]:
import pandas as pd
import numpy as np
import tqdm as tqdm
import spacy

In [4]:
spacy.__version__

'3.7.2'

# Import CORPUS

In [6]:
LIB = pd.read_csv(f"{output_dir}/{data_prefix}-LIB.csv").set_index(OHCO[:1])

In [8]:
CORPUS = pd.read_csv(f"{output_dir}/{data_prefix}-CORPUS.csv").set_index(OHCO)

In [9]:
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str,term_str,pos_group
book_id,chap_id,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
105,1,1,0,0,"('Sir', 'NNP')",NNP,Sir,sir,NN
105,1,1,0,1,"('Walter', 'NNP')",NNP,Walter,walter,NN
105,1,1,0,2,"('Elliot,', 'NNP')",NNP,"Elliot,",elliot,NN
105,1,1,0,3,"('of', 'IN')",IN,of,of,IN
105,1,1,0,4,"('Kellynch', 'NNP')",NNP,Kellynch,kellynch,NN


In [10]:
def gather_docs(CORPUS, ohco_level, str_col='term_str', glue=' '):
    OHCO = CORPUS.index.names
    CORPUS[str_col] = CORPUS[str_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[str_col].apply(lambda x: glue.join(x)).to_frame('doc_str')
    return DOC

## Gather CHAPS

In [11]:
SENTS = gather_docs(CORPUS, 4) # We do this to preserve sentence boundaries in CHAPs
CHAPS = gather_docs(SENTS, 2, str_col='doc_str', glue='. ')

In [12]:
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
book_id,chap_id,Unnamed: 2_level_1
105,1,sir walter elliot of kellynch hall in somerset...
105,2,mr shepherd a civil cautious lawyer who whatev...
105,3,i must take leave to observe sir walter said m...
105,4,he was not mr wentworth the former curate of m...
105,5,on the morning appointed for admiral and mrs c...
...,...,...
34970,110,in the midst of all these mental confusions th...
34970,111,gaining the apostles and leaving his two compa...
34970,112,pierre passed on to a remote quarter of the bu...
34970,113,that sundown pierre stood solitary in a low du...


# Use SpaCy

## Load Statistical Models

These are also called "trained pipelines" in the documentation.

**Trained pipelines for English:**
* `en_core_web_sm`: English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_md`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_lg`:  English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
* `en_core_web_trf`: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

<img 
     width="500"
     src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/>
     
See <a href="https://spacy.io/usage/processing-pipelines">the docs</a> for more.

In [13]:
trained_pipeline = 'en_core_web_md'

In [14]:
# !python -m spacy download {trained_pipeline}

In [15]:
# doc = spacy.nlp(doc_str)

In [16]:
nlp = spacy.load(trained_pipeline)

## Generate Annotations

In [17]:
# pipleline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
# disable= ["attribute_ruler", "lemmatizer", "parser"]
disable = []
DOCS = [doc.to_json() for doc in nlp.pipe(CHAPS.doc_str.values, disable=disable)]

## Convert to DataFrames

In [None]:
features = list(DOCS[0].keys())

In [None]:
features

In [None]:
feature_data = {f:[] for f in features}
for i in range(len(DOCS)):    
    text = DOCS[i]['text']
    for feature in features[1:]:
        df = pd.DataFrame(DOCS[i][feature])
        df[f'{feature[:-1]}_str'] = df.apply(lambda x: text[x.start:x.end], 1)
        df['doc_id'] = i
        feature_data[feature].append(df)
    
class mySpaCy(): pass
spcy = mySpaCy()
for feature in features[1:]:
    setattr(spcy, feature, pd.concat(feature_data[feature]).rename_axis(f'{feature[:-1]}_id'))

In [None]:
spcy.ents

## Explore

### TOKEN

In [None]:
spcy.tokens

In [None]:
CORPUS

### VOCAB

In [None]:
spcy.VOCAB = spcy.tokens.value_counts('token_str').to_frame('n')

In [None]:
spcy.VOCAB['max_pos'] = spcy.tokens.value_counts(['token_str','pos']).unstack().idxmax(1)

In [None]:
spcy.VOCAB[spcy.VOCAB.max_pos == 'PROPN'].sample(10)

### ENT

In [None]:
spcy.ents.label.value_counts()

In [None]:
spcy.ents[spcy.ents.label=='PERSON'].sample(10)

In [None]:
spcy.ents[spcy.ents.label=='PERSON'].value_counts(['doc_id','ent_str']).unstack().sum().sort_values()

In [None]:
spcy.ents[spcy.ents.label=='ORG'].sample(10)

In [None]:
spcy.ents[spcy.ents.label=='DATE'].sample(10)

### SENT

In [None]:
spcy.sents

In [None]:
SENTS

# Save

In [None]:
import sqlite3

In [None]:
with sqlite3.connect(f"{data_home}/output/space-demo.db") as db:
    for feature in features[1:]:
        getattr(spcy, feature).to_sql(feature, db, index=True, if_exists='replace')