# SpaCy Tutorial

## Install spaCy

In [3]:
# # Install spaCy
# !pip install spacy

In [4]:
import spacy

## Download pretrained model

SpaCy supports multiple languages, such as English, German, French, Spanish, Portuguese, Italian, Dutch, Greek, Japanese, etc.  

Here, we will use a small-sized pretrained English model. A large model is slower but more accurate while a small model is faster but less accurate and has less features (e.g. no word vectors).

The model file contains weights, vocabulary, and model pipeline meta information.

In [5]:
# # Download
# !python -m spacy download en_core_web_sm
# # Restart kernel after the model is downloaded

## Load model

In [6]:
# Create an nlp object
nlp = spacy.load("en_core_web_sm")

In [7]:
# Example document
document = """Hey, I'm Yuibi. I work very hard at XYZ, Inc. as a data scientist. \
It's located in San Antonio, TX, which is the best city! 😊 \
You can buy 3 tacos for $2 as of December 13th, 2019. lol"""

In [8]:
# Created by processing a string of text with the nlp object
doc = nlp(document)

## Tokenization
Tokenization chops sentence(s) into pieces called tokens. For English, it normally uses whitespace as a separator with special treatments for punctuations, emoji. etc. SpaCy allows customization of most of its features (e.g. add infix "-" as a seperator for tokenization).  

![token](img/token.png)  

**Advanced:** Some languages do not use whitespaces to separate words. In those cases, do either
1. Segment text into morphemes before tokenization.  
2. Use byte-pair-encoding (BPE) to create sub-word units.  
BPE is also effective for English as shown in Google's BERT.

**Hiererarcy:** Corpus > document > sentence > word > sub-word > character > subcharacter > stroke  
Tokens are normally on word, sub-word, or character-level for English.

In [9]:
# For each token, print its token number and token text 
i = 0
for token in doc:
    i += 1
    print (f"{i} {token.text}")

1 Hey
2 ,
3 I
4 'm
5 Yuibi
6 .
7 I
8 work
9 very
10 hard
11 at
12 XYZ
13 ,
14 Inc.
15 as
16 a
17 data
18 scientist
19 .
20 It
21 's
22 located
23 in
24 San
25 Antonio
26 ,
27 TX
28 ,
29 which
30 is
31 the
32 best
33 city
34 !
35 😊
36 You
37 can
38 buy
39 3
40 tacos
41 for
42 $
43 2
44 as
45 of
46 December
47 13th
48 ,
49 2019
50 .
51 lol


## Stop Words
Stop words are common words, which are often not needed for downstream tasks like word frequency analysis, topic modeling, count vectorizer (to reduce features for, say, bag-of-word text classifier), etc.  

[Some AI researchers](https://twitter.com/deliprao/status/1068555626299584512) argue that stop word removal is detrimental for deep learning models as the removal can potentially change the meanings of documents (e.g. negation).

Stop word list can be customized.

In [10]:
# By default, spaCy contains a few hundred stop words for English 
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stop_words)

326

In [11]:
# Print 30 stop words
list(spacy_stop_words)[:30]

["'ll",
 'make',
 'again',
 'am',
 'can',
 'becoming',
 'his',
 'take',
 'more',
 'wherever',
 'you',
 'him',
 'still',
 'otherwise',
 'go',
 'the',
 'quite',
 'ca',
 'four',
 'we',
 "'ve",
 'even',
 'further',
 'mostly',
 '’ll',
 'although',
 'when',
 'upon',
 'enough',
 'as']

## Lemmatization  
Lemmatization converts a word into its inflected form (lemma) while still ensuring that the reduced form (lemma) belongs to the language, which might not be the case for Stemming. See the table below for an example.  

Lemmatization depends on part-of-speech (**POS**) tagging, and is rule-based (for now).  

As with stop words, lemmatization can help downstream tasks by normalizing text, but for deep learning tasks, it is almost always skipped.  

Stemming is not available in spaCy.  

Word | Stem | Lemma
--- | --- | ---
Studies | Studi | Study
Studying | Study | Study

In [12]:
# For each token, print its token number, token text, and lemma 
i = 0
for token in doc:
    i += 1
    print (f"{i} {token.text} \t\t {token.lemma_}")

1 Hey 		 hey
2 , 		 ,
3 I 		 -PRON-
4 'm 		 be
5 Yuibi 		 Yuibi
6 . 		 .
7 I 		 -PRON-
8 work 		 work
9 very 		 very
10 hard 		 hard
11 at 		 at
12 XYZ 		 XYZ
13 , 		 ,
14 Inc. 		 Inc.
15 as 		 as
16 a 		 a
17 data 		 data
18 scientist 		 scientist
19 . 		 .
20 It 		 -PRON-
21 's 		 be
22 located 		 locate
23 in 		 in
24 San 		 San
25 Antonio 		 Antonio
26 , 		 ,
27 TX 		 TX
28 , 		 ,
29 which 		 which
30 is 		 be
31 the 		 the
32 best 		 good
33 city 		 city
34 ! 		 !
35 😊 		 😊
36 You 		 -PRON-
37 can 		 can
38 buy 		 buy
39 3 		 3
40 tacos 		 taco
41 for 		 for
42 $ 		 $
43 2 		 2
44 as 		 as
45 of 		 of
46 December 		 December
47 13th 		 13th
48 , 		 ,
49 2019 		 2019
50 . 		 .
51 lol 		 lol


## Sentence Segmentation  
SpaCy can split a document into individual sentences by predicting sentence boundaries. As with most features in spaCy, this can be customized.

In [13]:
# for each sentence, print sentence number and text
i = 0
for sentence in doc.sents:
    i += 1
    print (f"{i} {sentence}")

1 Hey, I'm Yuibi.
2 I work very hard at XYZ, Inc. as a data scientist.
3 It's located in San Antonio, TX, which is the best city!
4 😊
5 You can buy 3 tacos for $2 as of December 13th, 2019.
6 lol


## Statistical Models  
SpaCy comes with 3 deep neural network (DNN) based models:
1. Part-of-speech (**POS**) tagger
2. Named entity recognizer (**NER**)
3. Syntactic dependency parser

SpaCy used supervised Seq2Seq convolutional neural network (CNN) with sub-word embedding, residual connections, and layer normalization in a multi-task fashion to train those models. These models can be retrained if you have your own labeled data.

### Part-of-speech (**POS**)  
POS indicates which category a word is assigned to in accordance with its syntactic functions.  

Examples:
- Noun
- Pronoun
- Proper noun
- Adjective
- Verb
- Adverb
- Adposition
- Auxiliary
- Punctuation
- Determiner 
- Subordinating conjunction
- Interjection

In [14]:
# For each token, print its token number, token text, and POS 
i = 0
for token in doc:
    i += 1
    print (f"{i} {token.text} \t\t {token.pos_}")

1 Hey 		 INTJ
2 , 		 PUNCT
3 I 		 PRON
4 'm 		 AUX
5 Yuibi 		 PROPN
6 . 		 PUNCT
7 I 		 PRON
8 work 		 VERB
9 very 		 ADV
10 hard 		 ADV
11 at 		 ADP
12 XYZ 		 PROPN
13 , 		 PUNCT
14 Inc. 		 PROPN
15 as 		 SCONJ
16 a 		 DET
17 data 		 NOUN
18 scientist 		 NOUN
19 . 		 PUNCT
20 It 		 PRON
21 's 		 AUX
22 located 		 VERB
23 in 		 ADP
24 San 		 PROPN
25 Antonio 		 PROPN
26 , 		 PUNCT
27 TX 		 PROPN
28 , 		 PUNCT
29 which 		 DET
30 is 		 AUX
31 the 		 DET
32 best 		 ADJ
33 city 		 NOUN
34 ! 		 PUNCT
35 😊 		 PROPN
36 You 		 PRON
37 can 		 VERB
38 buy 		 VERB
39 3 		 NUM
40 tacos 		 NOUN
41 for 		 ADP
42 $ 		 SYM
43 2 		 NUM
44 as 		 SCONJ
45 of 		 ADP
46 December 		 PROPN
47 13th 		 NOUN
48 , 		 PUNCT
49 2019 		 NUM
50 . 		 PUNCT
51 lol 		 PROPN


### Named Entity Recognizer (**NER**)  
NER locates named entities and classifies them into pre-defined categories, such as:  
- PERSON	People, including fictional.
- FAC	Buildings, airports, highways, bridges, etc.
- ORG	Companies, agencies, institutions, etc.
- GPE	Countries, cities, states.
- PRODUCT	Objects, vehicles, foods, etc. (Not services.)
- EVENT	Named hurricanes, battles, wars, sports events, etc.
- DATE	Absolute or relative dates or periods.
- TIME	Times smaller than a day.
- MONEY	Monetary values, including unit.
- QUANTITY	Measurements, as of weight or distance.
- ORDINAL	“first”, “second”, etc.
- CARDINAL	Numerals that do not fall under another type.  

In [15]:
# For each extracted token, print its token number, token text, and named entities 
i = 0
for token in doc.ents:
    i += 1
    print (f"{i} {token.text} \t\t {token.label_}")

1 Yuibi 		 GPE
2 XYZ, Inc. 		 ORG
3 San Antonio 		 GPE
4 TX 		 ORG
5 3 tacos 		 MONEY
6 2 		 MONEY
7 December 13th, 2019 		 DATE


In [16]:
# Visualize NER
spacy.displacy.render(doc, style='ent')

### Syntactic Dependency Parser  
This process extracts the dependency parse of a sentence to represent its grammatical structure. The extracted structured is represented as directed graph, and it can be used as features in some deep learning algorithms (Tree-Recursive Neural Network, Graph Neural Network, etc).

In [17]:
# For each token, print its token number, token text, and dependency 
i = 0
for token in doc:
    i += 1
    print (f"{i} {token.text} \t\t {token.dep_} \t\t {token.head.text}")

1 Hey 		 intj 		 'm
2 , 		 punct 		 'm
3 I 		 nsubj 		 'm
4 'm 		 ROOT 		 'm
5 Yuibi 		 attr 		 'm
6 . 		 punct 		 'm
7 I 		 nsubj 		 work
8 work 		 ROOT 		 work
9 very 		 advmod 		 hard
10 hard 		 advmod 		 work
11 at 		 prep 		 work
12 XYZ 		 pobj 		 at
13 , 		 punct 		 XYZ
14 Inc. 		 appos 		 XYZ
15 as 		 prep 		 work
16 a 		 det 		 scientist
17 data 		 compound 		 scientist
18 scientist 		 pobj 		 as
19 . 		 punct 		 work
20 It 		 nsubjpass 		 located
21 's 		 auxpass 		 located
22 located 		 ROOT 		 located
23 in 		 prep 		 located
24 San 		 compound 		 Antonio
25 Antonio 		 pobj 		 in
26 , 		 punct 		 Antonio
27 TX 		 appos 		 Antonio
28 , 		 punct 		 Antonio
29 which 		 nsubj 		 is
30 is 		 relcl 		 Antonio
31 the 		 det 		 city
32 best 		 amod 		 city
33 city 		 attr 		 is
34 ! 		 punct 		 located
35 😊 		 ROOT 		 😊
36 You 		 nsubj 		 buy
37 can 		 aux 		 buy
38 buy 		 ROOT 		 buy
39 3 		 nummod 		 tacos
40 tacos 		 dobj 		 buy
41 for 		 prep 		 buy
42 $ 		 nmod 		 2
43 2 		 pob

In [18]:
# Visualize syntactic dependency
# Split into sentences
sentence_spans = list(doc.sents)
spacy.displacy.render(sentence_spans, style="dep")

## Word Vectors  
SpaCy models come with word vectors that are trained in GloVe, a similar algorithm to Word2Vec. SpaCy allows to use other pretrained word vectors or your custom word vectors, such as ones that are trained in Gensim, FastText, TensorFlow, etc.  

Examples below show how to compute similarity between words/sentences, but you can also feed the vector representation as a feature to downstream tasks, such as text classifier in your favorite ML framework.

In [27]:
# Printing a word vector for "cat" and its demension size
nlp = spacy.load("en_core_web_lg")
word1 = nlp("cat")
print(f"{word1.vector} Dimension: {len(word1.vector)}")

[-0.15067   -0.024468  -0.23368   -0.23378   -0.18382    0.32711
 -0.22084   -0.28777    0.12759    1.1656    -0.64163   -0.098455
 -0.62397    0.010431  -0.25653    0.31799    0.037779   1.1904
 -0.17714   -0.2595    -0.31461    0.038825  -0.15713   -0.13484
  0.36936   -0.30562   -0.40619   -0.38965    0.3686     0.013963
 -0.6895     0.004066  -0.1367     0.32564    0.24688   -0.14011
  0.53889   -0.80441   -0.1777    -0.12922    0.16303    0.14917
 -0.068429  -0.33922    0.18495   -0.082544  -0.46892    0.39581
 -0.13742   -0.35132    0.22223   -0.144     -0.048287   0.3379
 -0.31916    0.20526    0.098624  -0.23877    0.045338   0.43941
  0.030385  -0.013821  -0.093273  -0.18178    0.19438   -0.3782
  0.70144    0.16236    0.0059111  0.024898  -0.13613   -0.11425
 -0.31598   -0.14209    0.028194   0.5419    -0.42413   -0.599
  0.24976   -0.27003    0.14964    0.29287   -0.31281    0.16543
 -0.21045   -0.4408     1.2174     0.51236    0.56209    0.14131
  0.092514   0.71396   -0.02

In [28]:
# Cosine similarity of cat and dog
word2 = nlp("dog")
print(word1.similarity(word2))

0.8016854705531046


In [29]:
# Cosine similarity of 2 different documents by averaging word vectors
doc1 = nlp("I like sushi.")
doc2 = nlp("My favorite food is spacy food.")
print(doc1.similarity(doc2))

0.7740199549421769


In [31]:
# An unrelated sentence pair returns low value
doc1 = nlp("I like sushi.")
doc2 = nlp("Jupyter notebook installation guide")
print(doc1.similarity(doc2))

0.31461387624735404


## Working with Big Dataset  
This Twitter data was downloaded from [Kaggle](https://www.kaggle.com/c/twitter-sentiment-analysis2).

In [34]:
import pandas as pd
df = pd.read_csv('data/train.csv.zip', sep=',', compression='zip', encoding='latin_1')
df = df.sample(1000, random_state=1111)
df.shape

(1000, 3)

In [35]:
df.head(10)

Unnamed: 0,ItemID,Sentiment,SentimentText
1266,1267,1,Awesome. &lt;3 TEDDY! &lt;3
39898,39910,1,@Andre_R its been my mission to find other sou...
88879,88891,1,@Azura999 Told you
3688,3689,1,thanks for the birthday DMs
68437,68449,0,@bryandl i know! you should come visit again!!
28556,28568,0,@airlanggatwerp bagi link nya dong nce huhu
48974,48986,1,@AriaParadiso @ChelseaParadiso nighty night u ...
68698,68710,0,@BSBSavedMyLife it won't play
90592,90604,0,@chiniehdiaz Im out of it..havent had any in d...
55359,55371,1,"@barrymoltz Ahhh, yes-- shiny object syndrome...."


### Typical way (slow)
Single-threaded

In [37]:
def tokenize(text:str=None):
    doc = nlp(text)
    token_list = []
    
    for token in doc:
        token_list.append(token.text)
        
    return token_list

In [38]:
%%timeit
df['token_list1'] = df.apply(lambda x: tokenize(x.SentimentText), axis=1)

16.4 s ± 3.38 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### SpaCy way (fast)  
SpaCy allows multi-processing by treating texts as a stream and yielding Doc objects.

In [41]:
%%timeit
token_list = []

for doc in nlp.pipe(df.SentimentText.astype('unicode').values, batch_size=100, n_threads=40):
    word_list = []
    for token in doc:
        word_list.append(token.text)
        
    token_list.append(word_list)

df['token_list2'] = token_list

3.34 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### SpaCy way (faster)  
When an `nlp` object is created, spaCy adds pipelines. By disabling unused pipeline components, spaCy can become even faster! Pipelines can be customized.    

![nlp_pipeline](img/nlp_pipeline.png)  

In [44]:
# Print pipeline components
for p in nlp.pipeline:
    print(p)

('tagger', <spacy.pipeline.pipes.Tagger object at 0x00000188297E7808>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001880CAC4D08>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001881E39A8E8>)


In [45]:
%%timeit
token_list = []

# Disable POS, Dependency Parser, and NER since all we want is tokenizer 
# Alternatively, you can use nlp.make_doc method, which skips all pipelines, if you just need a tokenizer.
with nlp.disable_pipes('tagger', 'parser', 'ner'):
    for doc in nlp.pipe(df.SentimentText.astype('unicode').values, batch_size=100, n_threads=40):
        word_list = []
        for token in doc:
            word_list.append(token.text)

        token_list.append(word_list)

df['token_list3'] = token_list

61.7 ms ± 5.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
