# SpaCy!

- huge library e.g, Tagger,NER,POS uses CNN
- #1 library forNLP (e.g., NLTK, gensim)
- HuggingFace (mostly for deep learning)

`pip install spacy` or `pip install -U 'spacy[cuda-autodetect]'`

`python -m spacy download en_core_web_sm` 

`python -m spacy download en_core_web_md`  #has word embedding (GloVe)

`python -m spacy download en_core_web_trf` #everything is trained using transformer

For the `en_core_web_sm`, replace `sm` with `md` (medium), `lg` (large), and `trf` (transformers).  The more complex model you used, the better features you get.

In [1]:
# !python -m spacy download en_core_web_md

In [2]:
# !python -m spacy download en_core_web_sm

In [3]:
# !python -m spacy download en_core_web_trf

In [4]:
import spacy
spacy.__version__

'3.4.2'

## 1.Basic

### 1.1 Introduction

In [5]:
#create a spacy object that can parse a lot of stuffs
#based on some learned model
nlp = spacy.load("en_core_web_sm")

In [6]:
text = "Thailand really like to eat naan and masala. He also likes to eat sushi."

In [7]:
doc = nlp(text) 
type(doc)

spacy.tokens.doc.Doc

In [8]:
#there are so many things in this doc
for tokens in doc[:10]:
    print(tokens) #this spacy.tokens.doc.Doc already tokenize it!!!
    break

Thailand


In [9]:
tokens

Thailand

In [10]:
for sent in doc.sents:
    print(sent) #it also has sentence

Thailand really like to eat naan and masala.
He also likes to eat sushi.


In [11]:
tokens.ent_type #entitity type ids

384

In [12]:
tokens.ent_type_ #geo political entity

'GPE'

In [13]:
spacy.explain('GPE')

'Countries, cities, states'

In [14]:
tokens.ent_iob_ #beginning of an entity

'B'

In [15]:
tokens.pos_ #Proper noun

'PROPN'

In [16]:
tokens.dep_

'nsubj'

In [17]:
tokens.head

like

In [18]:
sentence1 = list(doc.sents)[0]
sentence1

Thailand really like to eat naan and masala.

In [19]:
from spacy import displacy #displacy stuffs

displacy.render(sentence1, style="dep")

In [20]:
displacy.render(sentence1, style="ent")

### 1.2. Word Vectors

In [21]:
nlp = spacy.load("en_core_web_md")

In [22]:
text = "Chaky likes to eat sushi"

In [23]:
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [24]:
sentence = list(doc.sents)[0]
sentence[1]

likes

In [25]:
len(sentence[1].vector) #what is the size --> 300 gloev embedding

300

### 1.3. Similarity

In [26]:
#before similarity, let's about nlp.vocab.strings
doc = nlp("I lover coffee.")

In [27]:
nlp.vocab.strings['coffee'] #hash value

3197928453018144401

In [28]:
nlp.vocab.strings[3197928453018144401]

'coffee'

In [29]:
#first numericalize dog
interger = nlp.vocab.strings['dog']
interger

7562983679033046312

In [30]:
#get the vector based on this id
vector = nlp.vocab.vectors[interger]
vector[:5] #size 300 - vector of dog

array([  1.233 ,   4.2963,  -7.9738, -10.121 ,   1.8207], dtype=float32)

In [31]:
import numpy as np

close_words = nlp.vocab.vectors.most_similar(np.asarray([vector]),n=10)
close_words

(array([[ 7918624946109788756,  4969328240109515165,  4560869431627726864,
         17429802345416193488,  6017664905485703127, 14534804554944721111,
           173986088034745168, 15668852121853073894, 11567120971096873637,
         15872191516786115817]], dtype=uint64),
 array([[ 1147,  2545,  3201,  9003,  3828, 18829,  5845, 11580,  7045,
         18612]], dtype=int32),
 array([[1.    , 0.8334, 0.8221, 0.8108, 0.7856, 0.7195, 0.685 , 0.6328,
         0.6148, 0.5966]], dtype=float32))

In [32]:
close_words[0].shape

(1, 10)

In [33]:
nlp.vocab.strings[close_words[0][0][0]]

'dogsbody'

### 1.4. Doc and Span Similarity

In [34]:
doc1 = nlp('Chaky likes french fries')
doc2 = nlp('Tonson likes sweet potato nuggets')

In [35]:
doc1.similarity(doc2) #hihger means more similar

0.681774068977061

In [36]:
#doc --> sents --> span --> tokens

#do span similarity
span1 = doc1[2:4]
span1, type(span1)

(french fries, spacy.tokens.span.Span)

In [37]:
span2 = doc2[2:6]
span2

sweet potato nuggets

In [38]:
span1.similarity(span2)

0.534758985042572

## 2. Entity Ruler

Basically NER, the most prominent function of spacy.

To do NER, spacy can do two ways:
1. rule-based (talk about this first)
2. neural network (a little bit on this)

NER is the most common thing people do in industry, e.g., information extraction

In [39]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [40]:
#pipes --> everytime you insert some text into nlp
#It must through a sequential list of pipes --> they do something
analysis = nlp.analyze_pipes(pretty=True)
# analysis

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

In [41]:
#add some entityruler pipe, we use the function (add_pipe)
ruler = nlp.add_pipe('entity_ruler', before='ner') #put the pipe before ner

#add patterns
patterns = [
                {'label': "LOC", 'pattern' : "Rangsit"}
            ]

ruler.add_patterns(patterns)

In [42]:
analysis = nlp.analyze_pipes(pretty=True)

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

In [43]:
text = "AIT is at Rangsit."
doc = nlp(text)

In [44]:
for ent in doc.ents:
    print(ent.text, ent.label_)

AIT ORG
Rangsit LOC


### 2.1 More patterns!!

In [45]:
import spacy

text = 'My phone number is (555) 666-5555'

nlp = spacy.blank("en") #blank model (no pipes)

In [46]:
ruler = nlp.add_pipe('entity_ruler')

In [47]:
patterns = [
        {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},{"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}  
]
#? is 0 or 1

ruler.add_patterns(patterns)

In [48]:
doc = nlp(text)

In [49]:
for ent in doc.ents:
    print(ent.text, ent.label_)

(555) 666-5555 PHONE_NUMBER


### 2.2 Matcher

Even more powerful pattern guy

In [51]:
from spacy.matcher import Matcher #help us reconginize patterns

In [55]:
#Email
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

In [58]:
pattern = [{"LIKE_EMAIL" : True}]
matcher.add("EMAIL",[pattern])

doc = nlp("Chaky email is chaklam@ait.asia.")
matches = matcher(doc)

In [59]:
matches 

[(17587345535198158200, 3, 4)]

In [61]:
nlp.vocab[matches[0][0]].text

'EMAIL'

In [64]:
#gripper nouns and longer pharse
with open('../data/wiki_king.txt','r') as f:
    text = f.read()

text

'Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.\n\nKing participated in and led marches for blacks\' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his f

In [65]:
nlp = spacy.load('en_core_web_sm')

In [77]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN"}] #pos ==> part of speech
matcher.add("PROPER_NOUN_CHAKY",[pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 2, 3) King
(2015442650195688329, 3, 4) Jr.
(2015442650195688329, 6, 7) Michael
(2015442650195688329, 7, 8) King
(2015442650195688329, 8, 9) Jr.
(2015442650195688329, 10, 11) January
(2015442650195688329, 16, 17) April
(2015442650195688329, 50, 51) King


In [78]:
##multi-word token
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}] #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY",[pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 0, 2) Martin Luther
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 0, 3) Martin Luther King
(2015442650195688329, 1, 3) Luther King
(2015442650195688329, 2, 3) King
(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 1, 4) Luther King Jr.
(2015442650195688329, 2, 4) King Jr.
(2015442650195688329, 3, 4) Jr.


In [79]:
##how do we get only one
##greddy = longest
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}] #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY",[pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(2015442650195688329, 84, 89) Martin Luther King Sr.
(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 129, 133) Southern Christian Leadership Conference
(2015442650195688329, 6, 9) Michael King Jr.
(2015442650195688329, 70, 72) Mahatma Gandhi
(2015442650195688329, 147, 149) Albany Movement
(2015442650195688329, 194, 196) Lincoln Memorial
(2015442650195688329, 10, 11) January
(2015442650195688329, 16, 17) April
(2015442650195688329, 50, 51) King


In [80]:
##sort
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}] #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY",[pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)

matches.sort(key = lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 6, 9) Michael King Jr.
(2015442650195688329, 10, 11) January
(2015442650195688329, 16, 17) April
(2015442650195688329, 50, 51) King
(2015442650195688329, 70, 72) Mahatma Gandhi
(2015442650195688329, 84, 89) Martin Luther King Sr.
(2015442650195688329, 90, 91) King
(2015442650195688329, 114, 115) King
(2015442650195688329, 118, 119) Montgomery


In [81]:
##add some verb
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"},{"POS":"VERB"}] #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY",[pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)

matches.sort(key = lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

(2015442650195688329, 50, 52) King advanced
(2015442650195688329, 90, 92) King participated
(2015442650195688329, 114, 116) King led
(2015442650195688329, 168, 170) King helped


#### 2.3 Regex - regular expression

In [82]:
import spacy

#Sample text
text = "This is a sample number 5555555."

#Build upon the spaCy small model
nlp = spacy.blank("en")

#add the pipe
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [{"TEXT": {"REGEX": "((\d){7})"}}]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER
