# 2. SpaCy Pipeline and Properties

In [1]:
import spacy 
nlp = spacy.load("en")

In [2]:
filename = 'Tripadvisor_hotelreviews_Shivambansal.txt'
document = open(filename, encoding='utf8').read()
document = nlp(document)

現在 document 是 spcay 的 english model，並且擁有許多 methods/properties

In [3]:
dir(document)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_disk',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'merge',
 'noun_chunks',
 'noun_chunks_iterator',
 'print_tree',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_extension',
 'similarity',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_byte

此時來個 [Overview](https://spacy.io/api) 可能會讓你更好理解：
- 中心物件為 `Doc` 與 `Vocab`
    - `Doc` 儲存所有 tokens
    - `Vocab` 為所有可得資訊的 **lookup table**
- 一開始使用 `load()` 會選擇語言模型，接著使用 `nlp(texts)` 就幫你完成 `Pipeline` 及 `Tokenize` 的步驟拿到 Doc 物件

---

## 2.1 Tokenization

spacy 的 model 已經將文本分割成句子、而每一句皆已 tokenization 了：

In [4]:
# first token of the doc
print(document[0])

# last token of the doc
print(document[len(document)-5])

Nice
boston


In [5]:
# List of sentences of our doc
list(document.sents)[:5]

[Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
 Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).,
 Overall, it was a good experience and the staff was quite friendly. ,
 what a surprise]

## 2.2 Part of Speech Tagging

In [6]:
# get all tags
all_tags = {w.pos: w.pos_ for w in document}
all_tags

{84: 'ADJ',
 92: 'NOUN',
 86: 'ADV',
 85: 'ADP',
 90: 'DET',
 100: 'VERB',
 95: 'PRON',
 97: 'PUNCT',
 89: 'CCONJ',
 96: 'PROPN',
 94: 'PART',
 103: 'SPACE',
 99: 'SYM',
 93: 'NUM',
 101: 'X',
 87: 'AUX',
 91: 'INTJ'}

In [7]:
# all tags of first sentence of our document
for word in list(document.sents)[0]:
    print(word, ", Tag: ", word.tag_)

Nice , Tag:  JJ
place , Tag:  NN
Better , Tag:  RBR
than , Tag:  IN
some , Tag:  DT
reviews , Tag:  NNS
give , Tag:  VBP
it , Tag:  PRP
credit , Tag:  NN
for , Tag:  IN
. , Tag:  .


- 其中：

|NAME|TYPE|DESCRIPTION|
|---|---|---|
|pos|int|Coarse-grained part-of-speech.|
|pos_|unicode|Coarse-grained part-of-speech.|
|tag|int|Fine-grained part-of-speech.|
|tag_|unicode|Fine-grained part-of-speech.|

- *粗粒度*
- *細粒度*

- 該 word 為 ***Token物件***，還有哪些屬性可參考官方文件：[Token物件 #attributes](https://spacy.io/api/token#attributes)

In [8]:
#define some parameters  
noisy_pos_tags = set(["PROP"])
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    if (token.pos_ in noisy_pos_tags) or \
       (token.is_stop == True) or \
       (len(token) <= min_token_length):
        return True
    return False

def cleanup(token, lower = True):
    if lower:
        return token.lower().strip()
    return token.strip()

# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.text) for word in document if not isNoise(word)]
Counter(cleaned_list).most_common(5)

[('hotel', 685),
 ('room', 653),
 ('great', 300),
 ('sheraton', 286),
 ('location', 272)]

## 2.3 Entity Detection

- Entities can be of different types, such as – ***person***, ***location***, ***organization***, ***dates***, ***numerals***, etc. 
- These entities can be accessed through ***“.ents”*** property.
- [All Entities](https://spacy.io/api/annotation#named-entities)

In [9]:
labels = set([w.label_ for w in document.ents]) 
labels

{'CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

In [10]:
for label in labels: 
    entities = [ cleanup(e.text, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print(label)
    print("    ", entities[:5])

MONEY
     ['99', 'more than $80', '129', '18', 'about $135']
ORDINAL
     ['25th', '24th', '2nd', '5th', '29th']
PRODUCT
     ['the Rodeo Dr', 'The Atlantis Fitness', 'Copley', 'Highly', 'the USS Constitution']
ORG
     ['Quincy Market', 'Suite', 'PRICELINE', 'the Pour House', 'the Fitz Inn']
NORP
     ['Italian', 'T.', '65F.We', 'European', 'Hynes']
PERSON
     ['Keeps Me Coming Back', 'Sweet Sleeper - so', 'Duck', 'the Duck Tour', 'Taylor']
EVENT
     ['New Years', 'the Marathon Expo', 'the Hynes Convention Ctr', 'an International Seafood Show', 'Key Club International Convention']
GPE
     ['Hotwire', 'Backbay', 'Priceline', 'Sak鈥檚', 'the Ramada Hong Kong']
FAC
     ['22nd Floor', 'the Prudential Tower', 'Faneiul Hall', 'South Tower', 'the South Tower']
CARDINAL
     ['4-star', '9/6', 'about 400', '3/3', '1200']
TIME
     ['more than one night', 'the third night', 'the next morning', 'about an hour', '179CDN$/per night']
LOC
     ['Breakfast', 'the North Wing', 'Boston Harbor', 'Ne

- 其中，[`document.ents`](https://spacy.io/api/doc#ents)回傳的是 [Span物件](https://spacy.io/api/span)

---

## 2.4 Dependency Parsing
- What is Dependency Parsing?
    - [Dependency Parsing in NLP](https://shirishkadam.com/2016/12/23/dependency-parsing-in-nlp/)
- 可用來幫助我們做語意分析，理解主詞、受詞等到底是哪個 token

In [11]:
# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.text.lower()]
hotel[:5]

[The hotel as stated is in a fantastic location and the Wrentham Village outlet is well worth a visit for bargain shopping ( the bus picks up outside).,
 The hotel bar is a little pricey ( not helped by the current dollar rate) but is a nice place to relax after a busy day shopping.,
 A cab from the airport to the hotel can be cheaper than the shuttles depending what time of the day you go.,
 Boston from 17th Floor of Sheraton Hotel 
 Find an alternative to the Sheraton,
 We stayed at this hotel for 3 nights.]

In [12]:
sentence = hotel[4] 
for word in sentence:
    print(word, ': ', str(list(word.children)))

We :  []
stayed :  [We, at, for, .]
at :  [hotel]
this :  []
hotel :  [this]
for :  [nights]
3 :  []
nights :  [3]
. :  []


In [13]:
from nltk import Tree

def tok_format(tok):
    return '{}({})({})'.format(tok.orth_, tok.tag_, tok.dep_)
#     return "_".join([tok.orth_, tok.tag_, tok.dep_])

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

print(sentence)
to_nltk_tree(sentence.root).pretty_print()

We stayed at this hotel for 3 nights.
               stayed(VBD)(ROOT)                                  
       ________________|_________________________________          
      |                |           at(IN)(prep)    for(IN)(prep)  
      |                |                |                |         
      |                |         hotel(NN)(pobj) nights(NNS)(pobj)
      |                |                |                |         
We(PRP)(nsubj)    .(.)(punct)     this(DT)(det)    3(CD)(nummod)  



In [14]:
# check all adjectives used with a word 
def pos_words (doc, token, ptag):
    sentences = [sent for sent in doc.sents if token in sent.text]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if token in word.text: 
                pwrds.extend([child.text.strip() for child in word.children
                                                      if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', "ADJ")

[('other', 20),
 ('great', 10),
 ('nice', 7),
 ('good', 7),
 ('better', 6),
 ('Nice', 5),
 ('different', 5),
 ('many', 5),
 ('best', 4),
 ('wonderful', 3)]

# 3. Word to Vectors Integration

- [官方參考文件](https://spacy.io/usage/vectors-similarity)
- [All Available pre-trained statistical models for English](https://spacy.io/models/en)
- 英文模型就有四種，有甚麼不同呢？
    - size 不同，越大越精細 (sm < md < lg)
    - md 與 lg 才是 GloVe vectors

***Use built-in GloVe word embeddings***

In [15]:
# download en_core_web_md first (~95.4 MB)
# !python -m spacy download en_core_web_md
import en_core_web_md

# nlp_haha = spacy.load('en_core_web_md') 
# not works for me (Anaconda on Win10)

nlp_haha = en_core_web_md.load() # works fine
tokens = nlp_haha('dog cat banana egawhgawh.')
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
egawhgawh False 0.0 True
. True 4.9316354 False


In [16]:
for i, i_token in enumerate(tokens[:-1]):
    print("{} ({})".format(i_token, i_token.has_vector))
    for j, j_token in enumerate(tokens[i + 1:]):
        if i_token.has_vector and j_token.has_vector:
            print("   └╴", j_token, i_token.similarity(j_token))

dog (True)
   └╴ cat 0.80168545
   └╴ banana 0.24327643
   └╴ . 0.27271757
cat (True)
   └╴ banana 0.28154364
   └╴ . 0.28120673
banana (True)
   └╴ . 0.18991143
egawhgawh (False)


In [17]:
print(tokens[0].vector.shape)

(300,)


***Use custom word vectors*** ([官方指引](https://spacy.io/usage/vectors-similarity#custom))
- 大部分的 word vectors 格式皆會輸出成一行一行、每行為 word <vecotr...> 的格式
- 在給 spacy 讀取之前需先用 [cli tool](https://spacy.io/api/cli#init-model) 將純文字檔轉成 bianry format，確切的格式須遵守 Word2Vec format
    - on **linux**: `python -m spacy init-model en custom_vectors --vectors-loc custom_vectors.txt`
    - on **windows**: `python.exe -m spacy init-model en custom_vectors --vectors-loc custom_vectors.txt`
    - ![init-model.png](./img/init-model.png)
- Word2Vec format 例子詳見 [custom_vectors.txt](./custom_vectors.txt)

In [18]:
nlp_haha = spacy.load("./custom_vectors") # assign directory
doc1 = nlp_haha("haha uccu.")
doc2 = nlp_haha("hello cellopoint.")

print("Doc 1: ")
for token in doc1:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
    
print("\n", "Doc 2: ")
for token in doc2:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

print("\n", "驗算 Norm: ", (0.1**2 + 0.2**2 + 0.3**2 + 0.4**2 + 0.5**2)**0.5)

Doc 1: 
haha True 0.7416199 False
uccu False 0.0 True
. False 0.0 True

 Doc 2: 
hello False 0.0 True
cellopoint True 0.7416199 False
. False 0.0 True

 驗算 Norm:  0.7416198487095663


In [19]:
print(doc1.similarity(doc2))

1.000000028574667


# 4. Machine Learning with text using Spacy

In [20]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

import string
punctuations = string.punctuation

import en_core_web_md
parser = en_core_web_md.load()

#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic utility function to clean the text 
def clean_text(text):     
    return text.strip().lower()

In [21]:
#Create spacy tokenizer that parses a sentence and generates tokens
#these can also be replaced by word vectors 
def spacy_tokenizer(sentence):
    tokens = parser(sentence)
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]
    return tokens

#create vectorizer object to generate feature vectors, we will use custom spacy’s tokenizer
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, ngram_range=(1,1))
classifier = LinearSVC()

- 其中：`-PRON-` 是 SpaCy 處理**人稱代詞**的詞形還原實作選擇，給個特殊符號代替。原文請參考[api/annotation#lemmatization](https://spacy.io/api/annotation#lemmatization)

In [22]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

In [23]:
# Load sample data
train = [('I love this sandwich.', 'pos'),          
         ('this is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('this is my best work.', 'pos'),
         ("what an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('he is my sworn enemy!', 'neg'),          
         ('my boss is horrible.', 'neg')]

test =   [('the beer was good.', 'pos'),     
         ('I do not enjoy my job', 'neg'),
         ("I ain't feelin dandy today.", 'neg'),
         ("I feel amazing!", 'pos'),
         ('Gary is a good friend of mine.', 'pos'),
         ("I can't believe I'm doing this.", 'neg')]

In [24]:
# Create model and measure accuracy
pipe.fit([x[0] for x in train], [x[1] for x in train]) 
pred_data = pipe.predict([x[0] for x in test])

for (sample, pred) in zip(test, pred_data):
    print(sample, pred)
print("Accuracy:", accuracy_score([x[1] for x in test], pred_data))

('the beer was good.', 'pos') pos
('I do not enjoy my job', 'neg') neg
("I ain't feelin dandy today.", 'neg') neg
('I feel amazing!', 'pos') pos
('Gary is a good friend of mine.', 'pos') pos
("I can't believe I'm doing this.", 'neg') neg
Accuracy: 1.0
