# Week 06: Dependency Parser and spacy
The assignment this week is to identify the grammar pattern VERB-PREP-NOUN using two different methods. You will practice the various functionalities of spacy in the process. 

Data used in this assignment:  
https://drive.google.com/file/d/1OIZPsDezgLaBjw3OX30YFyeFkzegtwP8/view?usp=sharing

* sentences.s2orc.txt

spacy tutorials:  
https://www.machinelearningplus.com/spacy-tutorial-nlp/#phrasematcher  
https://spacy.io/usage/linguistic-features#entity-linking

## Requirements
* pandas
* spacy



### Installation of spacy

In [1]:
! pip install spacy
! python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Read Data

In [2]:
! pip install pandas



#### 動詞 介系詞 名詞
raw base match: 直接抓連在一起的   
tree: dependeancy--> 不用一定要連在一起 也抓得到    
    

In [3]:
import pandas as pd
def loadData(path):
    with open(path) as f:
        sents = []
        for line in f.readlines():
            line = line.strip("\n").split("\t")
            sents.append(line[1])
    return pd.DataFrame({"sentence":sents})
data = loadData("sentences.s2orc.txt")
print(data.head())


                                            sentence
0  Meanwhile, an analysis of the literature shows...
1  Meanwhile, this list can be supplemented with ...
2  At the same time, in many cases, several instr...
3  It is not possible to give a systematic assess...
4  Correlation was calculated for the years, wher...


In [4]:
import re
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')

### Spacy example
If you have any probelm, look up the documentation [here](https://spacy.io/usage/linguistic-features)


In [5]:
example_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost.
He began immediately to rant about the gas price .
"""

# Remove newline character
example_text = re.sub("\n", '', example_text)
example_doc = nlp(example_text)
#for token in example_doc:
#    print(token.text,'--',token.is_stop,'---',token.is_punct)

<font color="red">**[ TODO ]**</font> Please print out the 2nd sentence in the example_text

In [6]:
sents = []
for sent in example_doc.sents:
    sents.append(sent)
print(sents[1])

Citizens who had their main investment in the share-market are facing a great loss.


Let's start with some simple linguistic features we have been dealing with.

<font color="red">**[ TODO ]**</font> Please print out the following token features of the first sentence in example_text:  
text,  lemma,  POS

#### It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text.

In [7]:
for token in sents[0]:
    print(token.text, token.lemma_, token.pos_)

    

The the DET
economic economic ADJ
situation situation NOUN
of of ADP
the the DET
country country NOUN
is be AUX
on on ADP
edge edge NOUN
, , PUNCT
as as SCONJ
the the DET
stock stock NOUN
market market NOUN
crashed crash VERB
causing cause VERB
loss loss NOUN
of of ADP
millions million NOUN
. . PUNCT


<font color="red">**[ TODO ]**</font> Data Process 1: Please run the s2orc data through spacy and store the result in data_doc

In [8]:
# 要跑一陣子：一次喂 再切
# type(data) = dataframe
data.head()
data_doc = []
for x in data['sentence']:
    data_doc.append(nlp(x))
#data_doc = nlp(data.loc[:, 'sentence'][:])



In [9]:
data_doc[0]

Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed.

In [10]:
data_doc[:10]

[Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed.,
 Meanwhile, this list can be supplemented with instruments of monetary policy, which also have an impact on financial stability.,
 At the same time, in many cases, several instruments are used to reduce financial instability, which contributes to the achievement of various intermediate goals.,
 It is not possible to give a systematic assessment of financial stability and coordinate the use of monetary, macro-prudential and micro-prudential policies in order to reduce systemic risks.,
 Correlation was calculated for the years, where the information is available for both indicators.,
 Table 4 defines the criteria for market and institutional balance of financial stability, formed for the Russian economy.,
 The development of a risk map is necessary in order to determine the objects of regulation.,
 Blowing out a bubble has little effect on the asset itsel

### Named Entity Recognition
Named Entity: a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name.  

The following is an example of named entity recognition using spacy

In [11]:
# 找到專有名詞

ner_doc = nlp("Ada Lovelace was born in New York at Thanksgiving.")

# Document level
for e in ner_doc.ents:
    print(e.text, e.label_) 

Ada Lovelace PERSON
New York GPE
Thanksgiving DATE


In [12]:
from spacy import displacy
displacy.render(ner_doc,style='ent',jupyter=True)

<font color="red">**[ TODO ]**</font> Data Process 2: Please replace all named entities in data_doc with their labels.  
For example,  
"Ada Lovelace was born in New York at Thanksgiving." should be adjusted to  
"PERSON was born in GPE at DATE."

In [None]:
data_doc1 = []

for doc in data_doc:
    s = doc.text
    for ent in doc.ents:
        s = s.replace(ent.text, ent.label_)
    data_doc1.append(nlp(s))

In [None]:
data_doc1[33] 

In [None]:
data_doc = data_doc1

In [None]:
data_doc[:10]
#print(len(data_doc))

### Dependency Parser

If you have probelms concerning the dependency parser tags, look up the documentation [here](https://universaldependencies.org/en/dep/index.html). 


In [None]:
# Example of Dependency Parser
# dep wo _ 是id
print(sents[2])
for token in sents[2]:
    #print(token.text, token.dep_)
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

In [None]:
from spacy import displacy

displacy.render(sents[2], style="dep")

To traverse a dependency tree, use the following properties of token object.  
token.children, token.lefts, token.rights  

If you have any probelms, please check [here](https://spacy.io/api/token#children)

<font color="red">**[ TODO ]**</font> Please identify a VERB-PREP-NOUN grammar structure in sent[2] by traversing the dependency tree.  
Expected output:  
(lay, off, thousands)


In [None]:
def verb_right(token):
    return_set = []
    # type1: v -> p -> n
    if token.pos_ == 'ADP':
        rights = [t for t in token.rights]
        #check 有無 n
        for child in rights:
            if child.pos_ == 'NOUN':
                return_set.append(tuple((token, child)))
    # type2: v -> n -> p
    if token.pos_ == 'NOUN':
        lefts = [t for t in token.lefts]
        #check 左側有無 p
        for child in lefts:
            if child.pos_ == 'ADP':
                return_set.append(tuple((token, child)))
    
    return return_set #list of tuple

In [None]:
col_list = [] 
for token in sents[2]:    
    print(token.text, token.pos_, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
    if token.pos_ == 'VERB':
        total = []
        rights = [t for t in token.rights]
        #print(f'\t {token.text}\trights:  {rights}\n')
        
        for t in rights:
            #print(t.text)
            # type( verb_right(t) ) = [(), () ...] --> list_of_tuples
            total.extend(list_of_tuples for list_of_tuples in verb_right(t))
            #print(f'\t total:  {total}\n')
            # TODO = total might have duplicate / 
        count = 0
        for r_child in rights:
            count += 1
            if r_child.pos_ == 'ADP':
                for i in range(count, len(rights)):
                    #print(f'i:   {i}')
                    if rights[i].pos_ == 'NOUN':
                        total.append(tuple((r_child, rights[i])))
                        #print(type(total[-1]))
                            #break # 只找一個
        #print(f'\n\n final total\t\t{total}')
        for x in total:
            #print(type(x))
            #print(x)
            col_list.append(tuple((token, x[0], x[1])))
            
print('\n\n')            
print(col_list)

<font color="red">**[ TODO ]**</font>  Please identify all VERB-PREP-NOUN grammar structure in data_doc by traversing the dependency trees and save the results in a list of tuples dep_gp.


In [None]:
dep_gp = []

for sentence in data_doc:
    col_list = [] 
    for token in sentence:
        if token.pos_ == 'VERB':
            total = []
            rights = [t for t in token.rights]

            for t in rights:
                # type( verb_right(t) ) = [(), () ...] --> list_of_tuples
                total.extend(list_of_tuples for list_of_tuples in verb_right(t))
                #print(f'\t total:  {total}\n')
                # TODO = total might have duplicate / 
            count = 0
            for r_child in rights:
                count += 1
                if r_child.pos_ == 'ADP':
                    for i in range(count, len(rights)):
                        #print(f'i:   {i}')
                        if rights[i].pos_ == 'NOUN':
                            total.append(tuple((r_child, rights[i])))
                                #break # 只找一個
            for x in total:
                col_list.append(tuple((token, x[0], x[1])))
           
    #print(col_list)
    dep_gp.extend(col_list)


<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in dep_gp with the verb "charge".


In [None]:
# print verb有 provide 的 !!!
print(dep_gp[:20])
print(len(dep_gp))
print('*' *50)
provide_list = []
# provide_list = [x if x[0].text == 'provide' x for x in dep_gp]
for x in dep_gp:
    if x[0].lemma_ == 'provide':
        provide_list.append(x)
print(len(provide_list))
print(provide_list)

### Rule Based Methods 
We can also custom build rules for spacy to match patterns.  
[Documentation](https://spacy.io/api/matcher)

In [None]:
from spacy.matcher import Matcher 

In [None]:
# Example text
text = """I visited Manali last time . Around same budget trips ? I was visiting Ladakh this summer . I have planned visiting New York and other abroad places for next year. Have you ever visited Kodaikanal? """
text = re.sub('\n', '', text)
doc = nlp(text)

In [None]:
# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [{"LEMMA": "visit"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Visting_places", [my_pattern])
matches = matcher(doc)

# Counting the no of matches
print(" matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

<font color="red">**[ TODO ]**</font> Please identify all VERB-PREP-NOUN grammar structure in data_doc by applying a matcher rule and store the results in a list of tuples rule_gp. 


a+b+c 中間如果有其他字 --> 應該抓不到

In [None]:
#用lemma 

rule_gp = []


# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [ {"POS": "VERB"}, {"POS": "ADP"},  {"POS": "NOUN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Visting_places", [my_pattern])

for data_nlp in data_doc:
    matches = matcher(data_nlp)
    for match_id, start, end in matches:
        #print("Match found:", data_nlp[start:end].text)
        rule_gp.append(tuple(data_nlp[start:end].text.split()))
# Counting the no of matches
print(len(rule_gp))
#print(rule_gp)

In [None]:
rule_gp[:10]

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in rule_gp with the verb "charge".


In [None]:
for x in rule_gp:
    if x[0] == 'provided':
        print(x)

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.