# Week 06: Dependency Parser and spacy
The assignment this week is to identify the grammar pattern VERB-PREP-NOUN using two different methods. You will practice the various functionalities of spacy in the process. 

Data used in this assignment:  
https://drive.google.com/file/d/1OIZPsDezgLaBjw3OX30YFyeFkzegtwP8/view?usp=sharing

* sentences.s2orc.txt

spacy tutorials:  
https://www.machinelearningplus.com/spacy-tutorial-nlp/#phrasematcher  
https://spacy.io/usage/linguistic-features#entity-linking

## Requirements
* pandas
* spacy



### Installation of spacy

In [1]:
! pip install spacy
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Read Data

In [2]:
import pandas as pd

def loadData(path):
  with open(path) as f:
    sents = []
    for line in f.readlines():
      line = line.strip("\n").split("\t")
      sents.append(line[1])
  return pd.DataFrame({"sentence": sents})

data = loadData("sentences.s2orc.txt")
data.head(30)

Unnamed: 0,sentence
0,"Meanwhile, an analysis of the literature shows..."
1,"Meanwhile, this list can be supplemented with ..."
2,"At the same time, in many cases, several instr..."
3,It is not possible to give a systematic assess...
4,"Correlation was calculated for the years, wher..."
5,Table 4 defines the criteria for market and in...
6,The development of a risk map is necessary in ...
7,Blowing out a bubble has little effect on the ...
8,"In the state, the investment directions are ti..."
9,The decrease in IFS in early 2015 is due to a ...


In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_entities")

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

### Spacy example
If you have any probelm, look up the documentation [here](https://spacy.io/usage/linguistic-features)

In [4]:
import re

example_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost.
He began immediately to rant about the gas price .
"""

# Remove newline character
example_text = re.sub("\n", "", example_text)
example_doc = nlp(example_text)
example_doc

The economic situation of the country is on edge , as the stock market crashed causing loss of millions. Citizens who had their main investment in the share-market are facing a great loss. Many companies might lay off thousands of people to reduce labor cost.He began immediately to rant about the gas price .

<font color="red">**[ TODO ]**</font> Please print out the 2nd sentence in the example_text

In [5]:
sents = [sent for sent in example_doc.sents]
type(sents[0])

spacy.tokens.span.Span

In [6]:
sents[1]

Citizens who had their main investment in the share-market are facing a great loss.

Let's start with some simple linguistic features we have been dealing with.

<font color="red">**[ TODO ]**</font> Please print out the following token features of the first sentence in example_text:  
text,  lemma,  POS (Part of Speech)

In [7]:
for token in sents[0]:
  print(token.text, token.lemma_, token.pos_)

The the DET
economic economic ADJ
situation situation NOUN
of of ADP
the the DET
country country NOUN
is be AUX
on on ADP
edge edge NOUN
, , PUNCT
as as SCONJ
the the DET
stock stock NOUN
market market NOUN
crashed crash VERB
causing cause VERB
loss loss NOUN
of of ADP
millions million NOUN
. . PUNCT


<font color="red">**[ TODO ]**</font> Data Process 1: Please run the s2orc data through spacy and store the result in data_doc

In [8]:
data_doc = list(nlp.pipe(data["sentence"]))
type(data_doc)

list

In [9]:
data_doc[0]

Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed.

In [10]:
type(data_doc[0])

spacy.tokens.doc.Doc

### Named Entity Recognition
Named Entity: a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name.  

The following is an example of named entity recognition using spacy

In [11]:
ner_doc = nlp("Ada Lovelace was born in New York at Thanksgiving.")

# Document level
for e in ner_doc.ents:
  print(e.text, e.label_) 

Ada Lovelace PERSON
New York GPE
Thanksgiving DATE


In [12]:
from spacy import displacy

displacy.render(ner_doc, style='ent', jupyter=True)

<font color="red">**[ TODO ]**</font> Data Process 2: Please replace all named entities in data_doc with their labels.  
For example,  
"Ada Lovelace was born in New York at Thanksgiving." should be adjusted to  
"PERSON was born in GPE at DATE."

In [13]:
for sent in data_doc[5:10]:
  displacy.render(sent, style="ent", jupyter=True)



In [14]:
# nlp.add_pipe("merge_entities") 上面有寫
for idx in range(len(data_doc)):
  data_doc[idx] = nlp(" ".join([token.text if not token.ent_type_ else token.ent_type_ for token in data_doc[idx]]))
data_doc[5:10]

[Table CARDINAL defines the criteria for market and institutional balance of financial stability , formed for the NORP economy .,
 The development of a risk map is necessary in order to determine the objects of regulation .,
 Blowing out a bubble has little effect on the asset itself .,
 In the state , the investment directions are tightly controlled , in private companies , there is a danger of their involvement in various risk schemes .,
 The decrease in ORG in DATE is due to a sudden increase in inflation at DATE .]

### Dependency Parser

If you have problems concerning the dependency parser tags, look up the documentation [here](https://universaldependencies.org/en/dep/index.html). 


In [15]:
# Example of Dependency Parser
print(sents[2])
for token in sents[2]:
  print(token.text, token.dep_)

Many companies might lay off thousands of people to reduce labor cost.
Many amod
companies nsubj
might aux
lay ROOT
off prt
thousands dobj
of prep
people pobj
to aux
reduce advcl
labor compound
cost dobj
. punct


In [16]:
from spacy import displacy

displacy.render(sents[2], style="dep")

To traverse a dependency tree, use the following properties of token object.  
token.children, token.lefts, token.rights  

If you have any problems, please check [here](https://spacy.io/api/token#children)

<font color="red">**[ TODO ]**</font> Please identify a VERB-PREP-NOUN grammar structure in sent[2] by traversing the dependency tree.  
Expected output:  
(lay, off, thousands)


In [17]:
sent = nlp("Dog barks towards cats.")

In [18]:
displacy.render(sent, style="dep")

In [19]:
def get_tuples(sent):
  result = []
  for token in sent:
    if token.pos_ == "VERB":
      for child in token.children:
        if child.dep_ == "prep":
          for grandchild in child.children:
            if grandchild.pos_ == "NOUN":
              result.append((token.lemma_, child.lemma_, grandchild.lemma_))
  return result

tuples = get_tuples(sent)
for t in tuples:
  print(t)

('bark', 'towards', 'cat')


<font color="red">**[ TODO ]**</font>  Please identify all VERB-PREP-NOUN grammar structure in data_doc by traversing the dependency trees and save the results in a list of tuples dep_gp (dependency grammar pattern).

In [20]:
import itertools

lists = [get_tuples(sent) for sent in data_doc]
dep_gp = list(itertools.chain.from_iterable(lists))
print(len(dep_gp))

7654


In [21]:
dep_gp[:5]

[('supplement', 'with', 'instrument'),
 ('use', 'at', 'time'),
 ('use', 'in', 'case'),
 ('contribute', 'to', 'achievement'),
 ('coordinate', 'in', 'order')]

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in dep_gp with the verb "charge".


In [22]:
[tpl for tpl in dep_gp if tpl[0] == "make"]

[('make', 'in', 'order'),
 ('make', 'on', 'side'),
 ('make', 'for', 'purpose'),
 ('make', 'in', 'order'),
 ('make', 'between', 'quality'),
 ('make', 'unlike', 'dagongmei'),
 ('make', 'for', 'distance'),
 ('make', 'in', 'word'),
 ('make', 'in', 'transmission'),
 ('make', 'at', 'conjunction'),
 ('make', 'in', 'fashion'),
 ('make', 'give', 'synthesis'),
 ('make', 'of', 'concentration'),
 ('make', 'regard', 'outcome'),
 ('make', 'into', 'brand'),
 ('make', 'in', 'perspective'),
 ('make', 'as', 'attempt'),
 ('make', 'for', 'microphone'),
 ('make', 'with', 'librarian'),
 ('make', 'on', 'printer'),
 ('make', 'with', 'performance'),
 ('make', 'from', 'perspective'),
 ('make', 'to', 'low'),
 ('make', 'in', 'patient'),
 ('make', 'in', 'time'),
 ('make', 'among', 'measurement'),
 ('make', 'as', 'way'),
 ('make', 'onboard', 'spacecraft'),
 ('make', 'for', 'interaction'),
 ('make', 'of', 'segment'),
 ('make', 'on', 'date'),
 ('make', 'at', 'pupil'),
 ('make', 'on', 'thrust'),
 ('make', 'over', 'ran

### Rule Based Methods 
We can also custom build rules for spacy to match patterns.  
[Documentation](https://spacy.io/api/matcher)

In [23]:
# Example text
text = """I visited Manali last time . Around same budget trips ? I was visiting Ladakh this summer . I have planned visiting New York and other abroad places for next year. Have you ever visited Kodaikanal? """
text = re.sub('\n', '', text)
match_doc = nlp(text)

In [24]:
from spacy.matcher import Matcher 

# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [{"LEMMA": "visit"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Visting_places", [my_pattern])
matches = matcher(match_doc)

# Counting the no of matches
print("matches found: ", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
  print("Match found:", match_doc[start:end].text)

matches found:  4
Match found: visited Manali
Match found: visiting Ladakh
Match found: visiting New York
Match found: visited Kodaikanal


<font color="red">**[ TODO ]**</font> Please identify all VERB-PREP-NOUN grammar structure in data_doc by applying a matcher rule and store the results in a list of tuples rule_gp. 


In [25]:
from spacy.tokens import Doc

doc = Doc.from_docs(data_doc)
matcher = Matcher(nlp.vocab)
vpn_pattern = [{"POS": "VERB"}, {"DEP": "prep"}, {"POS": "NOUN"}]
matcher.add("verb_prep_noun", [vpn_pattern])
matches = matcher(doc)
rule_gp = [(doc[start].lemma_, doc[start + 1].lemma_, doc[start + 2].lemma_) for match_id, start, end in matches]
print("matches found: ", len(rule_gp))

matches found:  949


In [26]:
rule_gp[:5]

[('supplement', 'with', 'instrument'),
 ('make', 'in', 'order'),
 ('result', 'in', 'infertility'),
 ('lead', 'to', 'damage'),
 ('take', 'into', 'account')]

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in rule_gp with the verb "charge".


In [27]:
[tpl for tpl in rule_gp if tpl[0] == "make"]

[('make', 'in', 'order'),
 ('make', 'of', 'particle'),
 ('make', 'among', 'org'),
 ('make', 'on', 'date'),
 ('make', 'as', 'athlete'),
 ('make', 'on', 'development')]

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.