## Installation 

### Spacy

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy

# !conda install -c conda-forge spacy   #### By Conda

# Or clone its repository https://github.com/explosion/spaCy

## Get Started

Dataset 1: Conll04 set
---
The entity and relation extraction dataset including four entity types in the dataset (Location,Organization,Person, and Other) and five relation types(Kill, Livein, Locatedin, OrgBasedin and Workfor).
http://lavis.cs.hs-rm.de/storage/spert/public/datasets/conll04/

Entity Type:

- Location: Loc (gt); GPE & LOC (spacy)
- Organization: Org (gt); ORG (spacy)
- Person: Peop (gt); PERSON(spacy)
- Other: Other (gt); All other tags (spacy)


### Get an overview of the dataset

In [1]:
import json
import random

# use pip install json / random if needed

with open("conll04.json", "r") as file:
    data = json.load(file)
print(data[0])  # Conll04 dev data

example = data[0]
entity = [(line['type'], " ".join([example['tokens'][i] for i in range(line['start'], line['end'])])) for line in example['entities']]
print("Entity: ", entity)

relation = [f"{entity[rline['head']]} -> {rline['type']} -> {entity[rline['tail']]}" for rline in example['relations']]
print("Relation: ", relation)


{'tokens': ['John', 'Wilkes', 'Booth', ',', 'who', 'assassinated', 'President', 'Lincoln', ',', 'was', 'an', 'actor', '.'], 'entities': [{'type': 'Peop', 'start': 0, 'end': 3}, {'type': 'Peop', 'start': 6, 'end': 8}], 'relations': [{'type': 'Kill', 'head': 0, 'tail': 1}], 'orig_id': 5178}
Entity:  [('Peop', 'John Wilkes Booth'), ('Peop', 'President Lincoln')]
Relation:  ["('Peop', 'John Wilkes Booth') -> Kill -> ('Peop', 'President Lincoln')"]


### Q1: Please find out the first five organization (Org) entities in conll04 dev data.

### How to use spacy to extract entities and relations?

**Step1: download a pretrained model in Spacy**

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf
# !pip install spacy-transformers
# The details of all available models can be found in https://spacy.io/usage/models.

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("He works at Google.")
spacy.displacy.render(doc, style='ent', jupyter=True)

In [3]:
spacy.displacy.render(doc, style='dep', jupyter=True)

In [4]:
text = "The Princess Diaries Volume VIII: Princess on the Brink released in the United Kingdom as The Princess Diaries: After Eight is a young adult book in the critically acclaimed Princess Diaries series."
doc = nlp(text)
spacy.displacy.render(doc, style="ent", jupyter=True)

In [5]:
nlp = spacy.load("en_core_web_trf")
import en_core_web_trf
nlp = en_core_web_trf.load()
text = "The Princess Diaries Volume VIII: Princess on the Brink released in the United Kingdom as The Princess Diaries: After Eight is a young adult book in the critically acclaimed Princess Diaries series."
doc = nlp(text)
spacy.displacy.render(doc, style="ent", jupyter=True)

In [6]:
from spacy.tokens import Span # highlights overlapping spans in a text
text = "Welcome to the Bank of China."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"),
    Span(doc, 5, 6, "GPE"),
]

spacy.displacy.render(doc, style="span", jupyter=True)

*We will talk about how to formally use metrics to evaluate model's performance on IE tasts later today or tmrw.*

**Step2: load the pretrained model from Spacy**

In [7]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Step3: Use Spacy to extract entities in the raw text.**

In [8]:
doc = " ".join(example['tokens'])
doc = nlp(doc)
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)

PERSON  |  John Wilkes Booth
PERSON  |  Lincoln


### Q2: Please find out the first five organization (ORG) entities in conll04 data using SpaCy.

**Step4: Use Spacy to extract relations in the raw text.**

Way 1: Pattern match
===

In [9]:
for tok in doc: 
    print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

John --> compound --> PROPN
Wilkes --> compound --> PROPN
Booth --> nsubj --> PROPN
, --> punct --> PUNCT
who --> nsubj --> PRON
assassinated --> relcl --> VERB
President --> compound --> PROPN
Lincoln --> dobj --> PROPN
, --> punct --> PUNCT
was --> ROOT --> AUX
an --> det --> DET
actor --> attr --> NOUN
. --> punct --> PUNCT


In [10]:
from spacy.matcher import Matcher 
matcher = Matcher(nlp.vocab) 
pattern = [{'POS':'PRON'}, 
           {'POS': 'VERB'}, 
           {'POS': 'PROPN'} ]

# POS (coarse-grained part-of-speech) provides a general tag, such as 'NOUN'.

# Matcher class object 
matcher.add("matching_1", [pattern]) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

who assassinated President


### Q3: Please try to create a template that can be used to extract a 'Such-as' relationship from text.
---
Pattern: X such as Y

In [None]:
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 

# create a spaCy object 
doc = nlp(text)
for tok in doc: 
    print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

In [None]:
matcher = Matcher(nlp.vocab) 
pattern = [] #?
matcher.add("matching_1", [pattern]) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

Way 2: Subtree Matching
===

In [11]:
from spacy import displacy 
doc = " ".join(example['tokens'])
doc = nlp(doc)
displacy.render(doc, style='dep', jupyter=True)

In [12]:
def subtree_matcher(doc): 
    x = '' 
    y = '' 

    # iterate through all the tokens in the input sentence 
    for i,tok in enumerate(doc): 
        # extract object 
        if tok.dep_.find("obj") == True: 
            y = tok.text 

        # extract subject 
        if tok.dep_.endswith("subj") == True: 
            x = tok.text 

    return x,y

In [13]:
subtree_matcher(doc)

('who', 'Lincoln')

Way3: Package textacy: NLP, before and after spaCy
===

In [None]:
!pip install textacy
# !pip install --upgrade tbb
# !pip install more-itertools
# !pip install daal==2021.4.0

In [14]:
import textacy
import re

In [15]:
tri = textacy.extract.triples.subject_verb_object_triples(doc)
for item in tri:
    print(" ".join([str(x) for x in item[0]]), "->", " ".join([str(x) for x in item[1]]), "->", " ".join([str(x) for x in item[2]]))

who -> assassinated -> President Lincoln
