## Installation 

### Spacy

In [1]:
!pip install -U pip setuptools wheel
!pip install -U spacy

# !conda install -c conda-forge spacy   #### By Conda

# Or clone its repository https://github.com/explosion/spaCy

Collecting pip
  Using cached pip-23.1.2-py3-none-any.whl (2.1 MB)
Collecting setuptools
  Using cached setuptools-67.8.0-py3-none-any.whl (1.1 MB)
Collecting wheel
  Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: wheel
    Found existing installation: wheel 0.37.1
    Uninstalling wheel-0.37.1:
      Successfully uninstalled wheel-0.37.1
  Attempting uninstall: setuptools
    Found existing installation: setuptools 65.6.3
    Uninstalling setuptools-65.6.3:
      Successfully uninstalled setuptools-65.6.3
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyq

## Get Started

Dataset 1: Conll04 set
---
The entity and relation extraction dataset including four entity types in the dataset (Location,Organization,Person, and Other) and five relation types(Kill, Livein, Locatedin, OrgBasedin and Workfor).
http://lavis.cs.hs-rm.de/storage/spert/public/datasets/conll04/

Entity Type:

- Location: Loc (gt); GPE & LOC (spacy)
- Organization: Org (gt); ORG (spacy)
- Person: Peop (gt); PERSON(spacy)
- Other: Other (gt); All other tags (spacy)


### Get an overview of the dataset

In [2]:
import json
import random

# use pip install json / random if needed

with open("conll04.json", "r") as file:
    data = json.load(file)
print(data[0])  # Conll04 dev data

example = data[0]
entity = [(line['type'], " ".join([example['tokens'][i] for i in range(line['start'], line['end'])])) for line in example['entities']]
print("Entity: ", entity)

relation = [f"{entity[rline['head']]} -> {rline['type']} -> {entity[rline['tail']]}" for rline in example['relations']]
print("Relation: ", relation)


{'tokens': ['John', 'Wilkes', 'Booth', ',', 'who', 'assassinated', 'President', 'Lincoln', ',', 'was', 'an', 'actor', '.'], 'entities': [{'type': 'Peop', 'start': 0, 'end': 3}, {'type': 'Peop', 'start': 6, 'end': 8}], 'relations': [{'type': 'Kill', 'head': 0, 'tail': 1}], 'orig_id': 5178}
Entity:  [('Peop', 'John Wilkes Booth'), ('Peop', 'President Lincoln')]
Relation:  ["('Peop', 'John Wilkes Booth') -> Kill -> ('Peop', 'President Lincoln')"]


### Q1: Please find out the first five organization (Org) entities in conll04 dev data.

### How to use spacy to extract entities and relations?

**Step1: download a pretrained model in Spacy**

In [3]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf
# !pip install spacy-transformers
# The details of all available models can be found in https://spacy.io/usage/models.

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.3.0
    Uninstalling en-core-web-sm-3.3.0:
      Successfully uninstalled en-core-web-sm-3.3.0
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-trf==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.5.0/en_core_web_trf-3.5.0-py3-none-any.whl (460.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.3/460

Installing collected packages: spacy-transformers, en-core-web-trf
  Attempting uninstall: spacy-transformers
    Found existing installation: spacy-transformers 1.1.7
    Uninstalling spacy-transformers-1.1.7:
      Successfully uninstalled spacy-transformers-1.1.7
  Attempting uninstall: en-core-web-trf
    Found existing installation: en-core-web-trf 3.3.0
    Uninstalling en-core-web-trf-3.3.0:
      Successfully uninstalled en-core-web-trf-3.3.0
Successfully installed en-core-web-trf-3.5.0 spacy-transformers-1.2.4
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("He works at Google.")
spacy.displacy.render(doc, style='ent', jupyter=True)

In [3]:
spacy.displacy.render(doc, style='dep', jupyter=True)

In [8]:
text = "The Princess Diaries Volume VIII: Princess on the Brink released in the United Kingdom as The Princess Diaries: After Eight is a young adult book in the critically acclaimed Princess Diaries series."
doc = nlp(text)
spacy.displacy.render(doc, style="ent", jupyter=True)

In [9]:
nlp = spacy.load("en_core_web_trf")
import en_core_web_trf
nlp = en_core_web_trf.load()
text = "The Princess Diaries Volume VIII: Princess on the Brink released in the United Kingdom as The Princess Diaries: After Eight is a young adult book in the critically acclaimed Princess Diaries series."
doc = nlp(text)
spacy.displacy.render(doc, style="ent", jupyter=True)

In [10]:
from spacy.tokens import Span # highlights overlapping spans in a text
text = "Welcome to the Bank of China."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"),
    Span(doc, 5, 6, "GPE"),
]

spacy.displacy.render(doc, style="span", jupyter=True)

*We will talk about how to formally use metrics to evaluate model's performance on IE tasts later today or tmrw.*

**Step2: load the pretrained model from Spacy**

In [33]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Step3: Use Spacy to extract entities in the raw text.**

In [34]:
doc = " ".join(example['tokens'])
doc = nlp(doc)
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)

PERSON  |  John Wilkes Booth
PERSON  |  Lincoln


### Q2: Please find out the first five organization (ORG) entities in conll04 data using SpaCy.

**Step4: Use Spacy to extract relations in the raw text.**

Way 1: Pattern match
===

In [35]:
for tok in doc: 
    print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

John --> compound --> PROPN
Wilkes --> compound --> PROPN
Booth --> nsubj --> PROPN
, --> punct --> PUNCT
who --> nsubj --> PRON
assassinated --> relcl --> VERB
President --> compound --> PROPN
Lincoln --> dobj --> PROPN
, --> punct --> PUNCT
was --> ROOT --> AUX
an --> det --> DET
actor --> attr --> NOUN
. --> punct --> PUNCT


In [36]:
from spacy.matcher import Matcher 
matcher = Matcher(nlp.vocab) 
pattern = [{'POS':'PRON'}, 
           {'POS': 'VERB'}, 
           {'POS': 'PROPN'} ]

# POS (coarse-grained part-of-speech) provides a general tag, such as 'NOUN'.

# Matcher class object 
matcher.add("matching_1", [pattern]) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

who assassinated President


### Q3: Please try to create a template that can be used to extract a 'Such-as' relationship from text.
---
Pattern: X such as Y

In [37]:
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 

# create a spaCy object 
doc = nlp(text)
for tok in doc: 
    print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

GDP --> nsubj --> NOUN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> ADP
Vietnam --> pobj --> PROPN
will --> aux --> AUX
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


In [39]:
matcher = Matcher(nlp.vocab) 
pattern = [] #?
matcher.add("matching_1", [pattern]) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

countries such as Vietnam


Way 2: Subtree Matching
===

In [4]:
from spacy import displacy 
doc = " ".join(example['tokens'])
doc = nlp(doc)
displacy.render(doc, style='dep', jupyter=True)

In [53]:
def subtree_matcher(doc): 
    x = '' 
    y = '' 

    # iterate through all the tokens in the input sentence 
    for i,tok in enumerate(doc): 
        # extract object 
        if tok.dep_.find("obj") == True: 
            y = tok.text 

        # extract subject 
        if tok.dep_.endswith("subj") == True: 
            x = tok.text 

    return x,y

In [54]:
subtree_matcher(doc)

('who', 'Lincoln')

Way3: Package textacy: NLP, before and after spaCy
===

In [13]:
!pip install textacy
# !pip install --upgrade tbb
# !pip install more-itertools
# !pip install daal==2021.4.0

Collecting tbb
  Using cached tbb-2021.9.0-py2.py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.whl (1.0 MB)
Installing collected packages: tbb
Successfully installed tbb-2021.9.0
Collecting daal==2021.4.0
  Using cached daal-2021.4.0-py2.py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.whl (189.9 MB)
Installing collected packages: daal
Successfully installed daal-2021.4.0


In [56]:
import textacy
import re

In [57]:
tri = textacy.extract.triples.subject_verb_object_triples(doc)
for item in tri:
    print(" ".join([str(x) for x in item[0]]), "->", " ".join([str(x) for x in item[1]]), "->", " ".join([str(x) for x in item[2]]))

who -> assassinated -> President Lincoln
