# Relation extraction
## Table of contents<a name=contents></a>
1. [Packages](#packages)
2. [Text](#text)
3. [NLP pipes](#pipes)
4. [Dependency trees](#trees)

## 1. Packages <a name=packages></a>

In [94]:
import spacy
from spacy import displacy


from spacy.tokens import Token
from spacy import Language

from collections import deque

from nltk.corpus import reuters

import regex as re

## 2. Text

In [2]:
nlp = spacy.load('en_core_web_lg')

In [3]:
text = "Fujitsu, a competitor of NEC, acquired Fairchild Corp."
doc = nlp(text)

In [4]:
displacy.render(doc, style="dep",options={'compact': False, 'distance': 100})

Back to the [table of contents](#contents).

## 3. NLP pipes <a name=pipes></a>

In [31]:
Token.set_extension('ref_n', default='', force = True)
Token.set_extension('ref_t', default='', force = True)

@Language.component("init_coref")
def init_coref(doc):
    for e in doc.ents:
        if e.label_ in ['ORG', 'GOV', 'PERSON']:
            e[0]._.ref_n, e[0]._.ref_t = e.text, e.label_
    return doc

In [32]:
def reset_pipeline(nlp, pipes):
    # remove all custom pipes
    custom_pipes = [pipe for (pipe, _) in nlp.pipeline
                    if pipe not in ['tagger', 'parser', 'ner',
                                    'tok2vec', 'attribute_ruler', 'lemmatizer']]
    for pipe in custom_pipes:
        _ = nlp.remove_pipe(pipe)
    # re-add specified pipes
    for pipe in pipes:
        if 'neuralcoref' == pipe or 'neuralcoref' in str(pipe.__class__):
            nlp.add_pipe(pipe, name='neural_coref')
        else:
            nlp.add_pipe(pipe)

    print(f"Model: {nlp.meta['name']}, Language: {nlp.meta['lang']}")
    print(*nlp.pipeline, sep='\n')
    
reset_pipeline(nlp, ['init_coref'])

Model: core_web_lg, Language: en
('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f13c8420f30>)
('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f13c8420ec0>)
('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f13c815cc50>)
('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f13c81885f0>)
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f13c81722d0>)
('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f13c815ce50>)
('init_coref', <function init_coref at 0x7f13c17cd4d0>)


Back to the [table of contents](#contents).

## 4. Dependency trees <a name=trees></a>

Dependency trees seem to be very efficient to find verbs (active or passive) and their subject and object.

In [161]:
# Actually we search for the shortest path between the
# subject running through our predicate (verb) to the object.
# subject and object are organizations in our examples.

# Here are the three helper functions omitted in the book:
# - bfs: breadth first searching the closest subject/object 
# - is_passive: checks if noun or verb is in passive form
# - find_subj: searches left part of tree for subject
# - find_obj: searches right part of tree for object

def bfs(root, ent_type: str, deps:list, first_dep_only=False):
    """
    
    : root: token containing the word at the left of the verb, hopefully the subject?
    : ent_type: specifies entity type (for now always called for "ORG")
    : deps: ??? ['nsubjpass', 'nsubj:pass'] ???
    : first_dep_only: 
    """
    """Return first child of root (included) that matches
    ent_type and dependency list by breadth first search.
    Search stops after first dependency match if first_dep_only
    (used for subject search - do not "jump" over subjects)"""
    # deque to ease the access to the list
    to_visit = deque([root]) # queue for bfs

    while len(to_visit) > 0:
        # the left element of the queue is given to child and deleted from the queue
        child = to_visit.popleft()
        print("child", child, child.dep_)
        # check if the dependency of the token was one of those provided
        if child.dep_ in deps:
            # check if the label/entity type is the same as the one provided
            if child._.ref_t == ent_type:
                return child
            #else:
            #    for 
            # explore what to do if we keep looking after the first dependency match?
            # quid for a subject with an "and"???
            elif first_dep_only: # first match (subjects)
                return None
        # check if it is a compound (adjective),
        # if the noun it describes dependency is one of those provided
        # and if it has the right entity type but only works on the first token of the entity (customized pipe)
        # why doesn't it return the whole entity then? A compound is no subject by its own...?
        elif child.dep_ == 'compound' and \
             child.head.dep_ in deps or child.head.head.dep_ in deps and \
             child._.ref_t == ent_type: # check if contained in compound
            return child
        to_visit.extend(list(child.children))
    return None

def is_passive(token):
    if token.dep_.endswith('pass'): # noun
        return True
    for left in token.lefts: # verb
        if left.dep_ == 'auxpass':
            return True
    return False

def find_subj(pred, ent_type: str, passive: bool):
    """
    Find closest subject in predicates left subtree or
    predicates parent's left subtree (recursive).
    Has a filter on organizations.
    : pred: token containing a verb
    : ent_type: specifies entity type (for now always called for "ORG")
    : passive: specifies if the verb is in the passive form
    : return: pred's subject
    """
    ## To modify to make it work for different kind of entities

    # begins with the further related word on the left of the predicate
    for left in pred.lefts:
        if passive: # if pred is passive, search for passive subject
            subj = bfs(left, ent_type, ['nsubjpass', 'nsubj:pass'], True)
        else:
            subj = bfs(left, ent_type, ['nsubj'], True)
        if subj is not None: # found it!
            return subj
    
    # if the subject is not on the left tree of the predicate,
    # the predicate's head could be another verb with the same subject
    # example: Apple is looking at buying a startup
    if pred.head != pred and not is_passive(pred): # why not just "passive" instead of is_passive(pred)?
        return find_subj(pred.head, ent_type, passive) # climb up left subtree
    else:
        return None

def find_obj(pred, ent_type, excl_prepos):
    """
    Find closest object in predicates right subtree.
    Skip prepositional objects if the preposition is in exclude list.
    Has a filter on organizations.
    : pred: token containing a verb
    : ent_type: specifies entity type (for now always called for "ORG")
    : excl_prepos: excluded prepositions
    : return: object of the predicate
    """
    
    ## To modify to make it work for different kind of entities
        
    # looks into every related token on the right of the predicate
    # until it finds an object filling the conditions
    for right in pred.rights:
        print("right: ",right)
        obj = bfs(right, ent_type, ['dobj', 'pobj', 'iobj', 'obj', 'obl'])
        # if an object is found,
        # it looks that its preposition is not excluded
        if obj is not None:
            if obj.dep_ == 'pobj' and obj.head.lemma_.lower() in excl_prepos: # check preposition
                continue
            return obj
    return None

def extract_rel_dep(doc, pred_name:str, pred_synonyms:str, excl_prepos=[]):
    """
    Method extracting relationship(s) (may be plural!)
    It only returns triplets!
    : doc: text to analyze
    : pred_name: predicate
    : pred_synonyms: predicate's synonyms
    : excl_prepos: prepositions which can not precede the object chosen
    : return: triplet(s) with the subject and its entity type,
              the predicate and the object and its entity type
    """
    for token in doc:
        #print(token, token.pos_, token.lemma_)
        # looks for a verb equivalent to the predicate referred to
        if token.pos_ == 'VERB' and token.lemma_ in pred_synonyms:
            print("found token: ",token)
            # saves that verb as a predicate (readability)
            # looks if it is passive
            # and then searches for the subject of the verb
            pred = token
            passive = is_passive(pred)
            print("passive: ",passive)
            subj = find_subj(pred, 'ORG', passive)
            print("subject: ",subj)
            # if the subject is found, it looks for the object
            if subj is not None:
                obj = find_obj(pred, 'MONEY', excl_prepos)
                print("object: ",obj)
                if obj is not None:
                    # if there is a subject and an object,
                    # it sets the triplet in the following order:
                    # active subject, verb in active form, passive subject
                    if passive: # switch roles
                        obj, subj = subj, obj
                    yield ((subj._.ref_n, subj._.ref_t), pred_name, 
                           (obj._.ref_n, obj._.ref_t))

In [34]:
doc = nlp("sells")
for token in doc:
    print(type(token),token.pos_,token.tag_)
    break

<class 'spacy.tokens.token.Token'> VERB VBZ


In [35]:
help(spacy.tokens.token.Token)

Help on class Token in module spacy.tokens.token:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
 |  DOCS: https://spacy.io/api/token
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |      Token.__bytes__(self)
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(...)
 |      The number of unicode characters in the token, i.e. `token.text`.
 |      
 |      RETURNS (int): The number of unicode characters in the token.
 |      
 |      DOCS: https://spacy.io/api/token#len
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __reduce__(...)
 |      Token.__reduce__(self)
 |  
 |  __repr_

In [36]:
doc = nlp("I like New York in Autumn.")
rights = [t.text for t in doc[1].lefts]
assert rights == ["in"]

AssertionError: 

In [37]:
rights

['I']

In [38]:
for e in doc.ents:
    print(e, e.root, e.root.head)

New York York like
Autumn Autumn in


In [39]:
for t in doc:
    print("\n",t, t.head)
    for child in t.children:
        print(child)
    for left in t.lefts:
        print(left)


 I like

 like like
I
York
.
I

 New York

 York like
New
in
New

 in York
Autumn

 Autumn in

 . like


In [40]:
doc_pass = nlp("I have been married to my wife for 30 years")

In [41]:
for t in doc_pass:
    print(t,t.dep_)

I nsubj
have aux
been ROOT
married acomp
to prep
my poss
wife pobj
for prep
30 nummod
years pobj


In [81]:
doc_2verbs = nlp("Apple is looking at buying Amazon for $1 billion")

In [84]:
for elem in extract_rel_dep(doc_2verbs, pred_name="buy",pred_synonyms=["buy"]):
    print(elem)

found token:  buying
passive:  False
subject:  Apple
right:  Amazon
object:  Amazon
(('Apple', 'ORG'), 'buy', ('Amazon', 'ORG'))


In [69]:
for t in doc_2verbs:
    print("TOKEN: ",t,"\ndepencies:",t.dep_,"\ntoken's head:", t.head)
    for child in t.children:
        print("children: \n",child)
    for left in t.lefts:
        print("lefts: \n",left)
    for right in t.rights:
        print("rights: \n",right)

TOKEN:  Apple 
depencies: nsubj 
token's head: looking
TOKEN:  is 
depencies: aux 
token's head: looking
TOKEN:  looking 
depencies: ROOT 
token's head: looking
children: 
 Apple
children: 
 is
children: 
 at
lefts: 
 Apple
lefts: 
 is
rights: 
 at
TOKEN:  at 
depencies: prep 
token's head: looking
children: 
 buying
rights: 
 buying
TOKEN:  buying 
depencies: pcomp 
token's head: at
children: 
 startup
children: 
 for
rights: 
 startup
rights: 
 for
TOKEN:  U.K. 
depencies: compound 
token's head: startup
TOKEN:  startup 
depencies: dobj 
token's head: buying
children: 
 U.K.
lefts: 
 U.K.
TOKEN:  for 
depencies: prep 
token's head: buying
children: 
 billion
rights: 
 billion
TOKEN:  $ 
depencies: quantmod 
token's head: billion
TOKEN:  1 
depencies: compound 
token's head: billion
TOKEN:  billion 
depencies: pobj 
token's head: for
children: 
 $
children: 
 1
lefts: 
 $
lefts: 
 1


In [44]:
doc_verb_before = nlp("If I eat an apple, I drink water")

In [91]:
type(doc_verb_before)

spacy.tokens.doc.Doc

In [45]:
def peek(iterable):
    try:
        first = next(iterable)
    except StopIteration:
        return None
    return iterable

In [46]:
for t in doc_verb_before:
    print("TOKEN: ",t,"\ntoken's head:", t.head)
    if peek(t.children):
        print("Children:")
        for child in t.children:
            print(child)
    if peek(t.lefts):
        print("Lefts:")
        for left in t.lefts:
            print(left)
    if peek(t.rights):
        print("Rights:")
        for right in t.rights:
            print(right)
    print("\n")

TOKEN:  If 
token's head: eat


TOKEN:  I 
token's head: eat


TOKEN:  eat 
token's head: drink
Children:
If
I
apple
Lefts:
If
I
Rights:
apple


TOKEN:  an 
token's head: apple


TOKEN:  apple 
token's head: eat
Children:
an
Lefts:
an


TOKEN:  , 
token's head: drink


TOKEN:  I 
token's head: drink


TOKEN:  drink 
token's head: drink
Children:
eat
,
I
water
Lefts:
eat
,
I
Rights:
water


TOKEN:  water 
token's head: drink




In [47]:
next(doc_verb_before[2].children)

If

Back to the [table of contents](#contents).


In [72]:
reuters_fileids_crude = reuters.fileids(categories=['crude'])

In [74]:
reuters_fileids_crude[25]

'test/16658'

In [97]:
article = reuters.raw(reuters_fileids_crude[1])

In [102]:
pattern = re.compile("\n")
article = re.sub(pattern,"",article)
pattern = re.compile(" +")
article = re.sub(pattern," ",article)
pattern = re.compile(".'s")
article = re.sub(pattern,"'s",article) #???
article = article.strip()
#re.findall("(\w[^\.]*\.)",article)
article

'ENERGY/U.S. PETROCHEMICAL INDUSTRY Cheap oil feedstocks, the weakened U.S. dollar and a plant utilization rate approaching 90 pct will propel the streamlined U.S. petrochemical industry to record profits this year, with growth expected through at least 1990, major company executives predicted. This bullish outlook for chemical manufacturing and an industrywide move to shed unrelated businesses has prompted GAF Corp &lt;GAF>, privately-held Cain Chemical Inc, and other firms to aggressively seek acquisitions of petrochemical plants. Oil companies such as Ashland Oil Inc &lt;ASH>, the Kentucky-based oil refiner and marketer, are also shopping for money-making petrochemical businesses to buy. "I see us poised at the threshold of a golden period," said Paul Oreffice, chairman of giant Dow Chemical Co &lt;DOW>, adding, "Ther\'s no major plant capacity being added around the world now. The whole game is bringing out new products and improving the old ones." Analysts say the chemical industr

In [103]:
doc_reuters = nlp(article)

In [114]:
displacy.render(doc_reuters, style="ent")

In [106]:
for ent in doc_reuters.ents:
    if ent.label_ == "MONEY":#"CARDINAl": #"QUANTITY":
        print(ent)

741 mln dlrs
58 mln dlrs
13 billion dlrs
three billion dlrs
700 mln dlrs
1.1 billion dlrs


In [130]:
import requests
from bs4 import BeautifulSoup

In [137]:
def synonyms(term):
    response = requests.get('https://www.thesaurus.com/browse/{}'.format(term))
    soup = BeautifulSoup(response.text, 'html.parser')
    soup.find('section', {'class': 'css-191l5o0-ClassicContentCard e1qo4u830'})
    return [span.text.strip() for span in soup.findAll('a', {'class': 'css-1kg1yv8 eh475bn0'})] 

In [141]:
synonyms("have profit")

[]

In [147]:
["earn"]+synonyms("earn")

['earn',
 'acquire',
 'bring in',
 'collect',
 'derive',
 'draw',
 'gain',
 'get',
 'make',
 'obtain',
 'pick up',
 'realize',
 'reap',
 'receive',
 'score',
 'secure',
 'win',
 'acquire',
 'gain',
 'reap',
 'score',
 'win']

In [119]:
from PyDictionary import PyDictionary

In [126]:
dictionary = PyDictionary("win","earn","good")
dictionary.synonym("good")

good has no Synonyms in the API


In [162]:
verb = "offer"
if peek(extract_rel_dep(doc_reuters, pred_name=verb, pred_synonyms=[verb]+synonyms(verb), excl_prepos=[])):
    for relation in extract_rel_dep(doc_reuters, pred_name=verb, pred_synonyms=[verb]+synonyms(verb), excl_prepos=[]):
        print(relation)

found token:  seek
passive:  False
child firms nsubj
child to aux
child aggressively advmod
child outlook nsubj
child has aux
subject:  None
found token:  offered
passive:  False
child GAF nsubj
subject:  GAF
right:  dlrs
child dlrs dobj
child billion nummod
child for prep
child three compound
child Corp pobj
child Borg compound
object:  Borg
found token:  seek
passive:  False
child firms nsubj
child to aux
child aggressively advmod
child outlook nsubj
child has aux
subject:  None
found token:  offered
passive:  False
child GAF nsubj
subject:  GAF
right:  dlrs
child dlrs dobj
child billion nummod
child for prep
child three compound
child Corp pobj
child Borg compound
object:  Borg
(('GAF', 'ORG'), 'offer', ('Borg Warner Corp &', 'ORG'))
found token:  provided
passive:  False
child could aux
child Armen nsubj
subject:  None


In [86]:
reuters.raw(reuters_fileids_crude[1])

'ENERGY/U.S. PETROCHEMICAL INDUSTRY\n  Cheap oil feedstocks, the weakened U.S.\n  dollar and a plant utilization rate approaching 90 pct will\n  propel the streamlined U.S. petrochemical industry to record\n  profits this year, with growth expected through at least 1990,\n  major company executives predicted.\n      This bullish outlook for chemical manufacturing and an\n  industrywide move to shed unrelated businesses has prompted GAF\n  Corp &lt;GAF>, privately-held Cain Chemical Inc, and other firms\n  to aggressively seek acquisitions of petrochemical plants.\n      Oil companies such as Ashland Oil Inc &lt;ASH>, the\n  Kentucky-based oil refiner and marketer, are also shopping for\n  money-making petrochemical businesses to buy.\n      "I see us poised at the threshold of a golden period," said\n  Paul Oreffice, chairman of giant Dow Chemical Co &lt;DOW>, adding,\n  "There\'s no major plant capacity being added around the world\n  now. The whole game is bringing out new products a

In [87]:
reuters.raw(reuters_fileids_crude[2])

'TURKEY CALLS FOR DIALOGUE TO SOLVE DISPUTE\n  Turkey said today its disputes with\n  Greece, including rights on the continental shelf in the Aegean\n  Sea, should be solved through negotiations.\n      A Foreign Ministry statement said the latest crisis between\n  the two NATO members stemmed from the continental shelf dispute\n  and an agreement on this issue would effect the security,\n  economy and other rights of both countries.\n      "As the issue is basicly political, a solution can only be\n  found by bilateral negotiations," the statement said. Greece has\n  repeatedly said the issue was legal and could be solved at the\n  International Court of Justice.\n      The two countries approached armed confrontation last month\n  after Greece announced it planned oil exploration work in the\n  Aegean and Turkey said it would also search for oil.\n      A face-off was averted when Turkey confined its research to\n  territorrial waters. "The latest crises created an historic\n  oppor

In [155]:
trial = nlp("GAF, which made an unsuccessful attempt in 1985 to acquire Union Carbide Corp &lt;UK>, recently offered three billion dlrs for Borg Warner Corp &lt;BOR>, a Chicago manufacturer of plastics and chemicals.")

In [156]:
displacy.render(trial, style="dep",options={'compact': False, 'distance': 100})

In [157]:
for t in trial:
    print("TOKEN: ",t,"\ndepencies:",t.dep_,"\npos: ",t.pos_,"\ntoken's head:", t.head)
    if peek(t.children):
        print("Children:")
        for child in t.children:
            print(child)
    if peek(t.lefts):
        print("Lefts:")
        for left in t.lefts:
            print(left)
    if peek(t.rights):
        print("Rights:")
        for right in t.rights:
            print(right)
    print("\n")

TOKEN:  GAF 
depencies: nsubj 
pos:  PROPN 
token's head: offered
Children:
,
made
>
,
Rights:
,
made
>
,


TOKEN:  , 
depencies: punct 
pos:  PUNCT 
token's head: GAF


TOKEN:  which 
depencies: nsubj 
pos:  PRON 
token's head: made


TOKEN:  made 
depencies: relcl 
pos:  VERB 
token's head: GAF
Children:
which
attempt
in
acquire
Lefts:
which
Rights:
attempt
in
acquire


TOKEN:  an 
depencies: det 
pos:  DET 
token's head: attempt


TOKEN:  unsuccessful 
depencies: amod 
pos:  ADJ 
token's head: attempt


TOKEN:  attempt 
depencies: dobj 
pos:  NOUN 
token's head: made
Children:
an
unsuccessful
Lefts:
an
unsuccessful


TOKEN:  in 
depencies: prep 
pos:  ADP 
token's head: made
Children:
1985
Rights:
1985


TOKEN:  1985 
depencies: pobj 
pos:  NUM 
token's head: in


TOKEN:  to 
depencies: aux 
pos:  PART 
token's head: acquire


TOKEN:  acquire 
depencies: advcl 
pos:  VERB 
token's head: made
Children:
to
Corp
Lefts:
to
Rights:
Corp


TOKEN:  Union 
depencies: compound 
pos:  PROPN 
