<a href="https://colab.research.google.com/github/pranavsrinivas29/Knowledge-Graph/blob/main/KG_Construction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <a name="1-pre"></a> 1. Preliminaries
In our last NLP session, we have shown that with the use of SpaCy, we can identify entities and retrieve the dependency graph.

In [1]:
!pip install -U spacy



In [2]:
#First we have to import the spacy library and download the language models
import spacy
import pandas as pd
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp('Leonard Simon Nimoy was born in Boston. \
Nimoy played Spock. \
Spock is a character in the Star Trek franchise. \
Star Trek was created by the great Eugene Wesley Roddenberry.')

displacy.render(doc, style='ent', jupyter=True)

for sent in doc.sents:
  displacy.render(sent, style="dep", jupyter=True, options={'distance': 100})


## 2. <a name="2-pair"></a> Entity Pair Extraction

![nimoy-dep](https://drive.google.com/uc?id=1sM4nn3ZMYoq4T2jsBbWGaKMTWHmGq1SW)

To determine the entity pairs we also make use of the dependency graph. Looking at the first sentence of our example, `Leonard Simon Nimoy was born in Boston`, we can see that:

0. Entities can be found in noun phrases.
1. The entity tagged as a **subj** (*Leonard Simon Nimoy*) is the **head** of the triple.
2. While the **obj** (Boston) is the **tail** and the verb (*was born in*) in between them is the relation.
3. *subj* and *obj* may be composed of several tokens (dep_ == "compound").

Below is a simple method for extracting entity pairs. Note that this is far from being exhaustive.

In [3]:
def extract_entity_pairs(sent):
  head = ''
  tail = ''

  prefix = ''             # variable for storing compound noun phrases
  prev_token_dep = ''     # dependency tag of previous token in the sentence
  prev_token_text = ''    # previous token in the sentence


  for token in sent:
    # if it's a punctuation mark, do nothing and move on to the next token
    if token.dep_ == 'punct':
      continue

    # Condition #1: subj is the head entity
    if token.dep_.find('subj') == True:
      head = f'{prefix} {token.text}'

      # Reset placeholder variables, to be reused by succeeding entities
      prefix = ''
      prev_token_dep = ''
      prev_token_text = ''

    # Condition #2: obj is the tail entity
    if token.dep_.find('obj') == True:
      tail = f'{prefix} {token.text}'

    # Condition #3: entities may be composed of several tokens
    if token.dep_ == "compound":
      # if the previous word was also a 'compound' then add the current word to it
      if prev_token_dep == "compound":
        prefix = f'{prev_token_text} {token.text}'
      # if not, then this is the first token in the noun phrase
      else:
        prefix = token.text

    # Placeholders for compound cases.
    prev_token_dep = token.dep_
    prev_token_text = token.text
  #############################################################

  return [head.strip(), tail.strip()]

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_entity_pairs(sent)}')

Sentence 1: ['Leonard Simon Nimoy', 'Boston']
Sentence 2: ['Nimoy', 'Spock']
Sentence 3: ['Spock', 'Star Trek franchise']
Sentence 4: ['Star Trek', 'Eugene Wesley Roddenberry']


## 3. <a name="3-rel"></a> Relation Extraction

To extract the relation, we make use of spaCy's rule-based **Matcher** class. When we look at our example sentences, we can observe that relations are often tagged as verb phrases. Looking at our dependency graph, we can now define the dependency graph tags as patterns and use the span to identify the corresponding tokens of the relation.

In [4]:
from spacy.matcher import Matcher

def extract_relation(sent):

  # Rule-based pattern matching class
  matcher = Matcher(nlp.vocab)

  # define the patterns according to the dependency graph tags
  pattern = [{'DEP':'ROOT'},                # verbs are often root
            {'DEP':'prep','OP':"?"},
            {'DEP':'attr','OP':"?"},
            {'DEP':'det','OP':"?"},
            {'DEP':'agent','OP':"?"}]

  matcher.add("relation",[pattern])

  matches = matcher(sent)
  k = len(matches) - 1

  span = sent[matches[k][1]:matches[k][2]]

  return(span.text)

for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {extract_relation(sent)}')

Sentence 1: born in
Sentence 2: played
Sentence 3: is a
Sentence 4: created by


Let's combine the results of the entity-pair and relation extraction.

In [5]:
for id, sent in enumerate(doc.sents):
  entity_pair = extract_entity_pairs(sent)
  print(f'Triple {id+1}: ({entity_pair[0]}, {extract_relation(sent)}, {entity_pair[1]})')

Triple 1: (Leonard Simon Nimoy, born in, Boston)
Triple 2: (Nimoy, played, Spock)
Triple 3: (Spock, is a, Star Trek franchise)
Triple 4: (Star Trek, created by, Eugene Wesley Roddenberry)
