Note 1 : This notebook draws heaviily from the tutorial available here : https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/

Note 2 : You should ensure that you have installed SpaCy and a SpaCy model (such as en_core_web_sm) to run the code in this notebook.  You can download the en_core_web_sm by running the command :

python -m spacy download en_core_web_sm

in your terminal or command prompt.

In this notebook, we'll begin looking at "Relation Extraction".  Relation Extraction is an approach in NLP where we try to identify the relationship between two entities - typically an object and a subject.  To do this, we should understand a little about depdency structures, and how we can use SpaCy's visualiser - displacy - to explore dependency subtrees.

Dependency describes the "role" played by a word in the context of a sentence.  For example, a word might play the role of a "Negation Modifier", which is a word that modifies a word next to it to make it a negative.

Let's start exploring this.  But first, let's import the libraries we need.

In [None]:
import spacy

from spacy import displacy

import en_core_web_sm

# Load spacy model
nlp = en_core_web_sm.load()

Let's take an example piece of text for which we want to extract a relationship - in this case, the company Tableau acquiring the company Salesforce.

In [None]:
text = "Tableau was recently acquired by Salesforce"

Let's apply the SpaCy model to the text, and then plot and display a dependency graph using displacy.  We'll also print the dependency and POS tags.

In [None]:
# Apply the loaded SpaCy model to the text
doc = nlp(text)

# Plot and display the dependency graph using displacy
displacy.render(doc, style='dep')

# Use this for Spyder (will spin up a server) - see :
# https://spacy.io/usage/visualizers
# Once spun up, go to server address in console (e.g. 0.0.0.0:5000)
# displacy.serve(doc, style='dep')

for tok in doc:
    print (f"{tok.text} --> {tok.dep_} --> {tok.pos_}")

We can see a number of things from the above plot.  We see not only the POS tags for each word, but also how the words are dependent on each other.  For example, we see that "Tableau" acts a passive nominal subject in this sentence (Tableau is the "subject" of the sentence - the entity to which the verb is being (or in this case, has been) done), that "Salesforce" acts as an object of preposition (Salesforce is an object because it is the thing that is (or, here, was) doing the verb), and that the two halves of the sentence are linked by the verb "acquired".

If we want to extract the relationship in the text above, we need to be able to identify the relevant subject and object of the sentence.  Let's write a function that takes as its input a SpaCy document.  This function will iterate through all the tokens in the input document, and check to see if the token has a dependency tag that contains "subjpass" (passive subject) such as 'nsubjpass' (nominal subject (passive)) (remember you can use spacy.explain to look up POS and dependency tags such as 'nsubjpass').  If it does, it will store the text of the token in y.  It will then iterate through all of the tokens to look for an object (there are different types of object, so this will look for any dependency tags that end with "obj"). Once it finds one, it will store the text of the token in x.  The function will then return both x (the object) and y (the passive subject).

Note - this function will only work properly if there is one object and one passive subject (so realistically a single sentence).  More than one object or passive subject would cause only the last one in the text to be returned. So this would need adapting if that's what you need.

A nominal is a word or group of words that function together as a noun. A nominal gives more specific details than a simple noun.  E.g. "nice cup of tea" is an example of a nominal - it gives more description than the head noun (cup) - we now know that it is a cup filled with tea, and that it is nice.  See https://www.thoughtco.com/nominal-in-grammar-1691431 

Because nominals function as nouns, they can do whatever nouns can - be a subject, an object or a predictive nominative. A nominal subject (passive) is therefore one in which the nominal is acting as the thing to which the action is being done.  In our example sentence "Tableau was recently acquired by Salesforce", Tableau is a nominal subject passive ('nsubjpass') because it is the thing to which the action of acquiring was being done.

In [None]:
def subtree_matcher(doc):
    x = ''
    y = ''
    
    # Iterate through all the tokens in the input document
    for i, tok in enumerate(doc):
        # extract subject
        if tok.dep_.find("subjpass") == True:
            y = tok.text
            
    # Extract object
    if tok.dep_.endswith("obj") == True:
        x = tok.text
        
    # Return object and passive subject
    return x,y

Now let's call our function to grab out the object and passive subject of our sentence.

In [None]:
# Call the function above to grab out the object and passive subject of doc
obj, subj = subtree_matcher(doc)

# Print the identified subject and object
print (f"Subject : {subj}, Object: {obj}")

Let's try it with another sentence.

In [None]:
text_2 = "Careem, a ride hailing major in middle east was aquired by Uber"

doc_2 = nlp(text_2)

obj_2, subj_2 = subtree_matcher(doc_2)

print (f"Subject : {subj_2}, Object: {obj_2}")

Now let's try another one.

In [None]:
text_3 = "Salesforce recently acquired Tableau."

doc_3 = nlp(text_3)

obj_3, subj_3 = subtree_matcher(doc_3)

print (f"Subject : {subj_3}, Object: {obj_3}")

We can see that it hasn't worked properly this time.  Let's have a look at the dependency tree to see what might have gone wrong.

In [None]:
displacy.render(doc_3, style='dep')

for tok in doc_3:
    print (f"{tok.text} --> {tok.dep_} --> {tok.pos_}")

This sentence is an active version of the first example sentence. Therefore, the subject and object have been interchanged, and the dependency tag for the subject is no longer "nsubjpass" but is now "nsubj" - ie a subject in a sentence in the active voice (rather than the passive voice). So we need to update our logic to accomodate this.  Let's rewrite the function to first check whether the sentence is in the active or passive voice.  If it's passive, we'll do what we did before, but if it's active, we'll look for the dependency tag "subj" instead of "subjpass".

In [None]:
def subtree_matcher(doc):
    subjpass = False
    
    for i, tok in enumerate(doc):
        # Find dependency tag that contains the text "subjpass"
        if tok.dep_.find("subjpass") == True:
            subjpass = True
            
    x = ''
    y = ''
    
    # if sentence is passive
    if subjpass == True:
        for i, tok in enumerate(doc):
            if tok.dep_.find("subjpass") == True:
                y = tok.text
                
            if tok.dep_.endswith("obj") == True:
                x = tok.text
                
    # if sentence is not passive
    else:
        for i, tok in enumerate(doc):
            if tok.dep_.endswith("subj") == True:
                x = tok.text
                
            if tok.dep_.endswith("obj") == True:
                y = tok.text
                
    return x,y

Let's try our third sentence again with our new function.

In [None]:
text_3 = "Salesforce recently acquired Tableau."

doc_3 = nlp(text_3)

obj_3, subj_3 = subtree_matcher(doc_3)

print (f"Subject : {subj_3}, Object: {obj_3}")

Great! That's worked.  Let's try out another sentence.

In [None]:
text_4 = "Father took victim out for meal"

doc_4 = nlp(text_4)

obj_4, subj_4 = subtree_matcher(doc_4)

print (f"Subject : {subj_4}, Object: {obj_4}")

This hasn't picked up the subject we hoped for - we wanted "victim" and we got "meal".  Let's look at the dependency tree.

In [None]:
displacy.render(doc_4, style='dep')

for tok in doc_4:
    print (f"{tok.text} --> {tok.dep_} --> {tok.pos_}")

This time we find a new problem - here we've got a single subject (Father) but two objects (victim and meal). Looking at the dependency tree / tags, we see that victim has a tag of "dobj" and meal has a tage of "pobj".  "dobj" is a Direct Object - this is one that answers the question "whom?" or "what?" after an action verb.  So in this case, the father "took" - and who did he take?  The victim.  Therefore, victim is the Direct Object.  "pobj" is an Object of Preposition - this is one that answers "whom?" or "what?" after a preposition in a prepositional phrase.  Remember, a preposition governs a noun or pronoun to express a relation with another element in the clause (words such as by, to, at etc). In this case, the preposition is "for" - it identifies the reason why the father took the victim out (for a meal).  So, the Object of Preposition here is "meal" - it's the object in the prepositional phrase. See : https://webapps.towson.edu/ows/nouns.htm

So, our simple function above won't cut the mustard, because there are two objects here, and so the last one in the sentence (in this case, the object of preposition) would be returned as THE object.  But in reality, we probably want the direct object, or at least to be able to choose between them.  

Let's rewrite our function again.  Here, rather than returning a single subject and a single object, we return a list of subjects and objects.  For the list of objects, we actually return a list of tuples, where the first element of the tuple is the text of the token identified as an object, and the second is the dependency tag, so we can identify what type of object it is.  This would allow us, for example, to filter only direct objects, if that is all we're interested in.

In [None]:
def subtree_matcher(doc):
    subjpass = False
    
    for i, tok in enumerate(doc):
        # Find dependency tag that contains the text "subjpass"
        if tok.dep_.find("subjpass") == True:
            subjpass = True
            
    object_list = []
    subject_list = []
    
    # if sentence is passive
    if subjpass == True:
        for i, tok in enumerate(doc):
            if tok.dep_.find("subjpass") == True:
                subject_list.append(tok.text)
                
            if tok.dep_.endswith("obj") == True:
                obj_text = tok.text
                obj_dtag = tok.dep_
                object_list.append((obj_text, obj_dtag))
                
    # if subjpass == 0 then sentence is not passive
    else:
        for i, tok in enumerate(doc):
            if tok.dep_.endswith("subj") == True:
                subject_list.append(tok.text)
                
            if tok.dep_.endswith("obj") == True:
                obj_text = tok.text
                obj_dtag = tok.dep_
                object_list.append((obj_text, obj_dtag))
                
    return object_list, subject_list

Now let's try out our new function and see if that solves our problem.

In [None]:
obj_l, subj_l = subtree_matcher(doc_4)

print (f"Subject(s) : {subj_l}")
print (f"Object(s) : {obj_l}")

That's done the trick.  Now we've got an idea of how we can extract subjects and objects, we can move onto thinking about extracting Relational Triples - we'll look at that in the next notebook.