In [134]:
import spacy
sentence = 'I saw the man with a telescope.'
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentence) #used for printing purpose

Extracting a path of dependency relations from the ROOT to a token

The function takes as an input a sentence (string) and will output a List of list of tuple.
Initially the doc object is created from the input sentence then with a for I cycle through all of the tokens inside the sentence. Now that I have a token I can generate its ancestors. Now I cycle through these ancestors and for each of them create the tuple of [dependency, token.text], which is immediately added to a list. Now that all I collected all of the path which goes from the token to its root, I reverse the list so that the root is always the first element of the list and finally I append the token text with its dependency itself to the bottom of the list so that there’s the full path coming from the root to a token. Now that the list for a token is complete, I add it to a list. This is repeated for all tokens in the sentence.

In [135]:
def dependencyPath(sentence):
    doc = nlp(sentence)
    dep=[]
    for token in doc:
        array=[]
        ancestors = token.ancestors
        for ancestor in ancestors:
            anc=[ancestor.dep_, ancestor.text]
            array.append(anc)
        array.reverse()
        array.append([token.dep_, token.text])
        dep.append(array)
    return dep

dependency=dependencyPath(sentence)
for i, dep in enumerate(dependency):
    print ("Token" , doc[i], "with dependency ->", dep)

Token I with dependency -> [['ROOT', 'saw'], ['nsubj', 'I']]
Token saw with dependency -> [['ROOT', 'saw']]
Token the with dependency -> [['ROOT', 'saw'], ['dobj', 'man'], ['det', 'the']]
Token man with dependency -> [['ROOT', 'saw'], ['dobj', 'man']]
Token with with dependency -> [['ROOT', 'saw'], ['prep', 'with']]
Token a with dependency -> [['ROOT', 'saw'], ['prep', 'with'], ['pobj', 'telescope'], ['det', 'a']]
Token telescope with dependency -> [['ROOT', 'saw'], ['prep', 'with'], ['pobj', 'telescope']]
Token . with dependency -> [['ROOT', 'saw'], ['punct', '.']]


Extract subtree of a dependents given a token

The function takes as an input a sentence (string) and will output a list of list.
Initially the doc object is created from the input sentence then with a for I cycle through all of the tokens inside the sentence. Now that I have a token I can generate its subtree. Now I cycle through the subtree and if the token is not the one I’m actually generating the subtree for, then I add them to a list. Finally, I add the subtree list to a final list that will be returned. This is repeated for all tokens in the sentence.

In [136]:
def extractSubtree(sentence):
    doc = nlp(sentence)
    sub=[]
    for token in doc:
        array=[]
        #tree = doc[token.i].subtree
        tree = token.subtree
        for tr in tree:
            if(token!=tr):
                array.append(tr.text)
        sub.append(array)
    return sub

subtree=extractSubtree(sentence)
for i, tree in enumerate(subtree):
    print ("Token" , doc[i], "with subtree ->", tree)

Token I with subtree -> []
Token saw with subtree -> ['I', 'the', 'man', 'with', 'a', 'telescope', '.']
Token the with subtree -> []
Token man with subtree -> ['the']
Token with with subtree -> ['a', 'telescope']
Token a with subtree -> []
Token telescope with subtree -> ['a']
Token . with subtree -> []


Check if a given list of tokens (segment of a sentence) forms a subtree

The function takes as inputs an sequence (array) of string to be checked if they form a subtree and the sentence itself. After creating the doc object I create all the subtrees of the sentence with the previously defined function. Now I cycle through the doc using the enumerate function to be able to get the index and use it to check the right element in my subtree list, which is used to compare it to the given input sequence.
To do the comparison I created a function that tries to remove an element of the second list while going through the other one and using each element of the latter to remove all elements of the other list. As soon as there is an “error” like if there is no element to remove (which means that one list has different elements from the other) it returns False. If for instance all of the elements of one list are also in the other, but the latter also has other elements, then it still returns False because the not of a non-empty list is still False. This function was created to still be able to recognize a list as subtree even if it wasn’t ordered correctly.
Going back to the original function, as soon as a match is found the returned value is changed to True and the loop over all tokens is broken. Then we return the Boolean value to be displayed.

In [137]:
def compare(x, y):
    try:
        for elem in x:
            y.remove(elem)
    except ValueError:
        return False
    return not y

def checkSubtree(tokens, sentence):
    ret=False
    doc = nlp(sentence)
    subtree=extractSubtree(sentence)
    for i, tk in enumerate(doc):
        if compare(tokens, subtree[i]):
            ret=True
            break
    return ret

sequence1=['The', 'telescope', 'man']
ret=checkSubtree(sequence1, sentence)
if(ret):
    print('The sequence',sequence1, 'is a subtree of: "',sentence, '"')
else:
    print('The sequence',sequence1, 'is NOT a subtree of: "',sentence, '"')
    
sequence2=['a', 'telescope']
ret=checkSubtree(sequence2, sentence)
if(ret):
    print('The sequence',sequence2, 'is a subtree of: "',sentence, '"')
else:
    print('The sequence',sequence2, 'is NOT a subtree of: "',sentence, '"')

The sequence ['The', 'telescope', 'man'] is NOT a subtree of: " I saw the man with a telescope. "
The sequence ['a', 'telescope'] is a subtree of: " I saw the man with a telescope. "


Identify head of a span, given its tokens

The function takes as input a portion of a sentence and returns the token that will correspond to the head of that span. Like before I cycle through all of the tokens to see if a token is actually the head of itself (since each token token points toward its head). Then I simply return the head.

In [138]:
def headSpan(sentence):
    doc = nlp(sentence)
    head='NONE'
    for token in doc:
        if token.head==token:
            head=token
    return head.text
 
span = 'man with a'
head = headSpan(span)
print('The head for span',span,' is:',head)

The head for span man with a  is: man


Extract sentence subject, direct object and indirect object spans

The function takes as an input a sentence (string) and will output a dictionary with as three keys each for the sentence subject (or passive subject), direct object and indirect object (a.k.a. dative) and as element a list of lists where the inner lists contain the interested element and its span made by its the subtree.
After loading the doc of the sentence I cycle again through the sentence tokens, and for each of them create a list of all its subtree (this time also including the token itself of course), but it will be added to the dictionary only if the token is a sentence subject, direct object or indirect object, if it happens to be in one of those cases then the subtree list gets added to another list that will then be added to the dictionary only after going through all of the tokens. 

In [139]:
def subjectSpan(sentence):
    nlp = spacy.load('en_core_web_sm')
    doc=nlp(sentence)
    spans={}
    array1=[]
    array2=[]
    array3=[]
    for token in doc:
        array = ""
        space = " "
        tree = token.subtree
        for i, tr in enumerate(tree):
            #only used to not put a space at the beginning 
            if i==0: 
                array = tr.text
            else:
                array=array+space +tr.text    
        if (token.dep_ =="nsubj" or token.dep_ =="nsubjpass"):
            array1.append(array)
        elif (token.dep_=="iobj" or token.dep_=="dative"):
            array2.append(array)
        elif (token.dep_=="dobj"):
            array3.append(array)
    spans['nsubj']=array1
    spans['iobj']=array2
    spans['dobj']=array3
    return spans

spans=subjectSpan(sentence)
for sub in spans:
    print(sub, '=', spans.get(sub))

nsubj = ['I']
iobj = []
dobj = ['the man']
