[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W5T_Dependency_Parsing.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy transformers sentencepiece git+https://github.com/TurkuNLP/diaparser.git@master
!python -m spacy download en_core_web_sm

# Dependency parsing with spaCy

*This section is based on the [spaCy documentation](https://spacy.io/usage/linguistic-features#dependency-parse).*

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a `Doc` object has been parsed by calling `doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has been set returns a boolean value. If the result is `False`, the default sentence iterator will raise an exception.

### Visualizing Parse Trees

We can use the same `displaCy` tool we used for tagging to visualize the parse tree:

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

As for tagging, we can customize the visualizations to our taste:

In [None]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}
displacy.render(doc, style="dep", options=options)

Long texts can become difficult to read when displayed in one row, so it’s often better to visualize them sentence-by-sentence instead. Displacy supports rendering both `Doc` and `Span` objects, as well as lists of `Docs` or `Spans`. Instead of passing the full `Doc` to `displacy.render`, you can also pass in a list `doc.sents`. This will create one visualization for each sentence.

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. One day, Senex goes on a trip and leave Pseudolus in charge of Hero."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", options = {"compact": True})

### Noun Chunks

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over `Doc.noun_chunks`.

In the following example, `Text` is the original noun chunk txt. `Root Text` is the root of the noun chunk, connecting it to the rest of the parse. `Root Dep` is the dependency relation connecting the root to its head. `Head Text` is the head of the noun chunk, and `Head Dep` is the dependency relation connecting the head to the root.

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

def print_deps(doc):
    print(f"{'Text'.ljust(20)}\t{'Root Text'.ljust(20)}\tRoot Dep\tHead Text\tHead Dep\n" + "-"*80)
    for chunk in doc.noun_chunks:
        print(
            f"{chunk.text.ljust(20)}\t{chunk.root.text.ljust(20)}\t{chunk.root.dep_}\t{chunk.root.head.text}\t{chunk.root.head.dep_}"
        )

print_deps(doc)

Text                	Root Text           	Root Dep	Head Text	Head Dep
--------------------------------------------------------------------------------
Autonomous cars     	cars                	nsubj	shift	ROOT
insurance liability 	liability           	dobj	shift	ROOT
manufacturers       	manufacturers       	pobj	toward	prep


### Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of `.dep` is a hash value. You can get the string value with `.dep_`.

In the following, we add the fields `head.pos_` to consider the part of speed associated with token and `children` to get the list of the immediate syntactic dependents of the token. Notice that now we're operating on single tokens, as opposed to full noun chunks as above

In [24]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
displacy.render(doc, style="dep")

Autonomous amod cars NOUN []
cars nsubj shift VERB [Autonomous]
shift ROOT shift VERB [cars, liability, toward]
insurance compound liability NOUN []
liability dobj shift VERB [insurance]
toward prep shift VERB [manufacturers]
manufacturers pobj toward ADP []


Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest – from below:

In [25]:
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
verbs

{shift}

If you try to match from above, you’ll have to iterate twice. Once for the head, and then again through the children. To iterate through the children, use the `token.children` attribute, which provides a sequence of Token objects.

In [26]:
# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break
verbs

[shift]

### Iterating around the local tree

A few more convenience attributes are provided for iterating around the local tree from the token. `Token.lefts` and `Token.rights` attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, `Token.n_lefts` and `Token.n_rights` that give the number of left and right children.

In [30]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("bright red apples on the tree")
print("Lefts:", [token.text for token in doc[2].lefts])
print("Rights", [token.text for token in doc[2].rights])
print("# Lefts", doc[2].n_lefts)
print("# Rights", doc[2].n_rights)
displacy.render(doc)

Lefts: ['bright', 'red']
Rights ['on']
# Lefts 2
# Rights 1


You can get a whole phrase by its syntactic head using the `Token.subtree` attribute. This returns an ordered sequence of tokens. You can walk up the tree with the `Token.ancestors` attribute, and check dominance with `Token.is_ancestor`. For the default English pipelines, the parse tree is **projective**, which means that there are no crossing brackets. The tokens returned by `.subtree` are therefore guaranteed to be contiguous. This is not true for e.g. the German pipelines, which have many non-projective dependencies.

In [32]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")

root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,   descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])

Credit nmod 0 2 2 ['holders', 'submit']
and cc 0 0 0 ['Credit', 'holders', 'submit']
mortgage compound 0 0 0 ['account', 'Credit', 'holders', 'submit']
account conj 1 0 0 ['Credit', 'holders', 'submit']
holders nsubj 1 0 0 ['submit']


Finally, the `.left_edge` and `.right_edge` attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a `Span` object for a syntactic phrase, using the `.retokenize` function. Note that `.right_edge` gives a token within the subtree – so if you use it as the end-point of a range, don’t forget to +1!

In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Credit and mortgage account holders NOUN nsubj submit
must AUX aux submit
submit VERB ROOT submit
their PRON poss requests
requests NOUN dobj submit


The dependency parse can be a useful tool for information extraction, especially when combined with other predictions like named entities. The following example extracts money and currency values, i.e. entities labeled as `MONEY`, and then uses the dependency parse to find the noun phrase they are referring to – for example `"Net income"→ "$9.4 million"`.

In [42]:
import spacy

nlp = spacy.load("en_core_web_sm")
# Merge noun phrases and entities for easier analysis
nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of 1$ billion",
]
for doc in nlp.pipe(TEXTS):
    for token in doc:
        if token.ent_type_ == "MONEY":
            # We have an attribute and direct object, so check for subject
            if token.dep_ in ("attr", "dobj"):
                subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subj:
                    print(subj[0], "-->", token)
            # We have a prepositional object with a preposition
            elif token.dep_ == "pobj" and token.head.dep_ == "prep":
                print(token.head.head, "-->", token)

Net income --> $9.4 million
the prior year --> $2.7 million
Revenue --> twelve billion dollars
a loss --> 1$ billion
