# 6. Dependency parsing

In this class we will have a look at *parsing*, which means annotating the syntactic structure of the sentence. Currently the most popular way to see the problem is *dependency parsing*, in which we assume that each word in the sentence *depends* on some other word. For example the adjective can depend on the noun that it describes, while a verb's subject depends on the verb.

We will start by reading in the text of "Anna Karenina" exactly as in the previous class:

In [1]:
from collections import Counter, defaultdict
from operator import itemgetter
import spacy
import spacy.displacy

In [2]:
def load_paragraphs(filename):
    '''Reads a text divided into paragraphs.'''
    pars = []
    with open(filename) as fp:
        text = fp.read()
        pars = [p.replace('\n', ' ') for p in text.split('\n\n') if p.strip()]
    return pars

In [3]:
text = load_paragraphs('../ex3/anna_karenina.txt')
nlp = spacy.load('en_core_web_sm')
docs = [nlp(par) for par in text]  # Run the NLP pipeline paragraph by paragraph

From the annotations produced by spaCy, today we'll be interested in `head` and `dep_`:

In [4]:
for tok in docs[30][0].sent:
    print(tok.i, tok.orth_, tok.pos_, tok.head.i, tok.dep_, sep='\t')

0	“	PUNCT	15	punct
1	It	PRON	2	nsubj
2	’s	VERB	15	ccomp
3	that	DET	5	det
4	idiotic	ADJ	5	amod
5	smile	NOUN	2	attr
6	that	DET	7	nsubj
7	’s	VERB	5	relcl
8	to	PART	9	aux
9	blame	VERB	2	xcomp
10	for	ADP	9	prep
11	it	PRON	10	pobj
12	all	DET	11	appos
13	,	PUNCT	15	punct
14	”	PUNCT	15	punct
15	thought	VERB	15	ROOT
16	Stepan	PROPN	17	compound
17	Arkadyevitch	PROPN	15	nsubj
18	.	PUNCT	15	punct


The `head` annotation encodes the word that the given word depends on. For example token number 3 is "that" and it refers to token number 5: "smile". `tok.head` is again a Token object, so we can ask directly about its properties: e.g. `tok.head.pos_` is the POS-tag of the *head* of `tok` (cf. `tok.pos_`, which is the POS-tag of `tok` itself):

In [5]:
docs[30][3].head.pos_

'NOUN'

In [6]:
docs[30][3].pos_

'DET'

The annotation `dep_` gives us the *type* of the dependency. For example `nsubj` means that the word is subject of a verb.

In every sentence there is one word that doesn't depend on any other (typically the main verb): it's called the *root* of the sentence. It has the head set to itself and dependency type `ROOT`.

The dependency relationships in a sentence can be visualized by the function `render()` from the module `spacy.displacy`:

In [7]:
spacy.displacy.render(docs[30][0].sent, style='dep', options={'distance': 100, 'compact':True})

# In-class exercises

## Ex 1

The sentence above mentions an "idiotic smile". What else can be idiotic? Find all examples of "idiotic" + the word it refers to and list them together with the entire sentence they occur in.

idiotic	smile	Instead of being hurt, denying, defending himself, begging forgiveness, instead of remaining indifferent even—anything would have been better than what he did do—his face utterly involuntarily (reflex spinal action, reflected Stepan Arkadyevitch, who was fond of physiology)—utterly involuntarily assumed its habitual, good-humored, and therefore idiotic smile.
idiotic	smile	This idiotic smile he could not forgive himself.
idiotic	smile	“It’s that idiotic smile that’s to blame for it all,” thought Stepan Arkadyevitch.
idiotic	’s	“Oh, it’s so idiotic!
idiotic	expression	He takes away his hands and feels the shamestruck and idiotic expression of his face.
Idiotic	woman	“Idiotic woman!”


## Ex 2

Is there a relationship between a word's dependency type (i.e. the role that it plays in the sentence) and its POS tag? Create a dictionary `tags_by_dep` that contains a frequency list of POS tags for each dependency type.

In [90]:
tags_by_dep['nsubj'][:5]

[('PRON', 25189), ('NOUN', 6373), ('PROPN', 4960), ('DET', 1913), ('NUM', 124)]

In [91]:
tags_by_dep['case'][:5]

[('PART', 1321), ('PUNCT', 15), ('PROPN', 9), ('NOUN', 7), ('PRON', 6)]

In [92]:
tags_by_dep['ROOT'][:5]

[('VERB', 16818), ('AUX', 2607), ('NOUN', 552), ('PUNCT', 527), ('SPACE', 250)]

# Homework

## Ex 3 (2p.)

Write a function `deps(docs, lemma, tag)` that counts the dependents of word with a certain lemma and tag, segregated by dependency type. The function should for example allow us to see the frequency lists of subjects and objects of the verb "love", as shown below.

In [94]:
deps_love = deps(docs, 'love', 'VERB')

In [95]:
deps_love['nsubj'][:10]    # who loves?

[('i', 50),
 ('he', 43),
 ('she', 40),
 ('you', 29),
 ('who', 6),
 ('one', 5),
 ('wife', 3),
 ('they', 2),
 ('anna', 2),
 ('your', 2)]

In [96]:
deps_love['dobj'][:10]    # who is loved?

[('him', 49),
 ('her', 39),
 ('me', 32),
 ('you', 14),
 ('whom', 9),
 ('man', 5),
 ('child', 4),
 ('wife', 4),
 ('someone', 4),
 ('anna', 4)]

## Ex 4 (3p.)

Write a function `frames(docs, lemma, tag)` that extracts *frames* for a word with a specific lemma and tag (typically a verb) and counts them. A frame is a list of dependents that the word may take in a sentence, each dependent being specified by its type and POS-tag. For example the frame `(('nsubj', 'PRON'), ('dobj', 'PRON'))` means that the word takes a pronoun as subject *and* a pronoun as object.

In [98]:
fr_love = frames(docs, 'love', 'VERB')

In [99]:
fr_love[:5]

[((('nsubj', 'PRON'),), 6),
 ((('dobj', 'PRON'),), 4),
 ((('nsubj', 'PRON'), ('aux', 'AUX'), ('neg', 'PART'), ('dobj', 'PRON')), 3),
 ((('nsubj', 'PRON'), ('dobj', 'PRON')), 3),
 ((('dobj', 'NOUN'),), 3)]