# spaCy tutorial 3: Dependency parsing

First import in the spaCy library and load the small pre-trained English model. We'll also import `displacy` which is used for visualizing the dependency parses.

In [2]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")

# We'll make sure to import the visualizer `displacy`
from spacy import displacy

## Basic parsing

Now try out a simple sentence.

In [3]:
doc1 = nlp_sm("The black cat is sleeping.")

The nice thing is that spaCy automatically parses text, so we can immediately start working with `doc1`. First let's just see what the parse looks like.

In [4]:
displacy.render(doc1, style = "dep")

The default output shows the words along with their basic POS labels. Dependencies among words are connected with arrows and labels indicating the dependency relations between words, where each arrow comes from the head and points to a dependent of that head. So for example, the head *cat* has two dependents, *the* and *black*. The dependency relations for these two words are 'det' (determiner) and 'amod' (adjectival modifier) respectively. 

You can see spaCy's [Annotation Specifications](https://spacy.io/api/annotation#dependency-parsing) for a full list of dependencies used. The dependencies POS and dependency labels come from the Universal Dependencies project: [https://universaldependencies.org/](https://universaldependencies.org/). You can also use `spacy.explain()` to find out more information about a label.

In [5]:
spacy.explain("amod")

'adjectival modifier'

Now we can try another.

In [6]:
doc2 = nlp_sm("The black cat chased the red feathers.")

In [7]:
displacy.render(doc2, style = "dep")

### Visualizer options

There are lots of other options for visualizing dependency parses with `spacy`. For instance, you can include the `compact` option to create more compact square arrows, you can ask it to include lemmas with `add_lemma`, or you could ask it to use more fine-grained POS tags rather than the coarse ones it uses by default. These and other options are specified in the `options` argument as a dictionary. You can find all the options in the [`spacy` API documentation](https://spacy.io/api/top-level#displacy_options).

In [9]:
displacy.render(doc1, style = "dep", options = {"compact":True, "add_lemma":True, "fine_grained":True})

## Noun chunks

Yet another convenient thing spaCy does is noun 'chunking', which involves grouping words together into "base noun phrases", i.e. strings of words that have the same noun as their head. Essentially, a noun chunk is a noun plus the words describing that noun, and any of those words' dependents — for example, *colorless green ideas* or *the very tall building*. 

So if we go back to our example of `doc2`, we can easily pull out the two noun chunks *the black cat* and *the red feathers* by using the `noun_chunks` attribute. 

In [8]:
for chunk in doc2.noun_chunks:
    print(chunk)

The black cat
the red feathers


It's easy to see how this could be very useful for quickly extracting information from text. For example, here are the first 15 sentences from the [Brown Corpus of American English](http://www.helsinki.fi/varieng/CoRD/corpora/BROWN/). We can pull out all the noun chunks in these lines.

In [65]:
brown1 = nlp_sm("""The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place.
The jury further said in term presentments that the City Executive Committee, which had over charge of the election, deserves the praise and thanks of the City of Atlanta for the manner in which the election was conducted.
The September term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible irregularities in the hard primary which was won by Mayor Ivan Allen Jr.
Only a relative handful of such reports was received, the jury said, considering the widespread interest in the election, the number of voters and the size of this city.
The jury said it did find that many of Georgia's registration and election laws are outmoded or inadequate and often ambiguous.
It recommended that Fulton legislators act to have these laws studied and revised to the end of modernizing and improving them.
The grand jury commented on a number of other topics, among them the Atlanta and Fulton County purchasing departments which it said are well operated and follow generally accepted practices which inure to the best interest of both governments.
However, the jury said it believes these two offices should be combined to achieve greater efficiency and reduce the cost of administration.
The City Purchasing Department, the jury said, is lacking in experienced clerical personnel as a result of city personnel policies.
It urged that the city take steps to remedy this problem.
Implementation of Georgia's automobile title law was also recommended by the outgoing jury.
It urged that the next Legislature provide enabling funds and re the effective date so an orderly implementation of the law may be effected.
The grand jury took a swipe at the State Welfare Department's handling of federal funds granted for child welfare services in foster homes.
This is one of the major items in the Fulton County general assistance program, the jury said, but the State Welfare Department has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of Fulton County, which receives none of this money.
The jurors said they realize a proportionate distribution of these funds might disable this program in our less populous counties.""")

for chunk in brown1.noun_chunks:
    print(chunk)

The Fulton County Grand Jury
an investigation
Atlanta's recent primary election
no evidence
any irregularities
place
The jury
term presentments
the City Executive Committee
charge
the election
the praise
thanks
the City
Atlanta
the manner
the election
The September term jury
Fulton Superior Court Judge Durwood Pye
reports
possible irregularities
the hard primary
Mayor Ivan Allen Jr.
Only a relative handful
such reports
the jury
the widespread interest
the election
the number
voters
the size
this city
The jury
it
Georgia's registration
election laws
It
Fulton legislators
these laws
the end
them
The grand jury
a number
other topics
them
it
generally accepted practices
the best interest
both governments
the jury
it
these two offices
greater efficiency
the cost
administration
The City Purchasing Department
the jury
experienced clerical personnel
a result
city personnel policies
It
the city
steps
this problem
Implementation
Georgia's automobile title law
the outgoing jury
It
the next Legisl

This looks great! Note that noun chunks also include pronouns, e.g. *it*, *they* or *none*, but it doesn't do so well with bare demonstratives such as in *That is a lovely jacket* or *Give those to me!* Try these yourself and see what happens.

So now we can get to work pulling out all the noun phrases in our text, right? Well, not so fast... 

Now that I've shown you how to get the noun chunks, I'm going to advise that you **use them very carefully**. The reason is that in my experience, spaCy's chunking algorithm does not always return the results I want and/or expect. This is partly due to the fact that **noun** ***chunks*** **are not the same thing as noun** ***phrases***, at least in the sense of NPs in phrase structure grammar (PSG). So if you are looking for *noun phrases* in the PSG sense, you might be surprised at what you get.

There are a couple areas where I've run into problems with noun chunks (and I'm sure there are others). One is with possessive phrases with the genitive *'s*. These work only somewhat consistently. Suppose for instance, we have the sentence *The cat's whisker are long*. How many noun phrases are there, and what are they? We can try spaCy and see what we get.

In [11]:
doc3 = nlp_sm("The cat's whiskers are long.")

for chunk in doc3.noun_chunks:
    print(chunk)

The cat's whiskers


This makes sense. We have one noun phrase *the cat's whiskers*. But with more complex examples, things turn out differently. Consider the sentence below.

In [14]:
doc4 = nlp_sm("The article criticized the government's handling of the economy.")

What are the noun ***phrases*** in this sentence? The first is easy: *the article*. But what about *the government's handling of the economy*? How would you parse this in a typical PSG? I would probably parse this like so.

![PSG1](the_governments_handling_of_the_economy.png "PSG representation")

So I would say there are 4 NPs here: *the government*, *handling of the economy*, *the economy*, and the whole thing *the government's handling of the economy*. I suppose you could argue that *handling of the economy* is not quite the same as the other three (e.g. it can't be used on its own), and so shoudln't be included, but the other three should be obvious. 

Now I'll create a list of the noun ***chunks*** in the sentence `Doc` returned by spaCy (you could also use a `for` loop like above).

In [17]:
[n for n in doc4.noun_chunks]

[The article, the government's handling, the economy]

We can look at the dependency parse for this sentence, and see what we get.  

In [16]:
displacy.render(doc4)

A neat trick is to zoom in on the region we want by telling it to parse only the part starting at the 4th word (this has an index of 3 in Python's counting remember) and onward to the end of the sentence. 

In [77]:
displacy.render(doc4[3:])

So in this subtree, the root is *handling*

In [None]:
for t in doc4:
    if t.text == "handling":
        print(doc4[t.left_edge.i:t.right_edge.i+1])

In [66]:
[s for s in brown1.sents][12]

The grand jury took a swipe at the State Welfare Department's handling of federal funds granted for child welfare services in foster homes.

In [69]:
d = nlp_sm("The article criticized the prime minister's handling of the economy.")
[n for n in d.noun_chunks]

[The article, the prime minister's handling, the economy]

In [68]:
d = nlp_sm("The teacher like Alex's painting of a dinosaur.")
[n for n in d.noun_chunks]

[The teacher, Alex's painting, a dinosaur]

In [30]:
[n for n in nlp_sm("The story that you told was hilarious.").noun_chunks]

[The story, you]

In [36]:
displacy.render(gregor, options = {"compact":True})

In [35]:
gregor = nlp_sm("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.")

for chunk in gregor.noun_chunks:
    print(chunk)

Gregor
troubled dreams
he
himself
his bed
a horrible vermin


In [24]:
for token in gregor:
    print(token, token.pos_)

One NUM
morning NOUN
, PUNCT
when ADV
Gregor PROPN
Samsa PROPN
woke VERB
from ADP
troubled ADJ
dreams NOUN
, PUNCT
he PRON
found VERB
himself PRON
transformed VERB
in ADP
his DET
bed NOUN
into ADP
a DET
horrible ADJ
vermin NOUN
. PUNCT


In [34]:
for token in gregor:
    if token.pos_ in ["NOUN", "PROPN", "PRON"]:
        print(gregor[token.left_edge.i:token.right_edge.i+1])

One morning
Gregor
Gregor Samsa
troubled dreams
he
himself
his bed
a horrible vermin


In [50]:
displacy.render(nlp_sm("the world’s largest city"))

In [45]:
displacy.render(nlp_sm("Sam ate the world’s largest mince pie"))

In [38]:
doc4 = nlp_sm("Sam ate the world’s largest mince pie")

[n for n in doc4.noun_chunks]

[Sam, the world, largest mince pie]

In [42]:
for token in doc4:
    if token.pos_ in ["NOUN", "PROPN", "PRON"]:
        print(doc4[token.left_edge.i:token.right_edge.i+1])

Sam
the world
mince
largest mince pie


In [46]:
displacy.render(nlp_sm("the cat's whiskers are long"))

In [47]:
[n for n in nlp_sm("the cat's whiskers are long").noun_chunks]

[the cat's whiskers]

In [39]:
doc5 = nlp_sm("The very fat cat ate the food that we bought yesterday.")

In [40]:
[n for n in doc5.noun_chunks]

[The very fat cat, the food, we]

In [19]:
displacy.render(doc3, options = {"compact":True})