# NLP with spaCy

*Based off [a notebook by Alison Parrish](https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3)*

The aim of this notebook is to introduce a few simple concepts and techniques from NLP which will (hopefully) help you do creative things quickly, and (also hopefully) open the door for you to understand more sophisticated NLP concepts that you might want to use in your final project or in future courses.

We'll be using a library called [spaCy](https://spacy.io/), which is a good compromise between being very powerful and state-of-the-art and easy for newcomers to understand.

(Traditionally, most NLP work in Python was done with [NLTK](http://www.nltk.org/), which is fantastic, but it’s also large and slippery and difficult to understand. Increasingly, people are turning to spaCy for its ease of use.)

## Caveat the first: "natural language"

“Natural language” is a loaded phrase: what makes one stretch of language “natural” while another stretch is not? NLP techniques are opinionated about what language is and how it works; as a consequence, you’ll sometimes find yourself having to conceptualize your text with uncomfortable abstractions in order to make it work with NLP. 

Of course, a computer can never really fully “understand” human language. Even when the text you’re using fits the abstractions of NLP perfectly, the results of NLP analysis are always going to be at least a little bit inaccurate. But often even inaccurate results can be “good enough,” especially when dealing with large corpora.

## Caveat the second: English only (mostly)

The main assumption that most NLP libraries and techniques make is that the text you want to process will be in English. Historically, most NLP research has been on English specifically; it’s only more recently that serious work has gone into applying these techniques to other languages. If you’re interested in working on NLP in other languages, here are a few starting points:
* [Konlpy](https://github.com/konlpy/konlpy), natural language processing in
  Python for Korean
* [Jieba](https://github.com/fxsjy/jieba), text segmentation and POS tagging in
  Python for Chinese
* The [Pattern](http://www.clips.ua.ac.be/pattern) library (another
  simplified/augmented interface to NLTK) which includes POS-tagging and some
  morphology for Spanish in its
  [pattern.es](http://www.clips.ua.ac.be/pages/pattern-es) package.

## English grammar: a crash course

The following is a gross oversimplification of both how English grammar works, and how theories of English grammar work in the context of NLP. But it should be enough to get us going.

### Sentences and parts of speech

English texts can roughly be divided into "sentences." Sentences are themselves
composed of individual words, each of which has a function in expressing the
meaning of the sentence. The function of a word in a sentence is called its
"part of speech"---i.e., a word functions as a noun, a verb, an adjective, etc.
Here's a sentence, with words marked for their part of speech:

    I       really love entrees       from        the        new       cafeteria.
    pronoun adverb verb noun (plural) preposition determiner adjective noun

Of course, the "part of speech" of a word isn't a property of the word itself.
We know this because a single "word" can function as two different parts of speech:

> I love cheese.

The word "love" here is a verb. But here:

> Love is a battlefield.

... it's a noun. For this reason (and others), it's difficult for computers to
accurately determine the part of speech for a word in a sentence. (It's
difficult sometimes even for humans to do this.) But NLP procedures do their
best!

### Phrases and larger syntactic structures

There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a "dependency grammar." We'll talk about the details of this below.

## Installing spaCy

From the terminaly, type:

    sudo pip install spacy
    
(If you're one of the people having trouble with sudo, just try `pip install spacy`.)

Currently, spaCy is distributed in source form only, so the installation process involves a bit of compiling. On macOS, you'll need to install [XCode](https://developer.apple.com/xcode/) if you don't have it already in order to perform the compilation steps. [Here's a good tutorial for macOS Sierra](http://railsapps.github.io/xcode-command-line-tools.html), though the steps should be similar on other versions.

## Downloading the spaCy data ##

After you've installed spaCy, you'll need to download the data. Run the following on the command line:

    sudo python -m spacy download en
    
As above, if you're having trouble with sudo on your machine, just remove the "sudo" from the line above.

## Basic usage

Import `spacy` like any other Python module. The `spaCy` code expects all strings to be unicode strings, so make sure you've included `from __future__ import unicode_literals` at the top of your notebook—it'll make your life easier, trust me.

In [3]:
from __future__ import unicode_literals
import spacy

Create a new spaCy object using `spacy.load('en')` (assuming you want to work with English; spaCy supports other languages as well).

In [4]:
nlp = spacy.load('en')

# NOTE: If you are one of the people for whom sudo does not work, you will need to specify
# the longer name of the English language model as do:
# nlp = spacy.load('en_core_web_sm')

And then create a `Document` object by calling the spaCy object with the text you want to work with. Below I've included a few sentences from the Universal Declaration of Human Rights:

In [5]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

## Sentences

If you learn nothing else about spaCy (or NLP), then learn at least that it's an easy way to get a list of sentences in a text. Once you've created a document object, you can iterate over the sentences it contains using the `.sents` attribute:

In [6]:
for item in doc.sents:
    print(item.text)

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


Note: The `.sents` attribute is a generator, so you can't index or count it directly. (In python, a generator is a special type of function with limited capacity. All it does is return an object (an iterator) which we can iterate over one value at a time.)

To index or count the .sents attribute, you'll need to convert it to a list first using the `list()` function:

In [7]:
sentences_as_list = list(doc.sents)

In [8]:
# check the length to make sure it worked

len(sentences_as_list)

3

## Words

Iterating over a document yields each word in the document in turn. Words are represented with spaCy [Token](https://spacy.io/docs/api/token) objects, which have several interesting attributes. 

The `.text` attribute gives the underlying text of the word, and the `.lemma_` attribute gives the word's "lemma." 

**EXPLAIN THE BELOW**

A word's "lemma" is its most "basic" form, the form without any morphology
applied to it. "Sing," "sang," "singing," are all different "forms" of the
lemma *sing*. Likewise, "octopi" is the plural of "octopus"; the "lemma" of
"octopi" is *octopus*.

"Lemmatizing" a text is the process of going through the text and replacing
each word with its lemma. This is often done in an attempt to reduce a text
to its most "essential" meaning, by eliminating pesky things like verb tense
and noun number.

Let's take a look:


In [9]:
print("Word, lemma\n")
for word in doc:
    print(word.text + ", " + word.lemma_)
    
# Note: On the underscore at the end of a variable, see
# https://www.datacamp.com/community/tutorials/role-underscore-python


Word, lemma

All, all
human, human
beings, being
are, be
born, bear
free, free
and, and
equal, equal
in, in
dignity, dignity
and, and
rights, right
., .
They, -PRON-
are, be
endowed, endow
with, with
reason, reason
and, and
conscience, conscience
and, and
should, should
act, act
towards, towards
one, one
another, another
in, in
a, a
spirit, spirit
of, of
brotherhood, brotherhood
., .
Everyone, everyone
has, have
the, the
right, right
to, to
life, life
,, ,
liberty, liberty
and, and
security, security
of, of
person, person
., .


Individual sentences can also be iterated over to get a list of words:

In [10]:
sentence = list(doc.sents)[1]  # same as sentence = sentences_as_list[1]

for word in sentence:
    print(word.text, word.lemma_)

They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .


## Parts of speech

The `pos_` attribute gives a general part of speech; the `tag_` attribute gives a more specific designation. [List of meanings here.](https://spacy.io/api/annotation#pos-tagging)

In [11]:
print("Word, POS, tag\n")

for item in doc:
    print(item.text, item.pos_, item.tag_)

Word, POS, tag

All DET DT
human ADJ JJ
beings NOUN NNS
are AUX VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are AUX VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should AUX MD
act VERB VB
towards ADP IN
one NOUN NN
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood PROPN NNP
. PUNCT .
Everyone PRON NN
has AUX VBZ
the DET DT
right NOUN NN
to ADP IN
life NOUN NN
, PUNCT ,
liberty NOUN NN
and CCONJ CC
security NOUN NN
of ADP IN
person NOUN NN
. PUNCT .


### Extracting words by part of speech

With knowledge of which part of speech each word belongs to, we can make simple code to extract and recombine words by their part of speech. The following code creates a list of all nouns and adjectives in the text:

In [12]:
nouns = []
adjectives = []
for item in doc:
    if item.pos_ == 'NOUN':
        nouns.append(item.text)
for item in doc:
    if item.pos_ == 'ADJ':
        adjectives.append(item.text)

And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [13]:
import random

print(random.choice(adjectives) + " " + random.choice(nouns))

equal dignity


Making a list of verbs works similarly:

In [14]:
verbs = []
for item in doc:
    if item.pos_ == 'VERB':
        verbs.append(item.text)

In [29]:
verbs

['born', 'endowed', 'act']

The `.pos_` attribute gives us general information about the part of speech. But the `.tag_` attribute allows us to be more specific about the kinds of verbs we want. 

For example, this code gives us only the verbs in past participle form:

In [15]:
only_past = []

for item in doc:
    if item.pos_ == 'VERB':
        if item.tag_ == 'VBN':
            only_past.append(item.text)

In [16]:
only_past

['born', 'endowed']

## Larger syntactic units

Okay, so we can get individual words by their part of speech. Great! But what if we want larger chunks, based on their syntactic role in the sentence? The easy way is `.noun_chunks`, which is an attribute of a document or a sentence that evaluates to a list of [spans](https://spacy.io/docs/api/span) of noun phrases, regardless of their position in the document:

In [17]:
for item in doc.noun_chunks:
    print(item.text)

All human beings
dignity
rights
They
reason
conscience
one
a spirit
brotherhood
Everyone
the right
life
liberty
security
person


For anything more sophisticated than this, though, we'll need to learn about how spaCy parses sentences into its syntactic components.

### Understanding dependency grammars

![displacy parse](http://static.decontextualize.com/syntax_example.png)

[See in "displacy", spaCy's syntax visualization tool.](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0)

The spaCy library parses the underlying sentences using a [dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar). Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based [phrase structure grammars](https://en.wikipedia.org/wiki/Phrase_structure_grammar) commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a "dependent" of some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

There are different *types* of relationships between heads and dependents, and each type of relation has its own name. Use the displaCy visualizer (linked above) to see how a particular sentence is parsed, and what the relations between the heads and dependents are. (I've listed a few common relations below.)

Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents). The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words):

Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

In [18]:
# Let's take a look at how this works in practice

for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
    print("")

Word: Everyone
Tag: NN
Head: has
Dependency relation: nsubj
Children: []

Word: has
Tag: VBZ
Head: has
Dependency relation: ROOT
Children: [Everyone, right, .]

Word: the
Tag: DT
Head: right
Dependency relation: det
Children: []

Word: right
Tag: NN
Head: has
Dependency relation: dobj
Children: [the, to]

Word: to
Tag: IN
Head: right
Dependency relation: prep
Children: [life]

Word: life
Tag: NN
Head: to
Dependency relation: pobj
Children: [,, liberty]

Word: ,
Tag: ,
Head: life
Dependency relation: punct
Children: []

Word: liberty
Tag: NN
Head: life
Dependency relation: conj
Children: [and, security, of]

Word: and
Tag: CC
Head: liberty
Dependency relation: cc
Children: []

Word: security
Tag: NN
Head: liberty
Dependency relation: conj
Children: []

Word: of
Tag: IN
Head: liberty
Dependency relation: prep
Children: [person]

Word: person
Tag: NN
Head: of
Dependency relation: pobj
Children: []

Word: .
Tag: .
Head: has
Dependency relation: punct
Children: []



### Using .subtree for extracting syntactic units

The `.subtree` attribute evaluates to a generator that can be flatted by passing it to `list()`. This is a list of the word's syntactic dependents—essentially, the "clause" that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [19]:
def flatten_subtree(st):
       return ''.join([w.text_with_ws for w in list(st)]).strip() # just take my word for it!

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence:

In [20]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Flattened subtree: ", flatten_subtree(word.subtree))
    print("")

Word: Everyone
Flattened subtree:  Everyone

Word: has
Flattened subtree:  Everyone has the right to life, liberty and security of person.

Word: the
Flattened subtree:  the

Word: right
Flattened subtree:  the right to life, liberty and security of person

Word: to
Flattened subtree:  to life, liberty and security of person

Word: life
Flattened subtree:  life, liberty and security of person

Word: ,
Flattened subtree:  ,

Word: liberty
Flattened subtree:  liberty and security of person

Word: and
Flattened subtree:  and

Word: security
Flattened subtree:  security

Word: of
Flattened subtree:  of person

Word: person
Flattened subtree:  person

Word: .
Flattened subtree:  .



Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [21]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [22]:
subjects

['All human beings', 'They', 'Everyone']

Or every prepositional phrase:

In [23]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [24]:
prep_phrases

['in dignity and rights',
 'with reason and conscience',
 'towards one another',
 'in a spirit of brotherhood',
 'of brotherhood',
 'to life, liberty and security of person',
 'of person']

## Entity extraction

A common task in NLP is taking a text and extracting "named entities" from it—basically, proper nouns, or names of companies, products, locations, etc. You can easily access this information using the `.ents` property of a document.

In [25]:
doc2 = nlp("Claire Sterk and I visited the Apple Store in Decatur.")

In [26]:
for item in doc2.ents:
    print(item)

Claire Sterk
the Apple Store
Decatur


Entity objects have a `.label_` attribute that tells you the type of the entity. ([Here's a full list of the built-in entity types.](https://spacy.io/docs/usage/entity-recognition#entity-types))

In [27]:
for item in doc2.ents:
    print(item.text, item.label_)

Claire Sterk PERSON
the Apple Store ORG
Decatur GPE


[More on spaCy entity recognition.](https://spacy.io/docs/usage/entity-recognition)

## Loading data from a file

You can load data from a file easily with spaCy. You just have to make sure that the data is in Unicode format, not plain-text. An easy way to do this is to specify `'utf8'` as the encoding.

In [28]:
filename = "./2019-09-ccp-corpus-0.3/ccprecords/1850.ME-10.20.PORT.ART.01.txt"

with open(filename, "r", encoding="utf-8") as file:
    text = file.read()
    doc3 = nlp(text) # remember to convert it to a spacy object

And lo and behold, the named entities!

In [29]:
for item in doc3.ents:
    print(item.text, item.label_)

Colored Convention PRODUCT
Portland GPE
republican NORP
the United States Senate ORG
a few days DATE
Lawrence Chaplin PERSON
the National Prison FAC
Savior ORG
Liberian Colonization ORG


## Further reading and resources

[A few example programs can be found here.](https://github.com/aparrish/rwet-examples/tree/master/spacy)

We've barely scratched the surface of what it's possible to do with spaCy. [There's a great page of tutorials on the official site](https://spacy.io/docs/usage/tutorials) that you should check out!