# NLP concepts with spaCy

By [Allison Parrish](http://www.decontextualize.com/)

“Natural Language Processing” is a field at the intersection of computer science, linguistics and artificial intelligence which aims to make the underlying structure of language available to computer programs for analysis and manipulation. It’s a vast and vibrant field with a long history! New research and techniques are being developed constantly.

The aim of this notebook is to introduce a few simple concepts and techniques from NLP—just the stuff that’ll help you do creative things quickly, and maybe open the door for you to understand more sophisticated NLP concepts that you might encounter elsewhere.

We'll be using a library called [spaCy](https://spacy.io/), which is a good compromise between being very powerful and state-of-the-art and easy for newcomers to understand.

(Traditionally, most NLP work in Python was done with a library called [NLTK](http://www.nltk.org/). NLTK is a fantastic library, but it’s also a writhing behemoth: large and slippery and difficult to understand. Also, much of the code in NLTK is decades out of date with contemporary practices in NLP.)

This tutorial is written for Python 3.5+. [Here's a Python 2.7 version of the tutorial](https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3).

## Natural language

“Natural language” is a loaded phrase: what makes one stretch of language “natural” while another stretch is not? NLP techniques are opinionated about what language is and how it works; as a consequence, you’ll sometimes find yourself having to conceptualize your text with uncomfortable abstractions in order to make it work with NLP. (This is especially true of poetry, which almost by definition breaks most “conventional” definitions of how language behaves and how it’s structured.)

Of course, a computer can never really fully “understand” human language. Even when the text you’re using fits the abstractions of NLP perfectly, the results of NLP analysis are always going to be at least a little bit inaccurate. But often even inaccurate results can be “good enough”—and in any case, inaccurate output from NLP procedures can be an excellent source of the sublime and absurd juxtapositions that we (as poets) are constantly in search of.

## English only (sorta)

The main assumption that most NLP libraries and techniques make is that the text you want to process will be in English. Historically, most NLP research has been on English specifically; it’s only more recently that serious work has gone into applying these techniques to other languages. The examples in this chapter are all based on English texts, and the tools we’ll use are geared toward English. If you’re interested in working on NLP in other languages, here are a few starting points:

* [spaCy has models for various languages](https://spacy.io/models/#available-models), including German, Spanish, Portuguese, French, Italian and Dutch. Note that not all of these models support all of the capabilities of spaCy that we'll talk about in this tutorial. Also note that not all languages have the same ideas about what constitutes a "part of speech"!
* [Konlpy](https://github.com/konlpy/konlpy), natural language processing in
  Python for Korean
* [Jieba](https://github.com/fxsjy/jieba), text segmentation and POS tagging in
  Python for Chinese
* Facebook's [fasttext project](https://fasttext.cc/docs/en/pretrained-vectors.html) makes available word vectors for a large number of languages (~300).

## English grammar: a crash course

The only thing I believe about English grammar is [this](http://www.writing.upenn.edu/~afilreis/88v/creeley-on-sentence.html):

> "Oh yes, the sentence," Creeley once told the critic Burton Hatlen, "that's
> what we call it when we put someone in jail."

There is no such thing as a sentence, or a phrase, or a part of speech, or even
a "word"---these are all pareidolic fantasies occasioned by glints of sunlight
we see on reflected on the surface of the ocean of language; fantasies that we
comfort ourselves with when faced with language's infinite and unknowable
variability.

Regardless, we may find it occasionally helpful to think about language using
these abstractions. The following is a gross oversimplification of both how
English grammar works, and how theories of English grammar work in the context
of NLP. But it should be enough to get us going!

### Sentences and parts of speech

English texts can roughly be divided into "sentences." Sentences are themselves
composed of individual words, each of which has a function in expressing the
meaning of the sentence. The function of a word in a sentence is called its
"part of speech"—i.e., a word functions as a noun, a verb, an adjective, etc.
Here's a sentence, with words marked for their part of speech:

    I       really love entrees       from        the        new       cafeteria.
    pronoun adverb verb noun (plural) preposition determiner adjective noun

Of course, the "part of speech" of a word isn't a property of the word itself.
We know this because a single "word" can function as two different parts of speech:

> I love cheese.

The word "love" here is a verb. But here:

> Love is a battlefield.

... it's a noun. For this reason (and others), it's difficult for computers to
accurately determine the part of speech for a word in a sentence. (It's
difficult sometimes even for humans to do this.) But NLP procedures do their
best!

### Phrases and larger syntactic structures

There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a "dependency grammar." We'll talk about the details of this below.


## Installing spaCy

[Follow the instructions here](https://spacy.io/docs/usage/). To install on Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type

    conda install -c conda-forge spacy
    
This line installs the library. You'll also need to download a language model. For that, type:

    python -m spacy download en_core_web_md
    
(Replace `en` with the language code for your desired language, if there's a model available for it.) The language model contains the statistical information necessary to parse text into sentences and sentences into parts of speech. Note that this download is several hundred megabytes, so it might take a while!

If you're not using Anaconda, you can also install with `pip`. When using `pip`, make sure to upgrade to the newest version first, with `pip install --upgrade pip`. (This will ensure that at least *some* of the dependencies are installed as pre-built binaries)

    pip install spacy
    
(If you're not using a virtual environment, try `sudo pip install spacy`.)

Currently, spaCy is distributed in source form only, so the installation process involves a bit of compiling. On macOS, you'll need to install [XCode](https://developer.apple.com/xcode/) in order to perform the compilation steps. [Here's a good tutorial for macOS Sierra](http://railsapps.github.io/xcode-command-line-tools.html), though the steps should be similar on other versions.

After you've installed spaCy, you'll need to download the data. Run the following on the command line:

    !python -m spacy download en_core_web_md

## Basic usage

Import `spacy` like any other Python module:

In [1]:
import spacy

Create a new spaCy object using `spacy.load('en_core_web_md')`. (The name in the parentheses is the same as the name of the model you downloaded above. If you downloaded a different model, you can put its name here instead. You can also just write `'en'` and spaCy will load the best model it has for that language.)

In [2]:
nlp = spacy.load('en_core_web_md')

And then create a `Document` object by calling the spaCy object with the text you want to work with. Below I've included a few sentences from the Universal Declaration of Human Rights:

In [3]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

## Sentences

If you learn nothing else about spaCy (or NLP), then learn at least that it's a good way to get a list of sentences in a text. Once you've created a document object, you can iterate over the sentences it contains using the `.sents` attribute:

In [4]:
for item in doc.sents:
    print(item.text)

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


The `.sents` attribute is a [generator](https://wiki.python.org/moin/Generators), not a list, so while you can use it in a `for` loop or list comprehension, you can't index (or count) it directly. To do this, you'll need to convert it to a list first using the `list()` function:

In [5]:
sentences_as_list = list(doc.sents)

In [6]:
print(sentences_as_list)

[All human beings are born free and equal in dignity and rights., They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood., Everyone has the right to life, liberty and security of person.]


In [7]:
len(sentences_as_list)

3

Then you can get a random item from the list:

In [8]:
import random
random.choice(sentences_as_list)

All human beings are born free and equal in dignity and rights.

## Words

Iterating over a document yields each word in the document in turn. Words are represented with spaCy [Token](https://spacy.io/docs/api/token) objects, which have several interesting attributes. The `.text` attribute gives the underlying text of the word, and the `.lemma_` attribute gives the word's "lemma" (explained below):

In [9]:
for word in doc:
    print(word.text, word.lemma_)

All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .
Everyone everyone
has have
the the
right right
to to
life life
, ,
liberty liberty
and and
security security
of of
person person
. .


A word's "lemma" is its most "basic" form, the form without any morphology
applied to it. "Sing," "sang," "singing," are all different "forms" of the
lemma *sing*. Likewise, "octopi" is the plural of "octopus"; the "lemma" of
"octopi" is *octopus*.

"Lemmatizing" a text is the process of going through the text and replacing
each word with its lemma. This is often done in an attempt to reduce a text
to its most "essential" meaning, by eliminating pesky things like verb tense
and noun number.

Individual sentences can also be iterated over to get a list of words:

In [10]:
sentence = list(doc.sents)[1]
for word in sentence:
    print(word.text)

They
are
endowed
with
reason
and
conscience
and
should
act
towards
one
another
in
a
spirit
of
brotherhood
.


## Parts of speech

The `pos_` attribute gives a general part of speech; the `tag_` attribute gives a more specific designation. [List of meanings here.](https://spacy.io/docs/api/annotation)

In [11]:
for item in doc:
    print(item.text, item.pos_, item.tag_)

All DET DT
human ADJ JJ
beings NOUN NNS
are VERB VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are VERB VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should VERB MD
act VERB VB
towards ADP IN
one NUM CD
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood NOUN NN
. PUNCT .
Everyone NOUN NN
has VERB VBZ
the DET DT
right NOUN NN
to ADP IN
life NOUN NN
, PUNCT ,
liberty NOUN NN
and CCONJ CC
security NOUN NN
of ADP IN
person NOUN NN
. PUNCT .


### Extracting words by part of speech

With knowledge of which part of speech each word belongs to, we can make simple code to extract and recombine words by their part of speech. The following code creates a list of all nouns and adjectives in the text:

In [12]:
nouns = [item.text for item in doc if item.pos_ == 'NOUN']
adjectives = [item.text for item in doc if item.pos_ == 'ADJ']

And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [13]:
for i in range(10):
    print(random.choice(adjectives) + " " + random.choice(nouns))

free person
free security
human rights
free beings
equal Everyone
equal brotherhood
equal spirit
equal rights
human liberty
free reason


Making a list of verbs works similarly:

In [14]:
verbs = [item.text for item in doc if item.pos_ == 'VERB']

Although in this case, you'll notice the list of verbs is a bit unintuitive. We're getting words like "should" and "are" and "has"—helper verbs that maybe don't fit our idea of what verbs we want to extract.

In [15]:
verbs

['are', 'born', 'are', 'endowed', 'should', 'act', 'has']

This is because we used the `.pos_` attribute, which only gives us general information about the part of speech. The `.tag_` attribute allows us to be more specific about the kinds of verbs we want. For example, this code gives us only the verbs in past participle form:

In [16]:
only_past = [item.text for item in doc if item.tag_ == 'VBN']

In [17]:
only_past

['born', 'endowed']

## Larger syntactic units

Okay, so we can get individual words by their part of speech. Great! But what if we want larger chunks, based on their syntactic role in the sentence? The easy way is `.noun_chunks`, which is an attribute of a document or a sentence that evaluates to a list of [spans](https://spacy.io/docs/api/span) of noun phrases, regardless of their position in the document:

In [18]:
noun_chunks = [item.text for item in doc.noun_chunks]
print(", ".join(noun_chunks))

All human beings, dignity, rights, They, reason, conscience, a spirit, brotherhood, Everyone, the right, life, liberty, security, person


For anything more sophisticated than this, though, we'll need to learn about how spaCy parses sentences into its syntactic components.

### Understanding dependency grammars

![displacy parse](http://static.decontextualize.com/syntax_example.png)

[See in "displacy", spaCy's syntax visualization tool.](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0)

The spaCy library parses the underlying sentences using a [dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar). Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based [phrase structure grammars](https://en.wikipedia.org/wiki/Phrase_structure_grammar) commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a "dependent" of some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

Dependents are related to their heads by a *syntactic relation*. The name of the syntactic relation describes the relationship between the head and the dependent. Use the displaCy visualizer (linked above) to see how a particular sentence is parsed, and what the relations between the heads and dependents are.

Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents). The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words):

In [19]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
    print()

Word: Everyone
Tag: NN
Head: has
Dependency relation: nsubj
Children: []

Word: has
Tag: VBZ
Head: has
Dependency relation: ROOT
Children: [Everyone, right, .]

Word: the
Tag: DT
Head: right
Dependency relation: det
Children: []

Word: right
Tag: NN
Head: has
Dependency relation: dobj
Children: [the, to]

Word: to
Tag: IN
Head: right
Dependency relation: prep
Children: [life]

Word: life
Tag: NN
Head: to
Dependency relation: pobj
Children: [,, liberty]

Word: ,
Tag: ,
Head: life
Dependency relation: punct
Children: []

Word: liberty
Tag: NN
Head: life
Dependency relation: conj
Children: [and, security]

Word: and
Tag: CC
Head: liberty
Dependency relation: cc
Children: []

Word: security
Tag: NN
Head: liberty
Dependency relation: conj
Children: [of]

Word: of
Tag: IN
Head: security
Dependency relation: prep
Children: [person]

Word: person
Tag: NN
Head: of
Dependency relation: pobj
Children: []

Word: .
Tag: .
Head: has
Dependency relation: punct
Children: []



Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

### Using .subtree for extracting syntactic units

The `.subtree` attribute evaluates to a generator that can be flatted by passing it to `list()`. This is a list of the word's syntactic dependents—essentially, the "clause" that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [20]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence:

In [21]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Flattened subtree: ", flatten_subtree(word.subtree))
    print()

Word: Everyone
Flattened subtree:  Everyone

Word: has
Flattened subtree:  Everyone has the right to life, liberty and security of person.

Word: the
Flattened subtree:  the

Word: right
Flattened subtree:  the right to life, liberty and security of person

Word: to
Flattened subtree:  to life, liberty and security of person

Word: life
Flattened subtree:  life, liberty and security of person

Word: ,
Flattened subtree:  ,

Word: liberty
Flattened subtree:  liberty and security of person

Word: and
Flattened subtree:  and

Word: security
Flattened subtree:  security of person

Word: of
Flattened subtree:  of person

Word: person
Flattened subtree:  person

Word: .
Flattened subtree:  .



Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [22]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [23]:
subjects

['All human beings', 'They', 'Everyone']

Or every prepositional phrase:

In [24]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [25]:
dobj_phrases = []
for word in doc:
    if word.dep_ == 'dobj':
        dobj_phrases.append(flatten_subtree(word.subtree))

In [26]:
prep_phrases

['in dignity and rights',
 'with reason and conscience',
 'towards one another',
 'in a spirit of brotherhood',
 'of brotherhood',
 'to life, liberty and security of person',
 'of person']

In [27]:
dobj_phrases

['the right to life, liberty and security of person']

## Entity extraction

A common task in NLP is taking a text and extracting "named entities" from it—basically, proper nouns, or names of companies, products, locations, etc. You can easily access this information using the `.ents` property of a document.

In [28]:
doc2 = nlp("Carly Rae Jepsen and I visited the Apple Store in Manhattan.")

In [29]:
for item in doc2.ents:
    print(item)

Carly Rae Jepsen
Apple Store
Manhattan


Entity objects have a `.label_` attribute that tells you the type of the entity. ([Here's a full list of the built-in entity types.](https://spacy.io/docs/usage/entity-recognition#entity-types))

In [30]:
for item in doc2.ents:
    print(item.text, item.label_)

Carly Rae Jepsen PERSON
Apple Store ORG
Manhattan GPE


In [31]:
spacy.explain('GPE')

'Countries, cities, states'

## Loading data from a file

You can load data from a file easily with spaCy. [Here's the first few verses from the King James Version of the Bible](http://rwet.decontextualize.com/texts/genesis.txt), for example. (Download the linked file and make sure it's in the same directory as this notebook.)

In [3]:
doc3 = nlp(open("flood.txt").read())

From here, we can see what entities were here with us from the very beginning:

In [15]:
for item in doc3.ents:
    print(item.text, item.label_)
    list_of_strings = list(item.text)

earth LOC
an hundred
            MONEY
twenty years DATE
earth LOC
earth LOC
LORD PERSON
LORD PERSON
Noah PERSON
Noaoah ORG
Noah PERSON
Noah PERSON
three CARDINAL
Shem PERSON
Ham PERSON
Japheth PERSON
Noah PERSON
three hundred cubits QUANTITY
fifty cubits QUANTITY
thirty cubits QUANTITY
second ORDINAL
third ORDINAL
two CARDINAL
earth LOC
two CARDINAL
Noah PERSON
LORD PERSON
Noah PERSON
sevens CARDINAL
two CARDINAL
earth LOC
seven days DATE
forty nights DATE
Noah PERSON
Noah PERSON
six hundred years old DATE
Noah PERSON

            PERSON
two CARDINAL
two CARDINAL
Noah PERSON
Noah PERSON
seven days DATE
the six hundredth year DATE
Noah PERSON
the second month DATE
the seventeenth day of the month DATE
the same day DATE
forty nights DATE
the selfsame day DATE
Noah PERSON
Shem PERSON
Ham ORG

           Japheth PERSON
Noah PERSON
Noah PERSON
three
            QUANTITY
Noah PERSON
two CARDINAL
two CARDINAL
forty days DATE
Fifteen cubits QUANTITY


 PERSON
Noah PERSON
Noah PERSON
the end o

To make a list of all of the times in the creation of the Earth:

In [5]:
[item.text for item in doc3.ents if item.label_ == 'TIME']

[]

In [16]:
open("string_export.txt", "w").write("\n".join(list_of_strings))

17

In [44]:
doc = nlp(open("nature_corpus.txt").read())

In [17]:
sentences = list(doc3.sents)

In [49]:
words = [item.text for item in doc]

In [51]:
nouns = [item.text for item in doc if item.pos_ == 'NOUN']

In [52]:
open("nature_nouns.txt", "w").write("\n".join(nouns))

2045

In [18]:
open("flood_sentences.txt", "w").write("\n".join(sentences))

TypeError: sequence item 0: expected str instance, spacy.tokens.span.Span found

In [53]:
pl_nouns = [item.text for item in doc if item.tag_ == 'NNS']

In [54]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

## Approaches to keyword extraction

"Keyword extraction" is the name for any kind of procedure that attempts to identify a subset of words in a text as being representative of that text's overall meaning. It's a way of computationally answering the questions of what a text is about, and how this text might be different in its contents from other texts. There are a number of ways to perform keyword extraction, some of which are quite sophisticated and depend on a large number of documents to be effective. Others are simple and effective enough that we can implement them in a few lines of code with just the data that we get from spaCy's model and basic analysis of a single document. We'll take a look at a few techniques of the latter kind below.

Here are some helpful recent overviews of different keyword extraction techniques (sometimes also called "automatic terminology recognition") from a number of different disciplines:

* Astrakhantsev, N. “ATR4S: Toolkit with State-of-the-Art Automatic Terms Recognition Methods in Scala.” ArXiv:1611.07804 [Cs], Nov. 2016. arXiv.org, http://arxiv.org/abs/1611.07804.
* [Chuang, Jason, et al. “‘Without the Clutter of Unimportant Words’: Descriptive Keyphrases for Text Visualization.” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 19, no. 3, 2012, p. 19.](http://vis.stanford.edu/papers/keyphrases)
* [Understanding Keyness](http://www.thegrammarlab.com/?nor-portfolio=understanding-keyness) from the Grammar Lab

### Counting words

Maybe the most obvious way to extract keywords from a text is to find the words that occur most frequently. This approach might not be very valuable, as we'll see below, but it's helpful at least to know how it's done. Fortunately, Python's `Counter` object, which provides an easy way to count the number of times that particular items occur in a list, will do most of the work for us. [Here's a more detailed tutorial about `Counter`](https://gist.github.com/aparrish/4b096b95bfbd636733b7b9f2636b8cf4), but the basics are easy to understand. First, import `Counter` from Python's built-in `collections` library:

In [7]:
from collections import Counter

And then pass a list of strings to `Counter()`, assigning the result to a variable. I'll start by just counting raw word counts:

In [8]:
word_counts = Counter([item.text for item in doc3 if item.is_alpha])

(The `if item.is_alpha` clause in the list comprehension above limits the list to only tokens that are alphanumeric, i.e., excluding punctuation.)

The `word_counts` variable contains a `Counter` object, which has a few interesting methods and properties. If you just evaluate it, you get a dictionary-like object that maps tokens to the number of times those tokens occur:

In [9]:
word_counts

Counter({'And': 63,
         'it': 23,
         'came': 6,
         'to': 25,
         'pass': 6,
         'when': 4,
         'men': 5,
         'began': 2,
         'multiply': 5,
         'on': 7,
         'the': 241,
         'face': 9,
         'of': 123,
         'earth': 51,
         'and': 159,
         'daughters': 3,
         'were': 24,
         'born': 1,
         'unto': 19,
         'them': 12,
         'That': 1,
         'sons': 17,
         'God': 21,
         'saw': 4,
         'that': 39,
         'they': 12,
         'fair': 1,
         'took': 4,
         'wives': 6,
         'all': 29,
         'which': 11,
         'chose': 1,
         'LORD': 11,
         'said': 10,
         'My': 1,
         'spirit': 1,
         'shall': 24,
         'not': 8,
         'always': 1,
         'strive': 1,
         'with': 27,
         'man': 15,
         'for': 16,
         'he': 19,
         'also': 5,
         'is': 19,
         'fleset': 1,
         'his': 32,
         'days

You can get the count for a particular token by using square bracket indexing with the `Counter` object:

In [87]:
word_counts['firmament']

9

Or you can get the *n* most frequent items using the `.most_common()` method, which takes an integer parameter to limit the list to a certain number of items, sorted from most frequent to least:

In [10]:
word_counts.most_common(10)

[('the', 241),
 ('and', 159),
 ('of', 123),
 ('And', 63),
 ('earth', 51),
 ('that', 39),
 ('in', 38),
 ('every', 35),
 ('his', 32),
 ('Noah', 32)]

This is a list of [tuples](https://docs.python.org/3.5/library/stdtypes.html#typesseq-tuple). (Tuples are just like lists, except you can't change them after you create them.) To get just the list of the ten most common nouns:

In [11]:
top_ten_words = [item[0] for item in word_counts.most_common(10)]
print(", ".join(top_ten_words))

the, and, of, And, earth, that, in, every, his, Noah


You can think of this as a kind of (very simple!) list of keywords–essentially, the words that occur in this document more than any other word.

The following expression evaluates to a list of every word in the text and the percentage of the text that it comprises. (To keep things short, I'm just getting the first 25 items from the list using the list slice syntax `[:25]`.)

In [12]:
total_words = sum(word_counts.values())
[(item[0], word_counts[item[0]] / total_words) for item in word_counts.items()][:25]

[('And', 0.026359832635983262),
 ('it', 0.009623430962343096),
 ('came', 0.002510460251046025),
 ('to', 0.010460251046025104),
 ('pass', 0.002510460251046025),
 ('when', 0.0016736401673640166),
 ('men', 0.0020920502092050207),
 ('began', 0.0008368200836820083),
 ('multiply', 0.0020920502092050207),
 ('on', 0.0029288702928870294),
 ('the', 0.10083682008368201),
 ('face', 0.0037656903765690376),
 ('of', 0.05146443514644351),
 ('earth', 0.021338912133891212),
 ('and', 0.06652719665271967),
 ('daughters', 0.0012552301255230125),
 ('were', 0.0100418410041841),
 ('born', 0.00041841004184100416),
 ('unto', 0.007949790794979079),
 ('them', 0.00502092050209205),
 ('That', 0.00041841004184100416),
 ('sons', 0.007112970711297071),
 ('God', 0.008786610878661089),
 ('saw', 0.0016736401673640166),
 ('that', 0.016317991631799162)]

This tells you that, e.g., the text is about 13% made up of the word "the" and about 0.5% made up of the word "darkness." Another way of formulating this is in terms of probability: if you pick a random word from this text, it has about a 13% chance of being "the" and a 0.5% chance of being "darkness." Using this method of extracting keywords, we're just making a list of the words that are most likely to be drawn at random from all words in that text.

### Word probabilities

Of course, this particular way of extracting keywords in a text isn't terribly useful—of the top ten items on the list, at least eight of them (excluding "God" and "earth") could be expected to occur in similar probabilities in *any* given source text. A potentially more interesting way to formulate the problem is to ask: what words are *uniquely* frequent in this text (and not any arbitrary English text)?

To figure this out, we need data: specifically, data on what the probability is that a given word will occur in any text written in English. Of course, the corpus of "text written in English" is not all computer-readable, is growing all the time, and has a poorly defined boundary (what counts as "English?"), so we can never know these probabilities precisely. But with a sufficiently large corpus of English documents, we could at least form a rough idea.

Fortunately, spaCy's model includes—for every word in its vocabulary—the word's [log probability](https://en.wikipedia.org/wiki/Log_probability) estimate, based on a large corpus of English texts. You can access a word's log probability estimate in English using the `.prob` attribute of the `Token` object (which is what you get when you iterate over a document or a sentence.)

In [91]:
[(item.text, item.prob) for item in doc3][:25]

[('In', -7.603263854980469),
 ('the', -3.528766632080078),
 ('beginning', -9.830488204956055),
 ('God', -8.62376594543457),
 ('created', -9.588191986083984),
 ('the', -3.528766632080078),
 ('heaven', -11.090792655944824),
 ('and', -4.113108158111572),
 ('the', -3.528766632080078),
 ('earth', -9.99667739868164),
 ('.', -3.0678977966308594),
 ('\n', -6.0506510734558105),
 ('And', -7.012199401855469),
 ('the', -3.528766632080078),
 ('earth', -9.99667739868164),
 ('was', -5.252320289611816),
 ('without', -7.694504261016846),
 ('form', -9.062009811401367),
 (',', -3.4549596309661865),
 ('and', -4.113108158111572),
 ('void', -11.47757625579834),
 (';', -6.586422920227051),
 ('and', -4.113108158111572),
 ('darkness', -11.919983863830566),
 ('was', -5.252320289611816)]

Lower numbers (i.e., numbers that are more negative) are more rare. You can also look up any word's probability using the `.vocab` attribute of the [`Language`](https://spacy.io/api/language) object, which we initially created by calling `spacy.load()`, which returns a [`Lexeme`](https://spacy.io/api/lexeme) object:

In [92]:
water = nlp.vocab['water']

In [93]:
water.prob

-8.589462280273438

By the way: you can convert a log probability back to a percentage by raising the constant $e$ to the power of the log probability. The constant $e$ is included as part of the `math` package, and the operator to raise a value by a power in Python is `**`:

In [94]:
from math import e
e**water.prob

0.00018605610680043203

This tells us that, according to spaCy, if you pick a word at random from any given English text, the chance of it being "water" is about 0.02%.

A first approximation, then, of our task to find the words that are uniquely probable in our text would be simply to get a list of the *least common words* in the text, as judged by spaCy's word probability estimate. To do this, we first need a list of just the unique words in the text (i.e., a list of all of the words with duplicates removed).

In [95]:
unique_words = list(set([item.text for item in doc3 if item.is_alpha]))

Then, using Python's `sorted()` function, we can sort these according to their probability and give only the top ten rarest words in the text.

In [96]:
[item for item in sorted(unique_words, key=lambda x: nlp.vocab[x].prob)][:15]

['moveth',
 'creepeth',
 'firmament',
 'Seas',
 'fowl',
 'yielding',
 'subdue',
 'abundantly',
 'Behold',
 'fruitful',
 'replenish',
 'likeness',
 'hath',
 'winged',
 'dominion']

> NOTE: If you're looking at that `sorted()` function and wondering things like "what is `lambda`" and "why is this happening to me?" then you might want to take a look at [this tutorial](https://github.com/ledeprogram/courses/blob/master/databases-2015/01_Python_Beyond_the_Basics.ipynb).

### Word weirdness

The result of the expression above feels a *bit* more like an accurate summary of the text, but it does seem to be favoring words that are just rare *in general*, and isn't picking up on words that are relatively common in English but are unusually common in our document. For example, according to our probability calculation earlier, one in twenty words in our text is "God," but the same could not be said for English in general (outside of a few specific genres and contexts, at least). So we need to focus in on the *uniqueness* of the probability. Is a given word uniquely probable to occur in our document, as opposed to English in general?

An easy and intuitive way to calculate this is simply to find the ratio of the word's probability in our document to spaCy's estimate of the word's probability in English. This calculation for a particular word was called that word's "weirdness" in [Ahmad, Khurshid, et al. “University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER).” TREC, 1999, pp. 1–8.](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.3364) and a similar measure called "log ratio" was proposed by [Andrew Hardie here](http://cass.lancs.ac.uk/?p=1133)).

We'll find each word's "weirdness" score by dividing its frequency in our source document (Genesis) with its English log frequency estimate from spaCy, like so (taking care to convert spaCy's log probability back into a percentage by raising $e$ to that power). To account for our intuition that our source text, being comparatively small, overrepresents the frequency of its rarest words, and underrepresents the frequency of its most common words, we'll use the *square* of the ratio in our source text. (Note: I have no actual well-motivated statistical reason for this, but it seems to work okay in practice. [See this tutorial](quick-and-dirty-keywords.ipynb) for a more statistically defensible but slightly more difficult-to-understand approach to this task.)

In [97]:
square_weirdness = [(item, pow(word_counts[item]/total_words, 2) / e**nlp.vocab[item].prob) for item in unique_words]

In [98]:
square_weirdness

[('that', 0.02680706036866266),
 ('Let', 0.8735509428958823),
 ('one', 0.0006328201748930747),
 ('brought', 0.09958702824984658),
 ('whales', 0.5626471557219519),
 ('behold', 0.6741497008189071),
 ('two', 0.002663514115913552),
 ('face', 0.06997393346725132),
 ('said', 0.2234619775500119),
 ('fish', 0.16378760290969788),
 ('and', 0.3942243826208631),
 ('of', 0.04530349263154088),
 ('may', 0.0034026034243890475),
 ('after', 0.27245900703261716),
 ('sixth', 0.4794220601229158),
 ('he', 0.021358924306546304),
 ('over', 0.17746281298749195),
 ('our', 0.010520954094237858),
 ('creepeth', 1378.6369355482045),
 ('In', 0.003156013802475997),
 ('God', 8.966795747370472),
 ('moved', 0.024726696730063457),
 ('give', 0.012733081112292561),
 ('deep', 0.026994624370536725),
 ('fruitful', 7.9162567496963385),
 ('lights', 0.49984586472853804),
 ('his', 0.09135656008221582),
 ('bearing', 0.21605927589715465),
 ('fill', 0.039745426646784425),
 ('dominion', 4.919916289305311),
 ('you', 0.0004996394704867

The higher the score, the weirder the word (i.e., the more particular it is to our source text versus English in general). Sorting by the score gives us our new list of keywords:

In [99]:
[item[0] for item in sorted(weirdness, reverse=True, key=lambda x: x[1])][:15]

['firmament',
 'creepeth',
 'moveth',
 'fowl',
 'yielding',
 'waters',
 'earth',
 'herb',
 'God',
 'abundantly',
 'fruitful',
 'dominion',
 'cattle',
 'seed',
 'multiply']

This list has many of the same words from the "just the least probable" list, but now includes words like "waters" and "God" that, while moderately probable in English, are especially probable in our text. Try it out with your own source text and see what you think!

### Counting parsed units

Another simple way to pull out common words and phrases is to focus on only particular stretches of the document that have certain syntactic or semantic characteristics, as determined by spaCy's parser. For example, in the cell below I'm counting the number of times particular nouns appear:

In [100]:
noun_counts = Counter([item.text for item in doc3 if item.pos_ == 'NOUN'])

... and then getting just the ten most common nouns:

In [101]:
top_ten_nouns = [item[0] for item in noun_counts.most_common(10)]
print(", ".join(top_ten_nouns))

earth, waters, kind, day, firmament, light, evening, morning, seed, fowl


Here's the same thing with noun chunks:

In [102]:
chunk_counts = Counter([item.text for item in doc3.noun_chunks])
top_ten_chunks = [item[0] for item in chunk_counts.most_common(10)]
print(", ".join(top_ten_chunks))

God, the earth, it, the waters, his kind, them, the firmament, the evening, the morning, the heaven


Or with named entities:

In [103]:
entity_counts = Counter([item.text for item in doc3.ents])
top_ten_entities = [item[0] for item in entity_counts.most_common(10)]
print(", ".join(top_ten_entities))

earth, the day, God, the evening and the morning, the night, Night, the first day, the second day, one, Earth


Or with subjects of sentences:

In [104]:
subject_counts = Counter([item.text for item in doc3 if item.dep_ == 'nsubj'])
top_ten_subjects = [item[0] for item in subject_counts.most_common(10)]
print(", ".join(top_ten_subjects))

God, it, that, evening, earth, which, he, them, seed, waters


## Further reading and resources

We've barely scratched the surface of what it's possible to do with spaCy. [There's a great page of tutorials on the official site](https://spacy.io/docs/usage/tutorials) that you should check out!