# Spacy

## Basics

Spacy is a powerful Natural Language Processing library for Python.  It has a number of abilities, but we'll only use a few of them here.

To begin using spacy, we import it and load the English-language module.  By convention we assign this loaded module to a variable called `nlp`.  (This may take a minute to load, because it loads a lot of data behind the scenes.)

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

This `nlp` object is a strange beast whose exact nature we don't need to know much about:

In [None]:
nlp

The important thing is that if we call it and pass it a string of text, it does a bunch of processing on that string.  Let's create a simple text:

In [None]:
text = """This is a test.  It is a test of your home broadcasting system.  This is only a test.  It is not a real emergency.

In the event of a real emergency, you should step away from your computer and get to safety.  But don't do that now, because this is only a test.

This test is taking place at U.C.S.B., which is a great place.  The test is being conducted by Dr. Brendan Barnwell.

The purpose of these test sentences will become clear in a moment."""

Now we'll create our parsed "document":

In [None]:
doc = nlp(text)
doc

Hey, it looks just like a regular string!  But it's not.  Spacy cleverly makes its document objects display as if they were strings, but they are much more than that.  If we look at the type of our document we can see it's really not a string:

In [None]:
type(doc)

This is a common theme with Spacy objects.  When working with data using Spacy, you may find yourself periodically checking whether what you have is a basic Python object like a string, or some more complex Spacy object masquerading as a string.  As we'll see, this distinction can be critical in some cases, because although Spacy objects display like strings, they behave differently from strings in some important ways.

This Doc object has various useful attributes and abilities.  It allows us to access individual words in the text by numerical index, similar to a list:

In [None]:
doc[0]

It also has a `.sents` attribute that will let us iterate over the sentences of the text.  (The value of `.sents` is a generator, which we convert to a list here to view it.)

In [None]:
list(doc.sents)

Already you can see that this is useful.  Spacy magically knows how to separate the document into sentences.  It's smart about it too --- it doesn't just split it on periods, because it knows that the periods in "U.C.S.B." and "Dr." are not the ends of sentences.

Let's grab a sample sentence and work with it a bit:

In [None]:
sent = list(doc.sents)[6]
sent

Once again, this looks like a string but it's actually a thing called a Span:

In [None]:
type(sent)

This Span is actually an object like a list that we can iterate over using something like a `for` loop.  Each item inside the sentence (and the document) is a word --- or rather, a "token", since it may be not just a word but punctuation or the like.

In [None]:
for tok in sent:
    print("Here is a token:", tok)

Notice that "U.C.S.B." is counted as a single word, but that sentence punctuation like the comma and period are treated as separate tokens.  Handy.

Again, what we are seeing there looks like strings, but each token is actually a `Token` object.  We'll look more at what we can do with them in a minute.

And what's that at the end?  It seems to be a blank token.  Let's take a look.  (Notice that, like the document object, the sentence lets us access tokens 
by index, as we would with a list.)

In [None]:
sent[-1]

Still mysterious.  We can make it more clear what is there by printing some other stuff around it:

In [None]:
print("*", sent[-1], "*", sep="")

Aha!  It looks like this token is just a space character.  (The `sep=""` there is to prevent `print` from adding extra spaces to our output, so we know that the space we see is really in the token we're looking at.)

But why did this space show up here?  There were spaces between all the words, but we only see this one space token.  The reason is that, being old-fashioned, your instructor has typed [two periods at the end of every sentence](https://en.wikipedia.org/wiki/Sentence_spacing), instead of just one.  Spacy will get rid of a single space, but anything beyond that is kept in as an extra token.

We can also see this in other sentences where a line break occurs at the end of the sentence:

In [None]:
for tok in list(doc.sents)[3]:
    print("The token is between here >", tok, "<")

Note how the last token had two line breaks, as we can see because it broke the line between the ><.

Another way to see the "true" contents of a token is to look at its `.text` attribute:

In [None]:
list(list(doc.sents)[3])[-1].text

The `\n` represents a line break.

## Tokens

Let's get the tokens in that sentence as a plain list so we can mess with them a bit:

In [None]:
toks = list(list(doc.sents)[3])
toks

Let's get the second token, which is the word "is":

In [None]:
tok = toks[1]
tok

This token has useful attributes.  As we saw, it has a `.text` attribute that gives us the text as a plain string.

In [None]:
tok.text

This can be *really important* for the following reason: every token object is considered to be different from every other one --- even two tokens that have the same text.  For instance, tokens #1 and #7  in our document are both the word "is":

In [None]:
print(doc[1])
print(doc[7])

But the two are not considered equal:

In [None]:
doc[1] == doc[7]

The Token represnts an *occurrence* of the word, not the "Platonic ideal" of the word.  There are good reasons for this; for example, as we'll see later, two words with the same text may have different parts of speech in different contexts.  But it's important to remember this when you're doing things like counting word frequencies, because if you use something like `collections.Counter` on the Token objects, it will count each one as a different item:

In [None]:
import collections
collections.Counter(doc[:10])

To get an accurate count, we would need to count the `.text` of each word instead:

In [None]:
collections.Counter([tok.text for tok in doc[:10]])

Each token also has an `is_stop` attribute that tells us whether it is a "stop word".  It's so awesome and we --- oh wait, um, excuse me:

In [None]:
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)

I'll explain that in a second.  As I was saying!  The `is_stop` attribute tells us whether a word is a stop word:

In [None]:
tok.is_stop

A "stop word" is basically a very common word.  They are called "stop words" because when processing a text, if we encounter such a word, we often want to "stop" and throw it out, because it is unlikely to carry much meaningful information.  For instance, in document summarization, we don't want to place any weight on the word "is" when crafting a summary of some document; the word "is" doesn't have any useful information for the purposes of our summary.

Due to a bug in some recent versions of Spacy, `.is_stop` doesn't work correctly in some situations.  We won't go into the details of this, but that magic incanctation I did a second ago. . .

```
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)
```

. . . will fix it.  (If you get an error, you may be using an older version of Spacy; try `spacy.en.word_sets.STOP_WORDS` instead of `spacy.lang.en.stop_words.STOP_WORDS`.)

The token objects also have other useful attributes.  `is_punct` tells us whether the token is punctuation.  Our `tok` (which, remember is "is") is not punctuation:

In [None]:
tok.is_punct

But the second-to-last token in our sentence is a period, as `is_punct` tells us:

In [None]:
print(toks[-2])
toks[-2].is_punct

There is also the `.is_space` attribute to tell us if our token is one of those confounding "blank space" tokens we saw before.  In this case, the last token of our sentence is whitespace:

In [None]:
toks[-1].text

And we can check that with `.is_space`:

In [None]:
toks[-1].is_space

We can also see the part of speech of each word with the `.pos_` attribute.  Our first word is "This":

In [None]:
toks[0]

And its part of speech is "pronoun":

In [None]:
toks[0].pos_

Finally, you can get the lemma of a token with its `.lemma_` attribute:

In [None]:
toks[0].lemma_

The lemma of a word is its "base form", roughly corresponding to "the entry under which you would look for this word in the dictionary".  In this case, the lemma is the strange string `-PRON-`.  This is a "fake" lemma that is used as the base form for pronouns.  We can see some other examples of lemmas that are a bit more comprehensible.  For instance, the second word of our text, recall, is "is":

In [None]:
doc[1]

The lemma of this is "be", because "is" is an inflected form of "be":

In [None]:
doc[1].lemma_

Towards the end of our text we have the word "sentences":

In [None]:
doc[-8]

The lemma is "sentence" because "sentences" is just the plural of this base form:

In [None]:
doc[-8].lemma_

## Lexemes

Our `nlp` object also has a `.vocab` attribute that lets us access individual words out of context.  `.vocab` is like a dictionary whose keys are strings, and the values are "Lexeme" objects:

In [None]:
nlp.vocab['house']

These objects have some of the same attributes as the tokens we were working with before.  We can get their text, their stopword status, their punctuation status, etc.:

In [None]:
print(nlp.vocab['house'].text)
print(nlp.vocab['house'].is_stop)
print(nlp.vocab['house'].is_punct)
print(nlp.vocab['house'].is_space)

However, unlike the Tokens, the Lexeme does *not* have part-of-speech information:

In [None]:
nlp.vocab['house'].pos_

Why not?  Because, in general, just knowing the form of a word in isolation isn't enough to know its part of speech.  You need to know the context in which it was used.  "House", for instance, could be a noun as in "He bought a big house in the country", but it could also be a verb as in "The city has to find some way to house the homeless".  So the Lexeme can only give us information that's known about the word in isolation (such as whether it's punctuation), but not information that depends on its context (such as part of speech).  Here's a simple example:

In [None]:
for token in nlp("The population abandoned the town.  The abandoned town was soon forgotten."):
    print(token, token.pos_, end=" / ")

You can see that the first occurrence of the word "abandoned" is (correctly) tagged as a verb, while the second is (correctly) tagged as an adjective.

Lemmas are similar to parts of speech in this regard.  We can't get the lemma from a lexeme because in some contexts a given word is a derivation, which is considered a new lexeme, while in other contexts the same word is an inflection, which is considered a variation on a base lexeme.  We can see this with our same example:

In [None]:
for token in nlp("The population abandoned the town.  The abandoned town was soon forgotten."):
    print(token, token.lemma_, end=" / ")

Note that the lemma of the first "abandoned" is "abandon", since in this case "abandoned" is an inflection of the verb "abandon".  But the lemma of the second "abandoned" is just "abandoned", since here the word "abandoned" is not an inflection of a verb but a "real" new word, an adjective that is derived from a verb but is more distinct from it.  

This distinction can be subtle and we won't go into it in detail.  Suffice it to say that to get the lemma, you need to have a Token (that is, a word in context), not just a Lexeme (a word in isolation).

A Lexeme also has a `.prob` attribute that indicates its relative frequency.  Frequent words have higher `prob` values:

In [None]:
nlp.vocab['the'].prob

While infrequent ones have lower values:

In [None]:
nlp.vocab['defenestrate'].prob

Technically, the values are the natural logarithm of the word's relative frequency (which is why they're all negative).  We can get the real proportion of the word in the overall corpus by taking the number e to the power of the prob.  The function `exp` from the builtin `math` module will do this, for instance:

In [None]:
import math
math.exp(nlp.vocab['the'].prob)

This means that in the gigantic database which Spacy's numbers are based on, about 2.9% of the words were "the".

One thing to watch out for is that misspellings exist in Spacy's source data and so you can get data for them just like you can for "real" words:

In [None]:
nlp.vocab['teh'].prob

The common misspelling *teh* is actually more frequent than the real but uncommon word *defenstrate*!  (In fact even words that don't exist in the corpus at all will have a `prob` value, namely -20.)  Also, different capitalizations count as different words:

In [None]:
nlp.vocab['The'].prob

In [None]:
nlp.vocab['THE'].prob

This means that if you want to know the true overall frequency of "the", regardless of case, you'll have to do some extra processing.

Although we didn't mention it before, Tokens also have `.prob`:

In [None]:
print(doc[0].text)
doc[0].prob

It's important to remember, though, that `.prob` represents the relative frequency of the word *in Spacy's source data* (which we assume to be a fairly realistic approximation to "the language as a whole"), and *not* the frequency in the particular text that you're parsing with Spacy.  So `doc[0].prob` there has nothing to do with how often the word "This" appears in our little sample text; it has to do with how often it occurs out in the wide world.

## Word vectors

Spacy also gives us access to "word vectors".  A word vector is essentially a sequence of numbers (300 in our case) that represent the "location" of a word in a high dimensional space.  Just as in high school math you may have used something like `(3, 4)` to represent a certain point on a grid (3 units right on the x-axis, 2 units up on the y-axis). the 300-element word vector represents "where" a word is in a 300-dimensional space.  The idea is that words with similar meanings will be close to each other in this space.  However, as we'll see, sometimes the nature of "similarity" is not what we expect.

We can read the word vectors from a word in our text:

In [None]:
print(doc[1])
doc[1].vector

Somewhat mysterious!  We can also get the vector from the `.vocab` along the lines we saw before:

In [None]:
nlp.vocab['is'].vector

Now, we can't get much meaningful information just by looking at the vector.  In fact, a single vector by itself is essentially meaningless, because the dimensions of this space don't have any particular interpretation.  (It's not like one dimension tells us how "good" a word is and another tells how "tofu-related" its meaning is or anything like that.)

Instead, the vectors have meaning only in *relation* to other vectors.  We can compute the similarity between two words with the `.similarity` method:

In [None]:
nlp.vocab['dog'].similarity(nlp.vocab['cat'])

The result is a number between 0 and 1.  Larger numbers indicate greater similarity between the two words.  For instance, "dog" is more similar to "cat" than it is to "apple":

In [None]:
nlp.vocab['dog'].similarity(nlp.vocab['apple'])

And if we pick something that's almost a synonym, like "canine", the similarity will be almost perfect:

In [None]:
nlp.vocab['dog'].similarity(nlp.vocab['canine'])

However, it's vital to note that these similarity measures can be misleading if not interpreted carefully.  Here's a relatively tame example:

In [None]:
print("Similarity between moral and liberal is", nlp.vocab['moral'].similarity(nlp.vocab['liberal']))
print("Similarity between moral and conservative is", nlp.vocab['moral'].similarity(nlp.vocab['conservative']))

Aha!  Clear, incontrovertible evidence that liberals are more moral than conservatives!  Or is it?  How about this:

In [None]:
print("Similarity between immoral and liberal is", nlp.vocab['immoral'].similarity(nlp.vocab['liberal']))
print("Similarity between immoral and conservative is", nlp.vocab['immoral'].similarity(nlp.vocab['conservative']))

Oops, liberals are also more immoral.  How can this be?  Let's try another:

In [None]:
print("Similarity between moderate and liberal is", nlp.vocab['moderate'].similarity(nlp.vocab['liberal']))
print("Similarity between moderate and conservative is", nlp.vocab['moderate'].similarity(nlp.vocab['conservative']))

Conservatives are clearly, clearly much more toward the middle of the road.  But wait. . .

In [None]:
print("Similarity between extreme and liberal is", nlp.vocab['extreme'].similarity(nlp.vocab['liberal']))
print("Similarity between extreme and conservative is", nlp.vocab['extreme'].similarity(nlp.vocab['conservative']))

It turns out conservatives are also more extreme.  Perhaps we should try this:

In [None]:
print("Similarity between moral and immoral is", nlp.vocab['moral'].similarity(nlp.vocab['immoral']))
print("Similarity between moderate and extreme is", nlp.vocab['moderate'].similarity(nlp.vocab['extreme']))
print("Similarity between liberal and conservative is", nlp.vocab['liberal'].similarity(nlp.vocab['conservative']))

Now we begin to see what's going on.  Opposites like "moral" and "immoral" are actually fairly similar to each other --- as similar, or more so, than they are to other things like "liberal" and "conservative" with which we may be tempted to compare them.  Moreover, sometimes these tempting "interesting" similarities are dwarfed by off-the-wall ones:

In [None]:
print("Similarity between ilk and liberal is", nlp.vocab['ilk'].similarity(nlp.vocab['liberal']))
print("Similarity between ilk and conservative is", nlp.vocab['ilk'].similarity(nlp.vocab['conservative']))

These scores are some of the highest we've seen so far!  It's comforting to know that even in these times of deep ideological divison, we can find common ground in. . . ilk.

Although you'll often hear word vectors talked about as capturing *similarity*, in fact they are strongly influenced by *associations*.  But associations don't necessarily indicate similarity.  If you google for "liberal ilk" and "conservative ilk" you can find many articles referring to "CNN and their liberal ilk", "the Hollywood ilk and other liberals", "those of conservative ilk", "ultra-conservative candidates backed by the Tea Party and its ideological ilk", etc.  It seems plausible that these words have high similarity simply because people tend to use the word "ilk" when talking about liberals and conservatives (and also other "-isms" or words describing ideologies and other things of that. . . ilk).  Also, as we see, antonyms can often have quite high similarity scores, because their patterns of usage are very similar.

It's also important to remember that, in cases like these where we're working with precomputed word vectors, we don't actually know why the similarities are what they are.  The algorithms used to compute them are quite complex, so it's not easy to be sure about an explanation for particular similarity scores.  Also, the Spacy vectors were computed using a dataset called "Common Crawl" which essentially consists of a huge set of web pages.  There's a lot of weird stuff out there on the web.  Which brings us to our next example. . .

In [None]:
# don't run this until you've answered the ClassQuestion poll!
for other in ('cold', 'homegrown', 'scalding', 'scorching', 'warm', 'wet'):
    print(other, nlp.vocab['hot'].similarity(nlp.vocab[other]))

Oh dear.  We can probably guess why that turned out that way.  But notice again that "cold" is quite similar to "hot" --- more so than words which we might consider closer in meaning, like "warm" and "scalding".

These are all things to keep in mind if you're working with word vectors.  An important point is that when you work with *many* words and aggregate the results together, many of the idiosyncracies of individual word similaities will often (but not always) be smoothed out.  The danger of misinterpretation is greater when you look at *individual* similarities, especially if you cherry-pick them without adequately exploring all the connections.  For instance, if as above we compare "moral" and "immoral" to "liberal" and "conservative", it's vital to compare "moral" and "immoral" *to each other* and "liberal" and "conservative" *to each other* to at least minimally cover our bases.

Despite these potential dangers, word vectors are quite useful to have around.  One of the most useful things we can do with them is combine them into vectors that represent entire sentences or documents.  Spacy actually already does a simple version of this for us.  Each sentence has a vector too.  (I'm using `[:5]` here to keep the output short, but it has 300 elements just like the others.)

In [None]:
list(doc.sents)[2].vector[:5]

And in fact even the entire document has a vector:

In [None]:
doc.vector[:5]

How are these higher-level vectors obtained?  They're just the average of all the vectors of the words within them.  We can see that by taking the average ourselves and comparing it.

In [None]:
sum_of_word_vectors = sum(tok.vector for tok in doc)
average_of_word_vectors = sum_of_word_vectors/len(doc)
(average_of_word_vectors == doc.vector).all()

(The vectors are `numpy` arrays, an object that is somewhat similar to a Series in Pandas.  We won't spend too much time worrying about what these objects are, but note that, as shown here, you can use the `.all()` method to determine whether all elements of the vector are True.)  In this case the result means that our hand-computed average of word vectors is the same as `doc.vector`.  Similarly, the vector for a sentence is the average of the vectors of the words in the sentence.

This is a reasonable way to combine word vectors into vectors representing larger units.  In practice, however, you'll often want to do something a bit different.  In particular, it is usually a good idea to remove stop words before combining the remaining words into a single vector.  Otherwise the vector for your sentence or document will be "cluttered" with values from uninformative words like "the" and "of", which occur frequently in most texts.  We're not going to do much with word vectors now, so we won't go into this, but it's something to think about if you want to try playing around with word vectors in your project.

## Dependency parsing

Spacy also automatically does what's called "dependency parsing", which means it can parse the grammatical structure of a text.  Let's suppose we have this simple text (thanks again to Gertrude Stein):

In [None]:
text = "I wish that I had spoken only of it all."

Spacy includes a mechanism for displaying how it thinks the sentence should be parsed.  You can use the `render()` function from the `displacy` submodule.  Including the `jupyter=True` argument will display the result right in the notebook:

In [None]:
doc = nlp(text)
spacy.displacy.render(doc, jupyter=True)

Roughly speaking, each arrow links from the "head" of a syntactic constituent to a smaller constituent contained within it (although the notions of "head" used here don't always match perfectly with those in some formal theories).  The labels on the arrows indicate the nature of the syntactic dependency.  We can get a description of a label using `spacy.explain`:

In [None]:
spacy.explain("nsubj")

So when we see an arrow from "wish" to "I" with the label "nsubj", that's telling us that the subject of "wish" is "I".

This example is fairly "clean".  Real data, however, is often much less clean, and this causes some problems for Spacy's parser.  Let's take a new text, namely [a Yelp review](https://www.yelp.com/biz/hook-and-press-donuts-santa-barbara?hrid=OlobztuQ0LQF5EyOVTxdVg&utm_campaign=www_review_share_popup&utm_medium=copy_link&utm_source=(direct)) of a recently-opened donut shop in Santa Barbara:

In [None]:
text = """How to eat a @hookandpress donut.
Step 1 reach across the counter and slap whichever sad excuse for a hipster decided they could make "artisanal donuts "
Step 2 drive to milpas go to eller's donuts eat one of the lightest fluffiest gifts from the gods in existence pay $1.75 and all in the world will be right again
These were some of the most overhyped donuts I've ever eaten the gluten was over developed they were down right chewy as a bagel flavor was far from balanced and a disgusting amount of sugar save your money go buy a real  besides 3.90 for a donut is outrageous sometimes you need to say just cause it's pretty doesn't mean it doesn't suck and young techie hipsters please start saying no to $3+ donuts"""

doc = nlp(text)

Let's try to look at the sentences in this text:

In [None]:
for ix, sent in enumerate(doc.sents):
    print(ix, "--", sent)

Oh dear.  We can already see that Spacy is having difficulty parsing this.  It's put in sentence breaks in odd places (like between the parts of "Step 2").  Let's see how it handles the dependency parsing.  Step 1 doesn't look too bad:

In [None]:
spacy.displacy.render(list(doc.sents)[1], jupyter=True)

But if we try to parse that monster sentence #4. . .

In [None]:
spacy.displacy.render(list(doc.sents)[4], jupyter=True)

It's hard to even make sense of this, but you can see at least that Spacy is making some mistakes.  It's also easy to see why: this reviewer didn't use much punctuation, so it's difficult for Spacy to correctly decide where to break things up.  For instance, where the text says:

> These were some of the most overhyped donuts I've ever eaten the gluten was over developed they were down right chewy as a bagel flavor was far from balanced

Spacy thinks that "eaten the gluten" is a verb phrase with "the gluten" as the object, and it thinks that the donuts were "chewy as a bagel flavor".  For a human it's fairly easy to break it up correctly.  If we were to insert punctuation it would probably go like this:

> These were some of the most overhyped donuts I've ever eaten.  The gluten was over developed.  They were down right chewy as a bagel.  Flavor was far from balanced.

But Spacy has a harder time.

## Summary

This is only a brief introduction to some of what Spacy gives us.  We've looked briefly at a few main kinds of information Spacy can give us about a text:

1. Basic word-level info (is it punctuation, is it a stop word, etc.)
2. Word frequency
3. Part of speech
4. Word vectors
5. Dependency parsing

Any or all of these might be useful to you in doing your project, which of course you are thinking about all the time, right?  Of course.  Naturally.