## Exploratory Data Analysis with the Natural Language Toolkit

This notebook is a short introduction to exploratory data analysis with the Natural Language Toolkit (NLTK), an open-source Python library widely used in Computational Linguistics, Digital Humanities, and other fields. The notebook will demo a few quick ways to get started with this powerful library.  If you want to learn more, consult the [NLTK book](https://www.nltk.org/book/), an online free resource released under a Creative Commons license.

Most of you have already installed NLTK and other associated libraries since we used them in the previous notebook on sentiment analysis. But just in case, run the following cells:

In [None]:
import sys
!{sys.executable} -m pip install --user -U nltk

In [None]:
!{sys.executable} -m pip install matplotlib

In [None]:
!{sys.executable} -m pip install --user -U numpy

In [None]:
import nltk
nltk.download('punkt')
nltk.download('movie_reviews')

Now let's import NLTK into our notebook (remember that you always need to use the import command for any Python libraries you want to use, whether that's TextBlob, Markovify, NLTK, or something else):

In [None]:
import nltk

Let's also download all the subsidiary packages associated with NLTK:

In [None]:
nltk.download('all')

Recall that in order to use TextBlob, we had to "blobify" a text before we could do things like calculate sentiment. Similarly, in order to do some exploratory data analysis with NLTK, we need to put a wrapper around our novella so that it functions as an NLTK text object. The next few cells show you how to do that. First let's open and read our file. This code should look very familiar to you by now. I'm using "The Yellow Wallpaper" as my example, but you can substitute your own text file to work with a different novel. Just be sure it's in UTF-8 format! I'll assign "The Yellow Wallpaper" to the variable "yellow_wallpaper" (you can use whatever variable you want, as long as you're consistent! Alternatively, feel free to retain my variables when analyzing your own text just to avoid errors):

In [None]:
yellow_wallpaper = open("yellow_wallpaper2.txt").read()

Now we're going to import some more NLTK subsidiary libraries, as well as "tokenize" our text: that is, split it into smaller units, in this case words. NLTK has built-in functions to do this. We'll assign the tokenized version of "yellow_wallpaper" to a new variable, "split_wallpaper" (try to monitor and track how we're using variables):

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text
split_wallpaper = word_tokenize(yellow_wallpaper)

We're finally ready to turn "The Yellow Wallpaper" into an NLTK text object. Note how we're introducing yet another variable! This time it's "final_wallpaper." Basically, each time we transform the contents of one variable, we're assigning the transformed version of the text to a new variable. 

In [None]:
final_wallpaper = Text(split_wallpaper)

Much of this preliminary code is about pre-processing our text so that we can do interesting things with it. That pre-processing also involves adhering to boilerplate (conventional) code that the NLTK library requires. We're now in a position to do something interesting with our text. Let's create a dispersion plot:

In [None]:
final_wallpaper.dispersion_plot(["John"]) #sometimes you need to rerun the cell to see the graph

A dispersion plot visualizes how a particular word is distributed over a text. We might search the name of a character, for example, to discover where they're invoked most frequently and, conversely, where mentions or appearances of that character drop off. Are there certain chapters or sections of a novel, for example, that are saturated with mentions, but others in which that character (or theme or refrain or keyword) is absent? In this instance, I've created a dispersion graph showing all the appearances of "John," the protagonist's physician husband. As we can see, the narrator keeps up a steady stream of references to him. Let's try visualizing the word "woman". In terms of helpful tips for reading and interpreting the graph, keep in mind that each tick on the graph represents a single occurence of the search term.  

In [None]:
final_wallpaper.dispersion_plot(["woman"])

Take a close look at the x-axis of the graph: the number on the far left is "3500" instead of "0" as it was with the previous graph for "John". Can you see why? The reason is that the word "woman" doesn't make an immediate appearance in the novella. "Word offset," the label for the x-axis, refers to how many sequential words we read or encounter before our relevant term appears. It would be nice if we could adjust the graph so that we can see empty blank space signifying the span of text at the beginning where "woman" is absent. Similarly, the x-axis is abruptly cut off at "6500" on the right side since the term "woman" doesn't appear at the end. One way to approach this problem is to first ascertain how many words are in the novella. We can do that with the "len" method in Python. We saw this method in one of our earlier notebooks. The string split() method at the end splits the text into individual words so that they can be counted using len():

In [None]:
len(yellow_wallpaper.split())

Voila: we've got 6078 words in "Yellow Wallpaper." Now we can use that info to adjust our graph (see the second line of code, below, where we've plugged in the relevant numbers for the x-axis labeling):

In [None]:
import matplotlib.pyplot as plt
plt.xlim((0,7000))
final_wallpaper.dispersion_plot(["woman"])

Now we can see that there's a sizable span of text at the beginning where no mention of "woman" occurs.  

Now let's plot more than one term in our dispersion graph (you'll want to adjust the size of the graph using the technique I just covered for whatever text you want to analyze):

In [None]:
plt.xlim((0,7000))
final_wallpaper.dispersion_plot(["garden", "woman", "paper", "wallpaper"])

We can use NLTK's built-in concordance to see each word in context. A concordance is a list of every word in a text that also includes quotations from the relevant passages in which it occurs. Let's try "John":

In [None]:
final_wallpaper.concordance('John')

Note that by default only 25 out of 45 references to "John" are displayed. To show all 45 occurences, we need to use an additional parameter, "lines", being sure to also supply the integer "45" as our value:

In [None]:
final_wallpaper.concordance('John', lines = 45)

Do you notice anything interesting about the contexts in which "John" appears? The construction "John says" recurs numerous times. What does it suggest about the relationship between the protagonist and her husband? 

Another neat feature of NLTK is the ability to identify words that share common contexts with other words. As a random example, consider the following two contrived sentences:  
1. They took a walk through the garden.
2. They took a walk through the arbor.  

In these examples, "garden" and "arbor" share an identical linguistic context. While not interchangeable in meaning, their common contexts suggest a parallelism or symmetry that may invite further inquiry or reflection. Let's see what words share a common context with "paper" and "wallpaper" in "The Yellow Wallpaper":

In [None]:
final_wallpaper.similar('paper')

In [None]:
final_wallpaper.similar('wallpaper')

Interesting! The wallpaper inside the mansion shares common contexts with the gardens outside ("garden," "roses"). Why might that be? One thing that comes to mind is that the woman the protagonist sees creeping in the wallpaper also mysteriously appears in the gardens. Here are some extended transcripts from the novella. First let's take a look at the passages that describe the woman trapped in the wallpaper:

>There are things in that paper that nobody knows but me, or ever will.

>Behind that outside pattern the dim shapes get clearer every day.

>It is always the same shape, only very numerous.

>And it is like a woman stooping down and creeping about behind that pattern. I don’t like it a bit. I wonder—I begin to think—I wish John would take me away from here!

>At night in any kind of light, in twilight, candlelight, lamplight, and worst of all by moonlight, it becomes bars! The outside pattern I mean, and the woman behind it is as plain as can be.

>I didn’t realize for a long time what the thing was that showed behind,—that dim sub-pattern,—but now I am quite sure it is a woman.


And now let's take a look at the passages that describe the woman outside in the garden:

>I think that woman gets out in the daytime!

>And I’ll tell you why—privately—I’ve seen her!

>I can see her out of every one of my windows!

>It is the same woman, I know, for she is always creeping, and most women do not creep by daylight.

>I see her on that long shaded lane, creeping up and down. I see her in those dark grape arbors, creeping all around the garden.

>I see her on that long road under the trees, creeping along, and when a carriage comes she hides under the blackberry vines.

>I don’t blame her a bit. It must be very humiliating to be caught creeping by daylight!

>I always lock the door when I creep by daylight. I can’t do it at night, for I know John would suspect something at once.

>And John is so queer now, that I don’t want to irritate him. I wish he would take another room! Besides, I don’t want anybody to get that woman out at night but myself.

>I often wonder if I could see her out of all the windows at once.

>But, turn as fast as I can, I can only see out of one at one time.

>And though I always see her she may be able to creep faster than I can turn!

>I have watched her sometimes away off in the open country, creeping as fast as a cloud shadow in a high wind.



Things to try: experiment with other semantically significant words from "The Yellow Wallpaper" (e.g., "creeping", "yellow"). Then try analyzing  a novel of your choice, experimenting with dispersion plots, as well as the concordance and similarity features of NLTK. Remember that [Project Gutenberg](https://www.gutenberg.org/) is a great source for UTF-8 texts. 