*In case you are wondering what this little bit of "magic" does: it essentially tells Jupyter notebooks to embed whatever graphic is created by `matplotlib` into the current page. The second line makes that graphic a higher resolution and 2:1 in aspect, which fits better on a page, to my mind, than other aspects, but you can adjust that as you like. I place this at the top of a page so I do it first thing and then get on with the work I want to do. If I am creating nothing graphical, then it's not necessary.*

In [None]:
%pylab inline
figsize(12, 6)

# 3. Exploring a Text with the NLTK

<!-- ## Table of Contents <a name="toc"></a>

**[Loading a File & Understanding What It Is](#file)**  
**[Tokenizing](#tokenizing)**  
[A Quick Note about Normalization](#norm)  
[Using regex to Tokenize](#REtoke)   
[Using NLTK Tokenizers](#nltktoken)  -->

In the first two notebooks we learned how to load a file, create a word list out of a text string, and then to count words and visualize those counts. In this notebook we will use the `NLTK` to explore how various words occur or are used within a text. Once you have seen how these commands work, feel free to restart the notebook and upload your own text. In the next notebook, we will look at how to load more than one text. 

The first thing we are going to do is load the libraries we know we are going to use, load our text file, and create our word list.

In [None]:
import re
import nltk

mdg = open('texts/mdg.txt', 'r').read()
mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

Once we have our text as our, by now normal, list of words, we need to prepare it for use by the `NLTK` as a text. The command to do this is frighteningly straightforward:

In [None]:
mdg_text = nltk.Text(mdg_words)

Once we have our text as, well, a text -- but, better, because `nltk.Text`? -- we can do some pretty amazing things, like develop in what linguists call a *key word in context (KWiC)* concordance:

In [None]:
myword = mdg_text.concordance("dangerous")
print(myword)

We can make this easier to manipulate if we move things around a bit, and then you can try searching for a few words yourself:

In [None]:
# Try any of the following: hunt, hunted, hunter, dark, jungle
myword = "jungle"
print(mdg_text.concordance(myword))

We will examine more ways to understand the relationship between individual words and the text as their context in a moment, but while we have our text as an object and we have a cluster of words in front of us, it might be useful to demonstrate the utility of being able to see where words occur in a text using the NLTK's dispersion plot functon:

In [None]:
mdg_text.dispersion_plot(["hunter", "jungle", "dark"])

If you noticed that what you are doing above is feeding the `dispersion_plot()` function a list of words, you are beginning to get the hang of reading code, which means we can break out list out separately and then simply feed the name of the list to the function:

In [None]:
my_list = ["hunted", "hunter", "hunt"]
mdg_text.dispersion_plot(my_list)

A lot of the functions for examining words within a text have pretty straightforward names. Let's take a look at a number of them:

In [None]:
myword = "hounds"
print(mdg_text.similar(myword))

In [None]:
# common_contexts allows you to see where two words are used similarly:
mdg_text.common_contexts(["hounds", "hunter"])

In [None]:
# Please note that the collocations function uses the NLTK's stopwords list
# in the background. I am not yet sure how to feed it a custom stopword list.
mdg_text.collocations()

In [None]:
# This is a pretty straightforward function, but useful if you don't feel
# like consulting your usually much larger word frequency list.
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?