*In case you are wondering what this little bit of "magic" does: it essentially tells Jupyter notebooks to embed whatever graphic is created by `matplotlib` into the current page.*

In [1]:
%matplotlib inline

# 3. Exploring a Text with the NLTK

<!-- ## Table of Contents <a name="toc"></a>

**[Loading a File & Understanding What It Is](#file)**  
**[Tokenizing](#tokenizing)**  
[A Quick Note about Normalization](#norm)  
[Using regex to Tokenize](#REtoke)   
[Using NLTK Tokenizers](#nltktoken)  -->

In the first two notebooks we learned how to load a file, create a word list out of a text string, and then to count words and visualize those counts. In this notebook we will use the `NLTK` to explore how various words occur or are used within a text. Once you have seen how these commands work, feel free to restart the notebook and upload your own text. In the next notebook, we will look at how to load more than one text. 

## Load Text and Create Word List

The first thing we are going to do is load the libraries we know we are going to use, load our text file, and create our word list.

In [None]:
import re
import nltk

file = open('texts/hod.txt', 'r').read()
words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

## Create NLTK `Text`

Once we have our text as our, by now normal, list of words, we need to prepare it for use by the `NLTK` as a text. The command to do this is frighteningly straightforward:

In [None]:
mdg_text = nltk.Text(words)

## Concordances

Once we have our text as, well, a text -- but, better, because `nltk.Text`? -- we can do some pretty amazing things, like develop in what linguists call a *key word in context (KWiC)* concordance. The code for the concordance is fairly simple and straightforward:

```python
my_text.concordance("word")
```

You can type the line just like that, and Jupyter notebook will deliver results:

```
Displaying 4 of 4 matches:
 that the cape buffalo is the most dangerous of all big game for a moment the g
r the cape buffalo is not the most dangerous big game he sipped his wine here i
 in the same slow tone i hunt more dangerous game rainsford expressed his surpr
reason after a fashion so they are dangerous but where do you get them the gene
In [22]:
```

I am somewhat used to putting anything I want to display inside a print function, which is why you'll see it in the code block below. I have also moved the word I want to enter into a variable so I can type it outside of the line of code. This makes it easier to think about making this code part of a `for` loop when I want to investigate a list of words. (See below.)

In [None]:
# Try any of the following: hunt, hunted, hunter, dark, jungle
word = "game"
print(mdg_text.concordance(word))

In [None]:
word_list = ['hunter', 'hunted', 'hunt']
for word in word_list:
    print(mdg_text.concordance(word))

## Dispersion Plot

We will examine more ways to understand the relationship between individual words and the text as their context in a moment, but while we have our text as an object and we have a cluster of words in front of us, it might be useful to demonstrate the utility of being able to see where words occur in a text using the NLTK's dispersion plot functon:

In [None]:
mdg_text.dispersion_plot(["hunter", "jungle", "dark"])

If you noticed that what you are doing above is feeding the `dispersion_plot()` function a list of words, you are beginning to get the hang of reading code, which means we can break out list out separately and then simply feed the name of the list to the function:

In [None]:
my_list = ["hunted", "hunter", "hunt"]
mdg_text.dispersion_plot(my_list)

A lot of the functions for examining words within a text have pretty straightforward names. Let's take a look at a number of them:

In [None]:
myword = "eyes"
print(mdg_text.similar(myword))

In [None]:
# common_contexts allows you to see where two words are used similarly:
mdg_text.common_contexts(["hounds", "hunter"])

In [None]:
# Please note that the collocations function uses the NLTK's stopwords list
# in the background. I am not yet sure how to feed it a custom stopword list.
mdg_text.collocations()

In [None]:
# This is a pretty straightforward function, but useful if you don't feel
# like consulting your usually much larger word frequency list.
mdg_text.count("dangerous")

### Parts of Speech

In [None]:
nltk.pos_tag(mdg_words[0:20])

In [None]:
mylist = nltk.pos_tag(mdg_words[0:20])
for word, value in mylist:
    if value == 'NN':
        print(word)