# 4. Refining NLTK Inputs & Outputs

<!-- ## Table of Contents <a name="toc"></a>

**[Loading a File & Understanding What It Is](#file)**  
**[Tokenizing](#tokenizing)**  
[A Quick Note about Normalization](#norm)  
[Using regex to Tokenize](#REtoke)   
[Using NLTK Tokenizers](#nltktoken)  -->

In the first two notebooks we learned how to load a file, create a word list out of a text string, and then to count words and visualize those counts. In this notebook we will use the `NLTK` to explore how various words occur or are used within a text. Once you have seen how these commands work, feel free to restart the notebook and upload your own text. In the next notebook, we will look at how to load more than one text. 

## Load Text, Create Word List, Create NLTK `Text`

The first thing we are going to do is load the libraries we know we are going to use, load our text file, and create our word list.

In [None]:
import re
import nltk

file = open('texts/mdg.txt', 'r').read()
words = re.sub("[^a-zA-Z']"," ", file).lower().split()
text = nltk.Text(words)

## Concordances

In the previous notebook we explored how to use the NLTK's `concordance()` functionality and we discovered some limitations to its default settings. In this notebook we will explore how we can write some fairly basic blocks of code to get the kinds of outputs we want.

Recall that the basic command looks like this:

```python
my_text.concordance("word")
```

You can type the line just like that, and Jupyter notebook will deliver results, but please note that Jupyter notebook is simply displaying what Python is delivering to standard output, often known as `stdout`. If ever you enter a command and you don't get stdout, just add print functionality. 

One of the limitations of of a number of NLTK's functions is that they default to 25 lines of output. `concordance` will serve as our example here. A quick check of the [documentation][] reveals that the default is `concordance(word, width=79, lines=25)`. (Yes, the NLTK documentation is online: you can read all its lines of codes for yourself -- try that with Microsoft Word!)

[documentation]: http://www.nltk.org/_modules/nltk/text.html#Text.concordance

In order to find words that will exceed that limit but not by so much that there is a lot to scroll through, I wrote the following small bit of code to find those words for me. A quick description of what this code does:

* **`for word in set(words):`** tells Python to go through all the words in the text, but only once for each word, which we can do by reducing the text to the set of words used.
* **`if 25 < text.count(word) < 30:`** tells Python only to select the words whose count is greater than 25, exceeding the default, but less than 30 (but this could have been a larger number).
* **`print(word, text.count(word)`** prints the word and the number of times it occurs.

In [None]:
for word in set(words):
    if 25 < text.count(word) < 30:
        print(word, text.count(word))

"Hunting" will work. Let's see what happens when we add parameters for "hunting":

In [None]:
text.concordance("hunting", width=79, lines=28)

That works! Let's add two revisions:

* A little experimentation reveals that we don't need to specify `width` if don't wish to change it. 
* We don't need to feed the number of lines by hand: we can let `count()` do that for us.

In [None]:
word = "hunting"
text.concordance(word, lines=text.count(word))

In the previous notebook on the NLTK we tried a list of words that just so happened to be related:

```python
word_list = ['hunter', 'hunted', 'hunt']
for word in word_list:
    print(mdg_text.concordance(word))
```

And we wondered if we couldn't ask Python to stem, or lemmatize, for us. It turns out you can. The following example of code uses the Porter stemmer, but the Lancaster stemmer is also available, as is a lemmatizer that uses WordNet.

In [None]:
porter = nltk.PorterStemmer()

for word in set(words):
    if porter.stem(word) == "hunt":
        print(text.concordance(word, lines=text.count(word)))

I'm not sure why "hunter" doesn't make the list. I'll keep looking.

Clai asked about sorting on n-1 or n+1. I do not yet see an easy way to do this. I'll keep looking.