We'll often want this magical line at the start of our notebooks.
It makes plots show up right in the notebook. We might as well get used to it.

In [None]:
%matplotlib inline

# Working with Text Files

We can use these two lines to read a text file.
The first line creates a file object that points to the file
The second line reads in the contents of that file and assigns it
to a variable named `genesis_raw`.

In [None]:
myfile = open('corpora/genesis.txt')
genesis_raw = myfile.read()

`genesis_raw` will be a string with every character in genesis. 
Let's see how many characters it is:

In [None]:
len(genesis_raw)

We can display the first 100 characters:

In [None]:
genesis_raw[:100]

Now we want to split this long string into a list of words.
We can use the the `split` method that we encountered earlier

In [None]:
genesis_words = genesis_raw.split(' ')

We can see how long it is in words, and print the first 100 words.

In [None]:
print(len(genesis_words))
print(genesis_words[:100])

Notice that this doesn't work perfectly. There are some words that are stuck together as a single token in our list. For example, 'earth.\nAnd' is just one token. If we decided to count how many times the word "earth" appeared, this instance wouldn't be counted.

(`\n` represents a newline character.)

The moral is that splitting a string into separate words can be tricky. 

## Corpus reading with nltk
[www.nltk.org/howto/corpus.html](http://www.nltk.org/howto/corpus.html)

To sidestep the above problem, for the moment, we will use the Natural Language Toolkit (nltk). It is one of several libraries that we will be using.

To make use of a library we have to first `import` it:

In [None]:
import nltk

nltk has some relatively some relatively magic commands that will import some text files and do a bunch of processing for us. For example, it will automatically split the text files into separate words, and do a decent job of it.

Normally, we won't use these magic commands, because it's too hard to tell what they are doing. But for the purposes of illustration we'll use them here. Don't worry too much about the details. Just understand that the result is that it gives us a list of all of the words in genesis, now properly divided.

In [None]:
mygenesis = nltk.corpus.PlaintextCorpusReader("corpora", 'genesis\.txt')  # The first argument here is the folder containing the file. 
genesis_words = mygenesis.words()

In [None]:
print(genesis_words[:100])

### Some simple explorations

Finally, here's a taste of where things get interesting.

In [None]:
from nltk.draw import dispersion_plot
dispersion_plot(genesis_words, ["Adam", "Noah"])

In [None]:
from nltk.text import ConcordanceIndex
ci = ConcordanceIndex(mygenesis.words())
ci.print_concordance("Adam", width=80, lines=25)