# Files into Words

## Table of Contents <a name="toc"></a>

**[Loading a File & Understanding What It Is](#file)**  
**[Tokenizing](#tokenizing)**  
[A Quick Note about Normalization](#norm)  
[Using regex to Tokenize](#REtoke)   
[Using NLTK Tokenizers](#nltktoken)  

The first line of code run _here_ is something internal to Jupyter Notebooks that allows us to place any graphical output into the page itself and not in a separate window or file. (We can still save output to a file, if we want.)

In [1]:
%pylab inline
figsize(12, 6)

Populating the interactive namespace from numpy and matplotlib


## Loading a File & Understanding What It Is <a name="file"></a>

After that, it's time to get our text and start examining it. So the first thing we need to do is load the text file. In the case of reading one file, as we are doing here, we first tell Python to open the file in **read** mode -- that's all the `r` is doing inside the parenthesis, preventing any possibility of of us writing to it. After that we literally read the file into a variable. 

This can be done in two lines of code:

In [None]:
# First we open the file:
opened_file = open('texts/mdg.txt', 'r')

# Then we read the file:
mdg = opened_file.read()

If we try to work with the first object we created, `opened_file`, let's say by printing it, we are going to get an error. Python recognizes that it is probably a text file, but it hasn't been told specifically to take the stream of bytes, as Python understand them, and convert them into a string, which is how computers handle things humans call "texts."

In the code cells above and below you will also notice the presence of lines that are simply descriptions of what the code does and not themselves code: that is, they do not do anything but communicate to humans, yourself or others, what is happening in the code. These are called comments, and it's good to get in the habit of including them. Comments start with a hash mark, `#`, and are typically hand-wrapped if more than one line, with each line beginning with a hash mark. In Jupyter Notebook, you can simply select the lines you want to make comments and then `CMD + /` on your keyboard. This is a toggle-able action, so you can uncomment code this way too. 

In [None]:
# What happens when we try to print the opened file?
print(opened_file[0:20])

In [2]:
# The work of the two lines above can also be achieved in one line.
# See if your growing Pythonista abilities can't tell how this is done!
mdg = open('texts/mdg.txt', 'r').read()

Oh, you've just created your first "object" in Python, and it's a text! Or, rather, it's a string, one of the kinds of objects you can work with. If you ever wonder what kind of object you have, you can ask it its `type`:

In [None]:
len(mdg)

You can ask it other kinds of things: how big, or long it is -- `len()` -- as well as printing it to see what it looks like. Why don't you do that now? Replace `type` above first with `len` and then with `print` and then hit enter to see what happens. 

When you run **`print(mdg)`**, you see all of the text in what appears to be "human readable" form. One of the problems you now face is understanding that when you hit `print` and see the text, the computer just sees a `string` of characters -- remember what `type` told you? Python doesn't natively understand human languages: they are nothing more than a series of things, characters made up of letters, numbers, punctuation marks, and spaces.

When you asked Python to tell you the length of the object, it just counted all those things and told you the total. Our version of "The Most Dangerous Game" is 44,000+ characters long. But characters isn't a very useful way to measure texts, is it? Letters are not meaningful. Words are. That's how we think of texts, isn't it? In order to count the words, we have to tell Python how to break the string into words.

We need to convert our string into a list of words. To do that, we need to figure out how to tell the computer to find words among the sequence of characters. The term for words, and all the other bits that make up written language (like punctuation) as they occur in discourse is **tokens**, and what we need to do is **tokenize**.

## Tokenizing <a name="tokenizing"></a>

If only **tokenization** was straightforward. While your first essays into text analytics will probably rely upon relatively well-known methods, you will probably find yourself regularly re-visiting what you use and fine-tuning to fit the needs of your project. (That kind of iterative understanding and refinement based on the work at hand is part of what makes work in the digital humanities so rewarding.) In other words, it's important to remember that, first, tokenization refers "the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens" ([Wikipedia][]) and that "meaningful" is in the eye of the beholder.

[Wikipedia]: https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization


One of the first places to start tokenizing is with the `split()` method available to use on strings, turning them into lists. If you supply no value inside the parentheses, the default  for `split()` is to split a string into a list wherever there is a white space. 

In the cell below, we first create a string object called `text` -- in this case, the first paragraph from _Heart of Darkness_ -- and then we `split()` it and print the list to see the results.

In [41]:
text = """
The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest. The flood had made, the wind was nearly
calm, and being bound down the river, the only thing for it was to come
to and wait for the turn of the tide.
"""

split_text = text.split()

print(split_text)

['The', 'Nellie,', 'a', 'cruising', 'yawl,', 'swung', 'to', 'her', 'anchor', 'without', 'a', 'flutter', 'of', 'the', 'sails,', 'and', 'was', 'at', 'rest.', 'The', 'flood', 'had', 'made,', 'the', 'wind', 'was', 'nearly', 'calm,', 'and', 'being', 'bound', 'down', 'the', 'river,', 'the', 'only', 'thing', 'for', 'it', 'was', 'to', 'come', 'to', 'and', 'wait', 'for', 'the', 'turn', 'of', 'the', 'tide.']


**>>>**: Why not copy and paste a block of text from your own work to see what happens?

You can actually specify other values on which to split a string, which can be useful in some cases. (Please note that `split()` and `split(' ')` are the same thing, but do note that there is a space between the two quotation marks.)

In the next cell, we split the paragraph by comma -- the `\n`s you see are newline characters normally hidden from human eyes, but always there. They care considered part of white space by Python, which is why they are not part of the list above.

In [43]:
comma_text = text.split(",")
print(comma_text)

['\nThe Nellie', ' a cruising yawl', ' swung to her anchor without a flutter of\nthe sails', ' and was at rest. The flood had made', ' the wind was nearly\ncalm', ' and being bound down the river', ' the only thing for it was to come\nto and wait for the turn of the tide.\n']


As you can see, Python's built-in string method `split()` has plenty of power and may very well do everything you need for quick surveying of results. For more nuanced results, you may find yourself wanting to build a custom setup with regex or to use one of the tokenizers included with the NLTK. Both of those are discussed next.

[For more on tokenization in Python.](http://jeffreyfossett.com/2014/04/25/tokenizing-raw-text-in-python.html)

### A Quick Note about Normalization <a name="norm"></a>

In the first sentence from the opening of _Heart of Darkness_ there are two instances of the definite article *the*:

> The Nellie, a cruising yawl, swung to her anchor without a flutter of the sails, and was at rest.

If we break the sentence into tokens as above and then ask Python to compose a set of words based on the sentence we get the following:

In [49]:
# Whenever you have extra long text in a code example that breaks across lines,
# you enclose it within three quotation marks to let Python know that. 
# (This is one reason to load text from a file.)

sentence = """The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest."""

split = sentence.split()

print("This set has {} items:".format(len(set(split))), set(split))

This set has 18 items: {'rest.', 'swung', 'to', 'The', 'cruising', 'her', 'Nellie,', 'yawl,', 'without', 'flutter', 'the', 'and', 'was', 'of', 'at', 'anchor', 'a', 'sails,'}


In fact, if we ask Python is *The* and *the* are the same, it tells us no:

In [45]:
'The' == 'the'

False

The way most scholars and scientists avoid this particular semantic difficulty is to make all the text in a string lowercase with `lower()`. This too is a string method, and so you can append it just as you do `split()`. (FTR, there is also `upper()` that makes everything uppercase.)

Watch what happens when normalize our text to lowercase and ask Python to count the set of words:

In [50]:
sentence = """The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest."""

split = sentence.lower().split()

print("This set has {} items:".format(len(set(split))), set(split))

This set has 17 items: {'rest.', 'swung', 'to', 'cruising', 'her', 'without', 'yawl,', 'nellie,', 'the', 'flutter', 'was', 'and', 'at', 'of', 'anchor', 'a', 'sails,'}


The two *the*s are now considered to be the same!

### Using Regex to Tokenize <a name="REtoke"></a>

As you are already beginning to sense, there are a lot of ways to approach breaking a string of characters into a string of words. The first method presented above is the most basic and should be used when first starting out, especially when you pair lowercasing with it -- `.lower().split()` are your new best friends. But at some point you may find that you want a little bit more flexibility in how you handle your text. 

This next approach still depends upon `.lower().split()`, but before we pass our text to it, we do a bit of cleanup. It's one I have used quite often when making first passes through texts. After a while, you'll find that some bits of regex become part of your toolset, and reveal, as this one does mine, your own interests and concerns. In this case, the regex reveals a particular obsession of mine: I like to keep contractions together. (This reveals my own non-standard training: corpus linguists just take for granted that **`can't`** becomes **`ca`** and **`n't`**, or something else equally weird.

In [3]:
# We need the regular expression library
import re

mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

A quick walkthrough might help a little:

* **`import re`** tells the script to import the regular expression module which comes bundled with every Python installation but doesn't get loaded into our workspace unless we tell it to do so.
* **`mdg_words`** is the object we are creating: everything to the right is the process by which we create the object.
* We aren't going to discuss the regular expression substitution, **`re.sub()`** that gets done here, except to note that you should read the stuff inside the parentheses like this: `(find this pattern, substitute this, in this text)` -- in this case I am telling it to find things that are **not** (`[^ ]` is called a *negated set*) letters (big and small) or apostrophes and replace them with spaces. 
* **`.lower()`** is a method you can apply to strings that makes everything lower case -- otherwise "The" and "the" are two different keys. 
* **`.split()`** is the string method discussed above and we're using its default setting of splitting on white spaces, of which we have plenty, since we have replaced everything except for letters and apostrophes with white space.

The **`split()`** method turns our string into a **list**. In this case, a list of words that are in the same order as they are in the original text. (The computer has no reason to disturb this order.)

If we ask how long the list is, we should come back with a reasonably close count of the words -- don't forget we kept apostrophes, and while most are buried inside contractions, there may be some loose apostrophes. 

In [59]:
print('Words in text: {}.'.format(len(mdg_words)))

Words in text: 8017.


As you will discover, or have discovered, a list is a different kind of object than a string, and it has some useful properties. One of the things we can do is **slice** a list. In the command below we are saying we want the first 50 words in the list of words: start at the 0th element and go up to the 50th element. Note how the apostrophe, here as a single quotation mark inside Whitney's response about "'ship trap island'", is part of our tokens:

In [58]:
print(mdg_words[0:50])

['off', 'there', 'to', 'the', 'right', 'somewhere', 'is', 'a', 'large', 'island', 'said', 'whitney', "it's", 'rather', 'a', 'mystery', 'what', 'island', 'is', 'it', 'rainsford', 'asked', 'the', 'old', 'charts', 'call', 'it', "'ship", 'trap', 'island', "'", 'whitney', 'replied', 'a', 'suggestive', 'name', "isn't", 'it', 'sailors', 'have', 'a', 'curious', 'dread', 'of', 'the', 'place', 'i', "don't", 'know', 'why']


### Using NLTK Tokenizers <a name="nltktoken"></a>

Whether you develop your own tokenizer somewhere along the way or not is entirely up to you and the nature of the projects which you undertake. You may never find yourself doing so. There are certainly a lot of already available options, a number of which are packaged with the Natural Language Toolkit, more often called by its acronym "NLTK", which, by the way, is the same way we call it in Python: 

```python
import nltk
```

As you saw above, the way we tell Python we want some additional functionality is to tell it to **import** a particular library. Previously, we imported the regular expression library, which is called **`re`**. 

If we were to run the line above, we would probably wait a few seconds as the NLTK library loaded -- it's quite large. Because it is a large library, and we really don't need all its functionality, most people don't load all of it -- and isn't it cool that we can load only the parts we need! This is handy as you begin to work with larger scripts and larger data sets: keeping your workspace as tidy, and as small, becomes a necessity. And, in some cases, it's actually easier to use certain tools singly and with a particular name.

In [52]:
# In this instance, we are going to tell Python that we only want 
# one particular tool from the larger toolkit:

from nltk.tokenize import WhitespaceTokenizer

NLTK's `WhitespaceTokenizer` is about as basic as it comes -- in fact, in the NLTK documentation, they actually note you might as well use split. It is, so far as I can tell, its output is the same as `.lower().split()` while also keeping contractions together. However, do note that punctuation is kept with words. 

In [54]:
mdg_tokens = WhitespaceTokenizer().tokenize(mdg.lower())

Please note that we now have two versions of "The Most Dangerous Game" rendered as a list in Python: **`mdg_words`** and **`mdg_tokens`**. We created the former using regex and the latter with the NLTK WhitespaceTokenizer. We could have re-used the former name, and that would have, in effect, replaced the older list with the newer one. Please keep this in mind as you work, and also remember to use sensible names for the objects you create. (The more self-explanatory things are, the better as your code grows.)

Now, let's take a look at what kind of object this is, how big it is, and then let's print the first 50 items -- I use 50 as a convenient, quick look, but you can adjust the number down to 25 or up to 100 or some other number that you find more useful. I often change the numbers several times, contracting and expanding as I feel the need.

In [56]:
print(mdg_tokens[0:50])

['"off', 'there', 'to', 'the', 'right', '--', 'somewhere', '--', 'is', 'a', 'large', 'island,"', 'said', 'whitney.', '"it\'s', 'rather', 'a', 'mystery', '--', '"', '"what', 'island', 'is', 'it?"', 'rainsford', 'asked.', '"the', 'old', 'charts', 'call', 'it', "'ship-trap", 'island,\'"', 'whitney', 'replied.', '"a', 'suggestive', 'name,', "isn't", 'it?', 'sailors', 'have', 'a', 'curious', 'dread', 'of', 'the', 'place.', 'i', "don't"]


What would happen if we were to read the text differently, if we were to read all the words, but this time with the words in alphabetical order?

In [60]:
print(sorted(mdg_tokens))

['"', '"', '"', '"', '"', '"', '"', '"', '"', '"', '"', '"', '"a', '"a', '"a', '"a', '"after', '"again', '"ah,', '"and', '"and', '"and', '"and', '"and', '"and', '"as', '"as', '"bah!', '"better', '"but', '"but', '"but', '"but', '"but', '"but', '"but', '"but,', '"can\'t', '"cannibals?"', '"civilized?', '"come,"', '"dear', '"dear', '"did', '"did', '"don\'t', '"don\'t', '"ennui.', '"even', '"follow', '"for', '"fractured', '"general,"', '"get', '"god', '"good', '"hardly.', '"have', '"he', '"he', '"here', '"how', '"how', '"hunting', '"hunting?', '"hurled', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i', '"i\'ll', '"i\'ll', '"i\'ll', '"i\'ll', '"i\'m', '"i\'m', '"i\'m', '"i\'m', '"i\'ve', '"i\'ve', '"if', '"in', '"is', '"is', '"it', '"it', '"it', '"it', '"it\'s', '"it\'s', '"it\'s', '"it\'s', '"ivan', '"ivan,"', '"life', '"maybe.', '"mirage,"', 

What's this? That's a lot of *a*s and *about*s. What happens if we look at just the words without repetition?

In [61]:
print(sorted(set(mdg_tokens)))

['"', '"a', '"after', '"again', '"ah,', '"and', '"as', '"bah!', '"better', '"but', '"but,', '"can\'t', '"cannibals?"', '"civilized?', '"come,"', '"dear', '"did', '"don\'t', '"ennui.', '"even', '"follow', '"for', '"fractured', '"general,"', '"get', '"god', '"good', '"hardly.', '"have', '"he', '"here', '"how', '"hunting', '"hunting?', '"hurled', '"i', '"i\'ll', '"i\'m', '"i\'ve', '"if', '"in', '"is', '"it', '"it\'s', '"ivan', '"ivan,"', '"life', '"maybe.', '"mirage,"', '"my', '"nerve,', '"no', '"no,', '"no,"', '"no.', '"nonsense,"', '"nor', '"not', '"off', '"oh,', '"oh,"', '"one', '"perhaps', '"perhaps,"', '"pistol', '"precisely,"', '"pure', '"rainsford!"', '"rainsford,"', '"really?"', '"right.', '"simply', '"so', '"so,"', '"sometimes', '"sorry', '"splendid!', '"suppose', '"swam,"', '"thank', '"that', '"that\'s', '"the', '"there', '"they', '"this', '"tigers?"', '"to', '"tonight,"', '"ugh!', '"watch!', '"we', '"we\'ll', '"well,', '"what', '"where', '"why', '"why?"', '"will', '"yes,', '"ye

That looks like a lot of words. If we ask how many by counting how long the set of words is, we get: 

In [62]:
# Notice how I enclose the text I want to print in single quotes
# so that I can use double quotes in the text itself:
print('There are {} words in "The Most Dangerous Game."'.format(len(set(mdg_tokens))))

There are 2589 words in "The Most Dangerous Game."


### Stacked Tokenizers


In [20]:
fdist = nltk.FreqDist()
for sentence in nltk.tokenize.sent_tokenize(mdg):
    for word in nltk.tokenize.word_tokenize(sentence):
        fdist[word] += 1

In [24]:
print(fdist.most_common(10))

[('.', 640), (',', 556), ('the', 439), ('a', 246), ('``', 223), ("''", 210), ('I', 178), ('he', 173), ('of', 171), ('and', 155)]


In [21]:
sentences = nltk.tokenize.sent_tokenize(mdg)
# print(sentence for sentence in sentences[0:10])

<generator object <genexpr> at 0x11ff86ba0>


In [22]:
tokens = nltk.tokenize.word_tokenize(sentences)

TypeError: expected string or bytes-like object

## Counting Words

From our experiments above we learn that approximately two thousand words are spread out over 8000 places. If averaged over the entire text, each word appears 4 times, but looking over our sorted `mdg_words` above, we can see that the word **and** appears 162 times alone. And it's not even the top 5 of most used words! 

In order they are:

    the, 512
    a, 258
    he, 248
    i, 177
    of, 172
    and, 164

The list above is the start of a word frequency list. There are a number of ways to do this, and I will include those in separate files for your reference, but since we have started with the NLTK, I thought we would stay with it. We have a choice to make, however: we can either continue to import one tool at a time from the NLTK library, or we can just say to ourselves that we're going to be playing with a lot of the tools, so why not just bring them all into our workspace?

Please note that once I've imported all of the `nltk` library, I need to tell Python that a particular tool, or function, comes from that library. Sometimes functions from two different, and large, libraries can have the same name, prepending the library name is one way that Python has of avoiding what are called "namespace conflicts," which is a fancy way of saying you can't call two things by the same name. You've seen both ways of doing things now.

In [27]:
import nltk

In [5]:
freq_dist = nltk.FreqDist(mdg_words)

In [7]:
type(freq_dist)

nltk.probability.FreqDist

In [8]:
for word, frequency in freq_dist.most_common(10):
    print('{}:  {}'.format(word, frequency))

the:  512
a:  258
he:  248
of:  172
and:  164
i:  155
to:  148
was:  140
his:  137
rainsford:  117


In [19]:
for word, frequency in freq_dist:
    print(word, frequency)

ValueError: too many values to unpack (expected 2)

In [18]:
mdg_dict = {key:value for word, frequency in freq_dist.items()}

NameError: name 'value' is not defined

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's graph the 50 most frequent words:
# =-=-=-=-=-=-=-=-=-=-= 

# This shows all the words: still working on slices
freq_dist.plot()

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Save these results to a CSV file (makes it easier for the Excel-impaired)
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.to_csv('../data/mdg_word_freq.csv')

### ngrams

In [9]:
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = nltk.FreqDist(nltk.ngrams(mdg_words, size))


In [13]:
all_counts[5].most_common(5)

[(('went', 'to', 'the', 'window', 'and'), 3),
 (('to', 'the', 'window', 'and', 'looked'), 3),
 (('i', 'fell', 'off', 'a', 'yacht'), 2),
 (('sanger', 'rainsford', 'of', 'new', 'york'), 2),
 (('rainsford', 'my', 'dear', 'fellow', 'said'), 2)]

In [None]:
mpl.style.use('ggplot')
ax = df[['Word','Frequency']].plot(kind='bar', 
                                   title ="Frequency of Words in MDG",
                                   figsize=(20,10),
                                   legend=True)
ax.set_xlabel("Word")
ax.set_ylabel("Occurrences")
ax.set_xticklabels(list(df['Word'])) 
mpl.pyplot.show()

In [15]:
myword = mdg_words.concordance("dangerous")
print(myword)

AttributeError: 'list' object has no attribute 'concordance'

In [None]:
text.similar("love")
text.common_contexts(["husband", "wife"])
text.collocations()

In [None]:
# Lexical Diversity of MDG:
len(mdg2_word_list) / len(set(mdg2_word_list))

In [None]:
len(mdg_tokens) / len(set(mdg_tokens))

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?

In [None]:
mdg_text.dispersion_plot(["dangerous", "danger", "game", "fear"])

In [None]:
wordfrequency.plot()