# A Most Dangerous Game of Words

The first line of code run here is something internal to Jupyter Notebooks that allows us to place any graphical output into the page itself and not in a separate window or file. (We can still save output to a file, if we want.)

In [1]:
%pylab inline
figsize(12, 6)

Populating the interactive namespace from numpy and matplotlib


After that, it's time to get our text and start examining it. So the first thing we need to do is load the text file. In the case of reading one file, as we are doing here, we first tell Python to open the file in **read** mode -- that's all the `r` is doing inside the parenthesis, preventing any possibility of of us writing to it. After that we literally read the file into a variable. 

This can be done in two lines:

```python
opened_file = open('/texts/mdg.txt', 'r')
mdg = opened_file.read()
```

Or it can be done in one line as it is below:

In [3]:
mdg = open('texts/mdg.txt', 'r').read()

Oh, you've just created your first "object" in Python, and it's a text! Or, rather, it's a string, one of the kinds of objects you can work with. If you ever wonder what kind of object you have, you can ask it its `type`:

In [6]:
type(mdg)

str

You can ask it other kinds of things: how big, or long it is -- `len()` -- as well as printing it to see what it looks like. Why don't you do that now? Replace `type` above first with `len` and then with `print` and then hit enter to see what happens. 

We see all of the text, in a "human readable" form. 
One of the problems you now face is understanding that when you hit `print` and see the text, the computer just sees a `string` of characters -- remember what `type` told you? Python doesn't natively understand human languages: they are nothing more than a series of things, characters made up of letters, numbers, punctuation marks, and spaces.

When you asked Python to tell you the length of the object, it just counted all those things and told you the total. Our version of "The Most Dangerous Game" is 44078 characters long. But characters isn't a very useful way to measure texts, is it? Letters are not meaningful. Words are. That's how we think of texts, isn't it? In order to count the words, we have to tell Python how to break the string into words.

We need to convert our string in a list of words, which, it turns out, is still human readable. To do that, we need to figure out how to tell the computer to find words among the sequence of characters. The term for words as they are found in discourse is **tokens**, and what we need to do is **tokenize**.

In [None]:
import re

mdg_words = re.sub("[^a-zA-Z'-]"," ", mdg).lower().split()
print('Words in text: {}.'.format(len(mdg_words)))

For various reasons, I have come to develop my own tokeniser, which I have inserted above just so we can talk about it, but for our purposes today, I am going to turn to the Natural Language Toolkit, more often called by its acronym, which, by the way, is the same way we call it in Python: 

```python
import nltk
```

But the NLTK is a rather large library, or module -- we'll discuss this! -- and as you begin to work with larger programs and larger collections of texts, you want to keep your space as tidy as possible and only get out the tools you need. (And, in some cases, tools prefer to be pulled out singly which also makes their use easier.) 

In this instance, we are going to tell Python that we only want one particular tool from the larger toolkit:

In [9]:
from nltk.tokenize import WhitespaceTokenizer

In [11]:
mdg_tokens = WhitespaceTokenizer().tokenize(mdg)

Now, let's take a look at what kind of object this is, how big it is, and then let's print it.

In [14]:
print(mdg_tokens[0:10])

['"Off', 'there', 'to', 'the', 'right', '--', 'somewhere', '--', 'is', 'a']


What would happen if we were to read the text differently, if we were to read all the words, but this time with the words in alphabetical order?

In [None]:
print(sorted(mdg_words))

What's this? That's a lot of *a*s and *about*s. What happens if we look at just the words without repetition?

In [None]:
print(sorted(set(mdg_words)))

That looks like a lot of words. If we ask how many by counting how long the set of words is, we get: 

In [None]:
len(set(mdg_words))

Almost two thousand words are spread out over 8000 places. If averaged over the entire text, each word appears 4 times, but looking over our sorted `mdg_words` above, we can see that the word **and** appears 162 times alone. And it's not even the top 5 of most used words! 

In order they are:

    the, 512
    a, 258
    he, 248
    i, 177
    of, 172
    and, 164

To be clear, these numbers are drawn from a complete list of the words in "The Most Dangerous Game" that was compiled by first creating a dataframe with each word on its own, indexed, row:

In [None]:
import numpy as np
import pandas as pd

mdg_series = pd.Series(mdg_words)

print(mdg_series)

We can take that sequence of words and count them:

In [None]:
mdg_counts = mdg_series.value_counts()
print(mdg_counts)

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's graph the 50 most frequent words:
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.iloc[50:99].plot(kind='bar')

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Save these results to a CSV file (makes it easier for the Excel-impaired)
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.to_csv('../data/mdg_word_freq.csv')

In [None]:
mpl.style.use('ggplot')
ax = df[['Word','Frequency']].plot(kind='bar', 
                                   title ="Frequency of Words in MDG",
                                   figsize=(20,10),
                                   legend=True)
ax.set_xlabel("Word")
ax.set_ylabel("Occurrences")
ax.set_xticklabels(list(df['Word'])) 
mpl.pyplot.show()

In [None]:
import nltk
myword = mdg.concordance("dangerous")
print(myword)

In [None]:
text.similar("love")
text.common_contexts(["husband", "wife"])
text.collocations()

In [None]:
import nltk
mdgtokens = nltk.word_tokenize(mdg)
len(mdgtokens)

import nltk, re

mdg_raw = open("./mdg.txt").read()
mdg_words = re.sub("[^a-zA-Z'-]"," ", mdg_raw)
mdg_case = mdg_words.lower()

# print(mdg_case)



import re


mdg_word_list = mdg_words.split()
print(mdg_word_list)

sorted(set(mdg_word_list))

len(sorted(set(mdg_word_list)))

# Lexical Diversity of MDG:
len(mdg2_word_list) / len(set(mdg2_word_list))

In [None]:
len(mdg_tokens) / len(set(mdg_tokens))

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?

In [None]:
mdg_text.dispersion_plot(["dangerous", "danger", "game", "fear"])

In [None]:
wordfrequency.plot()