# Texts and Words

<div class="alert alert-block alert-info">
<b>Code Update:</b> The string method <code>lower()</code> has been added to the tokenization process.
</div>

*If you want to know how to get these colored "callout" boxes, see [IBM's markdown guide](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet).*

In [None]:
# Imports
import re
import nltk

# Open and read the file to create a string object
mdg_string = open('../data/mdg.txt', 'r').read()
# Create a list of substrings, aka words
mdg_words = nltk.tokenize.word_tokenize(mdg_string.lower())
# Create the NLTK Text object
mdg = nltk.Text(mdg_words)

# Repeat for "Heart of Darkness"
hod_string = open('../data/hod.txt', 'r').read()
hod_words = nltk.tokenize.word_tokenize(hod_string.lower())
hod = nltk.Text(hod_words)

There are many ways to create a frequency list. Three are below: using built-in functions, using the `nltk`, and using `pandas`. 

### Built-In

In [None]:
# Using a dictionary
mdg_dict = {}
for word in mdg_words:
    try:
        mdg_dict[word] += 1
    except: 
        mdg_dict[word] = 1

# When in doubt print something out
print(f"mdg_dict is a {type(mdg_dict)} of {len(mdg_dict)} entries (tokens).")

In [None]:
# Dictionaries are key, value pairs.
# To retrieve a value, enter the key:
mdg_dict["hunter"]

In [None]:
# Knowing this, you can actually get the most frequent tokens. 
# (This is not suggested as it's difficult to read.)
# What would you change to get the least frequent tokens?
for word in sorted(mdg_dict, key=mdg_dict.get, reverse=True)[0:10]:
    print(word, mdg_dict[word])

### pandas

This is my preferred way, for a variety of reasons, and I'm going to show it to you quickly and without much explanation just for reference before proceeding to walk through the `nltk` code.

In [None]:
# First import the pandas library
import pandas as pd

# Then use my preferred way to turn a string of words into a list of words
words = re.sub("[^a-zA-Z']"," ", mdg_string).lower().split()

# Then create a pandas series
mdg_series = pd.Series(words)

# pandas series are a particular data structure
mdg_series.head()

In [None]:
mdg_counts = mdg_series.value_counts()
print(mdg_counts[0:5])

In [None]:
# pandas makes certain kinds of graphing very easy
mdg_counts.iloc[0:49].plot(kind='bar')

In [None]:
# One of the niceties of working with pandas
mdg_counts.to_csv("mdg-counts.csv")

### NLTK

There are two ways forward: using only NLTK functions or, in the second cell below, combining using preferred word list with the NLTK's `FreqDist` functionality.

In [None]:
fdist = nltk.FreqDist()
for sentence in nltk.tokenize.sent_tokenize(mdg_string):
    for word in nltk.tokenize.word_tokenize(sentence):
        fdist[word] += 1

fdist.most_common(10)

In [None]:
mdg_dist = nltk.FreqDist()
for word in words:
    mdg_dist[word] +=1

mdg_dist.most_common(10)

In [None]:
# If you ask Python what kind of data structure freq_dist is,
# you'll get a rather unhelpful response, but LOOK ABOVE. What do you see?
type(mdg_dist)

In [None]:
# We can work with freq_dist like any list of tuples

for word, frequency in freq_dist.most_common(10):
    print(f"{word}:  {frequency}")

In [None]:
# freq_dist comes with a lot of functionality
# (See Table 3.1 in Chapter 1 of the NLTK book for more ideas.)

mdg_dist.plot()

# More Texts, More Words

### Lexical Diversity

In [None]:
def lex_div (text):
    lexdiv =len(set(text)) / len(text)
    return lexdiv


In [None]:
print(f"The lexical diversity of \"The Most Dangerous Game\" is {lex_div(mdg):.3f}")

In [None]:
print(f"The lexical diversity of \"Heart of Darkness\" is {lex_div(hod):.3f}")

In [None]:
def lex_div(a_file):
    the_string =  open(a_file, 'r').read()
    the_words = re.sub("[^a-zA-Z']"," ", the_string).lower().split()
    lexdiv = len (set (the_words)) / len (the_words)
    return lexdiv

In [None]:
data = ["A", "B", "C", "D", "E", "F", "G", "H"]

for i in data:
    the_file = "../data/1924/texts/"+i+".txt"
    lexdiv = lex_div(the_file)
    print(f"{i}: {lexdiv:.3f}")


*Hmmm* ... that's quite a range. Referring to the lexical diversities for "The Most Dangerous Game" and _Heart of Darkness_, what do you think is at work there? 

What happens if we add a text as a data point?

In [None]:
hamlet = "../data/hamlet.txt"
lex_div(hamlet)

<div class="alert alert-block alert-warning">
<b>Your turn:</b> Write code that explores the possible dimension in play here.
</div>

## Relative Frequencies

One way to explore how much a text compares to other texts from its time is to compare the relative frequencies of the terms it uses over and against a corpus of texts: it could be your own corpus, an established corpus, or a large (and wonky) corpus like Google's [Books Ngram Viewer](https://books.google.com/ngrams/).

The relative frequency is as easy as dividing a term's count by the overall length of the text. (Note: not the vocabulary!)

In [None]:
# Recall how we get the count for a term
mdg_dict["hunter"]

In [None]:
# Do the math
mdg_term_rf = mdg_dict["hunter"] / len(mdg_words)
print(mdg_term_rf)

In [None]:
# Compare to Google ngram viewer for 1924
# Sigh, yes this number was retrieved by hand
year_comp = mdg_term_rf / 0.0004010409
print(year_comp)

## Comparing Texts by Comparing Words

<div class="alert alert-block alert-success">
<b>Up Next:</b> What are the ways we can compare texts?
</div>

In [None]:
# Compare term frequencies


In [None]:
# Compare relative term frequencies