#Adding Context to Word Frequency Counts

<i>This is a work in progress.</i>

###Part 1: Determining a ratio

To add context to our word frequency counts, we can work with the corpus in a number of different ways. One of the easiest is to compare the number of words in the entire corpus to the frequency of the word we are investigating.

Let's begin by calling on all the <span style="cursor:help;" title="a set of instructions that performs a specific task"><b>functions</b></span> we will need. Remember that the first few sentences are calling on pre-installed <i>Python</i> <span style="cursor:help;" title="packages of functions and code that serve specific purposes"><b>modules</b></span>, and anything with a `def` at the beginning is a custom function built specifically for these exercises. The text in red describes the purpose of the function.

In [1]:
# This is where the modules are imported

from os import listdir
from os.path import splitext
from os.path import basename

# These functions iterate through the directory and create a list of filenames

def list_textfiles(directory):
    "Return a list of filenames ending in '.txt'"
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles


def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name


def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name


def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name

# These functions work on the content of the files

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename)
    contents = infile.read()
    infile.close()
    return contents


def clean_text(text):
    "Renders all text lowercase and removes punctuation"
    lower_text = text.lower()
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    clean_text = ""
    for character in lower_text:
        if character not in punctuation:
            clean_text += character
    return clean_text


def count_in_list(item_to_count, list_to_search):
    "Counts the number of a specified word within a list of words"
    number_of_hits = 0
    for item in list_to_search:
        if item == item_to_count:
            number_of_hits += 1
    return number_of_hits

In the next piece of code we will cycle through our directory again: first assigning readable names to our files and storing them as a list in the variable `filenames`; then we will remove the case and punctuation from the text, split the words into a list of tokens, and assign the words in each file to a list in the variable `corpus`.

In [2]:
filenames = []
for files in list_textfiles('data2'):
    files = get_filename(files)
    filenames.append(files)
    
corpus = []
for filename in list_textfiles('data2'):
    text = read_file(filename)
    clean = clean_text(text)
    words = clean.split()
    corpus.append(words)

Here we recreate our list from the last exercise, counting the instances of the word `privacy` in each file.

In [3]:
for words, names in zip(corpus, filenames):
    print"Instances of the word \'privacy\' in",names, ":", count_in_list("privacy", words)

Instances of the word 'privacy' in 2006 : 409
Instances of the word 'privacy' in 2007 : 298
Instances of the word 'privacy' in 2008 : 273
Instances of the word 'privacy' in 2009 : 679
Instances of the word 'privacy' in 2010 : 672
Instances of the word 'privacy' in 2011 : 750
Instances of the word 'privacy' in 2012 : 667
Instances of the word 'privacy' in 2013 : 1100
Instances of the word 'privacy' in 2014 : 1805


Next we use the `len` function to count the total number of words in each file.

In [4]:
for files, names in zip(corpus, filenames):
    print"There are", len(files), "words in", names

There are 5998461 words in 2006
There are 6943609 words in 2007
There are 5582924 words in 2008
There are 7826515 words in 2009
There are 7252849 words in 2010
There are 6217245 words in 2011
There are 8301653 words in 2012
There are 7180542 words in 2013
There are 8199436 words in 2014


Now we can calculate the ratio of the word `privacy` to the total number of words in the file. To accomplish this we simply divide the two numbers.

In [5]:
print("Ratio of instances of privacy to total number of words in the corpus:")
for words, names in zip(corpus, filenames):
    print '{:.6f}'.format(float(count_in_list("privacy", words))/(float(len(words)))),":",names

Ratio of instances of privacy to total number of words in the corpus:
0.000068 : 2006
0.000043 : 2007
0.000049 : 2008
0.000087 : 2009
0.000093 : 2010
0.000121 : 2011
0.000080 : 2012
0.000153 : 2013
0.000220 : 2014


Now our descriptive statistics concerning word frequencies have added value. We can see that there has indeed been a steady increase in the frequency of the use of the word `privacy` in our corpus. When we investigate the yearly usage, we can see that the frequency almost doubled between 2008 and 2009.

-------