#Adding Context to Word Frequency Counts

While the raw data from word frequency counts is compelling, it does little but describe quantitative features of the corpus. In order to determine if the statistics are indicative of a trend in word usage we must add value to the word frequencies. In this exercise we will produce a ratio of the occurences of `privacy` to the number of words in the entire corpus. Then we will compare the occurences of `privacy` to the indivudal number of transcripts within the corpus. This data will allow us identify trends that are worthy of further investigation.

###Part 1: Determining a ratio

To add context to our word frequency counts, we can work with the corpus in a number of different ways. One of the easiest is to compare the number of words in the entire corpus to the frequency of the word we are investigating.

Let's begin by calling on all the <span style="cursor:help;" title="a set of instructions that performs a specific task"><b>functions</b></span> we will need. Remember that the first few sentences are calling on pre-installed <i>Python</i> <span style="cursor:help;" title="packages of functions and code that serve specific purposes"><b>modules</b></span>, and anything with a `def` at the beginning is a custom function built specifically for these exercises. The text in red describes the purpose of the function.

In [1]:
# This is where the modules are imported

from os import listdir
from os.path import splitext
from os.path import basename

# These functions iterate through the directory and create a list of filenames

def list_textfiles(directory):
    "Return a list of filenames ending in '.txt'"
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles


def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name


def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name


def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name

# These functions work on the content of the files

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename)
    contents = infile.read()
    infile.close()
    return contents


def clean_text(text):
    "Renders all text lowercase and removes punctuation"
    lower_text = text.lower()
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    clean_text = ""
    for character in lower_text:
        if character not in punctuation:
            clean_text += character
    return clean_text


def count_in_list(item_to_count, list_to_search):
    "Counts the number of a specified word within a list of words"
    number_of_hits = 0
    for item in list_to_search:
        if item == item_to_count:
            number_of_hits += 1
    return number_of_hits

In the next piece of code we will cycle through our directory again: first assigning readable names to our files and storing them as a list in the variable `filenames`; then we will remove the case and punctuation from the text, split the words into a list of tokens, and assign the words in each file to a list in the variable `corpus`.

In [2]:
filenames = []
for files in list_textfiles('data2'):
    files = get_filename(files)
    filenames.append(files)

In [3]:
corpus = []
for filename in list_textfiles('data2'):
    text = read_file(filename)
    clean = clean_text(text)
    words = clean.split()
    corpus.append(words)

Here we recreate our list from the last exercise, counting the instances of the word `privacy` in each file.

In [4]:
for words, names in zip(corpus, filenames):
    print"Instances of the word \'privacy\' in",names, ":", count_in_list("privacy", words)

Instances of the word 'privacy' in 2006 : 409
Instances of the word 'privacy' in 2007 : 298
Instances of the word 'privacy' in 2008 : 273
Instances of the word 'privacy' in 2009 : 679
Instances of the word 'privacy' in 2010 : 672
Instances of the word 'privacy' in 2011 : 750
Instances of the word 'privacy' in 2012 : 667
Instances of the word 'privacy' in 2013 : 1100
Instances of the word 'privacy' in 2014 : 1805


Next we use the `len` function to count the total number of words in each file.

In [5]:
for files, names in zip(corpus, filenames):
    print"There are", len(files), "words in", names

There are 5998460 words in 2006
There are 6943608 words in 2007
There are 5582923 words in 2008
There are 7826514 words in 2009
There are 7252848 words in 2010
There are 6217244 words in 2011
There are 8301652 words in 2012
There are 7180541 words in 2013
There are 8199435 words in 2014


Now we can calculate the ratio of the word `privacy` to the total number of words in the file. To accomplish this we simply divide the two numbers.

In [6]:
print("Ratio of instances of privacy to total number of words in the corpus:")
for words, names in zip(corpus, filenames):
    print '{:.6f}'.format(float(count_in_list("privacy", words))/(float(len(words)))),":",names

Ratio of instances of privacy to total number of words in the corpus:
0.000068 : 2006
0.000043 : 2007
0.000049 : 2008
0.000087 : 2009
0.000093 : 2010
0.000121 : 2011
0.000080 : 2012
0.000153 : 2013
0.000220 : 2014


Now our descriptive statistics concerning word frequencies have added value. We can see that there has indeed been a steady increase in the frequency of the use of the word `privacy` in our corpus. When we investigate the yearly usage, we can see that the frequency almost doubled between 2008 and 2009, as well as dramatic increase between 2012 and 2014.

-----------

###Part 2: Counting the number of transcripts

Another way we can provide context is to process the corpus in a different way. Instead of splitting the data by word, we will split it in larger chunks pertaining to each individual transcript. Each transcript corresponds to a unique debate but starts with exactly the same formatting, making the files easy to split. The text below shows the beginning of a transcript. The first words are `OFFICIAL REPORT (HANSARD)`.

In [None]:
'''
OFFICIAL REPORT (HANSARD)
  
  
    House of Commons Debates
    VOLUME 141
    NUMBER 001
    1st SESSION
    39th PARLIAMENT
    Monday, April 3, 2006
    Speaker: The Honourable Peter Milliken
    HOUSE OF COMMONS
    CANADA
    (Table of Contents appears at back of this issue.)
    COMMONS DEBATES
    April 3, 2006
    DEBATES
    Edited Hansard * Table of Contents * Number 001 (Official Version)
    Official Report * Table of Contents * Number 001 (Official Version)
    Compte rendu officiel * Table des matieres * Numero 001 (Version officielle)
    141
    001
    03
    04
    2006
    2006/04/03 11:05:00
    House of Commons
    Debats de la Chambre des communes
    House of Commons Debates
    39
    1
'''

Here we will pass the files to another variable, called `corpus_1`. Instead of removing capitalization and punctuation, all we will do is split the files at every occurence of `OFFICIAL REPORT (HANSARD)`.

In [8]:
corpus_1 = []
for filename in list_textfiles('data2'):
    text = read_file(filename)
    words = text.split(" OFFICIAL REPORT (HANSARD)")
    corpus_1.append(words)

Now, we can count the number of files in each dataset. This is also an important activity for error-checking. While it is easy to trust the numerical output of the code when it works sucessfully, we must always be sure to check that the code is actually performing in exactly the way we want it to. In this case, these numbers can be cross-referenced with the original XML data, where each transcript exists as its own file. A quick check of the directory shows that the numbers are correct.

In [9]:
for files, names in zip(corpus_1, filenames):
    print"There are", len(files), "files in", names

There are 97 files in 2006
There are 117 files in 2007
There are 93 files in 2008
There are 128 files in 2009
There are 119 files in 2010
There are 98 files in 2011
There are 131 files in 2012
There are 111 files in 2013
There are 127 files in 2014


Here is a screenshot of some of the raw data. We can see that there are <u>97</u> files in 2006, <u>117</u> in 2007 and <u>93</u> in 2008. The rest of the data is also correct. 

<img src="filecount.png">

Now we can compare the amount of occurences of `privacy` with the number of debates occuring in each dataset.

In [10]:
for names, files, words in zip(filenames, corpus_1, corpus):
    print"In", names, "there were", len(files), "debates. The word privacy was uttered", count_in_list('privacy', words), "times."

In 2006 there were 97 debates. The word privacy was uttered 409 times.
In 2007 there were 117 debates. The word privacy was uttered 298 times.
In 2008 there were 93 debates. The word privacy was uttered 273 times.
In 2009 there were 128 debates. The word privacy was uttered 679 times.
In 2010 there were 119 debates. The word privacy was uttered 672 times.
In 2011 there were 98 debates. The word privacy was uttered 750 times.
In 2012 there were 131 debates. The word privacy was uttered 667 times.
In 2013 there were 111 debates. The word privacy was uttered 1100 times.
In 2014 there were 127 debates. The word privacy was uttered 1805 times.


These numbers confirm our earlier results. There is a clear indication that the usage of the term `privacy` is increasing, with major changes occuring between the years 2008 and 2009, as well as between 2012 and 2014. 