#Concordance Output

A concordance is a method of text analysis that is somewhat similar to the generation of word frequency statistics, only the search is expanded to the words that appear on either side of the word under investigation. We call the main search word the 'node' and the words surrounding it the 'span'. A condordance is simply a printed list displaying the sentences or 'context' that the node word appears in. This list is traditonally organized in a 'Key Word in Context' (KWIC) format, which has the node word in the centre of the page. The span can be adjusted, but generally includes about five words on the left and five words on the right of the node. 

The purpose of generating a concordance output is to allow for manual, but controlled, examination of the word in question. As we will see in this exercise, it becomes very easy to recognize patterns of language use when the text is organized in this way. Further investigation can be conducted by sorting the list of text alphabetically, either on the word just to the left or right of the node word.

Generating a concordance output in Python is fairly simple thanks to the `NLTK` module. In this exercise we will generate a concordance output for one of our files.

---------

Once again we will import our modules and definitions first. Here we see some new modules: `NLTK`, `codecs`, and `sys`.

`NLTK` stands for <a href="http://www.nltk.org/book/ch00.html" target=blank><i>Natural Language Toolkit</i></a>, which facilitates natural language processing in <i>Python</i>. `NLTK` has many functions that support electronic text analysis, including tokenizing, word frequency counters, and for the purposes of this demonstration, concordancers.

`codecs` is a module that helps <i>Python</i> read and write text in <span style="cursor:help;" title="the industry standard for encoding special characters, like: æ, þ, ß"><b>Unicode</b></span>, which is a text encoding standard that includes non-alphanumeric characters. We will not be removing the capitalization or punctuation in this exercise, so we're using `codecs` to avoid any errors in reading and printing the file.

`sys` is a built-in <i>Python</i> module that allows for the manipulation of the <i>Python</i> <span style="cursor:help;" title="the infrastructure required to run programs"><b>runtime environment</b></span>. Here we will use it to write the output of a program to a text file.


In [1]:
# This is where the modules are imported
import nltk
import sys
import codecs

# This function works on the contents of the file

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = codecs.open(filename, 'r', 'utf-8')
    contents = infile.read()
    infile.close()
    return contents

def count_in_list(item_to_count, list_to_search):
    "Counts the number of a specified word within a list of words"
    number_of_hits = 0
    for item in list_to_search:
        if item == item_to_count:
            number_of_hits += 1
    return number_of_hits

For this demonstration we will focus only on one file, the 2013 section of the corpus. As evidenced in the last exercise, <i>Adding Context to Word Frequency Counts</i>, there was a significant increase in the usage of the word `privacy` between 2012 and 2013, which amounted to an increase of about 40%. Here we will take a closer look at 2013 in an attempt to identify any patterns of word use


This is a case where cleaning the text may also destroy some of the context. While it is nice to have the numbers line up (in terms of word frequencies vs. number of concordance lines), removing the punctuation and capitalization makes the text harder to read and understand.

In [2]:
text = read_file('2013.txt')
words = text.split()
text = nltk.Text(words)
concord = nltk.Text(text)

Here we will call the function, listing 25 lines from the text.

In [3]:
print(concord.concordance("privacy", lines=25))

Displaying 25 of 918 matches:
imply unacceptable. That is why the Privacy Commissioner's office was notified.
 the matter to the attention of the Privacy Commissioner of Canada. I also aske
table. That is why we called in the Privacy Commissioner and called in the RCMP
able. That is why we brought in the Privacy Commissioner. That is why we brough
ese victims and when will they take privacy protection seriously? (1455) Hon. D
 is why we took steps to inform the Privacy Commissioner of Canada and to bring
ystems to make sure that Canadians' privacy is protected. That is why we have e
 happened. We have also advised the Privacy Commissioner of the situation. We h
. Speaker, the government takes the privacy of Canadians extremely seriously. T
ely unacceptable. The Office of the Privacy Commissioner has been notified and 
nment takes extremely seriously the privacy of Canadians and the loss by the de
ion. We will continue to do so. The privacy commissioner is investigating this.
ed before,

The `NLTK` module is limited in the amount of processing it can conduct on concordances. It is more useful to output the entire concordance to a text file, which can then be sorted and manipulated in many ways. The following code prints the entire concordance to file. The '79' on line 8 refers to the total number of characters contained in each span, including all letters, punctuation and spaces. 

In [4]:
#creates a new file that can be written by the print queue
fileconcord = codecs.open('2013concord.txt', 'w', 'utf-8')
#makes a copy of the empty print queue, so that we can return to it at the end of the function
tmpout = sys.stdout
#stores the text in the print queue
sys.stdout = fileconcord
#generates and prints the concordance, the number pertains to the total number of bytes per line
concord.concordance("privacy", 79, sys.maxint)
#closes the file
fileconcord.close()
#returns the print queue to an empty state
sys.stdout = tmpout

Below is an example of a sorted concordance. This list is arranged alphabetically on the word to the right of the node.

<img src="sortedConcordance.png">

As we can see from this list, and the list above, there are distinct patterns of word and phrase use. The unsorted list above contains many examples of "Office of the Privacy Commissioner", while the soreted list shows phrases like "privacy of Canadians" and "privacy protection". The manual examination of a concordance allows a researcher to understand how the node word is used in the context of the corpus, and it can help in the formation of hypotheses about the meaning of the word. 

In the next exercise we will conduct a more robust statistical examination of the words that accompany the node word, which are known as 'collocates'. Concordance outputs provide a rich context for collocational analysis.