# Lecture 3.3

# 1. Text Processing

Reading files is just the first step. In most cases, we'd like to process larger (collections of) texts, and extract information by, for example, counting specific items such as words (but it can be anything, really, depending on your research question.)

In the remainder of this lecture, we'll make our hands dirty on some real-world examples, we'll look at some operations common in text processing, and interrogate actual books. 

In what follows, you get a closer a look at all the steps that come with such as simple taks as counting words.

Word-counting is very rudimentary, but nonetheless useful form of content analysis. Moretti coined the term 'distant reading' to argue that texts can be interpreted at some level of abstraction. 

In the next lecture we proceeding with more refined instruments, but the techniques snows here should provide you with  rudimentary tools for text-analysis at scale.

## 1.1 Loading a collection of files

Before we take off, let's load our first data set that contains some of the works of the philosopher John Locke.

This is somewhat different as what we have done until now. Instead of loading just *one* file, we now load a collection of texts. You could to this file by file (which works for a handful of file, but not for, let's say, thousands--unless you are very patient) but Python provides you with a more convenient to do this. What we need here is function from the `os` module.

In [5]:
import os

The above line, loads the Python `os` module. [CS] A module is a file that contains a collection of related functions grouped together. `os`, in this case stands for Operating Systems, as module provides you with an interface to interact with underlying Mac, Windows of Linux system on your computer.

With the `help()` function we can get a sense what this module contains:

In [6]:
help(os)

This function takes as argument the path to a directory and returns all the files and subdirectories present in that directory:

In [3]:
os.listdir('data/locke')

['John Locke - Second Treatise of Government.txt',
 'John Locke - An Essay Concerning Humane Understanding Volume I.txt',
 'John Locke - An Essay Concerning Humane Understanding Volume II.txt']

[CS] To call one of the functions of the module, we have to specify the name of the module and the name of the function, separated by a dot, also known as a period. This format is called **dot notation**.

[PH] Remember how to read files? Each time we had to open a file, read the contents and then close the file. Since this is a series of steps we will often need to do, we can write a single function that does all that for us. We write a small utility function read_file(filename) that reads the specified file and simply returns all contents as a single string.

In [6]:
def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    with open(filename) as infile: # windows users should use codecs.open after importing codecs
        contents = infile.read()
    return contents

Now, instead of having to open a file, read the contents and close the file, we can just call the function read_file to do all that:

In [7]:
text = read_file("data/locke_excerpt.txt")
print(text)

THE EPISTLE TO THE READER

READER,

I have put into thy hands what has been the diversion of some of my idle and heavy hours. 
If it has the good luck to prove so of any of thine, and thou hast but half so much pleasure in reading as I had in writing it, thou wilt as little think thy money, as I do my pains, ill bestowed.
Mistake not this for a commendation of my work; nor conclude, because I was pleased with the doing of it, that therefore I am fondly taken with
it now it is done. 
He that hawks at larks and sparrows has no less sport, though a much less considerable quarry, than he that flies at nobler game: and he is little acquainted with the subject of this treatise--the UNDERSTANDING--who does not know that, as it is the most elevated faculty of the soul, so it is employed with a greater and more constant delight than any of the other. 
Its searches after truth are a sort of hawking and hunting, wherein the very pursuit makes a great part of the pleasure. 
Every step the mind tak

In [11]:
def list_textfiles(directory):
    "Return a list of filenames ending in '.txt' in DIRECTORY."
    textfiles = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(os.path.join(directory,filename))
    return textfiles

The function listdir takes as argument the name of a directory and lists all filenames in that directory. We iterate over this list and append each filename that ends with the extension, .txt to a new list of textfiles. Using the list_textfiles function, the following code will read all text files in the directory data/gutenberg/training and outputs the length (in characters) of each:

In [12]:
for filepath in list_textfiles("data/locke"):
    text = read_file(filepath)
    print(filepath +  " has " + str(len(text)) + " characters.")

data/locke/John Locke - Second Treatise of Government.txt has 313912 characters.
data/locke/John Locke - An Essay Concerning Humane Understanding Volume I.txt has 834572 characters.
data/locke/John Locke - An Essay Concerning Humane Understanding Volume II.txt has 705611 characters.


We can now store all of Locke's works in a list:

In [13]:
texts = []
for filepath in list_textfiles("data/locke"):
    texts.append(read_file(filepath)) 

In [None]:
The built-in in `join()` function concatenates these seperate works into one string:

In [16]:
print('Before join()')
print(len(texts))
print(type(texts))
texts_joined = '\n'.join(texts)
print('\n')
print('After join()')
print(len(texts_joined))
print(type(texts_joined))

Before join()
3
<class 'list'>


After join()
1854097
<class 'str'>


[CS] The join function is the inverse of split. It takes a list of strings and concatenates the elements with a space between each pair:

In [19]:
sentence = "Singing in the rain"
print("After split(' ')")
words = sentence.split(" ")
print(words)
joined_words = ' '.join(words)
print("\n")
print("After ' '.join()")
print(joined_words)

After split(' ')
['Singing', 'in', 'the', 'rain']


After ' '.join()
Singing in the rain


In [None]:
The string before the dot defines the delimiter to insert between the different elements. 

In [20]:
print(' BLABLABLA '.join(words))

Singing BLABLABLA in BLABLABLA the BLABLABLA rain


### 1.1.1. Writing a count functions

[PH]Just to recap some of the stuff we learnt in the previous chapter. Can you write code that defines the variable `number_of_es` and counts how many times the letter *e* occurs in `texts_joined`? (Tip: use a `for` loop and an `if` statement)

In [28]:
number_of_es = 0
# insert your code here
for character in texts_joined:
    if character == 'e':
        number_of_es += 1

# The following test should print True if your code is correct 
print(number_of_es == 183448)

True


[PH]In the previous quiz, you probably wrote a loop that iterates over all characters in `text` and adds 1 to `number_of_es` each time the program finds the letter *e*. Counting objects in a text is a very common thing to do. Therefore, Python provides the convenient function `count`. This function operates on strings (`somestring.count(argument)`) and takes as argument the object you want to count. Using this function, the solution to the quiz above can now be rewritten as follows:

In [29]:
number_of_es = texts_joined.count("e")
print(number_of_es)

183448


In fact, `count` takes as argument any string you would like to find. We could just as well count how often the determiner `an` occurs:

In [30]:
print(texts_joined.count("an"))

24416


The string `an` is found 24416 times in our text. Does that mean that the word *an* occurs 24416 times in our No, in fact, *an* (the word) occurs only approximately  760... Think about this. Why then does Python print 24416?

If we want to count how often the word *an* occurs in the text and not the string `an`, we could surround *an* with spaces, like the following:

In [32]:
print(texts_joined.count(" an "))

760


Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of *an* followd by a semicolon or some end-of-sentence marker? Then we would need to query the text multiple times for each possible context of *an*. For that reason, we're going to approach the problem using a different, more sophisticated strategy. 

Recall from the previous chapter the function `split`. What does this function do? The function `split` operates on a string and splits a string on spaces and returns a list of smaller strings (or words):

In [34]:
text = texts_joined.split()
print(text[:100])

['SECOND', 'TREATISE', 'OF', 'GOVERNMENT', 'by', 'JOHN', 'LOCKE', 'Digitized', 'by', 'Dave', 'Gowan', '<dgowan@tfn.net>.', 'John', "Locke's", '"Second', 'Treatise', 'of', 'Government"', 'was', 'published', 'in', '1690.', 'The', 'complete', 'unabridged', 'text', 'has', 'been', 'republished', 'several', 'times', 'in', 'edited', 'commentaries.', 'This', 'text', 'is', 'recovered', 'entire', 'from', 'the', 'paperback', 'book,', '"John', 'Locke', 'Second', 'Treatise', 'of', 'Government",', 'Edited,', 'with', 'an', 'Introduction,', 'By', 'C.B.', 'McPherson,', 'Hackett', 'Publishing', 'Company,', 'Indianapolis', 'and', 'Cambridge,', '1980.', 'None', 'of', 'the', 'McPherson', 'edition', 'is', 'included', 'in', 'the', 'Etext', 'below;', 'only', 'the', 'original', 'words', 'contained', 'in', 'the', '1690', 'Locke', 'text', 'is', 'included.', 'The', '1690', 'edition', 'text', 'is', 'free', 'of', 'copyright.', '*', '*', '*', '*', '*', 'TWO']


### 1.1.2 Counting with dictionaries

In the previous chapter you have acquainted yourself with the `dictionary` structure. Recall that a dictionary consists of keys and values and allows you to quickly lookup a value. We will use a dictionary to write the function `counter` that takes as argument a list and returns a `dictionary` with for each unique item the number of times it occurs in the list. We wil first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.

We start with defining a variable `counts` which is an empty dictionary:

Next we will loop over all words in our list `words`. For each word, we check whether the dictionary already contains it. If so, we add 1 to its value. If not, we first add the word to the dictionary and assign to it the value 1.

We only use the first 1000 words as an example, but should work on the whole collections of texts as well!

In [35]:
words = text[:1000]

In [37]:
counts = {}

In [38]:
for word in words:
    if word in counts:
        counts[word] = counts[word] + 1
    else:
        counts[word] = 1
print(counts)

{'SECOND': 1, 'TREATISE': 1, 'OF': 5, 'GOVERNMENT': 2, 'by': 5, 'JOHN': 1, 'LOCKE': 2, 'Digitized': 1, 'Dave': 1, 'Gowan': 1, '<dgowan@tfn.net>.': 1, 'John': 1, "Locke's": 1, '"Second': 1, 'Treatise': 2, 'of': 31, 'Government"': 1, 'was': 3, 'published': 2, 'in': 14, '1690.': 1, 'The': 4, 'complete': 1, 'unabridged': 1, 'text': 4, 'has': 5, 'been': 3, 'republished': 1, 'several': 2, 'times': 2, 'edited': 1, 'commentaries.': 1, 'This': 1, 'is': 8, 'recovered': 1, 'entire': 1, 'from': 3, 'the': 40, 'paperback': 1, 'book,': 1, '"John': 1, 'Locke': 2, 'Second': 1, 'Government",': 1, 'Edited,': 1, 'with': 8, 'an': 6, 'Introduction,': 1, 'By': 1, 'C.B.': 1, 'McPherson,': 1, 'Hackett': 1, 'Publishing': 1, 'Company,': 1, 'Indianapolis': 1, 'and': 28, 'Cambridge,': 1, '1980.': 1, 'None': 1, 'McPherson': 1, 'edition': 2, 'included': 1, 'Etext': 1, 'below;': 1, 'only': 3, 'original': 1, 'words': 2, 'contained': 1, '1690': 2, 'included.': 1, 'free': 1, 'copyright.': 1, '*': 5, 'TWO': 2, 'TREATISES

[PH]If you don't remember anymore how dictionaries work, go back to the previous chapter and read the part about dictionaries once more.

Now that our code is working, we can add it to a function. We define the function `counter` using the `def` keyword. It takes one argument (`list_to_search`).

In [None]:
def counter(list_to_search):
    counts = {}
    for word in list_to_search:
        if word in counts:
            counts[word] = counts[word] + 1
        else:
            counts[word] = 1
    return counts

[PH]Hopefully we are boring you, but let's go through this function step by step.

1. We define a function using `def` and give it the name `counter` (line 1);
2. This function takes a single argument `list_to_search` which is the list we want to search through (line 1);
3. Next we define a variable `counts` which is an empty dictionary (line 2);
4. We loop over all words in `list_to_search` (line 3);
5. If the word is already in `counts`, we look up its current value and add 1 to it (line 4-5);
6. If the word is not in `counts` (else clause), we add the word to the dictionary and assign it the value 1 (line 6-7);
7. Return the result of counts (line 8);

Let's try out our new function!

## 1.2 Preprocessing text data

[PH]In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function `split` is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noice. For instance, the pronoun *her* occurs 4 times, but we also find `her.` occurring 1 time and the capitalized `Her`, also 1 time. Of course we would like te add those counts to that of *her*. As it appears, the tokenization of our text using `split` is fast and simple, but it leaves us with noisy and incorrect frequency distributions. 

There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient `split` function. For now we will follow the second path.

Some words in our text are capitalized. To lowercase these words, Python provides the function `lower`. It operates on strings:

### 1.2.1 Converting to lower (or upper) case

A standard procedure in text-processing is lowercasing. This converts all capital characters to lowercase, such as in:

In [None]:
print('Donald Duck'.lower())

# Or the same

name = 'Trevor Noah'
print(name.lower())

Again, why would we do this? The choice for lowercasing is arbitrary, we could as uppercase the whole text. The point here is **uniformisation**, i.e. we want to discard differences between elements that are not relevant to our research question. We, here, make the explicit choice to treat 'Hamburger' and 'hamburger' as the same item. We want to count them as the same word. 

But equally, if a word starts with a capital because it is located at the beginning, or somewhere else in the sentence, it remains the same word from a semantic point of view (in the majority of the cases).

In [None]:
print("Don't lowercase this!".upper())

Of course, this is a choice, and should always be reported.

### 1.2.2 Deleting punctuation

[PH]Lowercasing solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The function `replace` is just the function we're looking for. It takes two arguments: (1) the string we would like to replace and (2) the string we want to replace the first argument with:

In [39]:
x = 'Please. remove. all. dots. from. this. sentence.'
x = x.replace(".", "")
print(x)

Please remove all dots from this sentence


[PH]We would like to remove all punctuation from a text, not just dots and commas. We will write a function called `remove_punc` that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call `replace` on the same string each time replacing a different punctuation character with an empty string. 

In [41]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [42]:
def remove_punc(text):
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc(short_text))

Commas as it turns out are overestimated Dots however even more so


[PH] Now it is time to put everything together. We want to write a function `clean_text` that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased.

In [43]:
def clean_text(text):
    # insert your code here
    return remove_punc(text.lower())
    
# The following test should print True if your code is correct 
short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(clean_text(short_text) == 
      "commas as it turns out are overestimated dots however even more so")

True


## 1.3. Counting (the easy way)

In [47]:
from collections import Counter
counter = Counter(words)
print(counter)

Counter({'the': 40, 'of': 31, 'to': 30, 'and': 28, 'in': 14, 'be': 14, 'have': 13, 'I': 13, 'his': 12, 'not': 10, 'so': 10, 'is': 8, 'with': 8, 'AND': 8, 'my': 8, 'they': 8, 'which': 7, 'that': 7, 'he': 7, 'for': 7, 'an': 6, 'a': 6, 'all': 6, 'or': 6, 'OF': 5, 'by': 5, 'has': 5, '*': 5, 'THE': 5, 'him': 5, 'should': 5, 'up': 5, 'any': 5, 'The': 4, 'text': 4, '1.': 4, 'this': 4, 'than': 4, 'it': 4, 'are': 4, 'there': 4, 'will': 4, 'those': 4, 'shall': 4, 'Sir': 4, 'either': 4, 'was': 3, 'been': 3, 'from': 3, 'only': 3, 'C.': 3, 'R.': 3, 'were': 3, 'here': 3, 'what': 3, 'worth': 3, 'our': 3, 'make': 3, 'one': 3, 'their': 3, 'on': 3, 'If': 3, 'no': 3, 'against': 3, 'at': 3, 'cannot': 3, 'done': 3, 'GOVERNMENT': 2, 'LOCKE': 2, 'Treatise': 2, 'published': 2, 'several': 2, 'times': 2, 'Locke': 2, 'edition': 2, 'words': 2, '1690': 2, 'TWO': 2, 'TREATISES': 2, 'BY': 2, 'A.': 2, 'B.': 2, 'RIVINGTON,': 2, 'W.': 2, 'S.': 2, 'T.': 2, 'GOVERNMENT.': 2, 'present': 2, 'last': 2, 'concerning': 2, 'gov

In [49]:
counter.most_common(10)

[('the', 40),
 ('of', 31),
 ('to', 30),
 ('and', 28),
 ('in', 14),
 ('be', 14),
 ('have', 13),
 ('I', 13),
 ('his', 12),
 ('not', 10)]

What a disappointment, you might think at this point. All this work, just to obtain a list of these 'boring' words? The most frequents words are mostly not very informative, at least not if you want to pin down the topic of a text. These most frequent words are often called stopwords. 

On a side note: words have a rather persistent distribution, as small set of words is very frequent. For example the 10 most frequent words take around 26% of the total, put differently 0.1% of the total vocabulary (the ten most frequent) alone are enough to compose 26% of the text. For n=100: 60% of the text is composed from solely 1.4 percent of the vocabulary. If you want to compress your text, just throw away the 10 most frequent words!

The code below allows you to verify this.

In [51]:
topn = 100
total_words = len(words)
total_topn = sum([v for w,v in counter.most_common(topn)])
print(topn/len(set(words))*100)
print(total_topn/total_words*100)

18.587360594795538
53.300000000000004


## 1.4. Filtering

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stopw = stopwords.words('english')
print(stopw)

In [None]:
wf_filtered = Counter(w for w in words if w not in stopw)
wf_filtered.most_common(25)

## 1.5. Storing Output
### CSV Files

The previous steps reduced a book to table of word frequencies. For sure, you do not want to repeat this procedure every time but save it as an intermediate result. The optimal format is a CSV file, with CSV abbreviation Comma Separated Value. The comma in this case is called the **delimiter** the value that separates the items on each row. The end of the row is usually by a hard return.

The content of an example CSV 

``
'ideas', 1398
'one', 911
'idea', 886
``



In [None]:
content = ''
for key,value in wf_filtered.items():
    line = key+','+str(value)+'\n'
    content+=line
    
# or more concise
#content = '\n'.join(["{},{}".format(k,v) for k,v in wf.items()])

In [None]:
filename = "data/wf.csv"
with open(filename, "w") as outfile:
    outfile.write(content)

In [None]:
!ls data
!head data/wf.csv

-  Go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out- of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Search the web in order to find out how you can sort a dictionary -- this is not easy, because you might have to import another module.