# Counting words in Austen

I wanted to reproduce the counting values for words in the Austen corpus that I found in Voyant. First I experimented with finding words in a single text by opening the text, having it read and then counting the number of instances of "Mr" in that text. Note that the search is case sensitive but picks up "Mr" within other words (namely "Mrs" I imagine).

In [3]:
# opening up the file, printing some of it to make sure it's working:
with open ("austen\\1790 Love And Freindship.txt", "r") as f:
    austen1790 = f.read()
austen1790[0:80]

'LOVE AND FREINDSHIP AND OTHER EARLY WORKS\n\n(Love And Friendship And Other Early '

In [6]:
# Counting the instances of "Mr":
print("Instances of 'Mr' in Love and Freindship:", austen1790.count("Mr"))

Instances of 'Mr' in Love and Freindship: 49


I wanted to search through all text files, however, so I opened them using glob:

In [7]:
import glob
textFiles = glob.glob("austen\\*txt")
textFiles

['austen\\1790 Love And Freindship.txt',
 'austen\\1805 Lady Susan.txt',
 'austen\\1811 Sense and Sensibility.txt',
 'austen\\1813 Pride and Prejudice.txt',
 'austen\\1814 Mansfield Park.txt',
 'austen\\1815 Emma.txt',
 'austen\\1818 Northanger Abbey.txt',
 'austen\\1818 Persuasion.txt']

To solve the issue of picking up words within words, I turned to regular expressions.

I wrote a loop to open the texts and convert the characters within them to upper case before searching for "MR" within them in case the texts had typos where "Mr" was not capitalized. I used a regular expression to search for the term. I opted to ask for no letter characters on either side of "Mr" ([^\w]) to allow for the possibility of "Mr" appearing with or without a period. 

I was getting a codec error message saying that one of the bytes couldn't be properly decoded. An online search suggested I specify the encoding using encoding="utf-8".

I wanted to print the number of times "Mr" was used for each text, so I used the len function which indicated how many items were in the list that had been made from instances of "Mr". I wanted a final count of all the occurrences of the word as well, so used the += operator to add the count from each text to the overall count.

In [8]:
import re

totalMr = 0 # This is the count
for textFile in textFiles:
    f = open(textFile, "r", encoding="utf-8") # Needed to specify the encoding because I was getting a codec error
    textString = f.read().upper() # converted text to upper case
    f.close()
    mr = re.findall("([^\w]MR[^\w])", textString) # "MR" has to be in upper case to match the upper case text
    mr2 = len(mr) # tells us how long the list "mr" is in order to give word counts
    print(textFile, ":", mr2) 
    totalMr += mr2 # adds the number of occurrences of "Mr" found in each string after each iteration to the total number of occurrences
print("total counts of Mr/Mr.: ", totalMr)


austen\1790 Love And Freindship.txt : 25
austen\1805 Lady Susan.txt : 77
austen\1811 Sense and Sensibility.txt : 178
austen\1813 Pride and Prejudice.txt : 785
austen\1814 Mansfield Park.txt : 482
austen\1815 Emma.txt : 1153
austen\1818 Northanger Abbey.txt : 161
austen\1818 Persuasion.txt : 256
total counts of Mr/Mr.:  3117


I didn't want to have to go through all these steps for each word, so I wrote a function to do it for me. As you can see, the function does the same thing as above, simply replacing "Mr" with a variable called "word".

In [9]:
def getWordCount(word):
    totalWord = 0
    for textFile in textFiles:
        f = open(textFile, "r", encoding="utf-8") 
        textString = f.read().upper()
        f.close()
        wordList = re.findall("[^\w]"+word.upper()+"[^\w]", textString) #I added the instruction to capitalize the input word so I wouldn't have to remember to input a word in all caps every time
        numWords = len(wordList)
        print(textFile, ":", numWords)
        totalWord += numWords
    print("Total count of", word, ":", totalWord)

I could then search for any word I wanted to find the counts in each text:

In [10]:
getWordCount("Mrs")

austen\1790 Love And Freindship.txt : 24
austen\1805 Lady Susan.txt : 61
austen\1811 Sense and Sensibility.txt : 530
austen\1813 Pride and Prejudice.txt : 343
austen\1814 Mansfield Park.txt : 408
austen\1815 Emma.txt : 699
austen\1818 Northanger Abbey.txt : 175
austen\1818 Persuasion.txt : 291
Total count of Mrs : 2531


I didn't actually need to know the word counts for the individual texts, so I then altered the function to only include the total count for all the texts.

In [11]:
def getWordCount(word):
    totalWord = 0
    for textFile in textFiles:
        f = open(textFile, "r", encoding="utf-8") 
        textString = f.read().upper()
        f.close()
        wordList = re.findall("[^\w]"+word.upper()+"[^\w]", textString)
        numWords = len(wordList) # removed instructions to print text files + counts after this line
        totalWord += numWords 
    return totalWord

Then I decided to simply make a list of the words I wanted to search for, and using a loop apply each word to my function, ending up with a list of how many times each word appeared in my corpus.

In [12]:
wordList = ["mr", "mrs", "said", "miss", "think", "know", "good", "time", "little", "soon", "say", "great", "lady", "dear", "shall", "sir", "quite", "man", "thought", "fanny"]
for word in wordList:
    count = getWordCount(word)
    print(word, count)

mr 3117
mrs 2531
said 2165
miss 1942
think 1514
know 1450
good 1444
time 1432
little 1363
soon 1124
say 1109
great 1053
lady 1042
dear 960
shall 938
sir 921
quite 904
man 928
thought 866
fanny 980


The only difference between my word counts and those in Voyant is that because I used the regular expression [^\w]"+word+"[^\w], I am picking up instances when the word is followed by an apostrophe (denoting possession), so my counts of "lady," "man," and "fanny" are all higher than those in Voyant which counts "lady" and "lady's" as two separate words. I prefer picking up both in my search, but if I wanted to match Voyant more exactly, I could change my regular expression to read [^\w]"+word+"[^'a-z'] (a non-letter character, the word, then no apostrophe or letters). For example:

In [13]:
def getWordCount2(word):
    totalWord = 0
    for textFile in textFiles:
        f = open(textFile, "r", encoding="utf-8") 
        textString = f.read().upper()
        f.close()
        wordList = re.findall("[^\w]"+word.upper()+"[^'A-Z']", textString)
        numWords = len(wordList) 
        totalWord += numWords 
    return totalWord

wordList2 = ["lady", "man", "fanny"]
for word in wordList2:
    count = getWordCount2(word)
    print(word, count)


lady 1012
man 887
fanny 864


Even so, I am mysteriously off by one for both "lady" and "fanny" which I couldn't figure out. I decided try to find adverbs in one of the texts to take my mind off my frustration...

In [14]:
adverbs = re.findall(r"\w+ly", austen1790) # austen1790 was defined above (refers to "1790 Love And Freindship.txt")
adverbs[0:8]

['Early',
 'comply',
 'Surely',
 'surely',
 'considerably',
 'lovely',
 'shortly',
 'tremblingly']

"Tremblingly" was my favourite.