# Word statistics in a text file 

## The problem

Consider a large body of text, such as a novel, or a collection of texts (aka a _corpus_). Record the frequency every word in the text. Display the most common words.

## The approach

* Download a novel from [Gutenberg](http://www.gutenberg.org/) as a plain text file.
* Remove all punctuation signs and convert all text to lowercase.
* Use a dictionary to record the frequency of every word. The dictionary keys are the words, and values are how many times the word is seen in the text.

## Remove punctuation

Choose ["The Time Machine"](http://www.gutenberg.org/cache/epub/35/pg35.txt) by H.G.Wells to investigate.

Python allows us to work with trial-and-error, but a novel-size text is not convenient. First work with the first paragraph, then extend to the entire text.

In [1]:
par = """
The Time Traveller (for so it will be convenient to speak of him)
was expounding a recondite matter to us. His grey eyes shone and
twinkled, and his usually pale face was flushed and animated. The
fire burned brightly, and the soft radiance of the incandescent
lights in the lilies of silver caught the bubbles that flashed and
passed in our glasses. Our chairs, being his patents, embraced and
caressed us rather than submitted to be sat upon, and there was that
luxurious after-dinner atmosphere when thought roams gracefully
free of the trammels of precision. And he put it to us in this
way--marking the points with a lean forefinger--as we sat and lazily
admired his earnestness over this new paradox (as we thought it)
and his fecundity.
"""

One way to remove punctuation signs is to split the text into words, and then use `strip` to remove the sign characters from either side of the word. Imperfect, but simple.

We get a list of the punctuation signs from the `string` library.

In [4]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
translator = str.maketrans({key: None for key in string.punctuation})
par.translate(translator)

'\nThe Time Traveller for so it will be convenient to speak of him\nwas expounding a recondite matter to us His grey eyes shone and\ntwinkled and his usually pale face was flushed and animated The\nfire burned brightly and the soft radiance of the incandescent\nlights in the lilies of silver caught the bubbles that flashed and\npassed in our glasses Our chairs being his patents embraced and\ncaressed us rather than submitted to be sat upon and there was that\nluxurious afterdinner atmosphere when thought roams gracefully\nfree of the trammels of precision And he put it to us in this\nwaymarking the points with a lean forefingeras we sat and lazily\nadmired his earnestness over this new paradox as we thought it\nand his fecundity\n'

We don't worry about the newlines because they will go away with `split()`. Now count the occurrence of every word by going over the words.

In [19]:
words = par.translate(translator).lower().split()

In [20]:
words

['the',
 'time',
 'traveller',
 'for',
 'so',
 'it',
 'will',
 'be',
 'convenient',
 'to',
 'speak',
 'of',
 'him',
 'was',
 'expounding',
 'a',
 'recondite',
 'matter',
 'to',
 'us',
 'his',
 'grey',
 'eyes',
 'shone',
 'and',
 'twinkled',
 'and',
 'his',
 'usually',
 'pale',
 'face',
 'was',
 'flushed',
 'and',
 'animated',
 'the',
 'fire',
 'burned',
 'brightly',
 'and',
 'the',
 'soft',
 'radiance',
 'of',
 'the',
 'incandescent',
 'lights',
 'in',
 'the',
 'lilies',
 'of',
 'silver',
 'caught',
 'the',
 'bubbles',
 'that',
 'flashed',
 'and',
 'passed',
 'in',
 'our',
 'glasses',
 'our',
 'chairs',
 'being',
 'his',
 'patents',
 'embraced',
 'and',
 'caressed',
 'us',
 'rather',
 'than',
 'submitted',
 'to',
 'be',
 'sat',
 'upon',
 'and',
 'there',
 'was',
 'that',
 'luxurious',
 'afterdinner',
 'atmosphere',
 'when',
 'thought',
 'roams',
 'gracefully',
 'free',
 'of',
 'the',
 'trammels',
 'of',
 'precision',
 'and',
 'he',
 'put',
 'it',
 'to',
 'us',
 'in',
 'this',
 'waymark

In [21]:
frequency = {}  # empty dictionary
for word in words:
    if word not in frequency:
        frequency[word] = 1
    else:
        frequency[word] += 1

In [22]:
frequency

{'a': 2,
 'admired': 1,
 'afterdinner': 1,
 'and': 10,
 'animated': 1,
 'as': 1,
 'atmosphere': 1,
 'be': 2,
 'being': 1,
 'brightly': 1,
 'bubbles': 1,
 'burned': 1,
 'caressed': 1,
 'caught': 1,
 'chairs': 1,
 'convenient': 1,
 'earnestness': 1,
 'embraced': 1,
 'expounding': 1,
 'eyes': 1,
 'face': 1,
 'fecundity': 1,
 'fire': 1,
 'flashed': 1,
 'flushed': 1,
 'for': 1,
 'forefingeras': 1,
 'free': 1,
 'glasses': 1,
 'gracefully': 1,
 'grey': 1,
 'he': 1,
 'him': 1,
 'his': 5,
 'in': 3,
 'incandescent': 1,
 'it': 3,
 'lazily': 1,
 'lean': 1,
 'lights': 1,
 'lilies': 1,
 'luxurious': 1,
 'matter': 1,
 'new': 1,
 'of': 5,
 'our': 2,
 'over': 1,
 'pale': 1,
 'paradox': 1,
 'passed': 1,
 'patents': 1,
 'points': 1,
 'precision': 1,
 'put': 1,
 'radiance': 1,
 'rather': 1,
 'recondite': 1,
 'roams': 1,
 'sat': 2,
 'shone': 1,
 'silver': 1,
 'so': 1,
 'soft': 1,
 'speak': 1,
 'submitted': 1,
 'than': 1,
 'that': 2,
 'the': 8,
 'there': 1,
 'this': 2,
 'thought': 2,
 'time': 1,
 'to': 4,
 

Now we apply this to the entire text.

## Read file and count all words

In [24]:
fin = open("pg35.txt")
frequency = {}
# Read file line by line
for line in fin:
    words = line.translate(translator).lower().split()
    for word in words:
        if word not in frequency:
            frequency[word] = 1
        else:
            frequency[word] += 1
fin.close()

To get the most common words, we need to sort the results. Dictionaries do not support sorting, so we work with a list of key-value pairs.

The `items` method of a dictionary returns the list of its key-value pairs.  The `sorted` function returns a new sorted list. We sort by the second element, in reverse.

In [28]:
frequency.items()
sorted(frequency.items(), key=lambda x: x[1], reverse=True)[:30]

[('the', 2431),
 ('and', 1298),
 ('of', 1268),
 ('i', 1241),
 ('a', 860),
 ('to', 758),
 ('in', 598),
 ('was', 549),
 ('that', 451),
 ('my', 438),
 ('it', 431),
 ('had', 352),
 ('me', 280),
 ('as', 275),
 ('with', 262),
 ('at', 257),
 ('for', 242),
 ('but', 205),
 ('time', 201),
 ('you', 201),
 ('this', 198),
 ('or', 159),
 ('were', 157),
 ('on', 148),
 ('from', 137),
 ('not', 135),
 ('all', 133),
 ('then', 132),
 ('is', 129),
 ('his', 129)]

According to the [word ranking list in Wikipedia](https://en.wikipedia.org/wiki/Most_common_words_in_English), "time" is the 55th most common word. Here, it is the 19th. After all, this is a novel about a time machine.