# Word Frequencies
Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology?

We can start with counting the occurance of words in a document. Hereby, words should be converted to one case (e.g. lower case), and all punctuation characters should be eliminated.

Our program reads a (plain) text file, isolates individual words, and computes their frequencies in the document.

## Pull text documents from the web
Instead of saving documents on the local file system, we can also load them directly from the Web. The mechanism of loading from an URL is different from opening a local file is quite different. Fortumately, libraries like `urllib` make this operating fairly easy. 

In [32]:
from urllib.request import urlopen

In [None]:
help(urllib.request)

For example: load the collection of Shakespear's work and print a couple of rows. (The first 244 lines of this particular document are copyright information, and should be skipped.)

In [19]:
with urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt') as src:
    txt = src.readlines()
    for t in txt[244:250]:
        print(t.decode())

1609



THE SONNETS



by William Shakespeare





Load everything at once:

In [73]:
data = urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt').read().decode()
data[0:100]

'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with Wo'

## We need to know about some `string` operations
In particular how to change to lower case and replace special characters.

In [None]:
help(string)

## Lists and Tuples
Review list operations, such appending elements, concatenating lists, etc. Python also provides a structure for *tuples* which are quite useful.

In [None]:
help(list)

In [None]:
help(tuple)

In [98]:
# Example
a = []
a.append('a')
a.append('z')
a += ['b', 'x', 'c']
a.sort()
a[0:2]

['a', 'b']

## Dictonaries
Dictionaries serve as associative arrays that binds keys to values. These can be used to keep track of the individual words. However, retrieving values from their keys can be time consuming.

In [None]:
help(dict)

In [54]:
f['a'] = 0

In [55]:
'a' in f.keys()

True

In [56]:
f['b']

KeyError: 'b'

## Sorting
Here's an example for sorting a list of tuples. 

In [91]:
l = [(3,'a'), (1, 'b'), (5, 'd'), (7, 'x')]

In [92]:
l.sort(key=lambda x: x[0], reverse=True)
l

[(7, 'x'), (5, 'd'), (3, 'a'), (1, 'b')]

In [87]:
sorted(l, key=lambda x: x[0], reverse=True)

[(5, 'd'), (3, 'a'), (1, 'b')]

In [77]:
l

[(3, 'a'), (1, 'b'), (5, 'd')]

In [82]:
help(sorted)

Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customise the sort order, and the
    reverse flag can be set to request the result in descending order.



In [None]:
# curl http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt | tail +244 | tr 'A-Z' 'a-z'| tr ' .?:,;' '\n' | sort | uniq -c | sort -rn | more