# Word Frequencies
Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology?

We can start with counting the occurance of words in a document. Hereby, words should be converted to one case (e.g. lower case), and all punctuation characters should be eliminated.

Our program reads a (plain) text file, isolates individual words, and computes their frequencies in the document.

The following steps outline the process:
1. load text data
2. clean up text, convert characters, and transform to a list of words
3. count the occurance of words

## Load Text Data
The following shows how to load data from a web-site, local file system, and the Hadoop File System.

### Pull text documents from the web
Instead of saving documents on the local file system, we can also load them directly from the Web. The mechanism of loading from an URL is different from opening a local file is quite different. Fortumately, libraries like `urllib` make this operating fairly easy. 

In [43]:
from urllib.request import urlopen
# from urllib.request import *


In [45]:
# in order to get the help text, we should import the whole subpackage.
import urllib.request
help(urllib.request)

Help on module urllib.request in urllib:

NAME
    urllib.request - An extensible library for opening URLs using a variety of protocols

DESCRIPTION
    The simplest way to use this module is to call the urlopen function,
    which accepts a string containing a URL or a Request object (described
    below).  It opens the URL and returns the results as file-like
    object; the returned object has some extra methods described below.
    
    The OpenerDirector manages a collection of Handler objects that do
    all the actual work.  Each Handler implements a particular protocol or
    option.  The OpenerDirector is a composite object that invokes the
    Handlers needed to open the requested URL.  For example, the
    HTTPHandler performs HTTP GET and POST requests and deals with
    non-error returns.  The HTTPRedirectHandler automatically deals with
    HTTP 301, 302, 303 and 307 redirect errors, and the HTTPDigestAuthHandler
    deals with digest authentication.
    
    urlopen(url,

In [46]:
help(urlopen)

Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x7f4cd4cc8130>, *, cafile=None, capath=None, cadefault=False, context=None)



For example: load the collection of Shakespear's work and print a couple of rows. (The first 244 lines of this particular document are copyright information, and should be skipped.)

In [12]:
with urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt') as src:
    txt = src.readlines()
    for t in txt[244:250]:
        print(t.decode())

1609



THE SONNETS



by William Shakespeare





Load everything at once:

In [73]:
data = urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt').read().decode()
data[0:100]

'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with Wo'

*Note*: there is a difference between `read` and `readlines`. While `read` loads the entire content into string of bytes, `readline` allow to iterate over sections of the input stream that are separated by the new-line character(s).

### Pull text from local files
Alternatively, we may just read from a local file. 

In [73]:
with open('textfiles/shakespeare.txt', 'r') as src:
    txt = src.readlines()
    for t in txt[0:10]:
        print(t)       ## Note: we don't need to decode the string

1609



THE SONNETS



by William Shakespeare







                     1

  From fairest creatures we desire increase,



Read everything at once...

In [76]:
txt = open('textfiles/shakespeare.txt', 'r').read()
txt[0:100]

'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

### Pull text from Hadoop File System (HDFS)
We're usually interested in fairly big data sets which we keep on the Hadoop File System. All Hadoop and Spark functions can uncompress text files on the fly. Therefore they are stored in a compressed format (`.gz`).

In [79]:
import zlib
from hdfs import InsecureClient
client = InsecureClient('http://backend-0-0:50070')

In [80]:
with client.read('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/77239.gz') as reader:
  txt = zlib.decompress(reader.read(), 16+zlib.MAX_WBITS).decode()
txt[0:100]

'From: oaf@zurich.ai.mit.edu (Oded Feingold)\nSubject: Re: To All My Friends on T.P.M., I send Greetin'

In [81]:
txt.split('\n')

['From: oaf@zurich.ai.mit.edu (Oded Feingold)',
 'Subject: Re: To All My Friends on T.P.M., I send Greetings',
 'Organization: M.I.T. Artificial Intelligence Lab.',
 'Lines: 1',
 'Reply-To: oaf@zurich.ai.mit.edu',
 'NNTP-Posting-Host: klosters.ai.mit.edu',
 "In-reply-to: szljubi@chip.ucdavis.edu's message of Thu, 6 May 1993 22:47:00 GMT",
 '',
 "This is an outrage!  I don't even own a dog.",
 '']

In order to read the text files within an entire directory we have to first get thg list, and then iterate through it.

In [83]:
dir_list = client.list('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/')
dir_list[0:10]

['76355.gz',
 '76366.gz',
 '76367.gz',
 '76368.gz',
 '76369.gz',
 '76370.gz',
 '76372.gz',
 '76373.gz',
 '76374.gz',
 '76375.gz']

In [87]:
text_docs = []
for f in dir_list:
    with client.read('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/%s' % f) as reader:
        txt = zlib.decompress(reader.read(), 16+zlib.MAX_WBITS).decode()
        text_docs.append(txt)
print("Read %d text files." % len(text_docs))

Read 376 text files.


In [86]:
text_docs[1:3]

 'From: ohayon@jcpltyo.JCPL.CO.JP (Tsiel Ohayon)\nSubject: Re: rejoinder. Questions to Israelis\nOrganization: James Capel Pacific Limited, Tokyo Japan\nLines: 31\n\nIn article <1993Apr26.211905.28317@freenet.carleton.ca> aa229@Freenet.carleton.ca (Steve Birnbaum) writes:\n\n[SB] Oh yeah, Israel was really ready to "expand its borders" on the holiest day\n[SB] of the year (Yom Kippur) when the Arabs attacked in 1973.  Oh wait, you\n[SB] chose to omit that war...perhaps because it 100% supports the exact \n[SB] OPPOSITE to the point you are trying to make?  I don\'t think that it\'s\n[SB] because it was the war that hit Israel the hardest.  Also, in 1967 it was\n[SB] Egypt, not Israel who kicked out the UN force.  In 1948 it was the Arabs\n[SB] who refused to accept the existance of Israel BASED ON THE BORDERS SET\n[SB] BY THE UNITED NATIONS.  In 1956, Egypt closed off the Red Sea to Israeli\n[SB] shipping, a clear antagonistic act.  And in 1982 the attack was a response\n[SB] to years 

## Clean up text

### We need to know about some `string` operations
In particular how to change to lower case and replace special characters.

In [93]:
import string
help(string)

Help on module string:

NAME
    string - A collection of string constants.

DESCRIPTION
    Public module variables:
    
    whitespace -- a string containing all ASCII whitespace
    ascii_lowercase -- a string containing all ASCII lowercase letters
    ascii_uppercase -- a string containing all ASCII uppercase letters
    ascii_letters -- a string containing all ASCII letters
    digits -- a string containing all ASCII decimal digits
    hexdigits -- a string containing all ASCII hexadecimal digits
    octdigits -- a string containing all ASCII octal digits
    punctuation -- a string containing all ASCII punctuation characters
    printable -- a string containing all ASCII characters considered printable

CLASSES
    builtins.object
        Formatter
        Template
    
    class Formatter(builtins.object)
     |  Methods defined here:
     |  
     |  check_unused_args(self, used_args, args, kwargs)
     |  
     |  convert_field(self, value, conversion)
     |  
     |  format

In [96]:
txt = open("textfiles/shakespeare.txt").read()
txt[0:100]

'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

In [97]:
txt = txt.lower()

In [98]:
for c in '.;!\'" ':
    txt = txt.replace(c, '\n')
txt[0:100]

'1609\n\nthe\nsonnets\n\nby\nwilliam\nshakespeare\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1\n\n\nfrom\nfairest\ncreatures\nwe\ndesi'

In [100]:
word_list = txt.split('\n')
word_list[0:10]

['1609', '', 'the', 'sonnets', '', 'by', 'william', 'shakespeare', '', '']

## Lists and Tuples
Review list operations, such appending elements, concatenating lists, etc. Python also provides a structure for *tuples* which are quite useful.

In [None]:
help(list)

In [None]:
help(tuple)

In [98]:
# Example
a = []
a.append('a')
a.append('z')
a += ['b', 'x', 'c']
a.sort()
a[0:2]

['a', 'b']

## Dictonaries
Dictionaries serve as associative arrays that binds keys to values. These can be used to keep track of the individual words. However, retrieving values from their keys can be time consuming.

In [None]:
help(dict)

In [19]:
f = { 'one': 1, 'two': 2}
f['a'] = 0

In [20]:
f

{'a': 0, 'one': 1, 'two': 2}

In [22]:
f['one']

1

In [23]:
f.keys()

dict_keys(['one', 'two', 'a'])

In [24]:
f.values()

dict_values([1, 2, 0])

In [25]:
Ω = 17

In [26]:
Δ

17

In [55]:
'a' in f.keys()

True

In [56]:
f['b']

KeyError: 'b'

## Sorting
Here's an example for sorting a list of tuples. 

In [30]:
l2 = [3,4,1,45,7,234,123]
l2.sort()
l2

[1, 3, 4, 7, 45, 123, 234]

In [35]:
l = [(3,'a'), (9, 'z'), (1, 'y'), (1, 'b'), (5, 'd'), (7, 'x')]
l

[(3, 'a'), (9, 'z'), (1, 'y'), (1, 'b'), (5, 'd'), (7, 'x')]

In [37]:
def take_first(x):
    return x[0]

l.sort(key=take_first)
l

[(1, 'b'), (1, 'y'), (3, 'a'), (5, 'd'), (7, 'x'), (9, 'z')]

In [92]:
l.sort(key=lambda x: x[0], reverse=True)
l

[(7, 'x'), (5, 'd'), (3, 'a'), (1, 'b')]

In [87]:
sorted(l, key=lambda x: x[0], reverse=True)

[(5, 'd'), (3, 'a'), (1, 'b')]

In [77]:
l

[(3, 'a'), (1, 'b'), (5, 'd')]

In [41]:
l3 = [10, 110, 12, 1203]
l3.sort(key=lambda x: str(x))
l3

[10, 110, 12, 1203]

In [82]:
help(sorted)

Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customise the sort order, and the
    reverse flag can be set to request the result in descending order.



In [None]:
# curl http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt | tail -n +245 | tr 'A-Z' 'a-z'| tr ' .?:,;' '\n' | sort | uniq -c | sort -rn | more

In [1]:
txt = open('textfiles/shakespeare.txt', 'r').read()
txt[0:100]

'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

In [22]:
txt2 = txt.replace(',', '\n').replace('.', '\n').replace('?', '\n').replace('!', '\n').replace('\'', '\n').replace('"', '\n').lower()
txt2[0:100]

'1609\n\nthe sonnets\n\nby william shakespeare\n\n\n\n                     1\n  from fairest creatures we desi'

In [23]:
wordlist = txt2.split()

wordlist.sort()
results = []
current_word = wordlist[0]
current_counter = 1
for w in wordlist[1:]:
    if w!=current_word:
        results.append((current_word, current_counter))
        current_word = w
        current_counter = 1
    else:
        current_counter += 1
results.append((current_word, current_counter))
results.sort(key=lambda x: x[1], reverse=True)
results[0:10]

In [25]:
results[0:10]

[('&', 3),
 ('&c', 18),
 ('(1)', 218),
 ('(2)', 218),
 ('(a', 3),
 ('(alack', 1),
 ('(all', 4),
 ('(although', 2),
 ('(always', 1),
 ('(as', 17)]

In [27]:
results.sort(key=lambda x: x[1], reverse=True)

In [28]:
results[0:10]

[('the', 27531),
 ('and', 26658),
 ('i', 22430),
 ('to', 18937),
 ('of', 18103),
 ('a', 14554),
 ('you', 13475),
 ('my', 12474),
 ('that', 11457),
 ('in', 11010)]

In [33]:
wordlist = txt2.split()

reshash = {}
for w in wordlist:
    if w in reshash.keys():
        reshash[w] += 1
    else:
        reshash[w] = 1

results = [(k, reshash[k]) for k in reshash.keys()]
results.sort(key=lambda x: x[1], reverse=True)
results[0:10]

[('the', 27531),
 ('and', 26658),
 ('i', 22430),
 ('to', 18937),
 ('of', 18103),
 ('a', 14554),
 ('you', 13475),
 ('my', 12474),
 ('that', 11457),
 ('in', 11010)]

[('misuse', 8),
 ('julia', 153),
 ('legacy', 5),
 ('unhand', 1),
 ('nine-', 1),
 ('long-ingraffed', 1),
 ('substances', 2),
 ('profound;', 1),
 ('austerely', 2),
 ('executed', 18)]