# Homework 1: Word Frequencies

## Challenge
Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology? The assignment is to compute word frequencies for different types of documents, and to develop patterns for document classification.

## Tasks
1. Write Python code to load different text documents and compute word frequencies. The most frequent words should be at the beginning of the list.
2. Identify a small (about 5 to 10) words that could represent a particular type of document.
3. Show how different types have different word lists ("signatures").
4. Discuss results and the feasibilty of this method.

## Deliverable
Use this notebook to implement your assignment. Please, observe the following:
1. Your notebook should have the completly executed code and results.
2. Please, organize your notebook to tell the story. Remove unnecessary clutter, test code, and anything that does not belong to the story.
3. Save your notebook in a directory named `HW1` in `MSA8010F16` in your *home* directory on the Hadoop Cluster. The path should be `~/MSA8010F16/HW1/HW1.ipynb`.
4. Also save the notebook in HTML as `~/MSA8010F16/HW1/HW1.html`
5. All file names are *case sensitive*!

uses a library or sort

## Work
I would like to try to determine an author by using just word counts from their books. I have selected Jane Austen. I will load one of her books, Sense & Sensibility, and create the necessary functions to read, manipulate, and output the top 20 words for the document.

In [188]:
##Step 1: Load the data
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-sense-758.txt') as src:
    sense = []
    txt = src.readlines()
    for t in txt[244:]:
        sense = sense + (t.decode().replace('\n','').casefold().split(' '))            

In [189]:
##Step 2: remove all the punctuation 
def remove_punct(doc):
    import re
    from string import punctuation
    r = re.compile(r'[{}]'.format(punctuation))
    output = []
    for x in doc:
        output.append(r.sub('',x))
    return output

sense2 = remove_punct(sense)

In [190]:
##Step 3: Remove empty entries
sense3 = [w for w in sense2 if w]

In [191]:
##Step 4: remove stop words, most common words in the english language, in order to get words that matter. 
## i used a list from the below website
def remove_stops(doc):
    with urlopen('http://www.textfixer.com/resources/common-english-words.txt') as stop_words_src:
        stop_words = []
        sw = stop_words_src.readlines()
        for x in sw:
            stop_words = stop_words + (x.decode().split(','))   
    return [w for w in doc if w not in stop_words]

sense4 = remove_stops(sense3)

In [192]:
##Step 5: Find the top 20 meaningful words in the document
def find_top20_words(doc):
    from collections import Counter
    freq = Counter(doc)
    top20 = freq.most_common(20)    
    return (top20)    

print (find_top20_words(sense4))

[('elinor', 618), ('mrs', 526), ('very', 497), ('marianne', 490), ('more', 404), ('such', 354), ('one', 317), ('much', 287), ('herself', 249), ('time', 237), ('now', 232), ('know', 228), ('dashwood', 224), ('sister', 214), ('though', 213), ('edward', 210), ('well', 209), ('miss', 209), ('think', 205), ('jennings', 203)]


Now with all of the necessary functions developed I will load two more Jane Austen books, Pride & Predujice and Emma. From these three books I will try to find a theme within the top words. 

In [193]:
##Austen - Pride & Predujice
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-pride-757.txt') as src:
    pride = []
    txt = src.readlines()
    for t in txt[49:14502]:
        pride = pride + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

pride2 = remove_punct(pride)
pride3 = [w for w in pride2 if w]
pride4 = remove_stops(pride3)
pride5 = find_top20_words(pride4)
print (pride5)

[('mr', 781), ('elizabeth', 596), ('very', 487), ('such', 389), ('darcy', 374), ('mrs', 343), ('much', 328), ('more', 322), ('bennet', 295), ('one', 295), ('miss', 283), ('jane', 264), ('bingley', 258), ('know', 237), ('before', 227), ('herself', 227), ('though', 226), ('never', 221), ('well', 219), ('soon', 217)]


In [194]:
##Austen - Emma
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/austen-emma-754.txt') as src:
    emma = []
    txt = src.readlines()
    for t in txt:
        emma = emma + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

emma2 = remove_punct(emma)
emma3 = [w for w in emma2 if w]
emma4 = remove_stops(emma3)
emma5 = find_top20_words(emma4)
print (emma5)

[('very', 1187), ('mr', 1124), ('emma', 751), ('mrs', 687), ('miss', 587), ('much', 474), ('such', 471), ('more', 463), ('one', 428), ('harriet', 391), ('thing', 385), ('think', 384), ('weston', 382), ('little', 361), ('being', 358), ('well', 353), ('never', 346), ('knightley', 337), ('know', 322), ('elton', 317)]


After removing character names, there are seven common words between the three books. I will note these (know, more, mrs/miss, much, such, very, and well) and now test against other books; two written by women and one a man, during the same time period.

In [195]:
## Louisa May Alcott - Little Women
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/li_women') as src:
    little = []
    txt = src.readlines()
    for t in txt:
        little = little + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

little2 = remove_punct(little)
little3 = [w for w in little2 if w]
little4 = remove_stops(little3)
little5 = find_top20_words(little4)
print (little5)

[('jo', 1254), ('one', 866), ('little', 728), ('up', 647), ('meg', 638), ('amy', 573), ('laurie', 552), ('dont', 551), ('very', 494), ('out', 482), ('beth', 418), ('good', 407), ('now', 399), ('go', 393), ('im', 390), ('well', 376), ('never', 375), ('much', 371), ('old', 366), ('see', 361)]


In [196]:
##Bronte - Jane Eyre
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/bronte-jane-178.txt') as src:
    jane = []
    txt = src.readlines()
    for t in txt[120:]:
        jane = jane + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

jane2 = remove_punct(jane)
jane3 = [w for w in jane2 if w]
jane4 = remove_stops(jane3)
jane5 = find_top20_words(jane4)
print (jane5)

[('now', 666), ('one', 577), ('mr', 541), ('out', 402), ('up', 384), ('very', 376), ('more', 361), ('little', 342), ('jane', 333), ('well', 325), ('rochester', 317), ('sir', 314), ('miss', 308), ('never', 292), ('before', 284), ('see', 274), ('thought', 256), ('such', 255), ('over', 254), ('mrs', 250)]


In [197]:
##Tolstoy - Anna Karenina
from urllib.request import urlopen
import string
with urlopen('http://www.textfiles.com/etext/FICTION/anna_karenina') as src:
    anna = []
    txt = src.readlines()
    for t in txt:
        anna = anna + (t.decode().replace('\n','').replace('\r','').casefold().split(' '))    

anna2 = remove_punct(anna)
anna3 = [w for w in anna2 if w]
anna4 = remove_stops(anna3)
anna5 = find_top20_words(anna4)
print (anna5)

[('levin', 1524), ('up', 1287), ('one', 1201), ('out', 1004), ('now', 896), ('vronsky', 779), ('more', 747), ('anna', 742), ('well', 696), ('come', 682), ('go', 678), ('very', 673), ('know', 669), ('went', 638), ('alexei', 625), ('himself', 615), ('see', 613), ('kitty', 600), ('over', 581), ('time', 554)]


## Conclusion

Of the seven words found within the Jane Austen books, Little Women has three, Jane Eyre five, and Anna Karenina has four. Two of the words are present in all three books, very and well. If we remove these we have five words that do not fully overlap all three of the test texts. This could be significant but more testing would be necessary to confirm. I think a better way would be to add in some type of correlation test between words, something like groups of words together or without other words. This could help account for writing style which is not detectable when just looking at words. 

From doing just the word reference, there does not seeem to be much of a difference between men and women writers of the 1800s. Although a small sample size was used so I cannot say that with certainty. Again some type of word assocation algorithm along with additional samples could help determine if this is the acutal case. 