# Data processing

#### Task
Determine 10 most frequent words in "Hamlet".

#### Assumptions:
Do not distinguish between lower and upper case words.
Treat plurals as separate words ('ghost' and 'ghosts' are different words).
Include "left-overs" after split, such as 'd' in "we'd".
Include character names 

Text gotten from Project Gutenberg: http://www.gutenberg.org/cache/epub/2265/pg2265.txt. 
Plain text format. Remove the Gutenberg preface and legal note.

Step 1:
Read a bit of the file and print it so we can easily see what is that we need to do with it.

In [24]:
with open('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt','r') as inp:
    for i in range(10):
        line = inp.readline()
        print(line)


The Tragedie of Hamlet



Actus Primus. Scoena Prima.



Enter Barnardo and Francisco two Centinels.



  Barnardo. Who's there?

  Fran. Nay answer me: Stand & vnfold

your selfe





Step 2: Roughcleaning up. Remove all "hidden" characters (trailing end-of-line symbols, leading tabs, etc.)
Split each line into words.

In [34]:
with open('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt','r') as inp:
    for line in inp:
        cleaned_line = line.strip()  # Remove trailing line breaks
        words = cleaned_line.split() # Split the line and add to list
        

Step 3: We need to clean up each word. 

Algorithm:

Loop over all words in a line. Call function clean_word() on each word.
For a word define an empty string called new_word.
Loop over all characters of the word.
a) If the character is a letter (use string.ascii_letters) add it to the new_word
b) Else add white space " " to new_word
Split new_word at white spaces.
Return a list with all split words.

In [35]:
import string as s

def clean_word(word):
    new_word = ""
    for char in word:
        if char in s.ascii_letters:
            new_word += char
        else:
            new_word += ' '
    return new_word.split()

In main code:

In [36]:
with open('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt','r') as inp:
    for line in inp:
        cleaned_line = line.strip()  # Remove trailing line breaks
        words = cleaned_line.split() # Split it at white spaces
        cleaned_words = []
        for i in range(len(words)):
            word = words[i].lower()
            cleaned_words.extend(clean_word(word))
        

Step 4: Collect all "cleaned up" words and store them in a list.

In [37]:
hamlet_words = []
with open('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt','r') as inp:
    for line in inp:
        cleaned_line = line.strip()  # Remove trailing line breaks
        words = cleaned_line.split()
        cleaned_words = []
        for i in range(len(words)):
            word = words[i].lower()
            cleaned_words.extend(clean_word(word))
        hamlet_words.extend(cleaned_words)


For convenience, we'll place all cleaning up of the text into a function (file_to_words()).

In [38]:
def file_to_words(filename):
    all_words = []
    with open(filename,'r') as inp:
        for line in inp:
            cleaned_line = line.strip()  # Remove trailing line breaks
            words = cleaned_line.split() # Split the line into "words"
            # clean things up            
            cleaned_words = []
            for i in range(len(words)):
                word = words[i].lower()
                cleaned_words.extend(clean_word(word))
            all_words.extend(cleaned_words)
    return all_words

Step 5: We keep track of the number of occurrences of a word using a dictionary.

Key < == > word
Value < == > number of occurrences of the word.

Note: We will assume that all keys are lower case.

In [39]:
word_count = {}

hamlet = file_to_words('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt')

for word in hamlet:
    if not word in word_count:
        word_count[word] = 1
    else:
        word_count[word] += 1

Step 6: Put all together 

Convert the dictionary into the list of pairs.
Sort this list by the number of occurrences (reversed sort).
Get first 10 elements of the sorted list.
Make a pretty looking output.

In [40]:
import string as s
from operator import itemgetter

def clean_word(word):
    new_word = ""
    for char in word:
        if char in s.ascii_letters:
            new_word += char
        else:
            new_word += ' '
    return new_word.split()

# Function that produces 'clean' words
def file_to_words(filename):
    all_words = []
    with open(filename,'r') as inp:
        for line in inp:
            cleaned_line = line.strip()  # Remove trailing line breaks
            words = cleaned_line.split() # Split the line into "words"
            # clean things up            
            cleaned_words = []
            for i in range(len(words)):
                word = words[i].lower()
                cleaned_words.extend(clean_word(word))
            all_words.extend(cleaned_words)
    return all_words

# Dictionary that will contatain word counts
word_count = {}

# Call function file_to_words() with the text of Hamlet as its argument
hamlet = file_to_words('C:\\Users\\inese\\OneDrive\\Escritorio\\OneDrive\\Hamlet.txt')

# Do word counting 
for word in hamlet:
    if not word in word_count:
        word_count[word] = 1
    else:
        word_count[word] += 1
        
# find most frequent word
word_count_list = list(word_count.items())  # Transform dictionary into a list of pairs
word_count_list.sort(key=itemgetter(1),reverse=True)   # Sort by number of apperences, in reversed order

# Produce pretty output 
print('10 most frequent words are:')
print('-'*80)
for i in range(11):
    word = word_count_list[i][0]
    appears_times = word_count_list[i][1]
    percent = 100*appears_times/len(hamlet)
    int_perc = int(round(percent))
    print('Word {:^6s} appears {:5d} times, which is {:.2f}% of the text'.format(word.upper(),appears_times,percent), end=' : ' )
    stars = '*'*int_perc
    print(stars)


10 most frequent words are:
--------------------------------------------------------------------------------
Word  THE   appears   993 times, which is 3.28% of the text : ***
Word  AND   appears   863 times, which is 2.85% of the text : ***
Word   TO   appears   685 times, which is 2.26% of the text : **
Word   OF   appears   610 times, which is 2.02% of the text : **
Word   I    appears   574 times, which is 1.90% of the text : **
Word  YOU   appears   527 times, which is 1.74% of the text : **
Word   A    appears   511 times, which is 1.69% of the text : **
Word   MY   appears   502 times, which is 1.66% of the text : **
Word   IT   appears   419 times, which is 1.38% of the text : *
Word   IN   appears   400 times, which is 1.32% of the text : *
Word  THAT  appears   377 times, which is 1.25% of the text : *
