### Let's write an elementary tokenizer that uses words as tokens.

We will use Mark Twain's _Life On The Mississippi_ as a test bed. The text is in the accompanying file 'Life_On_The_Mississippi.txt'

Here's a not-terribly-good such tokenizer:

In [3]:
import string
import re

# ------------------------------------------------------------------------------
# HOMEWORK Part 1
punctuations = string.punctuation
trans_table = str.maketrans('', '', punctuations)
print(trans_table)
# ------------------------------------------------------------------------------

wdict = {}
with open('Life_On_The_Mississippi.txt', 'r') as L:
    line = L.readline()
    nlines = 1
    while line:
        # ----------------------------------------------------------------------
        # HOMEWORK Part 1 
        # remove punctuation chaterecters for line
        line = line.translate(trans_table)
        # remove non-ASCII characters
        line = re.sub(r'[^\x00-x7f]', '', line)
        # to lowercase
        line = line.lower()
        # ----------------------------------------------------------------------

        words = line.split()
        for word in words:
            if wdict.get(word) is not None:
                wdict[word] += 1
            else:
                wdict[word] = 1
        line = L.readline()
        nlines += 1

nitem = 0 ; maxitems = 100
for item in wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break

print(len(wdict))
print(wdict)

{33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 39: None, 40: None, 41: None, 42: None, 43: None, 44: None, 45: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 63: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None}
('the', 10012)
('project', 90)
('gutenberg', 87)
('ebook', 13)
('of', 4532)
('life', 89)
('on', 947)
('mississippi', 159)
('this', 781)
('is', 1148)
('for', 1095)
('use', 48)
('anone', 5)
('anwhere', 18)
('in', 2593)
('united', 37)
('states', 54)
('and', 5892)
('most', 124)
('other', 270)
('parts', 10)
('world', 68)
('at', 750)
('no', 422)
('cost', 25)
('with', 1081)
('almost', 38)
('restrictions', 2)
('whatsoever', 2)
('you', 117)
('ma', 90)
('cop', 17)
('it', 2294)
('give', 81)
('awa', 172)
('or', 581)
('reuse', 2)
('under', 119)
('terms', 26)
('license', 24)
('included', 3)
('online', 4)
('wwwgutenbergorg', 5)
('if', 381)
('ou', 916)
('are', 387)
('not', 722)
('lo

This is unsatisfactory for a few reasons:

* There are non-ASCII (Unicode) characters that should be stripped (the so-called "Byte-Order Mark" or BOM \ufeff at the beginning of the text);

* There are punctuation marks, which we don't want to concern ourselves with;

* The same word can appear capitalized, or lower-case, or with its initial letter upper-cased, whereas we want them all to be normalized to lower-case.

Part 1 of this assignment: insert code in this loop to operate on the str variable 'line' so as to fix these problems before 'line' is split into words.

A hint to one possible way to do this: use the 'punctuation' character definition in the Python 'string' module, the 'maketrans' and 'translate' methods of Python's str class, to eliminate punctuation, and the regular expression ('re') Python module to eliminate any Unicode---it is useful to know that the regular expression r'[^\x00-x7f]' means "any character not in the vanilla ASCII set.

Part 2: Add code to sort the contents of wdict by word occurrence frequency.  What are the top 100 most frequent word tokens?  Adding up occurrence frequencies starting from the most frequent words, how many distinct words make up the top 90% of word occurrences in this "corpus"?

For this part, the docs of Python's 'sorted' and of the helper 'itemgetter' from 'operator' reward study.

Write your modified code in the cell below.

In [4]:
# HOMEWORK Part 1
import numpy as np

def sort_words_by_frequency(words):

    # sort based on frequency in descending order
    sorted_words = sorted(words.items(), key=lambda item: item[1], reverse=True)

    return sorted_words


def print_first_n_words_by_occurence_frequency(sorted_words, n=100):

    print('-' * 80)
    print('The most frequent words:')
    print('-' * 80)
    for i in range(n):
        print(f'{i+1:5}: word = {sorted_words[i][0]:20} occurence frequency = {sorted_words[i][1]}')
    print('#' * 80)


def get_number_of_words_fraction_occurence(sorted_words, fraction=0.9):

    # convert list of tuples to 2D numpy array
    data = np.array(sorted_words, dtype=object)
    occurences = data[:,1]
    total = np.sum(occurences)
    print(f'total number of distinct words = {len(occurences)}')
    print(f'total number of occurences of all words = {total}')
    partial = 0
    nwords = 0
    for occurence in occurences:
        #print(partial, partial / total)
        if partial / total > fraction:
            break
        partial += occurence
        nwords += 1

    return nwords


sorted_words = sort_words_by_frequency(wdict)
print_first_n_words_by_occurence_frequency(sorted_words, n=100)

fraction = 0.9
nwords = get_number_of_words_fraction_occurence(sorted_words, fraction=fraction)
print(f'the first {nwords} most frequent words makes up the top {fraction*100:.1f} % of word occurences')

--------------------------------------------------------------------------------
The most frequent words:
--------------------------------------------------------------------------------
    1: word = the                  occurence frequency = 10012
    2: word = and                  occurence frequency = 5892
    3: word = of                   occurence frequency = 4532
    4: word = a                    occurence frequency = 4055
    5: word = to                   occurence frequency = 3593
    6: word = in                   occurence frequency = 2593
    7: word = it                   occurence frequency = 2294
    8: word = i                    occurence frequency = 2205
    9: word = was                  occurence frequency = 2114
   10: word = that                 occurence frequency = 1724
   11: word = he                   occurence frequency = 1402
   12: word = is                   occurence frequency = 1148
   13: word = for                  occurence frequency = 1095
   14:

In [None]:
# For HOMEWORK Part 2, see the end of "Sequential_Data_Models.ipynb"