## Exercise 1: String/text operation

### Shakespear plays wordcount

The data/shakespear_alllines.txt stores all lines of Shakepears' work in raw texts.
Determine the total number of words, and find out the top 10 most frequent words in his work.

In [4]:
import urllib.request
import string

# Loading data from my github
url = 'https://raw.githubusercontent.com/gr-grey/python-data-review/main/data/shakespear_alllines.txt'
txtfile = urllib.request.urlopen(url)

# read the txt file into list, each element is a raw string, like '"ACT I"\n'
alllines= []
for line in txtfile:
  decoded = line.decode("utf-8")
  alllines.append(decoded)

# process each line in the alllines list
# clean out the string of each line, return a list of lower cased words. 
def proc_line(eachline):
    # removes \n at the beginning and end of a string
    no_endl = eachline.strip()
    
    # remove punctuations, string.punctuation include '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    rmpunc = str.maketrans('','', string.punctuation) 
    
    # all lower case
    nopunc_all_lower = no_endl.translate(rmpunc).lower()

    # split words by space and return as a list
    return nopunc_all_lower.split()

# get a list of all words by process each line
allwords = []
for eachline in alllines:
    allwords.extend(proc_line(eachline))

# use dictionary to store word:counts pair
def count_words(wordlist):
    worddict = {}
    for eachword in wordlist:
        if eachword in worddict:
            worddict[eachword] += 1
        else:
            worddict[eachword] = 1
    return worddict

worddict = count_words(allwords)

# total unique word numbers is the length of the dictionary.
word_num = len(worddict)
print(f'Shakespeare doc has a total of {word_num} unique words.')

# sort the dictionary by the counts value
sortedcount = sorted(worddict.items(), key = lambda x: x[1], reverse = True)

print(f'The most frequent 10 words and their counts are:')
print(*(sortedcount[0:10]), sep ="\n")

Shakespeare doc has a total of 27381 unique words.
The most frequent 10 words and their counts are:
('the', 27029)
('and', 25029)
('i', 20086)
('to', 18409)
('of', 15851)
('a', 14028)
('you', 13316)
('my', 11864)
('in', 10497)
('that', 10403)


## Notes 

### Strings
- `strip()` removes the white space (includes \n) at the beginning and end of a string
- `maketrans()` makes a map for translate() to substitute or remove a set of characters
    - you can pass 1, 2 or 3 argument, when passing 1 argument, it has to be dictionary, the keys (character or int ASCII number) will be replaced by value (character or int)
    - when passing 2 arguments, both need to be string and need to have the same length, each char in string1 will be replace by the corresponding char in str2
    - when passing 3 arguments, characters in the third string will be deleted, string 1&2 will do the same replacement
    - the map has to be passed into a_string.translate(transmap) to function
- `lower()` returns all lower case
- `split()` splits a string into a list of words, and return them as a list

### File I/O
- Read a txt file and return a list of raw txt for each line, `open('file.txt').readlines()` 