# Problem 7

**Letter frequencies.** This problem has three (3) exercises worth a total of ten (10) points.

Letter frequency in text has been studied in cryptoanalysis, in particular frequency analysis. Linguists use letter frequency analysis as a rudimentary technique for language identification, where it's particularly effective as an indicator of whether an unknown writing system is alphabetic, syllablic, or ideographic.

Primarily, three different ways exist for letter frequency analysis. Each way generally results in very different charts for common letters. Based on the provided text, the first method is to count letter frequency in root words of a dictionary. The second way is to include all word variants when counting, such as gone, going and goes and not just the root word go. Such a system results in letters like "s" appearing much more frequently. The last variant is to count letters based on their frequency in the actual text that is being studied. 

For more details, refer to the link: 
https://en.wikipedia.org/wiki/Letter_frequency

In this problem, we will focus on the 3rd methodology.

**Exercise 0** (2 points). First, given a string input, define a function  `preprocess` that returns a string with non-alphabetic characters removed and all the alphabets converted into a lower case. 

For example, 'We are coding letter Frequency! Yay!" would be transformed into "wearecodingletterfrequencyyay"

### Scratch Work

In [1]:
import string

In [2]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [3]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [4]:
string.printable[62:]

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

### My Function (Exercise 0)

In [5]:
def preprocess(S):
    #
    alphaonly = ''
    for char in S:
        if char not in string.ascii_letters:
            S.strip(char)
        else:
            alphaonly += char
    
    return alphaonly.lower()
    #

In [6]:
# Test cell: valid_string
import random, string

N_str = 100 #Length of random string

def generate_str(n):
    random_str = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits + string.punctuation) for _ in range(n))
    return random_str

def check_preprocess_str(n):
    random_str = generate_str(n)
    print("Input String: ",random_str)
    assert preprocess('random_str').islower() == True
    assert preprocess(random_str).isalpha() == True
    print("|----Your function seems to work correct for the string----|"+"\n")

check_preprocess_str(N_str)
check_preprocess_str(N_str)
check_preprocess_str(N_str)

print("\n(Passed)!")

Input String:  14G+nS_9!\>#-q~XcQKbd{c.|UPB:iPK[']4J9b,ZC{fIuD4B%i=lwV_"!O\lo/W!~H\v=eg9Wlti\!5$;T+Wo|Tb}hz|4WX]JFI
|----Your function seems to work correct for the string----|

Input String:  eS;6#84zM(MyS`i~_Ht3<NU[HeRX,oHAc@Zx:Ef`La;EDiNXm&1,ngG=Ei%X7!'*+lRXg#{k!(!8Zb$I/Vl\eNC`:TW{>A67=lEF
|----Your function seems to work correct for the string----|

Input String:  ZW!2T(nRjC"AnvPAy*gr]lkP=8v<WER`o{XhK(8k$voI/fQ!{~i$u8tnRxz+\"/$h/wc)q;6n_{bh}E*JQ)qn&\5}[.kQ/1:.'K+
|----Your function seems to work correct for the string----|


(Passed)!


**Exercise 1** (4 points). With the necessary pre-processing complete, the next step is to write a function `count_letters(S)` to count the number of occurrences of each letter in the alphabet.  

You can assume that only letters will be present in the input string. It should output a dictionary and if any alphabet (a-z) is missing in the input string, it should still be a part of the output dictionary and its corresponding value should be equal to zero.

### Scratch

In [7]:
raw_string = 'G>q#F!VdDOM?HD\QU(>243{JHh`)&{&vO5IH(e[1>F?vO*s+RWd$H=>t]5Ul|VQ=VJ!Xw`dFit{f%Rp}^Kf}#U53c&NIWg;*ldy1'
ppd_string = preprocess(raw_string)

In [8]:
ppd_string

'gqfvddomhdqujhhvoihefvosrwdhtulvqvjxwdfitfrpkfucniwgldy'

In [9]:
from collections import Counter

In [10]:
string.ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

In [11]:
new_dict = dict(Counter(ppd_string))
# new_dict

In [12]:
for letter in string.ascii_lowercase:
    if letter not in new_dict:
        new_dict[letter] = 0
    else:
        continue

In [13]:
len(new_dict)

26

### My Function (Exercise 1)

In [14]:
def count_letters(S):
    #
    new_dict = Counter(S)
    
    for letter in string.ascii_lowercase:
        if letter not in new_dict:
            new_dict[letter] = 0
        else:
            continue
            
    return new_dict
    #

In [15]:
# Test cell: count_letters
import collections

N_processed_str = 100

def generate_processed_str(n):
    random_processed_str = ''.join(random.choice(string.ascii_lowercase) for _ in range(n))
    return random_processed_str

def check_count_letters(S):
    print("Input String: ",S)
    random_char = chr(random.randint(97,122))
    print("Character frequency evaluated for: ", random_char)
    if(random_char in S):
        assert count_letters(S)[random_char] == collections.Counter(S)[random_char]
        print("|----Your function seems to return correct freq for the char----|"+"\n")
    else:
        assert count_letters(S)[random_char] == 0
        print("|----Your function seems to return correct freq for the char----|"+"\n")
        
check_count_letters(generate_processed_str(N_processed_str))
check_count_letters(generate_processed_str(N_processed_str))
check_count_letters(generate_processed_str(N_processed_str))
print("\n(Passed)!")

Input String:  reaggjgyhgabcsrpedyhrtdfoqcbetxcejxqgdlufpqekpjcfvxgbqwmotuxlztqhpplhavbwulrrrodgvkmlobggwqoqbcuhenu
Character frequency evaluated for:  s
|----Your function seems to return correct freq for the char----|

Input String:  vvacmhjgqmbogkdpskzriyouoqdzxiezgppxcwroqkmlvdxkxctgglcnofbyyjfzprkulhtccxlnyqodcyzaeanxnbnmauymnrfs
Character frequency evaluated for:  b
|----Your function seems to return correct freq for the char----|

Input String:  cjzomdwgcbqmtcnvohoqmmmxwrfnwunjewbhqgfxklszcxngddsdpdzkxhrfaowbwrfrdvjyscmrmhgglbabkzjtljoalnlinsex
Character frequency evaluated for:  g
|----Your function seems to return correct freq for the char----|


(Passed)!


**Exercise 2** (4 points). The next step is to sort the distribution of a dictionary containing all the letters in the alphabet as keys and number of occurrences in text as associated value. 

Sorting should be first done in decreasing order by occurrence count and for two elements with same count, the order should be alphabetic. The function  `find_top_letter(d)` should return the 1st character in the order.

### Scratch Work

In [16]:
sorted_d = sorted(new_dict.items(), key=lambda kv: kv[1], reverse=True)
sorted_d

[('d', 6),
 ('f', 5),
 ('v', 5),
 ('h', 5),
 ('q', 3),
 ('o', 3),
 ('u', 3),
 ('i', 3),
 ('w', 3),
 ('g', 2),
 ('j', 2),
 ('r', 2),
 ('t', 2),
 ('l', 2),
 ('m', 1),
 ('e', 1),
 ('s', 1),
 ('x', 1),
 ('p', 1),
 ('k', 1),
 ('c', 1),
 ('n', 1),
 ('y', 1),
 ('a', 0),
 ('b', 0),
 ('z', 0)]

### My Function (Exercise 2)

In [17]:
def find_top_letter(d):
    #
    sorted_tup = sorted(d.items(), key=lambda kv: kv[1], reverse=True)
    return sorted_tup[0][0]
    #


In [18]:
# Test cell: highest_freq_letter

def create_random_dict():
    max_char_value = random.randint(5, 20)
    random_dict = {c:random.randint(0,max_char_value-1) for c in string.ascii_lowercase}
    random_letter1, random_letter2 = random.sample(string.ascii_lowercase, 2)
    random_dict[random_letter1], random_dict[random_letter2] = max_char_value, max_char_value
    if(random_letter1 < random_letter2):
        return random_letter1, random_dict
    else:
        return random_letter2, random_dict

def check_top_letter():
    top_letter, random_dict = create_random_dict()
    user_letter = find_top_letter(random_dict)
    assert user_letter == top_letter
    print("Input Dictionary: ", random_dict)
    print("Your function correctly returned most frequent letter: {} \n".format(user_letter))
    
check_top_letter()
check_top_letter()
check_top_letter()
print("\n(Passed)!")

Input Dictionary:  {'a': 11, 'b': 7, 'c': 1, 'd': 2, 'e': 16, 'f': 6, 'g': 3, 'h': 1, 'i': 6, 'j': 2, 'k': 17, 'l': 7, 'm': 6, 'n': 6, 'o': 15, 'p': 1, 'q': 4, 'r': 16, 's': 12, 't': 7, 'u': 12, 'v': 17, 'w': 7, 'x': 8, 'y': 11, 'z': 12}
Your function correctly returned most frequent letter: k 

Input Dictionary:  {'a': 1, 'b': 2, 'c': 2, 'd': 2, 'e': 6, 'f': 4, 'g': 1, 'h': 0, 'i': 2, 'j': 7, 'k': 1, 'l': 2, 'm': 2, 'n': 0, 'o': 5, 'p': 7, 'q': 0, 'r': 0, 's': 3, 't': 3, 'u': 5, 'v': 0, 'w': 1, 'x': 2, 'y': 0, 'z': 5}
Your function correctly returned most frequent letter: j 

Input Dictionary:  {'a': 0, 'b': 9, 'c': 10, 'd': 0, 'e': 1, 'f': 7, 'g': 5, 'h': 6, 'i': 4, 'j': 6, 'k': 6, 'l': 10, 'm': 7, 'n': 8, 'o': 7, 'p': 2, 'q': 4, 'r': 6, 's': 8, 't': 0, 'u': 4, 'v': 6, 'w': 7, 'x': 3, 'y': 3, 'z': 6}
Your function correctly returned most frequent letter: c 


(Passed)!


**Fin!** You've reached the end of this problem. Don't forget to restart the kernel and run the entire notebook from top-to-bottom to make sure you did everything correctly. If that is working, try submitting this problem. (Recall that you *must* submit and pass the autograder to get credit for your work!)