# Probability Refresher - Exercise 01

In the english language, there are $27$ possible letters a-z and the space - character. You are given a collection of $N$ English language documents, for each document, you have to:
* Calculate and plot the probability distribution $P(x)$ over the 27 letters $x$.
* Calculate and plot the probability distribution $P(x, y)$ over the $27 \times 27$ possible (ordered) di-grams $xy$. Note: For this joint distribution, its two marginals, $P(x)$ and $P(y)$, are identical.
* From this joint distribution, calculate and plot the conditional distributions $i)\ P(y|x)$, the conditional distribution of the second letter $y$ given the first letter $x$; and $ii)\ P(x|y)$, the conditional distribution of the first letter $x$ given the second letter $y$
* Calculate and plot the Shannon information content $h(x) = \log_2(1/P(x))$ of an outcome $x$ and the entropy of a randomly selected letter $H(x) = \sum\limits_x P(x)\log_2(1/P(x))$, assuming its probability is given by $P(x)$. Entropy is additive for independent random variables $(H(X,Y) = H(X) + H(Y))$ iff $P(x,y) = P(x)P(y))$, is this true for the document?

In [1]:
def read_doc(document):
    with open(document, 'r') as f:
        lines = f.readlines()
        return lines

In [24]:
def compute_probabilities_unigrams(document):
    probabilities = {}
    lines = read_doc(document)
    
    total = 0
    for line in lines:
        for char in line.replace("\n",""):
            total += 1
            if char.lower() not in probabilities:
                probabilities.update({char.lower(): 0})
            probabilities[char.lower()] += 1
    
    for el in probabilities:
        probabilities[el] = probabilities[el]/total
        print('({}: {})'.format(el, probabilities[el]))
        
def compute_probabilities_digrams(document):
    probabilities = {}
    lines = read_doc(document)
    
    total = 0
    for line in lines:
        for i, char in enumerate(line.replace("\n","")[:-1]):
            total += 1
            if '{}{}'.format(char.lower(), line[i+1].lower()) not in probabilities:
                probabilities.update({'{}{}'.format(char.lower(), line[i+1].lower()): 0})
            probabilities['{}{}'.format(char.lower(), line[i+1].lower())] += 1
    
    for el in probabilities:
        probabilities[el] = probabilities[el]/total
        print('({}: {})'.format(el, probabilities[el]))


In [26]:
compute_probabilities_unigrams('./txts/LIFE_by_JM_Machado_de_Assis.txt')

(k: 0.004025871172122492)
(x: 0.0007259767687434002)
(): 0.00013199577613516366)
(w: 0.017423442449841606)
(e: 0.1112064413938754)
(a: 0.05946409714889123)
(d: 0.02514519535374868)
(j: 0.0017819429778247096)
(u: 0.030359028511087644)
(t: 0.06876979936642028)
(i: 0.050884371700105596)
(f: 0.017555438225976768)
(r: 0.04679250263991552)
(!: 0.0010559662090813093)
(m: 0.02257127771911299)
(z: 0.0009899683210137275)
(o: 0.05807814149947202)
( : 0.16420274551214362)
(s: 0.05194033790918691)
(p: 0.01399155227032735)
(v: 0.011021647307286166)
(": 0.00013199577613516366)
(_: 0.0005279831045406547)
(?: 0.0026399155227032735)
(y: 0.019469376979936642)
(g: 0.01253959873284055)
(:: 0.00019799366420274552)
(h: 0.05101636747624076)
(q: 0.00026399155227032733)
(b: 0.009833685322069694)
(-: 0.0009899683210137275)
(.: 0.021581309398099262)
((: 0.00013199577613516366)
(;: 0.004355860612460401)
(c: 0.01643347412882788)
(,: 0.01399155227032735)
(': 0.00013199577613516366)
(n: 0.05075237592397043)
(l: 0.036