# Character Entropy

In this notebook, you will learn about information entropy and how to compute it for discrete probability distributions.

## Imports

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from ipywidgets import interact, interactive, fixed

In [2]:
from ipywidgets import interact

## Character counting and entropy

Write a function, `char_probs`, that takes a string and computes the probabilities of each character in the string:

* First do a character count and store the result in a dictionary.
* Then divide each character count by the total number of character to compute the normalized probabilties.
* Return the dictionary of characters (keys) and probabilities (values).

In [3]:
mys = "Hello mother you are lovely"
checked = []
dic = {}
for i in mys:
    if i not in checked:
        checked.append(i)
        dic[i] = mys.count(i)
print(dic)

{'h': 1, ' ': 4, 'o': 4, 'a': 1, 'v': 1, 'm': 1, 'u': 1, 'r': 2, 't': 1, 'H': 1, 'l': 4, 'e': 4, 'y': 2}


In [4]:
for i in dic:
    dic[i] = dic[i]/len(mys)
dic_normal = dic
print(dic_normal)

{'h': 0.037037037037037035, ' ': 0.14814814814814814, 'o': 0.14814814814814814, 'a': 0.037037037037037035, 'v': 0.037037037037037035, 'm': 0.037037037037037035, 'u': 0.037037037037037035, 'r': 0.07407407407407407, 't': 0.037037037037037035, 'H': 0.037037037037037035, 'l': 0.14814814814814814, 'e': 0.14814814814814814, 'y': 0.07407407407407407}


In [5]:
def char_probs(s):
    """Find the probabilities of the unique characters in the string s.
    
    Parameters
    ----------
    s : str
        A string of characters.
    
    Returns
    -------
    probs : dict
        A dictionary whose keys are the unique characters in s and whose values
        are the probabilities of those characters.
    """
    checked = []
    dic = {}
    for i in s:
        if i not in checked:
            checked.append(i)
            dic[i] = s.count(i)
    print(dic)
    for i in dic:
        dic[i] = dic[i]/len(s)
    dic_normal = dic
    return(dic_normal)

In [6]:
test1 = char_probs('aaaa')
assert np.allclose(test1['a'], 1.0)
test2 = char_probs('aabb')
assert np.allclose(test2['a'], 0.5)
assert np.allclose(test2['b'], 0.5)
test3 = char_probs('abcd')
assert np.allclose(test3['a'], 0.25)
assert np.allclose(test3['b'], 0.25)
assert np.allclose(test3['c'], 0.25)
assert np.allclose(test3['d'], 0.25)

{'a': 4}
{'b': 2, 'a': 2}
{'d': 1, 'c': 1, 'b': 1, 'a': 1}


The [entropy](http://en.wikipedia.org/wiki/Entropy_%28information_theory%29) is a quantiative measure of the disorder of a probability distribution. It is used extensively in Physics, Statistics, Machine Learning, Computer Science and Information Science. Given a set of probabilities $P_{i}$, the entropy is defined as:

$$H = - \Sigma_i P_i \log_2(P_i)$$ 

In this expression $\log_2$ is the base 2 log (`np.log2`), which is commonly used in information science. In Physics the natural log is often used in the definition of entropy.

Write a funtion, `entropy`, that computes the entropy of a probability distribution. The probability distribution will be passed as a Python `dict`. The keys in the `dict` will be the unique values in the distribution and the values in the `dict` will be the probabilities for those keys.

To compute the entropy, you should:

* First convert the values (probabilities) of the `dict` to a Numpy array of probabilities.
* Then use other Numpy functions (`np.log2`, etc.) to compute the entropy.
* Don't use any `for` or `while` loops in your code.

In [7]:
# convert values into np array
values = list(dic_normal.values())
values = np.array(values)
values

array([ 0.03703704,  0.14814815,  0.14814815,  0.03703704,  0.03703704,
        0.03703704,  0.03703704,  0.07407407,  0.03703704,  0.03703704,
        0.14814815,  0.14814815,  0.07407407])

In [8]:
x = np.array([1,2,3])
y = np.array([1,2,3])
z = x * y
print(z)
np.sum(z)

[1 4 9]


14

In [9]:
# compute the entropy
-(np.sum(values * np.log2(values)))

3.4215541688301352

In [10]:
def entropy(d):
    """Compute the entropy of a dict d whose values are probabilities."""
    # YOUR CODE HERE
    values = list(d.values())
    values = np.array(values)
    return -(np.sum(values * np.log2(values)))

In [11]:
assert np.allclose(entropy({'a': 0.5, 'b': 0.5}), 1.0)
assert np.allclose(entropy({'a': 1.0}), 0.0)

Use IPython's `interact` function to create a user interface that allows you to type a string into a text box and see the entropy of the character probabilities of the string. Try to get an intuitive sense of the entropy by typing in different strings. What strings make the entropy large? Small?

In [12]:
x = interact(char_probs, s="Type Something")

{'i': 1, 'k': 1, ' ': 2, 'o': 2, 'l': 1, 'e': 1, 'p': 2, 'I': 1}


{' ': 0.18181818181818182,
 'I': 0.09090909090909091,
 'e': 0.09090909090909091,
 'i': 0.09090909090909091,
 'k': 0.09090909090909091,
 'l': 0.09090909090909091,
 'o': 0.18181818181818182,
 'p': 0.18181818181818182}

Find the entropies of the all of the books in the directory `/data/gutenberg`:

In [13]:
ls /data/gutenberg/*.txt

/data/gutenberg/11.txt    /data/gutenberg/17208.txt  /data/gutenberg/33511.txt
/data/gutenberg/1342.txt  /data/gutenberg/23.txt     /data/gutenberg/84.txt
/data/gutenberg/1400.txt  /data/gutenberg/2701.txt
/data/gutenberg/1661.txt  /data/gutenberg/29021.txt


* Store your entropies in a list
* Create an appropriately labeled visualization of the distribution of entropies
* Print the mean entropy

NotImplementedError: 