# Entropy of written English text

Author: J. Lizier, Isabelle De Backer, 2022-; based on the original Matlab tutorials.

The following block aims to import all the relevant libraries to analyse data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math

# Specifics required for the text processing here:
import string
import re

# Preparing your environment

As per `Module_2_notebook.ipynb` we need to use the functions we have defined in our previous work in other notebooks. So gather the new functions you wrote in this module into your `simpleinfotheory.py` script, and make sure it is referencable from here (you may need to change the folder referenced below) before you run the import line in the next cell:

In [2]:
# Option 3: edit simpleinfotheory.py and past your functions into that as you write them
import sys
sys.path.append('../../Module1-IntroToInfoTheory/PythonCode/completed/')
import simpleinfotheory

# 12. (Optional extension) Entropy of written English text

Let's compute the Shannon information contents of letters in English language text ourselves, using the collected scripts from the 1990s comedy [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld).

1. Download the collection of text extracted from Seinfeld scripts following the links on Module 2 on Canvas.
1. Open the data in a text file to inspect it (you should always do this!). We have each character's line on a different line of text. There is much punctuation in here as well.
1. Load the data into Python: (_note_ you may need to alter the filename/path to match your own)

In [3]:
filename = './Seinfeld-scripts-textOnly.txt'
with open(filename, 'rt') as f:
    str = f.read()

4. Now we need to pre-process it to remove punctuation characters, digits, and newlines (which we'll turn into spaces), and convert all upper case characters into lower case. We'll also convert it to a numpy array. Afterwards, let's check that we're only left with characters and spaces by examining the set of unique symbols in `processedStr` (leave the ";" off so we see the output!):

In [4]:
p = re.compile('[!"#\$%&\'\(\)\*\+\,-\.\/:;<=>\?@\[\]\\\^_`{\|}~0-9]*');
processedStr = p.sub('', str); # Remove punctuation characters and digits
processedStr = ' '.join(processedStr.split('\n')); # Replace newline characters with spaces
processedStr = processedStr.lower(); # Convert all upper case into lower case
processedStr = np.array(list(processedStr)); # Finally convert this into a numpy array so we can work with it
np.unique(processedStr)

array([' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
       'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
       'z'], dtype='<U1')

5. Now compute the average entropy of these characters, as derived from their probabilities of occurrence in the Seinfeld script, using your `simpleinfotheory.entropyempirical()` function. Please note:
    - You will need to have imported your `simpleinfotheory` scripts as above
    - I would suggest that you edit your function `entropyempirical()` in `simpleinfotheory.py` to uncomment the line `[symbols, counts] = np.unique(xn, axis=0, return_counts=True)` instead of the subsequent for loop (which can be commented out), but still include the line `probabilities = counts / xnSamples`. (You can see how this is done in the `simpleinfotheory.py` solution code). This will run much faster. You will need to restart the kernel for this to take effect.

    How does this compare to the stated value of the entropy of characters from [Mackay](http://www.inference.org.uk/itprnn/book.pdf) in Table 2.9 (sec 2.3; or see slide 26 of our lecture) as estimated from "_The Frequently Asked Questions Manual for Linux_"? Did you expect it to be the same, and why or why not?

In [5]:
# Compute the entropy of the characters:
(result, symbols, probabilities) = simpleinfotheory.entropyempirical(processedStr)
print("Entropy of single characters from Seinfeld scripts is %.4f bits" % result)

Entropy of single characters from Seinfeld scripts is 4.0846 bits


6. Next, compute the Shannon information content of each character, and again compare these to those quoted by Mackay.<br/>
You will have noticed that the `simpleinfotheory.entropyempirical()` function returns the probabilities of each symbol as well as the result in a tuple `(result, symbols, probabilities)` (see more details in its header). So, when you call the function, make sure that you have accepted all output variables as follows: `(result, symbols, probabilities) = simpleinfotheory.entropyempirical(processedStr)`. You can then send the probabilities as an argument to your `simpleinfotheory.infocontent()` code. On comparing to Mackay's results for each character, remember that your Shannon information contents are for the characters in a sorted order, but that order may be different to what the book displays -- yours will be displayed for each character in the order they appear in `symbols` (which is as returned by `np.unique(processedStr)` above).

In [6]:
# Compute the Shannon information content of each character:
characterInfoContents = simpleinfotheory.infocontent(probabilities)
# To display more nicely:
for ix in range(symbols.size):
    print('Info content of %s is %.4f bits' % (symbols[ix], characterInfoContents[ix]));

Info content of [' '] is 2.3116 bits
Info content of ['a'] is 4.0322 bits
Info content of ['b'] is 6.4422 bits
Info content of ['c'] is 5.9460 bits
Info content of ['d'] is 5.2915 bits
Info content of ['e'] is 3.4355 bits
Info content of ['f'] is 6.4115 bits
Info content of ['g'] is 5.4955 bits
Info content of ['h'] is 4.3057 bits
Info content of ['i'] is 4.1372 bits
Info content of ['j'] is 8.3484 bits
Info content of ['k'] is 6.3208 bits
Info content of ['l'] is 4.9140 bits
Info content of ['m'] is 5.5827 bits
Info content of ['n'] is 4.2814 bits
Info content of ['o'] is 3.8170 bits
Info content of ['p'] is 6.4067 bits
Info content of ['q'] is 11.2288 bits
Info content of ['r'] is 4.6424 bits
Info content of ['s'] is 4.4777 bits
Info content of ['t'] is 3.7499 bits
Info content of ['u'] is 5.1079 bits
Info content of ['v'] is 7.1339 bits
Info content of ['w'] is 5.5464 bits
Info content of ['x'] is 9.9738 bits
Info content of ['y'] is 5.1689 bits
Info content of ['z'] is 10.5129 bits

7. _Next level challenge_: can you move on to compute joint entropies for consecutive appearance of two characters, and then the conditional entropy of the second given the first.<br/>
    _Hint_: to select all but the last item in a numpy array `x`, you can refer to `x[:-1]`, whilst to select all but the first item in an array `x`, you can refer to `x[1:]`<br/>
    What does this tell us about how reading one character reduces our uncertainty about the next, and does this make sense?

In [7]:
# Compute the joint entropies for two characters:
#  Need a matrix with first column being first character, and second column
#  being the second
characterPairSamples = np.column_stack( (processedStr[:-1],processedStr[1:]) );
pairEntropy,__,__ = simpleinfotheory.jointentropyempirical(characterPairSamples);
print('Entropy of characters pairs: %.4f bits' % pairEntropy);
# Compute the conditional entropy of the second character given the first:
conditionalEntropy = simpleinfotheory.conditionalentropyempirical(processedStr[1:], processedStr[:-1]);
print('Conditional entropy of character given previous: %.4f bits' % conditionalEntropy);

Entropy of characters pairs: 7.4496 bits
Conditional entropy of character given previous: 3.3650 bits


A more serious challenge would be to display the joint Shannon information contents, and the conditional Shannon information contents, as per Figures 2.2 and 2.3 of Mackay. This cannot be done with a simple modification to our simple Matlab scripts, as they were not set up to return the probabilities in a nicely ordered way for all possible combinations. (That was sacrificed to make your other tasks easier!). But you could attempt to pull out a list of all observed joint symbols and their probabilities, and sort them nicely yourself ready for display in such a figure. We will work further on this in the next module (and solutions are deferred to that module).