# Approximations to English

See section 3 of [A Mathematical Theory of Communication by Claude E. Shannon](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf).


In [1]:
# Imports.

# Selecting random items from lists.
import random

# Efficient data structures.
import collections

## Zero Order Letter Approximation

Create strings by selecting random characters from a list of characters.


### random
https://docs.python.org/3/library/random.html
> This module implements pseudo-random number generators for various distributions.


### random.choice

https://docs.python.org/3/library/random.html#random.choice

Select a random element from a list.  
By random, we mean all elements have equal chance of selection.

In [2]:
# Example: select a random element of a list.
random.choice([1, 2, 3, 4, 5])

3

In [3]:
# Select a random character.
random.choice('abcdefghijklmnopqrstuvwxyz .')

'k'

### random.choices

https://docs.python.org/3/library/random.html#random.choices

> Return a k sized list of elements chosen from the population with replacement.  
> If a weights sequence is specified, selections are made according to the relative weights.

In [4]:
# Select a sequence of characters using equal weights.
''.join(random.choices('abcdefghijklmnopqrstuvwxyz .', k=100))

'xsrnm bnu oeo akz.uy prfxuzygjoapc.ejkjegfv genhwxkatgkikfzkovnadgurdrcvnteuofmurdd.ckobfuwbwjkjyoxu'

## First Order Letter Approximation

### Reading Text Files

The following was adapted from a response from ChatGPT.  
https://chatgpt.com/share/66ffdf0f-4094-800d-9ae9-63ffb9b20043

In [5]:
with open('data/frankenstein.txt', 'r') as file:
  # Read the whole file into a string.
  english = file.read()

# Change everything to lower case.
english = english.lower()
# The characters to keep.
keep = 'abcdefghijklmnopqrstuvwxyz .'
# Remove unwanted characters.
cleaned = ''.join(c for c in english if c in keep)
# Count the frequency of each character.
counts = collections.Counter(cleaned)
# Print the results
for char, count in counts.items():
  print(f"'{char}': {count}")


't': 30379
'h': 19763
'e': 46094
' ': 71747
'p': 6134
'r': 20876
'o': 25254
'j': 502
'c': 9275
'g': 5980
'u': 10412
'n': 24359
'b': 5021
'k': 1760
'f': 8722
'a': 26743
's': 21173
'i': 24577
'm': 10545
'd': 16858
'y': 7923
'w': 7653
'l': 12722
'v': 3829
'.': 3145
'z': 213
'x': 677
'q': 324


In [6]:
L = [[l, c] for l, c in counts.items()]
L.sort(key=lambda x: x[1], reverse=True)
for l, c in L:
  print(f"'{l}': {c}")

' ': 71747
'e': 46094
't': 30379
'a': 26743
'o': 25254
'i': 24577
'n': 24359
's': 21173
'r': 20876
'h': 19763
'd': 16858
'l': 12722
'm': 10545
'u': 10412
'c': 9275
'f': 8722
'y': 7923
'w': 7653
'p': 6134
'g': 5980
'b': 5021
'v': 3829
'.': 3145
'k': 1760
'x': 677
'j': 502
'q': 324
'z': 213


## Dictionaries

## Second Order Letter Approximation

## First Order Word Approximation

## Second Order Word Approximation

## End