# Wordle Entropy Test

*"In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes."*

In Wordle, the 1st few words are all about identifying the matching letters as soon as possible hence we need to be as selective as possible to choose the letters that are most likely to exist in the corpus of 5-letter words in the English language.

So let's figure out the best words to start with i.e. words consisting of letters with the least entropy or uncertainty.

In [None]:
# Download list of 5 letter words. Hopefully this is a good enough dataset.
!wget "https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt" /home/emr-notebook

In [1]:
filepath = '/home/emr-notebook/sgb-words.txt'
with open(filepath) as fp:
    word_list = fp.readlines()
len(word_list)

5757

In [2]:
from collections import Counter

# We have 5757 words but let's get rid of words with repeating characters e.g. asses.
clean_word_list=[w.replace('\n', '') for  w in word_list if not any(v > 1 for v in iter(Counter(w).values()))]
len(clean_word_list)

3834

In [3]:
# 3834 words left. Let's get letter frequencies.
letter_counts = {}
for line in clean_word_list:
    for char in line:
        if (char.isalpha()):
            letter_counts[char] = letter_counts.get(char, 0) + 1

import operator
x = letter_counts
sorted_x = sorted(x.items(), key=operator.itemgetter(1), reverse=True)
sorted_x

[('e', 1899),
 ('s', 1822),
 ('a', 1606),
 ('r', 1361),
 ('o', 1147),
 ('i', 1129),
 ('t', 1059),
 ('l', 1002),
 ('n', 941),
 ('d', 810),
 ('u', 766),
 ('c', 700),
 ('p', 639),
 ('h', 586),
 ('m', 576),
 ('y', 540),
 ('g', 473),
 ('b', 459),
 ('k', 433),
 ('w', 366),
 ('f', 348),
 ('v', 229),
 ('x', 100),
 ('z', 78),
 ('j', 59),
 ('q', 42)]

In [4]:
# And now the probilities of each letter.
probability_counts = {}

def probability(boe,x):
    total = sum(boe.values()) 
    return (boe[x]/total)

probability_counts = { x: probability(letter_counts,x) for x in letter_counts.keys()}

In [5]:
# Let's see letters ordered with the least entropy.
sorted_p = sorted(probability_counts.items(), key=operator.itemgetter(1), reverse=True)
sorted_p

[('e', 0.09906103286384976),
 ('s', 0.09504434011476265),
 ('a', 0.08377673448095983),
 ('r', 0.0709963484611372),
 ('o', 0.059833072509128844),
 ('i', 0.058894105372978615),
 ('t', 0.05524256651017214),
 ('l', 0.05226917057902973),
 ('n', 0.049087115284298384),
 ('d', 0.04225352112676056),
 ('u', 0.03995826812728221),
 ('c', 0.036515388628064686),
 ('p', 0.03333333333333333),
 ('h', 0.030568596765779863),
 ('m', 0.03004694835680751),
 ('y', 0.028169014084507043),
 ('g', 0.02467396974439228),
 ('b', 0.023943661971830985),
 ('k', 0.02258737610850287),
 ('w', 0.019092331768388107),
 ('f', 0.01815336463223787),
 ('v', 0.011945748565466876),
 ('x', 0.005216484089723527),
 ('z', 0.004068857589984351),
 ('j', 0.0030777256129368804),
 ('q', 0.002190923317683881)]

In [6]:
# Get cumulative probaility of characters in each word.
def get_total_prob(w):
    return sum(probability_counts[c] for c in w)

get_total_prob("beast")  

0.35706833594157533

In [7]:
# Top 10 words to start with are:
res = sorted(clean_word_list, key = lambda x: get_total_prob(x.replace('\n', '')), reverse = True)[:5]
[(r,get_total_prob(r)) for r in res]

[('arose', 0.40871152842983827),
 ('raise', 0.40777256129368805),
 ('arise', 0.40777256129368805),
 ('aster', 0.40412102243088166),
 ('tears', 0.4041210224308816)]

In [8]:
# Words we should not never start with.
res = sorted(clean_word_list, key = lambda x: get_total_prob(x.replace('\n', '')), reverse = False)[:5]
[(r,get_total_prob(r)) for r in res]

[('jumpy', 0.13458528951486695),
 ('junky', 0.14287949921752738),
 ('humpf', 0.15206051121544079),
 ('whump', 0.15299947835159103),
 ('fudgy', 0.15320813771517997)]

In [9]:
# Let's Play
## If we start with the most likely word 'arose'
## what would the 2nd word be - with the least entropy eliminating the letters already in the 1st word.
chars='arose'

res = sorted([w for w in clean_word_list if not any([char in w for char in chars])], key = lambda x: get_total_prob(x.replace('\n', '')), reverse = True)[:5]
[(r,get_total_prob(r)) for r in res]

[('until', 0.25545122587376107),
 ('unlit', 0.25545122587376107),
 ('linty', 0.2436619718309859),
 ('glint', 0.24016692749087115),
 ('tunic', 0.23969744392279604)]

## ok - 1.arose 2. until ...
## Let's play.
![alt text](wordle3.png "Wordle")

## DONE :D