### Aggregate all the sources with words length 5 - 9 (~202471 words)

We aggregate the sources because words with affixiations are still valid guesses in the game (i.e. UNimportant, REwrite, playED, catS).

No data sources guarantees to include all the affixations of proper words, but each of them contain some subset so we can aggregate them all as a best-effort solution

In [14]:
word_set = set()
for path in ['enable_word_list', 'english_words_repo', 'online_plain_text', 'webster_dict_repo']:
    with open(path + '/words.txt') as f:
        data = f.read()
        for word in data.split('\n'):
            if len(word) < 5 or len(word) > 9:
                continue
            word_set.add(word)
        
word_list = sorted(word_set)
len(word_list)

202471

We can optimize string storage when we know that dictionary words sorted in order usually have a common prefix.

Ex:

`bank, bankers, banked`

can be stored as 

`bank, 4ers, 5d`

to shorten the strings, where the digit is the length of common prefix to the previous word 


In [22]:
# Encode run length prefix on a sorted word list
def prefix_encode(word_list):
    encoded = []
    prev = ''
    for word in word_list:
        idx = 0
        while idx < len(prev) and prev[idx] == word[idx]:
            idx += 1
        encoded_word = '' if idx == 0 else str(idx)
        encoded_word += word[idx:]
        encoded.append(encoded_word)
        prev = word
    return encoded

encoded_word_list = prefix_encode(word_list)

In [16]:
with open('word_list.txt', 'w') as f:
    f.write('\n'.join(word_list))
    
with open('encoded_word_list.txt', 'w') as f:
    f.write('\n'.join(encoded_word_list))

There's almost 50% improvement in file size, from 1.92MB to 1.06MB, which is great when we decide to use this in the web later

It's also slightly faster to build a Trie from encoded list, which is nice