<div style="text-align: right" align="right"><i>Peter Norvig, 3 Jan 2020</i></div>

# Spelling Bee

The [Jan. 3 2020 Riddler](https://fivethirtyeight.com/features/can-you-solve-the-vexing-vexillology/) concerns the popular NYTimes  [Spelling Bee](https://www.nytimes.com/puzzles/spelling-bee) puzzle:

*In this game, seven letters are arranged in a honeycomb lattice, with one letter in the center. Here’s the lattice from December 24, 2019:*

<img src="https://fivethirtyeight.com/wp-content/uploads/2020/01/Screen-Shot-2019-12-24-at-5.46.55-PM.png?w=1136" width=150>


*The goal is to identify as many words that meet the following criteria:*
1. *The word must be at least four letters long.*
2. *The word must include the central letter.*
3. *The word cannot include any letter beyond the seven given letters.*

*Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.*

***Which seven-letter honeycomb results in the highest possible game score?*** *To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.*

*For consistency, please use [this word list](https://norvig.com/ngrams/enable1.txt) to check your game score.*

# Approach to a Solution

Since the word list was on my web site (it is a standard Scrabble word list that I happen to host a copy of), I felt somewhat compelled to submit an answer. I had worked on word puzzles before, like Scrabble and Boggle. My first thought is that this puzzle is rather different because it deals with *unordered sets* of letters, not *ordered permutations* of letters. That makes things much easier. When I tried to find the optimal 5×5 Boggle board, I couldn't exhaustively try all $26^{(5×5)} \approx 10^{35}$ possibilites; I had to do hillclimbing to find a locally (but not necessarily globally) optimal solution. But for Spelling Bee, it is feasible to try every possibility. Here's a sketch of an approach:
 

- Since every honeycomb must contain a pangram, I can find the best honeycomb by considering all possible pangrams and all possible centers for each pangram and taking the one that scores highest. Something like:

      max(game_score(pangram, center) 
          for pangram in pangrams for center in pangram)
      
- So it comes down to having an efficient-enough computation of the words that a honeycomb can make.
- Represent a word as a set of letters, which I'll implement as a sorted string, e.g.: 
      letterset("GLAM") == letterset("AMALGAM") == "AGLM".  
- Note: I could have used a `frozenset`, but strings have a more compact printed representation, making them easier to debug, and they take up less memory. I won't need any fancy `set` operations like union and intersection.
- Represent a honeycomb as a letterset of 7 letters, along with an indication of which one is the center. So the honeycomb in the image above would be represented by `('AEGLMPX', 'G')`.

# Words, Word Scores, Pangrams, and Lettersets

I'll start by loading some modules and defining four basic functions about words:

In [1]:
from itertools import combinations
from collections import Counter

In [2]:
def Words(text) -> set:
    """The set of all the valid space-separated words in a str."""
    return {w for w in text.upper().split() 
            if len(w) >= 4 and 'S' not in w and len(set(w)) <= 7}

def word_score(word) -> int: 
    """The points for this word, including bonus for pangram."""
    N = len(word)
    bonus = (7 if is_pangram(word) else 0)
    return (1 if N == 4 else N + bonus)

def is_pangram(word) -> bool: 
    """Does a word use all 7 letters (some maybe more than once)?"""
    return len(set(word)) == 7

def letterset(word) -> str:
    """The set of letters in a word, represented as a sorted str of letters."""
    return ''.join(sorted(set(word)))

I'll make a tiny word list to experiment with: 

In [3]:
words = Words('amalgam amalgamation game games gem glam maple megaplex pelagic I me')
words

{'AMALGAM', 'GAME', 'GLAM', 'MAPLE', 'MEGAPLEX', 'PELAGIC'}

Note that `I`, `me` and `gem` are too short, `games` has an `s` which is not allowed, and `amalgamation` has too many distinct letters. 

Here are examples of the functions in action:

In [4]:
{w: word_score(w) for w in words}

{'AMALGAM': 7, 'MEGAPLEX': 15, 'GLAM': 1, 'MAPLE': 5, 'GAME': 1, 'PELAGIC': 14}

In [5]:
{w for w in words if is_pangram(w)}

{'MEGAPLEX', 'PELAGIC'}

In [6]:
{w: letterset(w) for w in words}

{'AMALGAM': 'AGLM',
 'MEGAPLEX': 'AEGLMPX',
 'GLAM': 'AGLM',
 'MAPLE': 'AELMP',
 'GAME': 'AEGM',
 'PELAGIC': 'ACEGILP'}

# The enable1 Word List

Now I will load in the `enable1` word list and see what we have:

In [7]:
! [ -e enable1.txt ] || curl -O http://norvig.com/ngrams/enable1.txt
! wc -w enable1.txt

  172820 enable1.txt


In [8]:
enable1 = Words(open('enable1.txt').read())
len(enable1)

44585

In [9]:
pangrams = [w for w in enable1 if is_pangram(w)]
pangrams[:10] # Just sample some of them

['CRACKLIER',
 'TURBINE',
 'METHIONINE',
 'UPGAZING',
 'CUMBERED',
 'BREEZEWAY',
 'JAMBING',
 'PAPERBACK',
 'TRIPINNATE',
 'TUNICAE']

In [10]:
len(pangrams)

14741

So: we start with 172,820 words in the word list, reduce that to 44,585 valid words (the others are either shorter than 4 letters in length, or contain an 'S', or have more than 7 distinct letters), and discover that 14,741 of those words are pangrams. 

I'm also curious: what's the highest-scoring individual word?

In [11]:
w = max(enable1, key=word_score)
w, is_pangram(w), word_score(w)

('ANTITOTALITARIAN', True, 23)

#  Efficiency: Caching a Scoring Table

The goal is to find the honeycomb that maximizes the  `game_score`: the total score of all words that can be made with the honeycomb. I've chosen to go down the path of considering all 14,741 pangrams, and all 7 centers for each pangram, for a total of 103,187 candidate honeycombs. 
I'll make things more efficient by *caching* some important information after computing it once, so I don't need to recompute it 103,187 times.
- For each word, I'll precompute the `letterset` and the `word_score`.
- For each letterset, I'll add up the total `word_score` points (over all the words with that letterset).
- The function `scoring_table(words)` will return a table (dict) with this information:

In [12]:
def scoring_table(words) -> dict:
    """Return a dict of {letterset: sum_of_word_scores} over words."""
    table = Counter()
    for w in words:
        table[letterset(w)] += word_score(w)
    return table

In [13]:
scoring_table(words)

Counter({'AGLM': 8, 'AEGLMPX': 15, 'AELMP': 5, 'AEGM': 1, 'ACEGILP': 14})

Note the letterset
`'AGLM'` scores 8 points as the sum over two words: 7 for `'AMALGAM'` and 1 for `'GLAM'`.  The other lettersets get their points from just one word.
The following calculation says that there are about twice as many words as lettersets: on average about two words have the same letterset.



In [14]:
len(enable1) / len(scoring_table(enable1))

2.058307557361156

# Computing the Game Score

The brute force approach would be to take each of the 103,187 honeycombs, and for each honeycomb look at each of the 44,585 words and add up the word scores of the words that can be made by the honeycomb. That seems slow. I have an idea for a faster approach:

- For each honeycomb, generate every possible *subset* of the letters in the honeycomb. A subset must include the central letter, and it may or may not include each of the other 6 letters, so there are $2^6 = 64$ subsets. The function `letter_subsets(letters)` returns these.
- We already have letterset scores in the scoring table, so we can compute the `game_score` of a honeycomb just by fetching 64 entries in the scoring table and adding them up.
- 64 is less than 44,585, so that's a nice optimization!


In [15]:
def game_score(letters, center, table) -> int:
    """The total score for this honeycomb, given a scoring table."""
    subsets = letter_subsets(letters, center)
    return sum(table[s] for s in subsets)

def letter_subsets(letters, center) -> list:
    """All subsets of `letters` that contain the letter `center`."""
    return [letterset(subset) 
            for n in range(1, 8) 
            for subset in combinations(letters, n)
            if center in subset]

Trying out `letter_subsets`:

In [16]:
len(letter_subsets('ABCDEFG', 'C')) # It will always be 64, for any honeycomb

64

In [17]:
letter_subsets('ABCD', 'C') # A smaller example gives 2**3 = 8 subsets

['C', 'AC', 'BC', 'CD', 'ABC', 'ACD', 'BCD', 'ABCD']

Trying out `game_score`:

In [18]:
game_score('AEGLMPX', 'G', scoring_table(words)) 

24

In [19]:
game_score('AEGLMPX', 'G', scoring_table(enable1)) 

153

Let's choose some more common letters and see if we can score more:

In [20]:
game_score('ETANHRD', 'E', scoring_table(enable1)) 

2240

# The Solution: The Best Honeycomb


Finally, here's the function that will give us the solution: `best_honeycomb` searches through every possible pangram and center and finds the combination that gives the highest game score:

In [21]:
def best_honeycomb(words) -> tuple: 
    """Return (score, letters, center) for the honeycomb with highest score on these words."""
    table = scoring_table(words)
    pangrams = {s for s in table if len(s) == 7}
    return max([game_score(pangram, center, table), pangram, center]
               for pangram in pangrams
               for center in pangram)

First the solution for the tiny `words` list:

In [22]:
best_honeycomb(words)

[29, 'AEGLMPX', 'M']

Now the solution for the problem that The Riddler posed, the big `enable1` word list:

In [23]:
%time best_honeycomb(enable1)

CPU times: user 4.28 s, sys: 11.2 ms, total: 4.29 s
Wall time: 4.3 s


[3898, 'AEGINRT', 'R']

**Wow. 3898** is a high score! And it took less than 5 seconds to find it.

However, I'd like to see the actual words in addition to the score. If I had designed my program to be modular rather than to be efficient, that would be trivial. But as is, I need to define a new function, `scoring_words`, before I can create such a report:

In [24]:
def scoring_words(letters, center, words) -> set:
    """The set of words that this honeycomb can make."""
    subsets = letter_subsets(letters, center)
    return {w for w in words if letterset(w) in subsets}

def report(words):
    """Print stats and word scores for the best honeycomb on these words."""
    (score, letters, center) = best_honeycomb(words)
    sw = scoring_words(letters, center, words)
    top = max(sw, key=word_score)
    np = sum(map(is_pangram, sw))
    assert score == sum(map(word_score, sw))
    print(f'''
    The highest-scoring honeycomb for this list of {len(words)} words is:
        {letters} (center {center})
    It scores {score} points on {len(sw)} words with {np} pangrams*
    The top scoring word is {top} for {word_score(top)} points.\n''')
    printcolumns(4, 20, [f'{w} ({word_score(w)}){"*" if is_pangram(w) else " "}'
                         for w in sorted(sw)])

def printcolumns(cols, width, items):
    """Print items in designated columns of designated width."""
    for i, item in enumerate(items, 1):
        print(item.ljust(width), end='')
        if i % cols == 0: print()

In [25]:
report(enable1)


    The highest-scoring honeycomb for this list of 44585 words is:
        AEGINRT (center R)
    It scores 3898 points on 537 words with 50 pangrams*
    The top scoring word is REINTEGRATING for 20 points.

AERATE (6)          AERATING (15)*      AERIE (5)           AERIER (6)          
AGAR (1)            AGER (1)            AGGER (5)           AGGREGATE (9)       
AGGREGATING (18)*   AGINNER (7)         AGRARIAN (8)        AGREE (5)           
AGREEING (8)        AGRIA (5)           AIGRET (6)          AIGRETTE (8)        
AIRER (5)           AIRIER (6)          AIRING (6)          AIRN (1)            
AIRT (1)            AIRTING (7)         ANEAR (5)           ANEARING (8)        
ANERGIA (7)         ANGARIA (7)         ANGER (5)           ANGERING (8)        
ANGRIER (7)         ANTEATER (8)        ANTIAIR (7)         ANTIAR (6)          
ANTIARIN (8)        ANTRA (5)           ANTRE (5)           AREA (1)            
AREAE (5)           ARENA (5)           ARENITE (7)         A