# Codenames AI: Spymaster & Guesser

## Context
Build an AI that plays [Codenames](https://en.wikipedia.org/wiki/Codenames_(board_game)) â€” the word-association party game where a **Spymaster** gives one-word clues to help their team guess words on a 5Ã—5 grid, while avoiding opponent words and the assassin.

## Game Rules Summary
- **25 words** arranged in a 5Ã—5 grid
- **9 words** belong to the starting team (red), **8** to the other team (blue)
- **7 bystanders** (neutral), **1 assassin** (instant loss)
- **Spymaster** gives a clue: `(word, count)` â€” one word + how many board words it relates to
- **Guessers** pick board words; correct = continue, wrong team/bystander = end turn, assassin = lose
- Clue word **cannot be** any word currently on the board

## What You'll Build
1. **Board generation** from nltk word corpus with role assignment
2. **Embedding engine** using sentence-transformers (cosine similarity in embedding space)
3. **Guesser AI** â€” rank board words by cosine similarity to a given clue
4. **Spymaster AI** â€” greedy vectorized search: for each vocab word, try targeting top-1/2/3 team words by similarity, score with margin formula
5. **Game loop** â€” full turn-based play with state tracking
6. **tkinter GUI** â€” interactive board display
7. **(Bonus) Agglomerative clustering Spymaster** â€” explicit clustering before clue search

## Dependencies
```
nltk, sentence-transformers, scipy, numpy, tkinter (stdlib)
```

## Approach: Embeddings + Greedy Search
We use `all-MiniLM-L6-v2` to embed all words into a shared vector space. Cosine similarity in this space captures semantic relatedness. The Spymaster's key insight: **let each candidate clue word define its own target group** â€” sort team words by similarity to the clue, try top-1/2/3, and score with a margin-based loss. This is fully vectorized (no Python loops over the ~3000 candidate vocab) and avoids the need for explicit clustering.

In [24]:
import nltk
import random
import numpy as np
import tkinter as tk
from enum import Enum
from dataclasses import dataclass, field
from collections import Counter

nltk.download("words", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

from nltk.corpus import words, wordnet

## Part 1: Board Generation

Generate a 25-word board from the nltk corpus. We need to:
1. Filter to **common, recognizable** words (the raw corpus has 235k words, most obscure)
2. Ensure all words have **WordNet synsets** (so similarity works)
3. Assign roles: 9 red, 8 blue, 7 bystander, 1 assassin

In [25]:
from functools import cache
import random
from random import randrange, shuffle


class CardRole(Enum):
    RED = "red"
    BLUE = "blue"
    BYSTANDER = "bystander"
    ASSASSIN = "assassin"


@dataclass
class Card:
    word: str
    role: CardRole
    revealed: bool = False


@cache
def get_candidate_words() -> list[str]:
    """
    - lowercase only
    - sample from the k most common
    - should be nouns?
    - no duplicates

    Returns: str
        sorted list of candidate words
    """
    seen = set()
    candidates = []
    for w in words.words():
        w = w.lower()
        if (
            w not in seen
            and w.isalpha()
            and w.islower()
            and 4 <= len(w) <= 8
            and len(wordnet.synsets(w)) > 5
        ):
            seen.add(w)
            candidates.append(w)
    return sorted(candidates)


def generate_board(seed=42) -> list[Card]:
    """
    25 words arranged in a 5x5 grid
        - I think it can just return a flat list here, can reorganize into grid if we need to

    9 words -> red
    8 words -> blue
    7 bystanders
    1 assassin
    """
    random.seed(seed)

    stack = (
        [CardRole.RED] * 9
        + [CardRole.BLUE] * 8
        + [CardRole.BYSTANDER] * 7
        + [CardRole.ASSASSIN]
    )
    shuffle(stack)

    word_list = get_candidate_words()

    cards = []
    while stack:
        word = word_list.pop(randrange(len(word_list)))
        card = Card(word=word, role=stack.pop(), revealed=False)
        cards.append(card)

    return cards


In [26]:
# Test Part 1
candidates = get_candidate_words()
print(f"Candidate pool: {len(candidates)} words")
assert len(candidates) > 1000, "Should have a large candidate pool"
assert all(w.isalpha() and w.islower() for w in candidates), (
    "All words should be lowercase alpha"
)
assert all(len(wordnet.synsets(w)) > 0 for w in candidates[:50]), (
    "All candidates should have synsets"
)
print(f"Sample: {candidates[:10]}")

board = generate_board(seed=42)
assert len(board) == 25, f"Board should have 25 cards, got {len(board)}"
role_counts = Counter(c.role for c in board)
assert role_counts[CardRole.RED] == 9, (
    f"Should have 9 red, got {role_counts[CardRole.RED]}"
)
assert role_counts[CardRole.BLUE] == 8, (
    f"Should have 8 blue, got {role_counts[CardRole.BLUE]}"
)
assert role_counts[CardRole.BYSTANDER] == 7, (
    f"Should have 7 bystanders, got {role_counts[CardRole.BYSTANDER]}"
)
assert role_counts[CardRole.ASSASSIN] == 1, (
    f"Should have 1 assassin, got {role_counts[CardRole.ASSASSIN]}"
)
assert len(set(c.word for c in board)) == 25, "All words should be unique"
assert all(not c.revealed for c in board), "No cards should be revealed initially"
print(f"\nBoard words: {[c.word for c in board]}")
print(f"Role distribution: {dict(role_counts)}")
print("âœ“ Part 1 passed")

Candidate pool: 3040 words
Sample: ['abandon', 'about', 'absolute', 'absorb', 'absorbed', 'abstract', 'abuse', 'accent', 'accept', 'accepted']

Board words: ['span', 'growed', 'admit', 'declined', 'uprose', 'permit', 'learning', 'grown', 'dead', 'fairer', 'laying', 'chopped', 'caught', 'modest', 'charm', 'maintain', 'lighter', 'stain', 'goes', 'betray', 'worry', 'purse', 'serious', 'conflict', 'mold']
Role distribution: {<CardRole.BYSTANDER: 'bystander'>: 7, <CardRole.RED: 'red'>: 9, <CardRole.ASSASSIN: 'assassin'>: 1, <CardRole.BLUE: 'blue'>: 8}
âœ“ Part 1 passed


## Part 2: Embedding Engine

The core of our Codenames AI is measuring **semantic similarity** between words using dense embeddings from `sentence-transformers`.

We precompute embeddings for:
1. All **25 board words**
2. A **vocabulary** of ~3000 candidate clue words (from the nltk corpus, filtered for quality)

Cosine similarity between embeddings gives us a fast, high-quality similarity measure. The embedding engine also precomputes the full similarity matrix between all vocab words and board words, so the Spymaster can search efficiently.

<details>
  <summary>ðŸ’¡ Key Insight: Why precompute?</summary>
  The Spymaster needs to score thousands of candidate clue words against all board words. By precomputing the vocabÃ—board similarity matrix once, we turn the search into fast numpy operations instead of repeated model calls.
</details>

In [27]:
from sentence_transformers import SentenceTransformer


class EmbeddingEngine:
    def __init__(
        self,
        board_words: list[str],
        vocab_words: list[str],
        model_name: str = "all-MiniLM-L6-v2",
    ):
        self.model = SentenceTransformer(model_name)
        self.board_words = board_words
        self.vocab_words = vocab_words

        self.board_word_to_idx: dict[str, int] = {
            w: i for i, w in enumerate(board_words)
        }
        self.vocab_word_to_idx: dict[str, int] = {
            w: i for i, w in enumerate(vocab_words)
        }

        self.board_embeddings = self.model.encode(
            board_words, normalize_embeddings=True
        )
        self.vocab_embeddings = self.model.encode(
            vocab_words, normalize_embeddings=True, batch_size=256
        )

        self.vocab_to_board_sims: np.ndarray = (
            self.vocab_embeddings @ self.board_embeddings.T
        )  # (nv, nb)

        self.board_to_board_sims: np.ndarray = (
            self.board_embeddings @ self.board_embeddings.T
        )

    def similarity(self, word1: str, word2: str) -> float:
        i, j = self.board_word_to_idx[word1], self.board_word_to_idx[word2]
        return float(self.board_to_board_sims[i, j])

    def clue_to_board_sims(self, clue_word: str) -> dict[str, float]:
        if clue_word in self.vocab_word_to_idx:
            clue_idx = self.vocab_word_to_idx[clue_word]
            sims = self.vocab_to_board_sims[clue_idx, :]  # (nb)
        else:
            # Dynamically encode and normalize this clue word
            clue_emb = self.model.encode([clue_word], normalize_embeddings=True)[
                0
            ]  # shape: (d,)
            sims = clue_emb @ self.board_embeddings.T  # (nb,)
        return {
            vocab_word: float(sim) for vocab_word, sim in zip(self.board_words, sims)
        }

    def get_vocab_sims_to_board_indices(self, board_indices: list[int]) -> np.ndarray:
        return self.vocab_to_board_sims[:, board_indices]


In [28]:
# Test Part 2
board = generate_board(seed=42)
board_words = [c.word for c in board]
vocab_words = [w for w in get_candidate_words() if w not in set(board_words)]

engine = EmbeddingEngine(board_words, vocab_words)

# Shape checks
assert engine.board_embeddings.shape == (25, 384), (
    f"Expected (25, 384), got {engine.board_embeddings.shape}"
)
assert engine.vocab_to_board_sims.shape == (len(vocab_words), 25)
assert engine.board_to_board_sims.shape == (25, 25)

# Self-similarity should be ~1.0
for w in board_words[:3]:
    assert engine.similarity(w, w) > 0.99, f"Self-similarity of '{w}' should be ~1.0"

# Similarity should be symmetric
w1, w2 = board_words[0], board_words[1]
assert abs(engine.similarity(w1, w2) - engine.similarity(w2, w1)) < 1e-6

# clue_to_board_sims should return dict with all board words
sims = engine.clue_to_board_sims(vocab_words[0])
assert isinstance(sims, dict)
assert len(sims) == 25
assert all(isinstance(v, float) for v in sims.values())

# Test on-the-fly encoding for a word not in vocab
sims_novel = engine.clue_to_board_sims("elephant")
assert len(sims_novel) == 25

print(f"Board: {board_words[:5]}...")
print(f"Vocab size: {len(vocab_words)}")
print(f"Embedding dim: {engine.board_embeddings.shape[1]}")
print(f"Sample similarity ({w1}, {w2}): {engine.similarity(w1, w2):.3f}")
print("âœ“ Part 2 passed")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Board: ['sphere', 'hack', 'admitted', 'deform', 'vest']...
Vocab size: 2990
Embedding dim: 384
Sample similarity (sphere, hack): 0.165
âœ“ Part 2 passed


## Part 3: Game State

Track the full game state: whose turn it is, which cards are revealed, remaining words per team, and win/loss conditions.

In [29]:
class Team(Enum):
    RED = "red"
    BLUE = "blue"


@dataclass
class GuessResult:
    word: str
    role: CardRole
    correct: bool
    game_over: bool
    assassin: bool


@dataclass
class Clue:
    word: str
    count: int


class GameState:
    def __init__(self, board: list[Card], starting_team: Team = Team.RED):
        self.board = board
        self.current_team = starting_team
        self.game_over = False
        self.winner: Team | None = None
        self.turn_history: list[tuple[Team, Clue, list[GuessResult]]] = []

    def get_unrevealed(self, role: CardRole | None = None) -> list[Card]:
        return [
            c for c in self.board if not c.revealed and (role is None or c.role == role)
        ]

    def get_team_role(self, team: Team) -> CardRole:
        return CardRole.RED if team == Team.RED else CardRole.BLUE

    def get_opponent_role(self, team: Team) -> CardRole:
        return CardRole.BLUE if team == Team.RED else CardRole.RED

    def _check_win(self) -> Team | None:
        if not self.get_unrevealed(CardRole.RED):
            return Team.RED
        if not self.get_unrevealed(CardRole.BLUE):
            return Team.BLUE
        return None

    def make_guess(self, word: str) -> GuessResult:

        card = None
        for c in self.board:
            if c.word.lower() == word.lower():
                card = c
                break

        if card is None:
            raise ValueError(f"Word {word} is not found on the board")
        if card.revealed:
            raise ValueError(f"Word {word} is already revelealed")

        card.revealed = True
        team_role = self.get_team_role(self.current_team)

        if card.role == CardRole.ASSASSIN:
            self.game_over = True
            self.winner = Team.BLUE if self.current_team == Team.RED else Team.RED
            return GuessResult(
                word=card.word,
                role=card.role,
                correct=False,
                game_over=True,
                assassin=True,
            )

        correct = card.role == team_role

        winner = self._check_win()
        game_over = winner is not None
        if game_over:
            self.game_over = True
            self.winner = winner

        return GuessResult(
            word=word,
            role=card.role,
            correct=correct,
            game_over=game_over,
            assassin=False,
        )

    def end_turn(self):
        self.current_team = Team.BLUE if self.current_team == Team.RED else Team.RED

    def remaining_count(self, team: Team) -> int:
        role = self.get_team_role(team)
        return len(self.get_unrevealed(role))

In [30]:
# Test Part 3
board = generate_board(seed=42)
game = GameState(board)

# Initial state
assert not game.game_over
assert game.current_team == Team.RED
assert game.remaining_count(Team.RED) == 9
assert game.remaining_count(Team.BLUE) == 8
assert len(game.get_unrevealed()) == 25

# Find a red card and guess it
red_cards = game.get_unrevealed(CardRole.RED)
result = game.make_guess(red_cards[0].word)
assert result.correct, "Guessing own team's word should be correct"
assert not result.game_over, "Game shouldn't be over yet"
assert not result.assassin
assert game.remaining_count(Team.RED) == 8, "Should have one fewer red word"
assert len(game.get_unrevealed()) == 24

# Guess a bystander (should not be 'correct')
bystander_cards = game.get_unrevealed(CardRole.BYSTANDER)
result = game.make_guess(bystander_cards[0].word)
assert not result.correct, "Bystander should not be correct"
assert not result.game_over

# End turn
game.end_turn()
assert game.current_team == Team.BLUE

# Test assassin
assassin_card = game.get_unrevealed(CardRole.ASSASSIN)
if assassin_card:  # Should exist
    result = game.make_guess(assassin_card[0].word)
    assert result.assassin, "Should be assassin"
    assert result.game_over, "Game should be over"
    assert game.game_over
    assert game.winner == Team.RED, "Opponent of the team that hit assassin wins"

# Test ValueError for already-revealed card
try:
    game.make_guess(red_cards[0].word)
    assert False, "Should have raised ValueError"
except ValueError:
    pass

print("âœ“ Part 3 passed")

âœ“ Part 3 passed


## Part 4: Guesser AI

The Guesser is the simplest component and the fastest to implement. Given a clue `(word, count)`, rank all unrevealed board words by **cosine similarity** to the clue word embedding, then pick the top `count` that exceed a confidence threshold.

A smart guesser also considers:
- **Threshold**: don't guess if similarity is too low (better to stop than hit the assassin)
- **Gap check**: if the top candidate is only marginally better than the next, the clue might be ambiguous

In [31]:
class Guesser:
    """
    AI Guesser that interprets clues using embedding similarity.
    """

    def __init__(self, engine: EmbeddingEngine, min_threshold: float = 0.15):
        self.engine = engine
        self.min_threshold = min_threshold

    def rank_words(
        self, clue_word: str, unrevealed_words: list[str]
    ) -> list[tuple[str, float]]:
        """
        Rank unrevealed words by cosine similarity to the clue word.

        Use engine.clue_to_board_sims() to get similarities, then filter
        to only unrevealed words and sort descending.

        Args:
            clue_word: The spymaster's clue
            unrevealed_words: Words still face-down on the board

        Returns:
            List of (word, similarity) tuples, sorted descending by similarity
        """
        similarities = self.engine.clue_to_board_sims(clue_word)  # dict[str, float
        sims_filtered = [
            (board_word, sim)
            for board_word, sim in similarities.items()
            if board_word in unrevealed_words
        ]
        return sorted(sims_filtered, key=lambda x: x[1], reverse=True)

    def make_guesses(self, clue: Clue, unrevealed_words: list[str]) -> list[str]:
        """
        Decide which words to guess given a clue.

        Strategy:
        - Rank words by similarity to clue
        - Take up to clue.count words that exceed min_threshold
        - Stop early if similarity drops below threshold

        Args:
            clue: The Clue (word + count)
            unrevealed_words: Words still on the board

        Returns:
            Ordered list of words to guess
        """
        count = clue.count
        clue_word = clue.word
        words_guessed = 0
        guesses = []
        for clue_word, sim in self.rank_words(
            clue_word, unrevealed_words=unrevealed_words
        ):
            if words_guessed >= count:
                break

            if sim < self.min_threshold:
                break

            guesses.append(clue_word)
            words_guessed += 1

        return guesses  # might be [] if nothing is similar enough - need to tune min_threshold


In [32]:
# Test Part 4
guesser = Guesser(engine)

# Test ranking
unrevealed = list(engine.board_words)  # all 25 words, none revealed yet
ranking = guesser.rank_words("danger", unrevealed)
assert len(ranking) == len(unrevealed)
assert all(isinstance(r, tuple) and len(r) == 2 for r in ranking)
sims = [s for _, s in ranking]
assert sims == sorted(sims, reverse=True), "Should be sorted descending"
print("Rankings for clue 'danger':")
for word, sim in ranking[:5]:
    print(f"  {word}: {sim:.3f}")

# Test guessing
guesses = guesser.make_guesses(Clue("danger", 2), unrevealed)
assert len(guesses) <= 2, "Should not exceed clue count"
assert all(g in unrevealed for g in guesses), "Guesses must be from unrevealed words"
print(f"\nGuesses for 'danger 2': {guesses}")

# Test threshold: very obscure clue should produce fewer guesses
guesses_obscure = guesser.make_guesses(Clue("quasar", 3), unrevealed)
print(f"Guesses for 'quasar 3': {guesses_obscure} (may be fewer due to threshold)")

print("âœ“ Part 4 passed")

Rankings for clue 'danger':
  stayed: 0.318
  cheat: 0.317
  hack: 0.316
  marine: 0.308
  contact: 0.304

Guesses for 'danger 2': ['stayed', 'cheat']
Guesses for 'quasar 3': ['mouse', 'cinch', 'sphere'] (may be fewer due to threshold)
âœ“ Part 4 passed


## Part 5: Spymaster AI

The heart of the exercise. The Spymaster must find a clue word that:
1. **Connects** as many of its own team's words as possible (high similarity)
2. **Avoids** opponent words and especially the assassin (low similarity)
3. Is **not** any word currently on the board

### Scoring a Clue

For a candidate clue $c$ targeting a subset $T$ of team words, with opponent words $O$, assassin $a$, and bystanders $B$:

$$\text{score}(c, T) = \underbrace{\min_{t \in T} \text{sim}(c, t)}_{\text{weakest link}} - \lambda_o \cdot \underbrace{\max_{o \in O} \text{sim}(c, o)}_{\text{worst opponent}} - \lambda_a \cdot \underbrace{\text{sim}(c, a)}_{\text{assassin penalty}} - \lambda_b \cdot \underbrace{\max_{b \in B} \text{sim}(c, b)}_{\text{bystander penalty}}$$

### Greedy Algorithm (no explicit clustering needed)

The key insight: **let each candidate clue define its own grouping**. For each vocab word, sort the team words by similarity to it, then try targeting the top-1, top-2, top-3. The clue word itself picks the natural cluster.

1. Precompute `vocab_to_board_sims[:, team_indices]` â†’ shape `(V, n_team)`
2. For count in [3, 2, 1]: for each vocab word, the weakest-link sim is the count-th highest team sim
3. Subtract penalties (opponent, assassin, bystander) â€” all vectorized
4. Apply a count bonus to prefer ambitious clues
5. Pick the best `(vocab_word, count, targets)` triple

This is fully vectorized â€” no Python loops over vocab words.

<details>
  <summary>ðŸ’¡ Hint: Vectorized top-k weakest link</summary>
  
  `np.sort(team_sims, axis=1)[:, ::-1]` gives you team words sorted by sim for each vocab word.
  Column 0 = most similar team word, column 1 = 2nd most, etc.
  The weakest link for a count-k clue is column k-1.
</details>

In [None]:
@dataclass
class ScoredClue:
    word: str
    count: int  # Number of team words targeted
    score: float
    targets: list[str]  # Which team words this clue aims at
    target_sims: list[float]  # Similarity to each target
    max_opponent_sim: float  # Highest similarity to any opponent word
    assassin_sim: float  # Similarity to assassin


class Spymaster:
    """
    AI Spymaster that generates clues using embedding similarity.

    Greedy approach: for each vocab word, sort team words by similarity,
    try targeting top-1/2/3, score with margin formula, pick the best.
    """

    def __init__(
        self,
        engine: EmbeddingEngine,
        opponent_penalty: float = 1.5,
        assassin_penalty: float = 3.0,
        bystander_penalty: float = 0.5,
        count_bonus: float = 0.05,  # bonus per extra target word
    ):
        self.engine = engine
        self.opponent_penalty = opponent_penalty
        self.assassin_penalty = assassin_penalty
        self.bystander_penalty = bystander_penalty
        self.count_bonus = count_bonus

    def score_all_vocab(
        self,
        team_indices: list[int],
        opponent_indices: list[int],
        assassin_index: int,
        bystander_indices: list[int],
        count: int,
    ) -> tuple[np.ndarray, np.ndarray]:
        """
        Score all vocab words for a given target count. Fully vectorized.

        For each vocab word v, the "targets" are the `count` team words most
        similar to v. The weakest-link sim is the count-th highest similarity.

        score(v) = weakest_link_sim
                 - opponent_penalty * max_opponent_sim
                 - assassin_penalty * assassin_sim
                 - bystander_penalty * max_bystander_sim
                 + count_bonus * (count - 1)

        Args:
            team_indices: Board indices of team's unrevealed words
            opponent_indices: Board indices of opponent words
            assassin_index: Board index of assassin word
            bystander_indices: Board indices of bystander words
            count: How many team words to target (1, 2, or 3)

        Returns:
            (scores, weakest_link_sims): both shape (V,)
        """
        team_sims = self.engine.get_vocab_sims_to_board_indices(team_indices)  # (nv, 7)
        opponent_sims = self.engine.get_vocab_sims_to_board_indices(
            opponent_indices
        )  # (nv, 8)
        bystander_sims = self.engine.get_vocab_sims_to_board_indices(bystander_indices)
        assassin_sim = self.engine.get_vocab_sims_to_board_indices([assassin_index])[
            0
        ]  # (nv, 1)

        # improvement: could just grab the top(count) similarities, wouldnt need to save nv
        weakest_team_sim = team_sims.min(dim=1)  # nv
        strongest_opponent_sim = opponent_sims.max(dim=1)  # nv
        strongest_bystander_sim = bystander_sims.max(dim=1)  # nv

        scores = (
            weakest_team_sim
            - self.opponent_penalty * strongest_opponent_sim
            - self.bystander_penalty * strongest_bystander_sim
            - self.assassin_penalty * assassin_sim
        )

        return scores, weakest_team_sim

    def generate_clue(self, game: GameState, team: Team) -> ScoredClue:
        """
        Generate the best clue for the given team.

        Strategy:
        1. Get board indices for team, opponent, assassin, bystander words
        2. For count in [3, 2, 1]: score all vocab words via score_all_vocab()
        3. Track the best (vocab_word, count, score) across all counts
        4. Build the ScoredClue with the winning vocab word's target info

        Args:
            game: Current game state
            team: Which team the spymaster plays for

        Returns:
            Best ScoredClue found
        """
        team_role = game.get_team_role(team)
        opp_role = game.get_opponent_role(team)

        team_cards = game.get_unrevealed(team_role)
        opp_cards = game.get_unrevealed(opp_role)
        assassin_cards = game.get_unrevealed(CardRole.ASSASSIN)
        bystander_cards = game.get_unrevealed(CardRole.BYSTANDER)

        team_indices = [self.engine.board_word_to_idx[c.word] for c in team_cards]
        opp_indices = [self.engine.board_word_to_idx[c.word] for c in opp_cards]
        assassin_indices = [
            self.engine.board_word_to_idx[assassin_cards[0].word]
            if assassin_cards
            else 0
        ]
        bystander_indices = [
            self.engine.board_word_to_idx[c.word] for c in bystander_cards
        ]

        best_score = float("-inf")
        best_vocab_idx = 0
        best_count = 1

        max_count = min(3, len(team_indices))
        for count in range(max_count, 0, -1):
            

In [None]:
# Test Part 5
spymaster = Spymaster(engine)

board = generate_board(seed=42)
game = GameState(board)
team_cards = game.get_unrevealed(CardRole.RED)
team_indices = [engine.board_word_to_idx[c.word] for c in team_cards]
opp_cards = game.get_unrevealed(CardRole.BLUE)
assassin_cards = game.get_unrevealed(CardRole.ASSASSIN)
bystander_cards = game.get_unrevealed(CardRole.BYSTANDER)
opp_indices = [engine.board_word_to_idx[c.word] for c in opp_cards]
assassin_index = engine.board_word_to_idx[assassin_cards[0].word]
bystander_indices = [engine.board_word_to_idx[c.word] for c in bystander_cards]

print(f"Team words: {[c.word for c in team_cards]}")
print(f"Opponent words: {[c.word for c in opp_cards]}")
print(f"Assassin: {assassin_cards[0].word}")

# Test score_all_vocab for count=2
scores, weakest_sims = spymaster.score_all_vocab(
    team_indices, opp_indices, assassin_index, bystander_indices, count=2
)
assert scores.shape == (len(engine.vocab_words),), (
    f"Expected ({len(engine.vocab_words)},), got {scores.shape}"
)
assert weakest_sims.shape == scores.shape
best_idx = np.argmax(scores)
print(
    f"\nBest 2-word clue: '{engine.vocab_words[best_idx]}' (score={scores[best_idx]:.3f}, weakest_link={weakest_sims[best_idx]:.3f})"
)

# Test full clue generation
clue = spymaster.generate_clue(game, Team.RED)
assert isinstance(clue, ScoredClue)
assert clue.count >= 1
assert clue.count == len(clue.targets)
assert clue.word.lower() not in {c.word for c in board}, "Clue must not be a board word"
print(f"\nGenerated clue: '{clue.word}' for {clue.count}")
print(f"Targets: {clue.targets}")
print(f"Target similarities: {[f'{s:.3f}' for s in clue.target_sims]}")
print(f"Max opponent sim: {clue.max_opponent_sim:.3f}")
print(f"Assassin sim: {clue.assassin_sim:.3f}")
print(f"Score: {clue.score:.3f}")
print("âœ“ Part 5 passed")

## Part 6: Game Loop

Wire everything together into a full self-play game: Red Spymaster â†’ Red Guesser â†’ Blue Spymaster â†’ Blue Guesser â†’ ...

Note: the `EmbeddingEngine` is created once per game (since it depends on the board words and is expensive to initialize). The `Spymaster` and `Guesser` both use the same engine, so they share the same embedding space â€” meaning the Spymaster can perfectly predict what the Guesser will think.

In [None]:
def play_game(seed: int = 42, verbose: bool = True) -> GameState:
    """
    Run a full game of Codenames with AI players.

    Setup:
    1. Generate board, create EmbeddingEngine, Spymaster, Guesser

    Each turn:
    1. Spymaster generates a clue for current team
    2. Guesser makes guesses one at a time (up to clue.count)
    3. Stop guessing on wrong guess or after clue.count correct guesses
    4. Record turn in game.turn_history as (team, clue_obj, [results])
    5. Switch teams

    Args:
        seed: Random seed for board generation
        verbose: Print play-by-play

    Returns:
        Final GameState
    """
    # TODO: implement
    pass

In [None]:
# Test Part 6
game = play_game(seed=42, verbose=True)
assert game.game_over, "Game should have ended"
assert game.winner is not None, "There should be a winner"
assert len(game.turn_history) > 0, "Should have at least one turn"
print(f"\nGame ended after {len(game.turn_history)} turns")
print(f"Winner: {game.winner.value}")
print("âœ“ Part 6 passed")

## Part 7: tkinter GUI

Build an interactive board display. The Spymaster view shows all roles (color-coded); the Guesser view shows only revealed cards.

Requirements:
- 5x5 grid of buttons, each showing a word
- Color scheme: Red team = red, Blue team = blue, Bystander = beige, Assassin = black
- Unrevealed cards are gray (guesser view) or lightly tinted (spymaster view)
- Clicking a card reveals it and processes the guess
- Status bar showing: current team, current clue, remaining counts
- "AI Turn" button that runs a full spymasterâ†’guesser turn

In [None]:
# Color scheme
COLORS = {
    CardRole.RED: {"bg": "#e74c3c", "fg": "white", "unrevealed": "#f5b7b1"},
    CardRole.BLUE: {"bg": "#3498db", "fg": "white", "unrevealed": "#aed6f1"},
    CardRole.BYSTANDER: {"bg": "#f5deb3", "fg": "black", "unrevealed": "#fef9e7"},
    CardRole.ASSASSIN: {"bg": "#2c3e50", "fg": "white", "unrevealed": "#d5d8dc"},
}
HIDDEN_COLOR = {"bg": "#ecf0f1", "fg": "black"}


class CodenamesGUI:
    """
    tkinter GUI for Codenames.

    Supports two views:
    - Spymaster view: shows all card roles (lightly tinted)
    - Guesser view: only shows revealed cards
    """

    def __init__(
        self,
        game: GameState,
        spymaster: Spymaster,
        guesser: Guesser,
        spymaster_view: bool = False,
    ):
        self.game = game
        self.spymaster = spymaster
        self.guesser = guesser
        self.spymaster_view = spymaster_view
        self.current_clue: ScoredClue | None = None
        self.guesses_remaining = 0

        self.root = tk.Tk()
        self.root.title("Codenames AI")
        self.buttons: list[tk.Button] = []

        self._build_ui()

    def _build_ui(self):
        """
        Build the GUI layout:
        - Status label at top (current team, clue, remaining counts)
        - 5x5 grid of card buttons
        - Control buttons at bottom (New Clue, AI Guess, Toggle View)
        """
        # TODO: implement
        pass

    def _get_card_colors(self, card: Card) -> dict:
        """
        Get background and foreground colors for a card.

        - If revealed: use the full role color
        - If spymaster_view: use the tinted/unrevealed color
        - Otherwise: use HIDDEN_COLOR (gray)

        Returns:
            dict with 'bg' and 'fg' keys
        """
        # TODO: implement
        pass

    def _update_board(self):
        """Refresh all button colors and the status label."""
        # TODO: implement
        pass

    def _on_card_click(self, index: int):
        """Handle a card click (human guesser mode)."""
        # TODO: implement
        pass

    def _generate_clue(self):
        """Have the AI Spymaster generate a clue. Display in status bar."""
        # TODO: implement
        pass

    def _ai_guess(self):
        """Have the AI Guesser make one guess based on the current clue."""
        # TODO: implement
        pass

    def run(self):
        """Start the GUI event loop."""
        self._generate_clue()
        self.root.mainloop()

In [None]:
# Test Part 7 â€” launch the GUI
# (This opens a window â€” close it to continue)
board = generate_board(seed=123)
board_words_gui = [c.word for c in board]
vocab_words_gui = [w for w in get_candidate_words() if w not in set(board_words_gui)]
engine_gui = EmbeddingEngine(board_words_gui, vocab_words_gui)
spymaster_gui = Spymaster(engine_gui)
guesser_gui = Guesser(engine_gui)
game_gui = GameState(board)

gui = CodenamesGUI(game_gui, spymaster_gui, guesser_gui, spymaster_view=True)
gui.run()

## Part 8 (Bonus): Evaluate Your AI

Run N self-play games and measure performance. This connects directly to Vals AI's work â€” **evaluating AI systems** on structured tasks with clear success criteria.

Metrics:
- **Win rate** per team (Red has advantage with 9 vs 8 words)
- **Average turns to win**
- **Assassin hit rate** (how often does the AI accidentally lose?)
- **Clue efficiency** = team words guessed / clue count (how many targets actually get guessed?)

In [None]:
@dataclass
class GameStats:
    winner: Team
    num_turns: int
    assassin_hit: bool
    red_remaining: int
    blue_remaining: int


def evaluate_ai(n_games: int = 20) -> list[GameStats]:
    """
    Run n_games of self-play and collect statistics.

    Args:
        n_games: Number of games to simulate

    Returns:
        List of GameStats for each game
    """
    # TODO: implement
    pass


def print_evaluation_report(stats: list[GameStats]):
    """
    Print a summary report of AI performance.

    Include: win rates, avg turns, assassin rate, etc.
    """
    # TODO: implement
    pass

In [None]:
# Test Part 8
stats = evaluate_ai(n_games=10)
assert len(stats) == 10
assert all(isinstance(s, GameStats) for s in stats)
print_evaluation_report(stats)
print("âœ“ Part 8 passed")

## Interview Preparation Notes

### What They're Likely Testing
1. **Software design**: Clean abstractions (GameState, EmbeddingEngine, Spymaster, Guesser). Can you decompose a problem into well-defined components?
2. **NLP fundamentals**: Word embeddings, cosine similarity, why dense representations beat taxonomy-based similarity
3. **Algorithm design**: The clue scoring function is an optimization problem. Clustering + centroid search + margin scoring.
4. **Vectorized computation**: Can you avoid Python loops and use numpy for performance?
5. **GUI basics**: Can you wire up tkinter quickly? Layout, event handling, state updates

### Build Order (Prioritized by Interview Impact)
1. **Data model + Board gen** (5 min): Card, GameState, enums â€” already done (Parts 1, 3)
2. **Embedding engine** (5 min): Load model, embed board + vocab, precompute similarity matrix (Part 2)
3. **Guesser** (5 min): Rank by cosine sim, threshold â€” trivial with embeddings (Part 4)
4. **Spymaster** (20 min): Clustering + vectorized scoring + clue search â€” this is the hard part (Part 5)
5. **Game loop** (5 min): Wire it together (Part 6)
6. **GUI** (10 min): 5x5 grid, click to reveal (Part 7)
7. **Evaluation** (5 min): Self-play stats (Part 8)

### Key Design Decisions
- **Shared embedding space**: Spymaster and Guesser use the same EmbeddingEngine, so the Spymaster can perfectly simulate the Guesser's behavior
- **Precomputed similarity matrix**: vocab_to_board_sims is (V, 25), enabling fully vectorized scoring
- **Agglomerative clustering**: Natural fit because we don't know k in advance, and the dendrogram gives us all possible groupings for free
- **3-word clue preference**: Multiplying score by count bonus incentivizes ambitious but safe multi-word clues

### Common Pitfalls
- **Board words in clue vocabulary**: Must filter out all 25 board words from the candidate vocab
- **Single-word teams**: When only 1-2 team words remain, clustering degenerates â€” handle gracefully
- **Assassin proximity**: A clue can have great team similarity but be close to the assassin. The penalty must be high enough to prevent this
- **Embedding model choice**: sentence-transformers models are trained on sentences, not single words. Performance on single words is decent but not optimal â€” acknowledge this tradeoff

### Extensions If Asked
- **ASE + GMM clustering**: Adjacency spectral embedding on the similarity graph + Gaussian mixture model with BIC for model selection. More principled than agglomerative but doesn't naturally account for "bad" words
- **Context-aware affinity**: Modify the clustering affinity to penalize pairs of team words that are both near bad words
- **Opponent modeling**: Consider what clues the opponent might give, and prioritize blocking
- **Multi-turn planning**: Consider future turns when choosing between safe 1-word vs risky 3-word clues

### Connection to Vals AI
- **Evaluation**: Part 8 is literally what Vals does â€” run AI agents on structured tasks and measure performance
- **Tool-calling parallel**: The Spymaster/Guesser interaction mirrors agent-tool patterns (plan â†’ execute â†’ observe)
- **Embedding-based retrieval**: The Spymaster's clue search is nearest-neighbor retrieval with constraints â€” same pattern as RAG

## Part 9 (Bonus): Agglomerative Clustering Spymaster

The greedy approach (Part 5) lets each clue word define its own targets. An alternative: **explicitly cluster** team words first, then search for clues per cluster. This is more principled when you want to reason about the structure of your team words before searching.

### Algorithm
1. Compute pairwise cosine distances between team word embeddings
2. Run agglomerative clustering (scipy `linkage` + `fcluster`) for k=2..5
3. For each cluster at each k, use `score_all_vocab()` from Part 5 to find the best clue
4. Compare best clues across all clusterings and pick the winner

### When this helps
- When the greedy approach gives 1-word clues because no single vocab word connects multiple team words well, but there ARE natural clusters that a more targeted search could exploit
- When you want to **visualize** the team word structure (dendrogram) for debugging or explainability

In [None]:
from scipy.cluster.hierarchy import linkage, fcluster
from collections import defaultdict


class ClusteringSpymaster(Spymaster):
    """
    Extended Spymaster that uses agglomerative clustering to find
    natural groups among team words before searching for clues.

    Inherits score_all_vocab() from Spymaster (greedy).
    Adds cluster_team_words() and overrides generate_clue().
    """

    def cluster_team_words(
        self, team_indices: list[int], max_k: int = 5
    ) -> dict[int, list[list[int]]]:
        """
        Cluster team words using agglomerative clustering on embeddings.

        Args:
            team_indices: Board indices of team's unrevealed words
            max_k: Maximum number of clusters to try

        Returns:
            dict mapping k -> list of clusters (each cluster is a list of board indices)
        """
        # TODO: implement
        # Hints:
        # - embeddings = self.engine.board_embeddings[team_indices]
        # - Z = linkage(embeddings, method='average', metric='cosine')
        # - For each k: labels = fcluster(Z, t=k, criterion='maxclust')
        # - Group team_indices by label
        pass

    def generate_clue(self, game: GameState, team: Team) -> ScoredClue:
        """
        Generate best clue using clustering + greedy hybrid.

        1. Cluster team words for k=2..max_k
        2. For each cluster, run score_all_vocab with count=len(cluster)
        3. Also run the greedy approach (super().generate_clue) as fallback
        4. Return the best clue across all approaches
        """
        # TODO: implement
        pass

In [None]:
# Test Part 9
board = generate_board(seed=42)
game = GameState(board)
clustering_spy = ClusteringSpymaster(engine)

team_cards = game.get_unrevealed(CardRole.RED)
team_indices = [engine.board_word_to_idx[c.word] for c in team_cards]

# Test clustering
clusterings = clustering_spy.cluster_team_words(team_indices)
assert isinstance(clusterings, dict)
for k, clusters in clusterings.items():
    all_indices = [idx for cluster in clusters for idx in cluster]
    assert sorted(all_indices) == sorted(team_indices), f"k={k}: missing indices"
    print(f"k={k}: {[[engine.board_words[i] for i in c] for c in clusters]}")

# Test clue generation
clue = clustering_spy.generate_clue(game, Team.RED)
assert isinstance(clue, ScoredClue)
print(f"\nClustering clue: '{clue.word}' for {clue.count}")
print(f"Targets: {clue.targets}")
print(f"Score: {clue.score:.3f}")
print("âœ“ Part 9 passed")