# Cryptic Crossword Tool Use Agent

Cryptic crosswords are crosswords based on wordplay. 

Clues generally consist of two parts. There will be a *definition* of the answer, which always occurs at the start or end of the clue. The remainder of the clue is a *word puzzle* representing the answer. 

A common element in the word puzzle is a *charade* - a mini-clue that represents part of the answer. Another is an *indicator* - a term representing the type of wordplay being used. Some examples:

#### **Hard time for Olympian goddess (4).**

In this clue the *definition* is "Olympian goddess" (at the end of the clue). 

The *word puzzle* is "Hard time". 

The word puzzle is a set of two *charades* - "Hard" represents H, and "time" represents ERA. The answer is therefore HERA - the Olympian goddess.

#### **Sign of venomous creature losing tail (7).**

In this clue the *definition* is "sign" (at the start of the clue).

The *charade* is venomous creature - SCORPION. The *indicator* is "losing tail" - indicating that it is missing the final letter. The answer is SCORPIO - the (star) sign. 

Charades, indicators and definitions are frequently reused in cryptic clues (e.g. "worker" usually means ANT or BEE). To solve a cryptic crossword it is important to be familiar with these recurring elements. My idea was to give an Agent tools to search a database of previous charades, indicators and definitions, to better solve cryptic clues. 

### 1: Set up dspy

I used dspy for easy implementation of the ReAct agent framework. I used gpt-4.1-mini as my model, with a temperature of 0.

In [53]:
import dspy
import os

lm = dspy.LM(
    'openai/gpt-4.1-mini',
    temperature = 0.0,
    max_tokens = 8000
)

dspy.configure(lm = lm)

### 2: Make evaluation set

My evaluation set is the Times crossword number 29263, published on 23/06/2024 (from https://timesforthetimes.co.uk/). I set up the clues and answers as dspy examples. 

My main validation metric is a case-insensitive check that the answers are the same length. (All clues have 1-word answers so we don't need to worry about spaces here).

My second validation metric checks that the answer provided is the correct length. 

In [38]:

clues = [
    ('Flightless bird is back, by the sound of it (4)', 'rhea'),
    ('Flowering shrub sadly beginning to age — if such (7)', 'fuchsia'),
    ('Fellow from South Africa dividing current account (5)', 'isaac'),
    ('Two deer, one running, formed an obstacle (8)', 'hindered'),
    ('Impulsive commander''s assistant taken in by chart (6)', 'madcap'),
    ('Ultimately serious row about politician''s plot (9)', 'storyline'),
    ('Rise of gunge in eastern compound (7)', 'nitride'),
    ('Light fitting key composer installed extremely rarely to begin with (10)', 'chandelier'),
    ('Rapt figure cooked bitter comestible (10)', 'grapefruit'),
    ('Ancient Palestinian artisan running around before noon (9)', 'samaritan'),
    ('Live on farming land? That''s possible to endure (8)', 'bearable'),
    ('Couple originally reproduced in boat illustrations (7)', 'artwork'), 
    ('Pull cover off adolescent pamphlet (7)', 'attract'),
    ('Electrician leaves vehicle under end of overpass (6)', 'sparks'),
    ('Outcast from former French island (5)', 'exile'),
    ('Hard time for Olympian goddess (4)', 'hera')
]

def make_example(clue, answer):
    example = dspy.Example(clue = clue, answer_chars = len(answer), answer = answer)
    example = example.with_inputs('clue')
    return example

cryptic_eval = [make_example(clue, answer) for clue, answer in clues]

def validate_answer(example, pred, trace = None):
    return example.answer.lower() == pred.answer.lower()

def validate_answer_length(example, pred, trace = None):
    return len(example.answer) == len(pred.answer)

### 3: Baseline Chain-of-Thought model

This is a default dspy chain-of-thought signature to see how the model does without tools. It scores 43.75%. 

In [3]:
single_turn_model = dspy.ChainOfThought('clue: str -> answer: str')

In [4]:
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset        = cryptic_eval,       
    metric        = validate_answer,          
    num_threads   = 8,                    
    display_progress = True
)

single_turn_score = evaluator(single_turn_model)

print(single_turn_score)



Average Metric: 7.00 / 16 (43.8%): 100%|██████████| 16/16 [00:00<00:00, 1341.18it/s]

2025/06/24 15:52:57 INFO dspy.evaluate.evaluate: Average Metric: 7 / 16 (43.8%)



43.75


### 4: Tools

I provided three tools - to search a database of definitions, charades and indicators. The datasets are from https://cryptics.georgeho.org . 

I used a dspy ReAct agent with these three tools. 

In [44]:
import polars as pl

definitions_df = pl.read_csv('data/clues.csv').select('definition', 'answer')
definitions_df = definitions_df.with_columns(pl.all().str.to_lowercase()).unique().drop_nulls()

charades_df = pl.read_csv('data/charades.csv').select('charade', 'answer')
charades_df = charades_df.with_columns(pl.all().str.to_lowercase()).unique().drop_nulls()

indicators_df = pl.read_csv('data/indicators.csv').select(['indicator', 'wordplay'])
indicators_df = indicators_df.with_columns(pl.all().str.to_lowercase()).unique().drop_nulls()

def search_charades(charade: str) -> list[str]:
    """Cryptic clues often include charades - part of the clue that represents part of the answer. 
    This function searches a database of charades from previous cryptic crosswords.
    
    Args:
        charade (str): The charade to search for - this should be part of the clue.

    Returns:
        list[str]: A list of possible answers to the charade. 

    Examples:
        "worker" -> ["ant", "bee", ...]
    """
    results = charades_df.filter(
        pl.col('charade') == charade.lower()
    )
    
    return results['answer'].to_list()

def search_definitions(definition: str, n: int) -> list[str]:
    """Searches for previous cryptic answers of length n matching the given definition.
    
    Args:
        definition (str): The definition from the clue.
        n (int): The length of the answer.

    Returns:
        list[str]: A list of previous cryptic answers matching this definition.

    Examples:
        "success" -> ["triumph", "coup", ...]
    """
    results = definitions_df.filter(
        pl.col('definition') == definition.lower(), 
        pl.col('answer').str.len_chars() == n
    )
    
    # if no results, try all the individual words in the query
    if len(results) == 0:
        res = []
        definitions = definition.split(' ')
        for definition in definitions:
            results = definitions_df.filter(
                pl.col('definition') == definition.lower(), 
                pl.col('answer').str.len_chars() == n
            )
            res += results['answer'].to_list()
        return res
    else:
        return results['answer'].to_list()

def search_indicators(indicator: str) -> list[str]:
    """Cryptic clues often include indicators - terms that indicate the wordplay being used. 
    This function searches a database of indicators from previous cryptic crosswords.
    
    Args:
        indicator (str): The indicator to search for - this should be part of the clue.

    Returns:
        list[str]: A list of possible indicators. 

    Examples:
        "drunk" -> ["anagram"]
    """
    results = indicators_df.filter(
        pl.col('indicator') == indicator.lower()
    )
    
    return results['wordplay'].to_list()

agent = dspy.ReAct(
    'clue: str -> answer: str',
    tools = [
        search_charades,
        search_definitions,
        search_indicators
    ]
)

### 5: Evaluation

Initially the tools made the accuracy go down! 

I made two improvements to the definition search tool. One was to allow the model to specify the length of the answer - this narrowed things down and seemed to help. I also adjusted the definition search tool to return results for the individual words in the definition query if nothing was returned for the definition query itself. After these improvements, the ReAct agent performs exactly as well as the baseline Chain-of-Thought model. 

Possible improvements are to expand the range of tools available (e.g. for anagrams and other common wordplay tricks), and to improve the tools themselves (possibly they should allow fuzzy matching based on embeddings or string similarity). 

My other validation check confirms that the model is good at providing an answer of the correct length. 

In [54]:
evaluation = []
for i, x in enumerate(cryptic_eval):
    print(f'{i}/{len(cryptic_eval)}')
    pred = agent(**x.inputs())
    is_correct = validate_answer(x, pred)
    is_correct_length = validate_answer_length(x, pred)
    evaluation.append({'answer': pred.answer, 
                       'actual_answer': x.answer, 
                       'correct': is_correct, 
                       'correct_length': is_correct_length, 
                       'trajectory': pred.trajectory,
                       'reasoning': pred.reasoning})

react_score = sum([pred['correct'] for pred in evaluation]) / len(evaluation)

print(react_score)

0/16
1/16
2/16
3/16
4/16
5/16
6/16
7/16
8/16
9/16
10/16
11/16
12/16
13/16
14/16
15/16
0.4375


In [37]:
print(sum([pred['correct_length'] for pred in evaluation]) / len(evaluation))

1.0


### 6: Inspecting the Trace

Inspecting the trace for the ReAct agent shows that it is using the tools sensibly. 

Here it uses the indicator search to determine that "sadly" is an anagram. It makes an error in not realising that the anagram is of A IF SUCH, only identifying that it must contain A I and S. 

It is then let down by the definition search which does not return FUCHSIA as a possible definition for "flowering shrub", but does return GENISTA (which contains A I and S). 

It is likely the performance here could be improved by fine-tuning on the cryptic clue dataset, and by improving the tools available. 

In [61]:
evaluation[1]

{'answer': 'genista',
 'actual_answer': 'fuchsia',
 'correct': False,
 'correct_length': True,
 'trajectory': {'thought_0': 'The clue is "Flowering shrub sadly beginning to age — if such (7)". The definition is likely "Flowering shrub" and the wordplay involves "sadly beginning to age — if such". The enumeration is (7), so the answer is a 7-letter word. The word "sadly" might be an anagram indicator, and "beginning to age — if such" might be the letters to be anagrammed or combined. I will first check if "sadly" is an anagram indicator.',
  'tool_name_0': 'search_indicators',
  'tool_args_0': {'indicator': 'sadly'},
  'observation_0': ['anagram'],
  'thought_1': 'Since "sadly" is confirmed as an anagram indicator, the wordplay likely involves an anagram of "beginning to age — if such" or a subset of these words or letters. The clue says "beginning to age — if such," which might mean taking the first letters of "age if such" or something similar. I will try to identify the charade or th