# Hunt the Wumpus LLM Agent Evaluation

This notebook evaluates the performance of different Hunt the Wumpus LLM agent prompts by comparing:
- Game outcomes and win rates
- Performance metrics (turns, rooms explored, arrow usage)
- Error rates and response times

Two trials are analyzed:
1. Basic Action Prompt
    * initial action prompt with the strategy and basic guidelines.
2. Enhanced Action Prompt + System Message 
    * current action prompt with more detailed considerations/guidelines and added system message.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# set style for visualizations - using a built-in style
plt.style.use("default")

# plot parameters
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.dpi"] = 100

## Data Loading and Preparation

Load the game data from CSV and prepare it for analysis. Each CSV contains a trial of ~30 game runs, with each row representing the following metrics from one game:

* timestamp of game completion
* number of turns taken
* number of rooms explored
* boolean flags for different loss conditions (death by pit, Wumpus, or running out of arrows)
* boolean for game won
* arrows remaining
* the number of errors associated with generating a turn action (i.e., Pydantic structured output validation errors)
* average and total LLM response times


In [None]:
from pathlib import Path

import pandas as pd

# configure pandas display options
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)

# get the directory containing the CSV files
notebook_path = Path.cwd()
print("Current directory:", notebook_path)
print("Files in directory:", list(notebook_path.glob("*.csv")))


def load_wumpus_data(trial1_path, trial2_path):
    """
    Load and prepare the Wumpus game data from separate trial CSV files.
    """
    # load both trials
    df1 = pd.read_csv(trial1_path, parse_dates=["timestamp"])
    df2 = pd.read_csv(trial2_path, parse_dates=["timestamp"])

    # add trial identifiers
    df1["trial_group"] = "Basic Action Prompt"
    df2["trial_group"] = "Enhanced Action Prompt + System Message"

    # combine the dataframes
    df = pd.concat([df1, df2], ignore_index=True)

    return df


# load the data using paths relative to notebook_path
trial1_path = notebook_path / "game_metrics_detailed_trial1.csv"
trial2_path = notebook_path / "game_metrics_detailed_trial2.csv"

df = load_wumpus_data(trial1_path, trial2_path)

print("\nTotal games analyzed:", len(df))
print("\nGames per trial:")
print(df["trial_group"].value_counts())

# show first 5 rows of each trial
columns_to_display = [
    "timestamp",
    "num_turns",
    "rooms_explored",
    "death_by_pit",
    "death_by_wumpus",
    "death_by_arrows",
    "game_won",
    "arrows_remaining",
    "action_generation_errors",
    "average_response_time",
]

print("\nFirst 5 rows of Basic Action Prompt trial:")
print(df[df["trial_group"] == "Basic Action Prompt"][columns_to_display].head())

print("\nFirst 5 rows of Enhanced Action Prompt + System Message trial:")
print(
    df[df["trial_group"] == "Enhanced Action Prompt + System Message"][
        columns_to_display
    ].head()
)

# print summary statistics for key metrics
print("\nSummary statistics for numerical columns:")
print(
    df[
        [
            "num_turns",
            "rooms_explored",
            "arrows_remaining",
            "action_generation_errors",
            "average_response_time",
        ]
    ].describe()
)

Current directory: /Users/ghahn/Projects/wumpus-llm-agent
Files in directory: []

Total games analyzed: 64

Games per trial:
trial_group
Basic Action Prompt                        32
Enhanced Action Prompt + System Message    32
Name: count, dtype: int64

First 5 rows of Basic Action Prompt trial:
                   timestamp  num_turns  rooms_explored  death_by_pit  death_by_wumpus  death_by_arrows  game_won  arrows_remaining  action_generation_errors  average_response_time
0 2024-09-26 23:31:59.223440          2               1             0                0                0         0                 5                       NaN              14.486365
1 2024-09-26 23:31:31.407691          3               3             1                0                0         0                 5                       NaN               8.195913
2 2024-09-26 23:30:43.749341          3               2             0                0                0         0                 5                       NaN 

## Game Outcome Analysis (Wins)

Create heatmaps showing game outcomes for each trial, with color intensity based on the number of turns taken.

In [None]:
def create_outcome_heatmap(df, trial_name):
    """
    Create a 6x5 heatmap showing game outcomes with turn count intensity.
    """
    trial_data = df[df["trial_group"] == trial_name].reset_index()

    # create 6x5 matrix of game outcomes
    outcomes = np.zeros((6, 5))
    turns = np.zeros((6, 5))

    for i in range(30):
        row = i // 5
        col = i % 5
        outcomes[row, col] = trial_data.loc[i, "game_won"]
        turns[row, col] = trial_data.loc[i, "num_turns"]

    # normalize turn counts for color intensity
    turns_normalized = (turns - turns.min()) / (turns.max() - turns.min())

    plt.figure(figsize=(12, 8))

    # create custom colormap: red for losses, green for wins
    colors = np.zeros((6, 5, 3))
    colors[outcomes == 0] = [1, 0, 0]  # red for losses
    colors[outcomes == 1] = [0, 1, 0]  # green for wins

    # adjust color intensity based on turn count
    for i in range(3):
        colors[:, :, i] = colors[:, :, i] * (1 - turns_normalized * 0.5)

    plt.imshow(colors)

    # add turn count text to each cell
    for i in range(6):
        for j in range(5):
            plt.text(
                j,
                i,
                f"{int(turns[i, j])}",
                ha="center",
                va="center",
                color="white" if turns_normalized[i, j] > 0.5 else "black",
            )

    plt.title(
        f"Game Outcomes - {trial_name}\nCell colors: Green=Win, Red=Loss\nNumbers show turn count"
    )
    plt.axis("off")
    return plt.gcf()


# create heatmaps for each trial
for trial in df["trial_group"].unique():
    outcome_heatmap = create_outcome_heatmap(df, trial)
    plt.show()

## Performance Metrics Analysis

Calculate and compare key performance metrics between trials.

In [None]:
def calculate_performance_metrics(df, trial_name):
    """
    Calculate key performance metrics for a trial.
    """
    trial_data = df[df["trial_group"] == trial_name]

    metrics = {
        "Total Games": len(trial_data),
        "Win Rate": (trial_data["game_won"].mean() * 100),
        "Avg Turns per Game": trial_data["num_turns"].mean(),
        "Avg Turns per Win": trial_data[trial_data["game_won"] == 1][
            "num_turns"
        ].mean(),
        "Avg Rooms Explored": trial_data["rooms_explored"].mean(),
        "Arrow Efficiency": (
            5 - trial_data[trial_data["game_won"] == 1]["arrows_remaining"].mean()
        ),  # average number of arrows to achieve win
        "Death by Pit %": (trial_data["death_by_pit"].mean() * 100),
        "Death by Wumpus %": (trial_data["death_by_wumpus"].mean() * 100),
        "Death by Arrows %": (trial_data["death_by_arrows"].mean() * 100),
        "Avg Errors per Game": trial_data["action_generation_errors"].mean(), # this metric was added after the first trial
        "Avg Response Time": trial_data["average_response_time"].mean(),
    }

    return pd.Series(metrics)


# calculate and display metrics
metrics = pd.DataFrame(
    {
        trial: calculate_performance_metrics(df, trial)
        for trial in df["trial_group"].unique()
    }
)
metrics

## Response Time Analysis

Analyze response times and their relationship with game outcomes and performance.

In [None]:
def plot_response_time_analysis(df):
    """
    Create response time analysis visualizations with clear labels and formatting.
    """
    # set consistent colors and styles
    colors = {
        "Basic Action Prompt": "blue",
        "Enhanced Action Prompt + System Message": "brown",
    }
    markers = {0: "o", 1: "X"}  # 0 for lost, 1 for won

    # create figure with more space
    fig = plt.figure(figsize=(15, 7))

    ax1 = plt.subplot(121)
    sns.boxplot(
        data=df,
        x="trial_group",
        y="average_response_time",
        hue="game_won",
        ax=ax1,
        palette=["blue", "brown"],
    )
    ax1.set_title("LLM Response Times by Trial and Game Outcome\n(Box Plot)", pad=10)
    ax1.set_xlabel("Strategy Type")
    ax1.set_ylabel("Average Response Time (seconds)")
    ax1.tick_params(axis="x", rotation=45)

    ax1.legend(
        title="Game Result",
        labels=["Lost", "Won"],
        handles=[
            plt.Rectangle((0, 0), 1, 1, fc="blue"),
            plt.Rectangle((0, 0), 1, 1, fc="brown"),
        ],
    )

    # scatter plot
    ax2 = plt.subplot(122)

    # plot each combination separately with explicit styles
    for trial in ["Basic Action Prompt", "Enhanced Action Prompt + System Message"]:
        for outcome in [0, 1]:  # 0 for lost, 1 for won
            mask = (df["trial_group"] == trial) & (df["game_won"] == outcome)
            ax2.scatter(
                df[mask]["average_response_time"],
                df[mask]["num_turns"],
                c=colors[trial],
                marker=markers[outcome],
                label=f"{trial.split()[0]} - {'Won' if outcome else 'Lost'}",
            )

    ax2.set_title("Response Time vs Number of Turns\n(Scatter Plot)", pad=10)
    ax2.set_xlabel("Average Response Time (seconds)")
    ax2.set_ylabel("Number of Turns")

    ax2.legend(title="Strategy and Outcome", bbox_to_anchor=(1.05, 1), loc="upper left")

    plt.tight_layout()
    return fig


response_time_fig = plot_response_time_analysis(df)
plt.show()

### Understanding the Response Time Analysis

#### Left Plot: Response Time Distribution (Box Plot)
This box plot shows how long the LLM takes to respond (in seconds) for each strategy:
- The boxes show where most of the response times fall
- The horizontal line in each box is the median response time
- The whiskers show the full range (excluding outliers)
- Used to compare response times between winning and losing games for each strategy

Important points:
- Longer/shorter boxes indicate more/less variation in response times
- Different heights between boxes show if one strategy is generally faster/slower
- Comparing "Won" vs "Lost" boxes shows if winning games had different response patterns

#### Right Plot: Response Time vs Game Length (Scatter Plot)
This scatter plot shows the relationship between how long the LLM takes to respond and how many turns the game lasts:
- Each dot represents one game
- Different colors show the two different strategies
- Different shapes show whether the game was won or lost
- Position shows both the response time (left-to-right) and number of turns (bottom-to-top)

Important points:
- Clustering of points shows common patterns
- Spread of points shows how consistent the relationship is
- Position of won/lost games shows if faster responses led to better outcomes