# Text Adventures and LLMs

**Everett Lewark & Tyson O'Leary**

CS 542: Natural Language Processing

Dec. 17, 2025

## Introduction

Video games and artificial intelligence have long overlapped, even as the popular perception of what technologies constitute AI has shifted over time. Categories of video game AI vary from simpler behaviors like those of arcade-game enemies to more advanced state-machine approaches or even the strategic search algorithms seen in chess engines. However, as players often face off against computer-controlled opponents in these games, a question arises: in what cases can these algorithms take the role of players themselves?

Prior research in machine learning has investigated applications of machine learning to the process of playing video games. This process is aided by the fact that video games are self-contained, easily repeatable, and can be configured to be deterministic if controlling the state of the random number generator. Within the scope of natural language processing, there has been some work on applying language models to text adventure games. Typically, these games task players with navigating around multi-room environments, locating and collecting various objects, and using these objects to solve puzzles and unlock new areas. This presents a challenging task for language models, since they have to identify these objects and interact with the environment to progress through the game.

The [Jericho](https://github.com/microsoft/jericho) and [TextWorld](https://github.com/Microsoft/TextWorld) projects from Microsoft provide text-adventure environments to use with language models, and we used the former as our test bench for our models. Many classic adventure games, including Zork (our chosen environment), [are available](https://github.com/BYU-PCCL/z-machine-games) in a format compatible with Jericho.

## Random baseline

In a prior project, we found that a model trained to play Street Fighter was actually comparable to one that just input actions randomly. Following that pattern, we chose to implement a naïve model here that simply picked random actions out of the set of possible commands. As in the Street Fighter case, this baseline model is also useful to understand how well an LLM performs at a text adventure game.

To evaluate all of our models, we track a variety of properties about the game environment as we step through:
- Unique rooms reached
- Unique items gathered
- Unique game states (hashes)
- Score (as reported by the game)
- Average moves ("retries") between increases in score

Many classic text adventure games use a custom Z-Machine language. This allowed games to be programmed once and then run on many different computers using platform-specific interpreters, thereby drastically reducing the complexity of porting them.

Jericho incorporates a Z-Machine interpreter called Frotz to run these games. Using this library, let's run our random agent through 100 turns of Zork to see how it performs:

In [1]:
import random
import sys
import time

from jericho import FrotzEnv

# Import our score tracking code from our main module
# To see more, check out the code under the 'adventure' directory
from adventure.metrics import ScoreTracker

STORY_FILE = "./z-machine-games-master/jericho-game-suite/zork1.z5"

def run_random(env, print_output=True):
    obs, info = env.reset()
    if print_output:
        print(obs)
    
    score_tracker = ScoreTracker(env)
    
    for i in range(100):
        start_time = time.time()
        actions = env.get_valid_actions()
        action = random.choice(actions)
        end_time = time.time()

        obs, reward, done, info = env.step(action)

        score_tracker.update(info, start_time, end_time)

        if print_output and (i <= 1 or done or i >= 98):
            print(">", action)
            print(obs)
            if i == 1:
                print("[...trimmed...]")
        if done:
            break

    return score_tracker.get_stats(env, info, print_output=False)


env = FrotzEnv(STORY_FILE)
run_random(env)

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


> south
South of House
You are facing the south side of a white house. There is no door here, and all the windows are boarded.


> west
West of House
There is a small mailbox here.


[...trimmed...]
> down
Forest Path
There is a bird's nest here.


> east
Forest




{'moves': 100,
 'unique_rooms': 8,
 'unique_hashes': 62,
 'unique_items': 3,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 0.04349986791610718}

The random algorithm's performance was typically worse than that observed with LLMs, but unpredictably it could receive higher scores. Often, the model would jump off a cliff or be eaten by a grue, rather than simply running out of turns as the others do.

To inspect the variability of these metrics, we can run the environment multiple times and aggregate the results:

In [12]:
from tqdm import tqdm

keys = ['unique_rooms', 'unique_hashes', 'unique_items', 'score']

max_values = {}
sums = {}

sample_n = 20

for i in tqdm(range(sample_n)):
    values = run_random(env, print_output=False)
    for k in keys:
        value = values[k]
        if value > max_values.get(k, -9999):
            max_values[k] = value
            
        sums[k] = sums.get(k, 0) + value

for k in keys:
    print(f"Max {k}:", max_values.get(k, 0))
    print(f"Mean {k}:", sums[k] / sample_n)

env.close()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:21<00:00,  1.07s/it]

Max unique_rooms: 13
Mean unique_rooms: 9.85
Max unique_hashes: 85
Mean unique_hashes: 45.95
Max unique_items: 6
Mean unique_items: 2.35
Max score: 15
Mean score: -2.25





In [2]:
from tqdm import tqdm

keys = ['unique_rooms', 'unique_hashes', 'unique_items', 'score']

max_values = {}
sums = {}

sample_n = 20

for i in tqdm(range(sample_n)):
    values = run_random(env, print_output=False)
    for k in keys:
        value = values[k]
        if value > max_values.get(k, -9999):
            max_values[k] = value
            
        sums[k] = sums.get(k, 0) + value

for k in keys:
    print(f"Max {k}:", max_values.get(k, 0))
    print(f"Mean {k}:", sums[k] / sample_n)

env.close()

100%|█████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.02s/it]

Max unique_rooms: 11
Mean unique_rooms: 8.65
Max unique_hashes: 80
Mean unique_hashes: 48.15
Max unique_items: 6
Mean unique_items: 2.9
Max score: 10
Mean score: -1.75





Notably, the mean score here is below zero due to the random actions often causing the player to die.

## Off-the-shelf models

Starting simple, we chose to assess the ability of off-the-shelf open-weight models to play Zork. We start with LLaMa3.2 1b and 3b as the language model acting as the player. Our first strategy simply provides a system prompt informing the model that it is playing a text adventure game, then prompts the model with the game’s scenario description for a response, which is fed back into the game as an action to take. This is intentionally naive, and the results reflect it. The model is inconsistent with the format and validity of the actions it gives. If a valid action is “open mailbox”, model responses could vary from “mail” to “As the player, you should type: open the mailbox and read its contents”. Neither of these are recognized by the game; however a human player could make these same mistakes when learning how commands are written.

In [1]:
import os
import sys
import time

import ollama
import jericho
import re
import json
import random

In [2]:
game = 'zork1.z5'
GAMES_DIR = "z-machine-games-master/jericho-game-suite"
env = jericho.FrotzEnv(f"{GAMES_DIR}/{game}")

In [3]:
import time
from adventure.metrics import ScoreTracker

def n_steps(turn_func, env, n=10):
    score_tracker = ScoreTracker(env)

    for _ in range(n):

        # Turn
        start = time.time()
        done, info = turn_func()
        end = time.time()

        score_tracker.update(info, start, end)

        if done:
            break

    return score_tracker.get_stats(env, info)

In [11]:
def basic_llm():
    # Basic
    system_prompt = (
        'You are a smart video game enthusiast who is skilled at playing old-school text adventure games. Given a description of your environment, explain your thought process and then give a command that is compatible with the game you are playing. Always put the command on its own line at the end of your response with nothing else. It needs to be easy and consistent to read with simple python'
        'This game does not run on an LLM, so it only recognizes a small vocabulary of commands. An example of a command is exactly the string "go west". This means you should only give exactly the command that the game recognizes '
    )

    make_prompt = lambda x: f'{system_prompt}\n\n{x}'

    game_response, info = env.reset()
    print(game_response)


    def turn():
        nonlocal game_response
        prompt = make_prompt(f'Game prompt:\n{game_response}')
        response = ollama.generate(model='llama3.2:3b', prompt=prompt).response
        print(f'LLM Response: {response}')
        player_in = response.splitlines()[-1].strip()
        print(player_in)

        game_response, reward, done, info = env.step(player_in)
        print(game_response)
        return done, info

    results = n_steps(turn, env)
    print(results)
    return results

In [12]:
basic_llm()

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM Response: To navigate this environment, I'll start by taking stock of my surroundings. The description mentions a "white house" and a "boarded front door", which suggests that the house might be locked or in disrepair. The presence of a "small mailbox" could imply that there's some sort of communication or delivery system at play.

Given these details, I'll want to explore the house further to see if it holds any clues or interesting objects.

I'd like to examine the house more closely, so my next command would be:

Examine House
Examine House
The house is a beautiful colonial house which is painted white. It is clear that the owners must have been extremely wealthy.


LLM Response: To analyze this environ

{'moves': 6,
 'unique_rooms': 4,
 'unique_hashes': 5,
 'unique_items': 0,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 1.6009242534637451}

Next, we built on the first strategy by adding a memory for the LLM, which provides all of the model’s past game interactions in the prompt. This was a step in the right direction because the model would sometimes recognize that the previous action wasn’t recognized, but it still was unable to fix the problem very often. This ability to “retry” shows better promise in other strategies we employ.

In [13]:
# Basic with memory
system_prompt_memory = (
    'You are a smart video game enthusiast who is skilled at playing old-school text adventure games. Given a description of your environment, explain your thought process and then give a command that is compatible with the game you are playing. Always put the command on its own line at the end of your response with nothing else. It needs to be easy and consistent to read with simple python'
    'At the beginning of your prompt, you will also receive up to 5 of the most recent interactions you\'ve had with the game'
    'This game does not run on an LLM, so it only recognizes a set vocabulary of commands. An example of a command is exactly the string "go west". This means you should only give exactly the command that the game recognizes '
)

In [14]:
def basic_llm_with_memory(memory_size=5, system_prompt=system_prompt_memory):

    make_prompt = lambda x: f'{system_prompt}\n\n{x}'

    game_response, info = env.reset()
    print(game_response)

    memory = []

    def turn():
        nonlocal game_response
        prompt = make_prompt(f'Game prompt:\n{game_response}')
        combined_memory = "\n".join(memory)
        prompt_with_memory = f'{combined_memory}\n\n{prompt}'
        response = ollama.generate(model='llama3.2:3b', prompt=prompt_with_memory).response
        print(f'LLM Response: {response}')
        player_in = response.splitlines()[-1].strip()

        memory.append(f'{prompt}\n{response}')
        if len(memory) > memory_size:
            memory.pop(0)

        # Take an action in the environment using the step fuction.
        # The resulting text-observation, reward, and game-over indicator is returned.
        game_response, reward, done, info = env.step(player_in)
        game_response = f'Received command: {player_in}\n{game_response}' # Add text the game received so the LLM can hopefully improve it's formatting
        print(game_response)

        return done, info

    result = n_steps(turn, env)
    print(result)
    return result

In [15]:
basic_llm_with_memory(system_prompt=system_prompt_memory)

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM Response: Let's analyze the situation. I'm standing west of a white house with a boarded front door and a small mailbox nearby. My goal is likely to explore the house or find something useful in the mailbox. Since I'm already in an open field, moving east would take me closer to the house, while moving north, south, or west might lead me away from potential interesting areas.

Given that my current location has a boarded front door, it's possible that there's something I could use inside the house or that leads to further exploration. However, since I'm in an open field now, I should try to get closer to the house before proceeding with any other actions.

I'll choose to move east towards the house. 

go e

{'moves': 5,
 'unique_rooms': 3,
 'unique_hashes': 4,
 'unique_items': 0,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 1.201346492767334}

For the same strategy with memory, we also tried a system prompt generated by ChatGPT to see if it could do any better. It did succeed at giving valid commands more often, but they still were not generally very useful commands. Because it is specified in the prompt, it frequently used the look command, even directly after it already used it.

In [16]:
system_prompt_memory_chatgpt = (
    "You are a player of a classic parser-based interactive fiction game.\n"
    "Respond only with a single imperative command in plain lowercase (e.g., “look”, “go east”, “get key”).\n"
    "Do not write sentences, explanations, strategies, or narratives.\n"
    "Use only standard text-adventure verbs: look, examine, go, take, drop, open, close, use, talk to, attack, inventory, etc.\n"
    "Act rationally based on the game’s last description.\n"
    "If confused, issue “look”.\n"
    "Notice that the memory you receive contains commands you have issued in the past. Don't repeat commands that won't help you move forward.\n"
)

In [17]:
basic_llm_with_memory(system_prompt=system_prompt_memory_chatgpt)

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM Response: look
Received command: look
West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM Response: go east
Received command: go east
The door is boarded and you can't remove the boards.


LLM Response: examine mailbox
Received command: examine mailbox
The small mailbox is closed.


LLM Response: open mailbox
Received command: open mailbox
Opening the small mailbox reveals a leaflet.


LLM Response: take leaflet
Received command: take leaflet
Taken.


LLM Response: inventory
Received command: inventory
You are carrying:
  A leaflet


LLM Response: examine leaflet
Received command: examine leaflet
"WELCOME TO ZORK!

ZORK is a

{'moves': 10,
 'unique_rooms': 2,
 'unique_hashes': 5,
 'unique_items': 1,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 0.2031045913696289}

We then take a more direct approach to fix the validity of commands. Jericho provides the `env.get_valid_actions()` function that returns a list of valid actions for the current game state (player location, inventory, etc.). We chose to include this list of actions in the model’s prompt to give it options to choose from. This felt like cheating at first because figuring out the correct commands is part of the gameplay experience of a text adventure game, and ideally we would like to see the model “learn” to play like a human would. The problem is that the only ways we know of to fix the model’s action outputs are providing the commands like this, training or fine-tuning the model to be better at generating commands, or trying a model other than LLaMa that might have more experience with these types of games. We explore both of these other options later.

In [18]:
def memory_and_provided_commands(memory_size=5):
    system_prompt = (
        f'You are a smart video game tester who is skilled at playing old-school text adventure games. You are playing {game}\n'
        'Respond only with a single imperative command in plain lowercase from the list of possible actions below.\n'
        'Do not write sentences, explanations, strategies, or narratives.\n'
        'Act rationally based on the game’s last description.\n'
        "Notice that the memory you receive contains commands you have issued in the past. Don't repeat commands that won't help you move forward\n"
    )

    make_prompt = lambda x: f'{system_prompt}\n\n{x}'

    game_response, info = env.reset()
    print(game_response)

    memory = []

    def turn():
        nonlocal game_response
        combined_memory = "\n".join(memory)
        actions = ', '.join(env.get_valid_actions())
        prompt = make_prompt(f'{combined_memory}\n\nGame text:\n{game_response}\n\nValid actions: {actions}')
        response = ollama.generate(model='llama3.2:3b', prompt=prompt).response
        print(f'LLM Response: {response}')
        player_in = response.splitlines()[-1].strip()

        memory.append(f'{game_response}\n{response}')
        if len(memory) > memory_size:
            memory.pop(0)

        # Take an action in the environment using the step fuction.
        # The resulting text-observation, reward, and game-over indicator is returned.
        game_response, reward, done, info = env.step(player_in)
        game_response = f'Received command: {player_in}\n{game_response}' # Add text the game received so the LLM can hopefully improve it's formatting
        print(game_response)
        return done, info

    result = n_steps(turn, env)
    print(result)
    return result

In [19]:
memory_and_provided_commands()

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM Response: open mailbox
Received command: open mailbox
Opening the small mailbox reveals a leaflet.


LLM Response: take leaflet
Received command: take leaflet
Taken.


LLM Response: close mailbox
Received command: close mailbox
Closed.


LLM Response: take leaflet
Received command: take leaflet
You already have that!


LLM Response: put down leaflet
Received command: put down leaflet
Dropped.


LLM Response: open mailbox
Received command: open mailbox
Opened.


LLM Response: close mailbox
Received command: close mailbox
Closed.


LLM Response: open mailbox
Received command: open mailbox
Opened.


LLM Response: north
Received command: north
North of House
You are facing the north side of a white house. Ther

{'moves': 10,
 'unique_rooms': 2,
 'unique_hashes': 8,
 'unique_items': 1,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 0.26403424739837644}

Process ForkPoolWorker-22:
Process ForkPoolWorker-12:
Process ForkPoolWorker-11:
Process ForkPoolWorker-6:
Process ForkPoolWorker-21:
Process ForkPoolWorker-5:
Process ForkPoolWorker-8:
Process ForkPoolWorker-7:
Process ForkPoolWorker-4:
Process ForkPoolWorker-9:
Process ForkPoolWorker-14:
Process ForkPoolWorker-25:
Process ForkPoolWorker-19:
Process ForkPoolWorker-24:
Process ForkPoolWorker-27:
Process ForkPoolWorker-17:
Process ForkPoolWorker-18:
Process ForkPoolWorker-10:
Process ForkPoolWorker-30:
Process ForkPoolWorker-31:
Process ForkPoolWorker-32:
Process ForkPoolWorker-20:
Process ForkPoolWorker-29:
Process ForkPoolWorker-23:
Process ForkPoolWorker-13:
Process ForkPoolWorker-28:
Process ForkPoolWorker-26:
Process ForkPoolWorker-3:
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent c

Providing valid actions is another step in the right direction, as it does make the model much more consistent at using valid commands. It still would use an invalid command at times, but it was able to fix its mistake more often by noticing its chosen command wasn’t in the list. The problem this revealed is that the model’s valid commands are still not necessarily commands that move the game forward. It’s common in this setup that the model will choose to open the mailbox, take the leaflet, put the leaflet in the mailbox, then repeat the process, or move in random cardinal directions without ever discovering the way into the house. Even when we tell it that its goal is to complete the game, it does not seem to know what that entails, and does not have any sort of “curiosity” to motivate in-depth exploration of the environment.

Our next strategy adds a second call to the LLM for every game action. The new prompt has the LLM first analyze the game’s description and explain the next action to take in natural language. This description is then given back to the model in the second prompt to generate the command alone. We hoped that this could give the model a clearer and more productive goal when choosing its action from the list.

In [20]:
def memory_analyze_provided_commands_chat():
    model = 'llama3.2:3b'
    system_prompt = (
        f'You are a smart video game tester who is skilled at playing old-school text adventure games. You are playing {game}\n'
        'Act rationally based on the game’s last description.\n'
        "Notice that the memory you receive contains commands you have issued in the past. Don't repeat commands that won't help you move forward\n"
    )

    make_prompt = lambda x: f'{system_prompt}\n\n{x}'
    memory = [
        ollama.Message(role='system', content=system_prompt)
    ]
    analysis_prompt = 'Concisely describe the current state of the game and a potential action to take to move forward.'
    memory.append(ollama.Message(role='system', content=analysis_prompt))

    game_response, info = env.reset()
    print(game_response)

    def turn_func():
        nonlocal model, make_prompt, analysis_prompt, memory, game_response
        actions_list = env.get_valid_actions()
        random.shuffle(actions_list)
        actions = ', '.join(actions_list)

        memory.append(ollama.Message(role='user', content=f'{game_response}\n\nValid game actions: {actions}'))
        response = ollama.chat(model=model, messages=memory)
        memory.append(response.message)
        response = response.message.content
        print(f'LLM description: {response}')

        prompt = make_prompt(f'Given your analysis of the game state, issue a rational action to take to progress in the game. Respond only with a single imperative command in plain lowercase. Use only standard text-adventure verbs. IMPORTANT: Your response will be used directly as input to the game. Minimize the number of words you use.\n\nYour analysis:\n{response}\n\n Only use one of these valid actions: {actions}\n\n')
        print('[action prompt]', prompt)
        response = ollama.generate(model=model, prompt=prompt).response
        print(f'LLM action: {response}')
        response = response.removeprefix('type').strip() # Give it a shot. Keeps saying type! TODO: Probably remove. Bandaid

        lines = response.splitlines()
        player_in = lines[-1].strip() if len(lines) != 0 else ''

        # Take an action in the environment using the step fuction.
        # The resulting text-observation, reward, and game-over indicator is returned.
        game_response, reward, done, info = env.step(player_in)
        game_response = f'Received command: {player_in}\n{game_response}' # Add text the game received so the LLM can hopefully improve it's formatting
        print(game_response)

        return done, info

    results = n_steps(turn_func, env)
    print(results)
    return results

In [21]:
memory_analyze_provided_commands_chat()

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


LLM description: Current state: I'm standing in an open field west of a white house with a boarded front door. There's also a small mailbox nearby.

Potential action: Open mailbox. This might contain some useful information or items that could aid me in my exploration of the area.
[action prompt] You are a smart video game tester who is skilled at playing old-school text adventure games. You are playing zork1.z5
Act rationally based on the game’s last description.
Notice that the memory you receive contains commands you have issued in the past. Don't repeat commands that won't help you move forward


Given your analysis of the game state, issue a rational action to take to progress in the game. Respond only wi

{'moves': 10,
 'unique_rooms': 4,
 'unique_hashes': 11,
 'unique_items': 1,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 0.7671231746673584}

Next, we tried using reasoning models on top of our strategy. We tried Qwen3 first, which tended to be very verbose in its thinking, but did perform well compared to LLaMa. We had a similar experience with gpt-oss, but it ran faster on average than Qwen with less meandering in its analysis. With both of these reasoning models, we attempted an agentic workflow, where making actions in game and viewing valid moves are tools the LLM can choose to employ.

In [32]:
def agent(max_retry=30, model='qwen3'):

    system_prompt = (
        f'Think step by step. You are playing {game}, an interactive fiction game. You must analyze the scenario the game presents to you and choose an action that will make progress. Your goal is to finish the game\n'
        'Use the tools provided to you to take actions, view possible actions for your current location, and view the game walkthrough if necessary'
    )

    game_response, info = env.reset()
    print(game_response)
    done = False

    memory = [
        ollama.Message(role='system', content=system_prompt)
    ]

    def do_game_action(action: str) -> str:
        """Perform an action in the active text adventure game and see the result"""
        """
        Args:
          action: game action string

        Returns:
          The game's response after performing the action
        """
        nonlocal done, info
        game_response, reward, done, info = env.step(action)
        return game_response

    def view_possible_actions() -> str:
        """View a list of the actions that can be performed in the game's current state"""
        """
        Returns:
          String containg actions separated by commas
        """
        return ', '.join(env.get_valid_actions())

    def view_walkthrough():
        """View the full game walkthrough as a list of actions"""
        """
        Returns:
          String containing actions separated by newlines
        """
        return env.get_walkthrough()

    available_functions = {
        'do_game_action': do_game_action,
        'view_possible_actions': view_possible_actions,
        'view_walkthrough': view_walkthrough
    }

    def turn():
        nonlocal game_response, memory
        memory.append(ollama.Message(role='user', content=f'{game_response}'))

        response = ollama.chat(model=model, messages=memory, think=True, tools=[do_game_action, view_possible_actions, view_walkthrough], options={'num_ctx': 2048})
        memory.append(response.message)

        print("Thinking: ", response.message.thinking)
        print("Content: ", response.message.content)

        if response.message.tool_calls:
            for tc in response.message.tool_calls:
                if tc.function.name in available_functions:
                    print(f"Calling {tc.function.name} with arguments {tc.function.arguments}")
                    result = available_functions[tc.function.name](**tc.function.arguments)
                    print(f"Result: {result}")
                    # add the tool result to the messages
                    memory.append({'role': 'tool', 'tool_name': tc.function.name, 'content': str(result)})
        return done, info

    result = n_steps(turn, env, n=5)
    print(result)
    print('Memory at end:')
    print(json.dumps(memory, indent=4, default=str))
    return result


In [33]:
agent()

Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


Thinking:  Okay, let's see. I'm playing Zork I, and I'm at the starting location: West of House. The description says there's a small mailbox here. The goal is to finish the game, so I need to figure out what actions to take.

First, I should check what possible actions I can take here. The user mentioned using the view_possible_actions function. Let me call that to see the available commands. Common actions in Zork might include looking around, checking the mailbox, or trying to enter the house. But since the house has a boarded door, maybe I need to find a key or break in somehow. Alternatively, the mailbox might have something useful. Let me check the possible actions to know what's allowed here. Once I kno

{'moves': 3,
 'unique_rooms': 3,
 'unique_hashes': 4,
 'unique_items': 0,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 22.60523405075073}

## Evaluating Existing Knowledge of Zork

A question that arises when running these models on a given game is whether the actions they take could just be memorized from being trained on the game in the past. We do acknowledge that memorizing the game is a valid strategy to complete the game, but it would be much more interesting if we could show that a model can be coerced to use logic to make choices and progress in the game. Below, we take the very first message shown to the player by Zork, truncate it around halfway through, and task each model with completing the text. Our assumption is that the more similar the embeddings of the generated text are to the embeddings of the actual text, the more likely that the model has prior experience with the game.

In [34]:
from sklearn.metrics.pairwise import cosine_similarity

In [35]:
GAMES_DIR = "z-machine-games-master/jericho-game-suite"
game = 'zork1.z5'
env = jericho.FrotzEnv(f"{GAMES_DIR}/{game}")

In [36]:
initial_observation, info = env.reset()
print(initial_observation)

parts = initial_observation.split(' ')
middle = len(parts) // 2
first_half = ' '.join(parts[:middle+3])
second_half = ' '.join(parts[middle+3:])

print('first')
print(first_half)
print()
print('second')
print(second_half)


Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


first
Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are

second
standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.




In [37]:
def embedding(text):
    embed_result = ollama.embed(model='nomic-embed-text:latest', input=text)
    # print(type(embed_result.embeddings))
    # print(embed_result.embeddings)
    return embed_result.embeddings

In [38]:
actual = embedding(initial_observation)
actual

[[0.003380086,
  0.028641177,
  -0.20731561,
  -0.0531142,
  0.040126424,
  0.022249332,
  1.5344145e-05,
  0.055960968,
  0.08155047,
  -0.03719124,
  -0.03514607,
  0.028711922,
  0.016595127,
  -0.010641888,
  0.017716078,
  -0.045954205,
  0.053810123,
  -0.070711,
  -0.022942051,
  0.018534083,
  0.009196966,
  0.00662489,
  -0.050868344,
  -0.015944717,
  0.08486651,
  0.0047782767,
  0.04918115,
  0.022196,
  0.00079265214,
  -0.028521271,
  -0.005742135,
  0.014803275,
  -0.024961503,
  0.00894201,
  -0.005235983,
  -0.0011901245,
  0.016886197,
  0.019853797,
  0.025256729,
  -0.026693264,
  0.06760423,
  0.036670394,
  -0.028068146,
  0.018736744,
  0.015599732,
  0.023915228,
  -0.031572953,
  0.03805489,
  0.081525646,
  0.035305116,
  0.049446065,
  8.945677e-05,
  0.028827835,
  -0.011557975,
  -0.023673614,
  -0.016673135,
  0.012287638,
  -0.0017163006,
  -0.011370117,
  0.025971413,
  0.09310577,
  0.093873,
  -0.0026437317,
  0.08064763,
  0.04936706,
  -0.012860028,


In [39]:
prompt = f'Below is the first half of the very first text prompt given by the text adventure game Zork. Please complete it exactly as it is given by Zork. Do not include anything except your completion.\n\n"{first_half}" (completion here...)'

In [40]:
def test_llama():
    llama_response = ollama.generate(model='llama3.2:3b', prompt=prompt).response
    rebuilt_llama = first_half + ' ' + llama_response
    # print(rebuilt_llama)
    llama_embed = embedding(rebuilt_llama)
    sim = cosine_similarity(llama_embed, actual)
    # print(sim)
    return {'text': rebuilt_llama, 'sim': sim}

In [41]:
def test_qwen():
    qwen_response = ollama.generate(model='qwen3', prompt=prompt, options={'num_ctx': 2048}).response
    rebuilt_qwen = first_half + ' ' + qwen_response
    # print(rebuilt_qwen)
    qwen_embed = embedding(rebuilt_qwen)
    sim = cosine_similarity(qwen_embed, actual)
    # print(sim)
    return {'text': rebuilt_qwen, 'sim': sim}

In [42]:
def test_gpt():
    gpt_response = ollama.generate(model='gpt-oss', prompt=prompt, options={'num_ctx': 2048}).response
    rebuilt_gpt = first_half + ' ' + gpt_response
    # print(rebuilt_gpt)
    gptoss_embed = embedding(rebuilt_gpt)
    sim = cosine_similarity(gptoss_embed, actual)
    # print(sim)
    return {'text': rebuilt_gpt, 'sim': sim}

In [None]:
llama_tests = [test_llama() for _ in range(30)]

In [None]:
llama_tests[0]

In [44]:
mean_llama = sum(x['sim'] for x in llama_tests) / 30
mean_llama

array([[0.82470155]])

In [None]:
qwen_tests = [test_qwen() for _ in range(30)]

In [None]:
qwen_tests[0]

In [46]:
mean_qwen = sum(x['sim'] for x in qwen_tests) / 30
mean_qwen

array([[0.92118203]])

In [None]:
gpt_tests = [test_gpt() for _ in range(30)]

In [None]:
gpt_tests[0]

In [None]:
mean_gpt = sum(x['sim'] for x in gpt_tests) / 30
mean_gpt

Qwen had the highest similarity on average, which makes sense intuitively from using it. It often described Zork as if it knew the game, even if it was often wrong in some way. The rest of the models still had high similarity, but each is working with the same starting text and the same system prompt, so it makes sense that the embeddings would be in close proximity in the latent space.


## Q-BERT

The Q-BERT model is an existing approach that combines multiple components: an ALBERT model for answering questions about the environment, and another model that constructs commands by combining a knowledge graph with command templates. Unfortunately, we were not able to get Q-BERT working for this project: not only did the code require revision to be compatible with current versions of Python libraries and thus recent GPUs, it also required more memory than we had available on a single machine. We were able to run the ALBERT fine-tuning process, but these memory issues occurred when attempting to train the downstream model.

Some parts of Q-BERT were designed to accommodate distributed training, so if we were to revisit this model then we would either want to reduce model/data size or increase the number of machines involved in training.


## Fine-tuning

Using Low-Rank Adaptation (LoRA), it is possible to fine-tune LLMs by updating a limited subset of their weights. This allows fine-tuning to be accomplished on consumer-grade GPUs with smaller memory. The Unsloth library for Python introduces some of its own optimizations to further accelerate this process.

To fine-tune a model, we convert the walkthrough for a game into a dataset containing observations of the environment and the actions the model chooses to take. We train the model to produce these same responses in the same situations.

In [1]:
from datasets import Dataset
import jericho

def get_prompt(env: jericho.FrotzEnv, obs: str, done: bool, include_actions: bool = False):
    """ Format a custom prompt including additional information about the environment. """
    
    items=["##Observation\n" + obs]
    state = env.get_state()

    if not done:
        look_desc, reward, done, info = env.step("look")
        if not obs.endswith(look_desc):
            items.append("##Location\n" + look_desc)

    if not done:
        inv_desc, reward, done, info = env.step("inventory")
        items.append("##Inventory\n" + inv_desc)

    if include_actions and not done:
        valid_actions = env.get_valid_actions()
        bullets = ["- " + action for action in valid_actions]
        items.append("##Available actions\n" + "\n".join(bullets))

    env.set_state(state)

    return "\n\n".join(items)


def get_steps(filename: str, extra_prompt = False):
    """ Return a sequence of (prompt, action) pairs needed to complete a game. """
    env = jericho.FrotzEnv(filename)
    
    initial_obs, info = env.reset()
    walkthrough = env.get_walkthrough()

    steps = []
   
    done = False
    obs = initial_obs
    for step in walkthrough:
        prompt = get_prompt(env, obs, done, include_actions=False)
        steps.append((prompt, step))
        obs, reward, done, info = env.step(step)
        if done:
            break

    env.close()

    return steps


def steps_to_dataset(steps: list[list[tuple[str, str]]], length: int, overlap: bool = True):
    """
    Convert a sequence of game steps to a dataset of windowed conversations,
    where the user prompt is the environment observation and the assistant response
    is the command to execute.
    """
    convos = []

    for game in steps:
        convo = []
        n = 0
        
        for step in game:
            convo.append({"role": "user", "content": step[0]})
            convo.append({"role": "assistant", "content": step[1]})
            n += 1
            if overlap:
                if length > 0 and n > length:
                    n -= 1
                    convo.pop(0)
                    convo.pop(0)
                    
                convos.append(list(convo))
            else:
                if length > 0 and n >= length:
                    n = 0
                    convos.append(convo)
                    convo = []

        if len(convo) > 0:
            convos.append(convo)

    return Dataset.from_dict({"conversations": convos})


def get_dataset(game_files: list[str], length: int, overlap: bool):
    steps = []
    for game_file in game_files:
        steps.append(get_steps(game_file))
    dataset = steps_to_dataset(steps, length=length, overlap=overlap)

    return dataset


def format_dataset(tokenizer, dataset):
    """ Apply the model-specific chat template to a dataset. """

    # Based on the Unsloth for Llama3.2 notebook located here:
    # https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb
    def formatting_prompts_func(examples):
        convos = examples["conversations"]
        texts = [
            tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt = False)
            for convo in convos
        ]
        return {'text': texts}

    dataset = dataset.map(formatting_prompts_func, batched=True)
    return dataset

We can peek at this dataset to see what the output looks like:

In [2]:
game_files = ["./z-machine-games-master/jericho-game-suite/zork1.z5"]
dataset = get_dataset(game_files, length=6, overlap=True)

dataset[0]

{'conversations': [{'content': '##Observation\nCopyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.\nZORK is a registered trademark of Infocom, Inc.\nRevision 88 / Serial number 840726\n\nWest of House\nYou are standing in an open field west of a white house, with a boarded front door.\nThere is a small mailbox here.\n\n\n\n##Inventory\nYou are empty-handed.\n\n',
   'role': 'user'},
  {'content': 'N', 'role': 'assistant'}]}

Next, we load the base Llama model to fine-tune:

In [3]:
from adventure.model import load_model, save_model

model, tokenizer = load_model("unsloth/Llama-3.2-3B-Instruct-bnb-4bit", "llama-3.2")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 12-17 14:22:16 [__init__.py:216] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 12-17 14:22:20 [vllm_utils.py:702] Unsloth: Patching vLLM v1 graph capture
INFO 12-17 14:22:20 [vllm_utils.py:732] Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.12.1: Fast Llama patching. Transformers: 4.57.3. vLLM: 0.10.2.
   \\   /|    NVIDIA RTX 4000 Ada Generation. Num GPUs = 1. Max memory: 19.548 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/llama-3.2-3b-instruct-bnb-4bit with actual GPU utilization = 49.44%
Unsloth: Your GPU has CUDA compute c

`torch_dtype` is deprecated! Use `dtype` instead!


INFO 12-17 14:22:29 [__init__.py:1815] Using max model len 2048
INFO 12-17 14:22:31 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=4096.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection'], 'llm_int8_threshold': 6.0}
INFO 12-17 14:22:32 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='unsloth/llama-3.2-3b-instruct-bnb-4bit', speculative_config=None, tokenizer='unsloth/llama-3.2-3b-instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=bitsandb



INFO 12-17 14:22:32 [gpu_model_runner.py:2370] Loading model from scratch...
INFO 12-17 14:22:32 [cuda.py:362] Using Flash Attention backend on V1 engine.
INFO 12-17 14:22:33 [bitsandbytes_loader.py:758] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 12-17 14:22:33 [weight_utils.py:348] Using model weights format ['*.safetensors']
INFO 12-17 14:22:33 [weight_utils.py:406] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-17 14:22:33 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 12-17 14:22:34 [gpu_model_runner.py:2392] Model loading took 2.3519 GiB and 1.103116 seconds
INFO 12-17 14:22:39 [backends.py:539] Using cache directory: /s/chopin/a/grad/elewark/.cache/vllm/torch_compile_cache/dc2c8eddc4/rank_0_0/backbone for vLLM's torch.compile
INFO 12-17 14:22:39 [backends.py:550] Dynamo bytecode transform time: 4.96 s
INFO 12-17 14:22:42 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.651 s
INFO 12-17 14:22:43 [monitor.py:34] torch.compile takes 4.96 s in total
INFO 12-17 14:22:44 [gpu_worker.py:298] Available KV cache memory: 6.91 GiB
INFO 12-17 14:22:44 [kv_cache_utils.py:864] GPU KV cache size: 64,640 tokens
INFO 12-17 14:22:44 [kv_cache_utils.py:868] Maximum concurrency for 2,048 tokens per request: 31.56x
INFO 12-17 14:22:44 [vllm_utils.py:707] Unsloth: Running patched vLLM v1 `capture_model`.


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.97it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  9.94it/s]

INFO 12-17 14:22:46 [gpu_model_runner.py:3118] Graph capturing finished in 2 secs, took 0.35 GiB
INFO 12-17 14:22:46 [vllm_utils.py:714] Unsloth: Patched vLLM v1 graph capture finished in 2 secs.





INFO 12-17 14:22:47 [gpu_worker.py:391] Free memory on device (19.31/19.55 GiB) on startup. Desired GPU memory utilization is (0.4944470289965936, 9.67 GiB). Actual usage is 2.35 GiB for weight, 0.39 GiB for peak activation, 0.02 GiB for non-torch memory, and 0.35 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=6881795584` to fit into requested memory, or `--kv-cache-memory=17234976256` to fully utilize gpu memory. Current kv cache memory in use is 7414472192 bytes.
INFO 12-17 14:22:47 [core.py:218] init engine (profile, create kv cache, warmup model) took 13.09 seconds
INFO 12-17 14:22:48 [llm.py:295] Supported_tasks: ('generate',)
INFO 12-17 14:22:48 [__init__.py:36] No IOProcessor plugins requested by the model
Unsloth: Just some info: will skip parsing ['q_norm', 'input_layernorm', 'layer_norm1', 'layer_norm2', 'post_layernorm', 'pre_feedforward_layernorm', 'k_norm', 'norm1', 'norm', 'ffn_norm', 'post_attention_layernorm', 'norm2', 'post_feed

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at unsloth/llama-3.2-3b-instruct-bnb-4bit and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Performing substitution for additional_keys=set()
Unsloth: Just some info: will skip parsing ['q_norm', 'input_layernorm', 'layer_norm1', 'cross_attn_post_attention_layernorm', 'layer_norm2', 'post_layernorm', 'cross_attn_input_layernorm', 'pre_feedforward_layernorm', 'k_norm', 'norm1', 'norm', 'ffn_norm', 'post_attention_layernorm', 'norm2', 'post_feedforward_layernorm', 'attention_norm']


Unsloth 2025.12.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Then, applying the model-specific formatter adds a 'text' attribute containing the actual content the Llama model will see:

In [4]:
dataset = format_dataset(tokenizer, dataset)

dataset[0]

  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)


Map:   0%|          | 0/397 [00:00<?, ? examples/s]

{'conversations': [{'content': '##Observation\nCopyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.\nZORK is a registered trademark of Infocom, Inc.\nRevision 88 / Serial number 840726\n\nWest of House\nYou are standing in an open field west of a white house, with a boarded front door.\nThere is a small mailbox here.\n\n\n\n##Inventory\nYou are empty-handed.\n\n',
   'role': 'user'},
  {'content': 'N', 'role': 'assistant'}],
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n##Observation\nCopyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.\nZORK is a registered trademark of Infocom, Inc.\nRevision 88 / Serial number 840726\n\nWest of House\nYou are standing in an open field west of a white house, with a boarded front door.\nThere is a small mailbox here.\n\n\n\n##Inventory\nYou are empty-handed.\n\n<|eot_id|><|start_h

After producing the dataset from the game environment, we can fine-tune the model using Unsloth as follows.
The boilerplate logic for fine-tuning is located under the modules directory.

In [5]:
# More detailed code for fine-tuning and model saving/loading is included in these modules.
# That code is based on this tutorial and notebook from Unsloth.
# https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/tutorial-how-to-finetune-llama-3-and-use-in-ollama
# https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb
from adventure.finetune import make_trainer

trainer = make_trainer(model, tokenizer, dataset, max_steps=10) # Limit number of steps for this example
trainer_stats = trainer.train()

#save_model(model, tokenizer, "lora_model_report")

  StockPickler.save(self, obj, save_persistent_id)
  StockPickler.save(self, obj, save_persistent_id)


Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/397 [00:00<?, ? examples/s]

Map (num_proc=36):   0%|          | 0/397 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 397 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,5.4152
2,5.7956
3,6.1994
4,5.7188
5,4.6013
6,3.3824
7,2.3518
8,2.7514
9,2.6008
10,1.804


In [6]:
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model)

from adventure.player import run_game
run_game(model, tokenizer, "./z-machine-games-master/jericho-game-suite/zork1.z5", 2, 10)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.


> E
The door is boarded and you can't remove the boards.


> E
The door is boarded and you can't remove the boards.


{'moves': 2, 'unique_rooms': 1, 'unique_hashes': 1, 'unique_items': 0, 'score': 0, 'max_score': 350, 'avg_retries': 1.0, 'avg_generate_time': 0.09915399551391602}


{'moves': 2,
 'unique_rooms': 1,
 'unique_hashes': 1,
 'unique_items': 0,
 'score': 0,
 'max_score': 350,
 'avg_retries': 1.0,
 'avg_generate_time': 0.09915399551391602}

This fine-tuning process is effective at training the model to output commands in the correct format. When trained for more steps than seen here, the model even carries out some correct instructions from the game walkthrough. However, this still does not allow the model to get very far through the game.

The main issue here is that this process does not necessarily teach the model how to decide its next action based on the state of the environment. Any deviation from the path it was taught can then compound further since it was not shown how to handle the situation.

### Reinforcement Learning

In an attempt to improve the generalizability of these models, we tried a reinforcement learning approach. Group-Relative Policy Optimization (GRPO) is a fine-tuning method for LLMs, notably used by DeepSeek to train their reasoning models. This approach simplifies Reinforcement Learning from Human Feedback (RLHF) by removing the value and reward models from the equation.

Like other reinforcement learning approaches, GRPO still allows us to define a custom reward function. By setting various criteria, we can encourage commands that increase the score reported by the game environment, pick up items, etc., while discouraging commands that have no effect.

Applying GRPO on top of a previously fine-tuned model did marginally improve performance. However, its effectiveness was still limited. One potential reason is that, unlike the training process models using a framework like Gymnasium might undergo, the space of possible situations the model might be in is still constrained by an input dataset. One potential mitigation is to explore the world further than what the walkthrough describes.


https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide


## Retrieval Augmented Generation

A recurring issue with language models is that it can be more difficult for them to “connect the dots” between different observations over the course of the game. Additionally, game worlds can be quite large in size, to the extent where the context size may not be sufficient to capture what information is specifically relevant to a scenario the model encounters.

One potential avenue to address these deficiencies is Retrieval Augmented Generation (RAG), which allows a model to search a vector database to find documents similar to a set of keywords. These similarities are computed based on embeddings generated by a separate model. RAG can be further augmented into a graph-based system that allows links between individual documents. This allows more complex relationships to be described, which is useful in this task. For example, edges can describe pathways between locations or objects contained within.

Langchain is a framework that can wrap other LLM providers such as Ollama within an API designed for “agent”-based flows. Here, a model uses sequences of tool actions to carry out some task.

Vending-Bench: https://arxiv.org/abs/2502.15840

In [None]:
from langchain.agents import create_agent
from langchain_core.documents import Document
from ollama import ResponseError

def rag_agent():
    game_response, info = env.reset()

    done = False

    @tool(response_format="content")
    def do_game_action(action: str) -> str:
        """Perform an action in the active text adventure game and see the result"""
        """
        Args:
          action: game action string

        Returns:
          The game's response after performing the action
        """
        nonlocal done, game_response, info
        game_response, reward, done, info = env.step(action)
        if done:
            game_response += '\nYou have finished the game!'
        return game_response
    
    @tool(response_format="content")
    def view_possible_actions() -> str:
        """View a list of the actions that can be performed in the game's current state"""
        """
        Returns:
          String containg actions separated by commas
        """
        return ', '.join(env.get_valid_actions())        


    tools = [remember, do_game_action, view_possible_actions]
    system_prompt = (
        f"You are playing {game}, an interactive fiction game. You must analyze the scenario the game presents to you and choose an action that will make progress. Your goal is to finish the game\n"
        "You have access to a tool that allows you to remember past events that have occured in your current playthrough that are relevant to your situation. "
        "Use the tool to help you decide on the next action to take in-game "
    )
    agent = create_agent(model, tools, system_prompt=system_prompt)
    
    def agent_stream():
        nonlocal agent, game_response
        query = (
            "Think critically. Finish the game.\n"
            # f"Here are relevant items from your past moves:\n{remember.invoke({'query':game_response})}\n"
            f"Here is your current scenario:\n{game_response}"
        )
        try:
            for event in agent.stream(
                {"messages": [{"role": "user", "content": query}]},
                stream_mode="values",
            ):
                last_message = event["messages"][-1]
                last_message.pretty_print()
                
                document = Document(
                    page_content=last_message.content, metadata={"move": info['moves']}
                )
                insert_into_vector_store([document])
                yield None
        except ResponseError:
            print('ResponseError occurred')
            
    cur_stream = None
    def turn():
        nonlocal cur_stream, done, info
        if cur_stream is None:
            cur_stream = agent_stream()
        try:
            next(cur_stream)
        except StopIteration:
            cur_stream = agent_stream()
        return done, info
    
    results = n_steps(turn, env, 10)
    print(results)
    return results


## Graph-based RAG

As seen with Q-BERT, knowledge graphs can be useful to capture the overall state of a game. Individual locations can be represented as nodes on a graph, with directed edges between them indicating pathways. The graph can also link objects within these rooms. By representing connections in this way, spatial relationships can be captured in a way that is accessible to search algorithms, while abstracting away their complexity from the model itself.

We used LangChain's Graph RAG retriever, supported by the DataStax [graph-rag](https://datastax.github.io/graph-rag/) library, to use our existing in-memory LangChain vector store as the knowledge graph where edges are defined by commonly keyed metadata fields. The model still provides information about its current game state to the retrieval function, but the response includes the retrieved documents and k connected documents across graph edges.

In [4]:
from adventure.rag_bot import Game as RagGame
game = RagGame()
game.run_game()

RAG results: [Document(id='180', metadata={'_depth': 0, '_similarity_score': np.float64(1.0), 'name': 'West of House', 'exits': ['81', '80', '78'], 'exit_directions': ['north', 'south', 'west']}, page_content='West of House\nYou are standing in an open field west of a white house, with a boarded front door.\nThere is a small mailbox here.')]
# Knowledge

## Description for "West of House" location:
West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.

# Location

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.

# Inventory

You are empty-handed.

# Observation

Previous command: 

Result:
Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.

## Chain of Thought

Even when provided with more information about the state of the game and player, the model still suffers from a lack of direction, still seeming to wander aimlessly if not randomly. To combat this, we try separating turns into multiple steps, similar to the added “analyze” step used with an earlier strategy. This is also a similar approach to Q*BERT, which uses a question-answer dataset. This paradigm of asking the model pointed questions related to its goal, and feeding the answers back to the model for its final inference forms a sort of “Chain of Thought” (https://www.promptingguide.ai/techniques/cot) and is a common strategy for prompt engineering to improve model performance.

For each step in the game:
- The model describes its overarching goal and one short-term goal
- The model lists key new information, important aspects of its surroundings, and information its tools, all centered around the goal it determined in the previous step
- The goal and listed assets are used to query the vector store to retrieve relevant game information from RAG memory as context
- The goal, assets, and context are provided to the model to generate the final action



In [1]:
from adventure.multistep_rag_bot import Game as CotGame
game = CotGame()
game.run_game()

# Last Action Result:
Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.



# Knowledge:
## Description for "West of House":
West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.

# Goal: My overarching goal is to explore the white house located west of my current position.

My single most pressing short-term goal is to open the front door of the house by removing or breaking the board covering it.

# Notes:
**Key New Information:** The presence of a small mailbox suggests that there may be some additional items or clues nearby.

**Important Aspects of Surroundings:**

* I'm standing in an open field west of a white house with a boarded front door.
* There's a small mailbox next to the bo

This created interesting output where the LLM would generate a thought process and did tend to follow it, but it didn’t always choose goals that were productive even with this. It did seem to promote better exploration of its environment. Likely refinements to what is stored in the vector database would be required to make this better, potentially allowing the model to determine goals outside of its current room.

## Reality Check: GPT-5.2

We really wanted to focus on strategies that would be runnable on machines we or another average researcher would have available to them, meaning we wanted to use smaller models that could be hosted locally. However, we would be remiss if we didn't try a larger model. We chose to try GPT-5.2 through OpenAI's API. It is advertised as highly productive model for agentic workflows and task completion. We found that it made it farther through the game than anything else, achieving a score of 40/350 at the highest compared to the rest of the models, which only achieved as high as 15/350. While ChatGPT did make it farther, there was a lot of game left to play and it still either died to the thief or got stuck in a loop of actions somewhere underground.

In [None]:
from openai import OpenAI

from dotenv import load_dotenv
load_dotenv();

game = 'zork1.z5'
GAMES_DIR = "z-machine-games-master/jericho-game-suite"
env = jericho.FrotzEnv(f"{GAMES_DIR}/{game}")

In [None]:
client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)


In [None]:

score_tracker = ScoreTracker(env)

game_response, info = env.reset()
print(game_response)
done = False

conversation = [
    {"role": "system", "content": "You are the player. You are playing the interactive fiction game Zork. Don't attach the thief"}
]

while not done:
    
    conversation.append({"role": "user", "content": game_response})

    start = time.time()
    
    response = client.responses.create(
        model="gpt-5.2",
        input=conversation
    )

    reply = response.output_text
    print("GPT:", reply)
    conversation.append({"role": "assistant", "content": reply})
    
    game_response, reward, done, info = env.step(reply)
    print(game_response)
    
    end = time.time()
    score_tracker.update(info, start, end)
    score_tracker.get_stats(env, info)



### Evaluation of Existing Knowledge

In [None]:
without_copyright = '''West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.
'''
actual = embedding(without_copyright)

first_half = '''West of House
You are standing in an open field west of
'''

prompt = f'Below is the first half of the very first text prompt given by the text adventure game Zork. Please complete it exactly as it is given by Zork. Do not include anything except your completion.\n\n"{first_half}" (completion here...)'
response = client.responses.create(
        model="gpt-5.2",
        input=prompt
    )
reply = response.output_text
rebuilt = first_half + ' ' + reply.split(':')[-1]
print(rebuilt)
cosine_similarity(embedding(rebuilt), actual)

GPT-5.2's reconstructed Zork introduction text has the highest cosine similarity of all, often generating it exactly the same as the game. This indicates that its good performance could be linked to it already having memorized Zork or much of Zork, which means it may not be employing any actual reasoning when playing.

# Results

| model | Configuration | Unique Rooms | Unique Items | Unique Game States | Score | Average Retries | Average Generate Time |
| --- | --- | --- | --- | --- | --- | --- | --- |
|||||||||
| llama3.2:3b | Basic LLaMa | 6 | 1 | 13 | 0 | 100+ | 1.85s |
| llama3.2:3b | Memory | 5 | 0 | 8 | 10 | 30 | 1.38s |
| llama3.2:3b | Memory & ChatGPT Prompt | 8 | 1 | 11 | 0 | 100+ | 0.255s |
| llama3.2:3b | Memory & Provided Commands | 8 | 5 | 53 | 10 | 6 | 0.15s |
| llama3.2:3b | Memory, Analyze, Provided Commands | 10 | 7 | 60 | 10 | 15.5 | 1.86s |
| llama3.2:3b | Chain of Thought RAG | 7 | 5 | 23 | 5 | 11 | 223.9s |



# Discussion

- COT RAG: Takes a long time to run on average. The memory just keeps growing, which means the context to parse through before generation takes longer and longer. Should have been limiting the memory size

## Reverse Conjecture Map

CS542-ConjectureMap.drawio.svg

## Apportionment of Work

| Person | Tasks |
| --- | --- |
| Everett | Off-the-shelf LLM prototype, Q-BERT, LoRA, GRPO, Graph RAG |
| Tyson | Off-the-shelf LLMs, GPT-5 experiment, Retrieval Augmented Generation, Multi-Step Questioning |
