## Tutorial: Optimizing a Linear Policy for MountainCar using an LLM

This notebook serves as a detailed guide on utilizing a Large Language Model (LLM), specifically GPT-4o, to perform Policy Search for a linear control policy within an OpenAI Gym reinforcement learning environment. Our focus will be on the MountainCar environment, where we aim to employ GPT-4o to discover the optimal parameters for a simple linear policy that enables the car to successfully reach its goal.

##### Environment: State and Action Variables

The MountainCar environment presents a classic control challenge framed as a deterministic Markov Decision Process (MDP). In this setup, the environment's future state depends solely on the current state and the action taken, not on the history of preceding states. The term "deterministic" signifies that a specific action performed in a particular state will consistently lead to the identical next state and reward. The core task involves an underpowered car, initially placed randomly in the valley between two hills. The objective is to drive the car to the goal located at the top of the right hill. Due to the car's limited engine power, it cannot ascend the steep slope directly. Instead, the agent must learn a strategy of driving back and forth, building momentum to eventually conquer the hill.

Control over the car is exerted by applying an external force at each discrete timestep. This force is determined by an action value, $a$, chosen from the continuous range [-1, 1]. This action value is then scaled by a constant factor (0.0015) to yield the actual physical force applied. A positive action propels the car rightward, while a negative action pushes it leftward. At every timestep, the agent observes the environment's current state, which is captured by a two-dimensional vector:
$$s = [x, v]$$
Here, $x$ represents the car's horizontal position (ranging from -1.2 to 0.6, with the goal at 0.5), and $v$ denotes its current velocity (ranging from -0.07 to 0.07). The environment provides a reward signal at each step; typically, this is -0.1 per step to encourage speed, plus a significant bonus (e.g., +100) upon reaching the goal.

##### Policy Representation

In reinforcement learning, the agent's behavior is dictated by a "policy," which essentially maps observed states to appropriate actions. For this specific problem, we adopt a straightforward **linear policy**. This implies that the action is calculated as a linear combination of the current state variables (position and velocity). This policy is parameterized by a weight matrix (in this case, a 2x1 vector) denoted as:
$$P = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}$$
Given a state $s = [x, v]$, the action $a$ is computed through a dot product:
$$a = s^T P = x \cdot w_1 + v \cdot w_2$$
The weights $w_1$ and $w_2$ quantify the influence of the car's position and velocity, respectively, on the chosen action. The central aim of our optimization process is to determine the specific values for $w_1$ and $w_2$ that result in the highest possible total reward accumulated over an entire episode.

##### Optimization Strategy: LLM-Driven Policy Search

The fundamental objective is to identify the optimal policy parameters $P$ that maximize the cumulative reward $R$ gathered over a complete episode, which consists of a sequence of steps from the start until termination (either reaching the goal or hitting the maximum step limit, e.g., 1000). This task is formally known as **Policy Search**. Mathematically, we seek to solve:
$$ \max_{P} \mathbb{E}\left[ \sum_{t=0}^{T} r_t \right]$$
where $r_t$ is the reward at timestep $t$ and $T$ is the episode length.

To facilitate this optimization, we utilize a "Replay Buffer." After each episode concludes, having been run with the current parameters $P$, the total reward $R$ is calculated. This $(P, R)$ pair is then stored in the buffer. We approach the relationship between the policy parameters $P$ and the resulting reward $R$ as a **black-box function**. This means the optimizer, which is the LLM in our case, operates without explicit knowledge of the MountainCar environment's internal physics or reward structure. It only observes the input parameters ($w_1, w_2$) and the corresponding output reward $R$.

##### LLM as the Optimizer

We harness the capabilities of GPT-4o to conduct this black-box optimization. The LLM is instructed via a prompt to function as an optimization assistant. The process begins with a "warmup" phase, where several episodes are run using randomly selected parameters $P$. The resulting $(P, R)$ pairs populate the Replay Buffer, providing initial data. Subsequently, the LLM is presented with a detailed prompt containing the optimization goal, the historical data from the Replay Buffer, output format instructions, and guidance on balancing exploration (trying novel parameters) versus exploitation (refining promising parameters), adapting this balance as the optimization progresses. Based on this prompt and the historical context (enabling in-context learning), the LLM proposes a new set of parameters $P$ anticipated to yield improved rewards.

The agent's policy is then updated with these suggested parameters, and one or more evaluation episodes are executed in the environment. The cumulative reward obtained from these evaluations is recorded, and the new $(P, \text{cumulative } R)$ pair is added to the Replay Buffer. This cycle of prompting the LLM, receiving parameter suggestions, evaluating the updated policy, and updating the buffer is repeated for a predetermined number of episodes (e.g., 400), allowing the LLM to iteratively refine the policy parameters towards optimality. The prompt design treats the task purely as optimizing an unknown function $f(w_1, w_2) = R$, guiding the LLM with hints on step size and search ranges but without revealing the underlying simulation details.

### Code Overview

The implementation follows a modular design. The **World** component (`MountaincarContinuousActionWorld`) wraps the standard Gymnasium environment, managing state transitions, reward calculations, and episode termination. The **Agent** component (`MountaincarContinuousActionLLMNumOptimAgent`) integrates the learning elements. It includes the **`LinearPolicy`** module, which stores the policy weights ($w_1, w_2$) and computes actions based on states. It also contains the **`EpisodeRewardBufferNoBias`** module, responsible for maintaining the Replay Buffer of (weights, reward) pairs. Finally, the **`LLMBrain`** module orchestrates all interactions with the LLM, including prompt generation using Jinja2 templates, API communication (handling both OpenAI and Gemini models), and parsing the LLM's responses to extract the suggested new parameters.

##### Hyperparameters

Several hyperparameters govern the experiment's execution. `NUM_EPISODES` (e.g., 400) sets the total number of optimization iterations. `RENDER_MODE` controls environment visualization. `MAX_TRAJ_COUNT` (e.g., 1000) defines the Replay Buffer size, influencing the historical context available to the LLM. `MAX_TRAJ_LENGTH` (e.g., 1000) sets the maximum steps per episode. `LLM_MODEL_NAME` specifies the LLM used. `NUM_EVALUATION_EPISODES` (e.g., 20) determines how many runs are averaged to evaluate a new policy. `WARMUP_EPISODES` (e.g., 20) sets the number of initial random runs. `SEARCH_STD` (e.g., 1.0) provides a hint to the LLM regarding the step size for parameter exploration.

##### Training Loop
<p style="text-align:center;">
<img src="./static/training_loop.drawio.svg" alt="image">
</p>


The `run_training_loop` function orchestrates the process. It initializes the World and Agent components. It performs the initial warmup runs if necessary, populating the replay buffer. Then, it enters the main loop, iterating `NUM_EPISODES` times. In each iteration, it interacts with the LLM (`agent.train_policy`) to get updated policy parameters based on the replay buffer history. It then evaluates the performance of this new policy over `NUM_EVALUATION_EPISODES` (`agent.evaluate_policy`), calculates the cumulative reward, and adds the new (parameters, cumulative reward) pair back into the replay buffer. Logging occurs at each step.

##### Output Structure

The training process generates structured logs. A main log directory contains subdirectories for each episode (`episode_*`) and potentially a `warmup/` directory. Each episode directory stores logs of evaluation trajectories, the parameters suggested by the LLM for that episode (`parameters.txt`), and the full LLM interaction including its reasoning (`parameters_reasoning.txt`). The final notebook cells typically include code for visualizing the learned policy in action and plotting the reward curve over episodes, illustrating the learning progress.

## Running the Notebook

To run the full experiment:
1. Ensure all dependencies are installed (OpenAI API, Google Generative AI, NumPy, Matplotlib, Gymnasium)
2. Set your API keys as environment variables
   ```bash
   export OPENAI_API_KEY=your_key_here
   export GEMINI_API_KEY=your_key_here

In [None]:
# !pip install -r requirements.txt
# !export OPENAI_API_KEY="your_key_here" # Replace with your OpenAI API key

In [None]:
from utils import *

This cell introduces and lists the key hyperparameters that control the execution and behavior of the reinforcement learning experiment. Hyperparameters are settings that are not learned by the agent itself but are defined by the user before the training process begins. They significantly influence the learning process and the performance of the agent.

`NUM_EPISODES` (e.g., 400): Defines the total number of optimization iterations or training episodes the agent will go through. A higher number allows for more learning but increases computation time.

`RENDER_MODE` (e.g., None): Controls how the environment is visualized during execution. Options typically include 'human' (real-time window), 'rgb_array' (returns a pixel array, useful for recording), or None (no visualization, fastest for training).

`MAX_TRAJ_COUNT` (e.g., 1000): Sets the maximum size of the Replay Buffer. This buffer stores (policy parameters, reward) pairs, and its size determines how much historical data the LLM has access to when making decisions.

`MAX_TRAJ_LENGTH` (e.g., 1000): Specifies the maximum number of steps allowed in a single episode. If the agent doesn't reach a terminal state within these steps, the episode is truncated.

`LLM_MODEL_NAME` (e.g., "gemini-2.5-flash-preview-04-17"): Specifies which Large Language Model will be used as the optimizer. The comment lists several compatible models from OpenAI and Google.
NUM_EVALUATION_EPISODES (e.g., 20): Determines how many times a newly proposed policy is run in the environment to get an average measure of its performance. Averaging helps to reduce variance in the reward signal.

`WARMUP_EPISODES` (e.g., 20): Sets the number of initial episodes run with randomly generated policy parameters. This "warmup" phase populates the Replay Buffer with some initial data points before the LLM starts optimizing.

`SEARCH_STD` (e.g., 1.0): Provides a hint to the LLM regarding the standard deviation or step size it should consider when exploring new parameter values, especially during the initial exploration phase.

In [None]:
NUM_EPISODES=400 # Total number of episodes to train for
RENDER_MODE=None # Choose from 'human', 'rgb_array', or None
MAX_TRAJ_COUNT=1000 # Maximum number of trajectories to store in buffer for prompt
MAX_TRAJ_LENGTH=1000 # Maximum number of steps in a trajectory
LLM_MODEL_NAME="gpt-4o" # LLM for optimization, choose from "o1-preview", "gpt-4o", "gemini-2.0-flash-exp", "gpt-4o-mini", "gemini-1.5-flash", "gemini-1.5-flash-8b", "gemini-1.5-pro", "gemini-2.5-pro-preview-05-06", "gemini-2.5-flash-preview-04-17", "o3-mini-2025-01-31", "gpt-4o-2024-11-20", "gpt-4o-2024-08-06", "claude-3-7-sonnet-20250219"

NUM_EVALUATION_EPISODES=20 # Number of episodes to generate agent rollouts for evaluation
WARMUP_EPISODES=20 # Number of randomly generated initial episodes
SEARCH_STD=1.0 # Step size for LLM to search for optimal parameters during exploration

In [None]:
mountaincar_params = {
    "num_episodes": NUM_EPISODES,
    "gym_env_name": "MountainCarContinuous-v0",
    "render_mode": RENDER_MODE,
    "logdir": "logs/mountaincar_continuous_tutorial",
    "dim_actions": 1,
    "dim_states": 2,
    "max_traj_count": MAX_TRAJ_COUNT,
    "max_traj_length": MAX_TRAJ_LENGTH,
    "llm_model_name": LLM_MODEL_NAME,
    "num_evaluation_episodes": NUM_EVALUATION_EPISODES,
    "warmup_episodes": WARMUP_EPISODES,
    "warmup_dir": None,
    "bias": True,
    "rank": None,
    "optimum": 100,
    "search_step_size": SEARCH_STD
}

The below cell defines the template for the black box optimization prompt. The prompt template uses variables defined in the code for setting the number of parameters required to optimize, the global optimum of the function, step size, current step count and the history of (parameter, reward) tuples.

In [None]:
LLM_SI_TEMPLATE_STRING = """
You are good global optimizer, helping me find the global maximum of a mathematical function f(params).
I will give you the function evaluation and the current iteration number at each step. 
Your goal is to propose input values that efficiently lead us to the global maximum within a limited number of iterations (400). 

# Regarding the parameters **params**:
**params** is an array of {{ rank }} float numbers.
**params** values are in the range of [-6.0, 6.0] with 1 decimal place.

# Here's how we'll interact:
1. I will first provide MAX_STEPS (400) along with a few training examples.
2. You will provide your response in the following exact format:
    * Line 1: a new input 'params[0]: , params[1]: , params[2]: ,..., params[{{ rank - 1 }}]: ', aiming to maximize the function's value f(params). 
    Please propose params values in the range of [-6.0, 6.0], with 1 decimal place.
    * Line 2: detailed explanation of why you chose that input.
3. I will then provide the function's value f(params) at that point, and the current iteration.
4. We will repeat steps 2-3 until we reach the maximum number of iterations.

# Remember:
1. **Do not propose previously seen params.**
2. **The global optimum should be around {{ optimum }}.** If you are below that, this is just a local optimum. You should explore instead of exploiting.
3. Search both positive and negative values. **During exploration, use search step size of {{ step_size }}**.


Next, you will see examples of params and f(params) pairs.
{{ episode_reward_buffer_string }}

Now you are at iteration {{step_number}} out of 400. Please provide the results in the indicated format. Do not provide any additional texts."""


llm_si_template = Template(LLM_SI_TEMPLATE_STRING)
llm_output_conversion_template = llm_si_template

### World

The `ContinualSpaceGeneralWorld` is a wrapper class over the Gymnasium environments to give standardized interface for the agents.

In [None]:
# MountainCarContinuous-v0
# https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/

import gymnasium as gym


class ContinualSpaceGeneralWorld():
    def __init__(
        self,
        gym_env_name,
        render_mode,
        max_traj_length=1000,
    ):
        assert render_mode in ["human", "rgb_array", None]

        self.env = gym.make(gym_env_name, render_mode=render_mode)
        self.gym_env_name = gym_env_name
        self.render_mode = render_mode
        self.steps = 0
        self.accu_reward = 0
        self.max_traj_length = max_traj_length
        if isinstance(self.env.action_space, gym.spaces.Discrete):
            self.discretize = True
        else:
            self.discretize = False

    def reset(self, new_reward=False):
        """ This method resets the environment to its initial state.
        If `new_reward` is True, it initializes the environment with a different reward structure.
        """
        del self.env
        if not new_reward:
            self.env = gym.make(self.gym_env_name, render_mode=self.render_mode)
        else:
            self.env = gym.make(self.gym_env_name, render_mode=self.render_mode, healthy_reward=0)

        state, _ = self.env.reset()
        self.steps = 0
        self.accu_reward = 0
        return state

    def step(self, action):
        """
        This method executes a step in the environment with the given action.
        It updates the environment state, accumulates the reward, and checks if the episode is done.
        """
        self.steps += 1
        action = action[0]
        state, reward, terminated, truncated, _ = self.env.step(action)
        self.accu_reward += reward

        if self.steps >= self.max_traj_length or terminated or truncated:
            done = True
        else:
            done = False

        return state, reward, done

    def get_accu_reward(self):
        """
        This method returns the accumulated reward for the current episode.
        """
        return self.accu_reward


### Sub Modules

`EpisodeRewardBufferNoBias`: Store and manage collection of (policy parameters and reward) pairs, acting as the replay buffer.

`LinearPolicy`: Implements a linear policy where the action is computed as a dot product of the state and weights, plus a bias term: $a = s^T W + b$.

`LinearPolicyNoBias`: Implements a linear policy without a bias term: $a = s^T W$.

`LLMBrain`: Coordinates with the LLM to get new parameters for the policy based on existing policy (parameter, reward) pairs.

In [None]:
class LinearPolicy():
    """
    Linear policy for continuous action space. The policy is represented as a (2,1) matrix of weights.
    Next action is calculated as the dot product of the state and the weight matrix.
    state.T * weight + bias -> action
    (1,2) * (2,1) + (1,1) -> (1,1)
    """
    def __init__(self, dim_states, dim_actions):

        self.dim_states =dim_states
        self.dim_actions = dim_actions

        self.weight = np.random.rand(self.dim_states, self.dim_actions)
        self.bias = np.random.rand(1, self.dim_actions)

    def initialize_policy(self):
        self.weight = np.round(np.random.normal(0., 3., size=(self.dim_states, self.dim_actions)), 1)
        self.bias = np.round(np.random.normal(0., 3., size=(1, self.dim_actions)), 1)

    def get_action(self, state):
        state = state.T
        return np.matmul(state, self.weight) + self.bias

    def __str__(self):
        output = "Weights:\n"
        for w in self.weight:
            output += ", ".join([str(i) for i in w])
            output += "\n"

        output += "Bias:\n"
        for b in self.bias:
            output += ", ".join([str(i) for i in b])
            output += "\n"

        return output

    def update_policy(self, weight_and_bias_list):
        if weight_and_bias_list is None:
            return

        weight_and_bias_list = np.array(weight_and_bias_list).reshape(self.dim_states + 1, self.dim_actions)
        self.weight = np.array(weight_and_bias_list[:-1])
        self.bias = np.expand_dims(np.array(weight_and_bias_list[-1]), axis=0)

    def get_parameters(self):
        parameters = np.concatenate((self.weight, self.bias), axis=0)
        return parameters

### Agent

The below cell defines the core agent wrapper. It is responsibe for managing the policy, interacting with the world and coordinating with the LLMBrain to learn.

In [None]:
class LLMNumOptimAgent:
    def __init__(
        self,
        logdir,
        dim_action,
        dim_state,
        max_traj_count,
        max_traj_length,
        llm_si_template,
        llm_output_conversion_template,
        llm_model_name,
        num_evaluation_episodes,
        bias,
        optimum,
        search_step_size,
    ):
        self.start_time = time.process_time()
        self.api_call_time = 0
        self.total_steps = 0
        self.total_episodes = 0
        self.dim_action = dim_action
        self.dim_state = dim_state
        self.bias = bias
        self.optimum = optimum
        self.search_step_size = search_step_size

        if not self.bias:
            param_count = dim_action * dim_state
        else:
            param_count = dim_action * dim_state + dim_action
        self.rank = param_count

        # Initialize the policy and replay buffer
        if not self.bias:
            self.policy = LinearPolicyNoBias(
                dim_actions=dim_action, dim_states=dim_state
            )
        else:
            self.policy = LinearPolicy(dim_actions=dim_action, dim_states=dim_state)
        self.replay_buffer = EpisodeRewardBufferNoBias(max_size=max_traj_count)
        self.llm_brain = LLMBrain(
            llm_si_template, llm_output_conversion_template, llm_model_name
        )
        self.logdir = logdir
        self.num_evaluation_episodes = num_evaluation_episodes
        self.training_episodes = 0

        if self.bias:
            self.dim_state += 1

    def rollout_episode(self, world, logging_file, record=True):
        """Simulates an episode in the environment using the current policy."""
        state = world.reset()
        state = np.expand_dims(state, axis=0)
        logging_file.write(
            f"{', '.join([str(x) for x in self.policy.get_parameters().reshape(-1)])}\n"
        )
        logging_file.write(f"parameter ends\n\n")
        logging_file.write(f"state | action | reward\n")
        done = False
        step_idx = 0
        while not done:
            action = self.policy.get_action(state.T)
            action = np.reshape(action, (1, self.dim_action))
            if world.discretize:
                action = np.argmax(action)
                action = np.array([action])
            next_state, reward, done = world.step(action)
            logging_file.write(f"{state.T[0]} | {action[0]} | {reward}\n")
            state = next_state
            step_idx += 1
            self.total_steps += 1
        logging_file.write(f"Total reward: {world.get_accu_reward()}\n")
        self.total_episodes += 1
        if record:
            self.replay_buffer.add(
                self.policy.get_parameters(), world.get_accu_reward()
            )
        return world.get_accu_reward()

    def random_warmup(self, world, logdir, num_episodes):
        for episode in range(num_episodes):
            self.policy.initialize_policy()
            # Run the episode and collect the trajectory
            print(f"Rolling out warmup episode {episode}...")
            logging_filename = f"{logdir}/warmup_rollout_{episode}.txt"
            logging_file = open(logging_filename, "w")
            result = self.rollout_episode(world, logging_file)
            print(f"Result: {result}")

    def train_policy(self, world, logdir):
        """Core method to train single iteration of the policy using LLM optimization."""

        def parse_parameters(input_text):
            # This regex looks for integers or floating-point numbers (including optional sign)
            s = input_text.split("\n")[0]
            print("response:", s)
            pattern = re.compile(r"params\[(\d+)\]:\s*([+-]?\d+(?:\.\d+)?)")
            matches = pattern.findall(s)

            # Convert matched strings to float (or int if you prefer to differentiate)
            results = []
            for match in matches:
                results.append(float(match[1]))
            print(results)
            assert len(results) == self.rank
            return np.array(results).reshape(-1)

        def str_nd_examples(replay_buffer: EpisodeRewardBufferNoBias, n):

            all_parameters = []
            for weights, reward in replay_buffer.buffer:
                parameters = weights
                all_parameters.append((parameters.reshape(-1), reward))

            text = ""
            for parameters, reward in all_parameters:
                l = ""
                for i in range(n):
                    l += f"params[{i}]: {parameters[i]:.5g}; "
                fxy = reward
                l += f"f(params): {fxy:.2f}\n"
                text += l
            return text

        # Update the policy using llm_brain, q_table and replay_buffer
        print("Updating the policy...")
        new_parameter_list, reasoning, api_time = self.llm_brain.llm_update_parameters_num_optim(
            str_nd_examples(self.replay_buffer, self.rank),
            parse_parameters,
            self.training_episodes,
            self.rank,
            self.optimum,
            self.search_step_size
        )
        self.api_call_time += api_time

        print(self.policy.get_parameters().shape)
        print(new_parameter_list.shape)

        self.policy.update_policy(new_parameter_list)
        
        print(self.policy.get_parameters().shape)
        logging_q_filename = f"{logdir}/parameters.txt"
        logging_q_file = open(logging_q_filename, "w")
        logging_q_file.write(str(self.policy))
        logging_q_file.close()
        
        q_reasoning_filename = f"{logdir}/parameters_reasoning.txt"
        q_reasoning_file = open(q_reasoning_filename, "w")
        q_reasoning_file.write(reasoning)
        q_reasoning_file.close()
        print("Policy updated!")

        # Run the episode and collect the trajectory
        print(f"Rolling out episode {self.training_episodes}...")
        logging_filename = f"{logdir}/training_rollout.txt"
        logging_file = open(logging_filename, "w")
        results = []
        for idx in range(self.num_evaluation_episodes):
            if idx == 0:
                result = self.rollout_episode(world, logging_file, record=False)
            else:
                result = self.rollout_episode(world, logging_file, record=False)
            results.append(result)
        print(f"Results: {results}")
        result = np.mean(results)
        self.replay_buffer.add(new_parameter_list, result)

        self.training_episodes += 1

        _cpu_time = time.process_time() - self.start_time
        _api_time = self.api_call_time
        _total_episodes = self.total_episodes
        _total_steps = self.total_steps
        _total_reward = result
        return _cpu_time, _api_time, _total_episodes, _total_steps, _total_reward
    

    def evaluate_policy(self, world, logdir):
        results = []
        for idx in range(self.num_evaluation_episodes):
            logging_filename = f"{logdir}/evaluation_rollout_{idx}.txt"
            logging_file = open(logging_filename, "w")
            result = self.rollout_episode(world, logging_file, record=False)
            results.append(result)
        return results


The below cell orchestrates the entire training process from initialization to completion. The `run_training_loop` function starts with initialization the world, and the agent instances. Then, it creates a set of warmup episodes to pass in as initial replay buffer to the optimizer. The code then runs the training loop for specified number of episodes and optimizes the policy parameters.

In [None]:
def run_training_loop(
    num_episodes,
    gym_env_name,
    render_mode,
    logdir,
    dim_actions,
    dim_states,
    max_traj_count,
    max_traj_length,
    llm_model_name,
    num_evaluation_episodes,
    warmup_episodes,
    warmup_dir,
    bias=None,
    rank=None,
    optimum=100,
    search_step_size=SEARCH_STD,
):
    world = ContinualSpaceGeneralWorld(
        gym_env_name,
        render_mode,
        max_traj_length,
    )

    agent = LLMNumOptimAgent(
        logdir,
        dim_actions,
        dim_states,
        max_traj_count,
        max_traj_length,
        llm_si_template,
        llm_output_conversion_template,
        llm_model_name,
        num_evaluation_episodes,
        bias,
        optimum,
        search_step_size,
    )
    print('init done')

    if not warmup_dir:
        warmup_dir = f"{logdir}/warmup"
        os.makedirs(warmup_dir, exist_ok=True)
        agent.random_warmup(world, warmup_dir, warmup_episodes)
    else:
        agent.replay_buffer.load(warmup_dir)
    
    overall_log_file = open(f"{logdir}/overall_log.txt", "w")
    overall_log_file.write("Iteration, CPU Time, API Time, Total Episodes, Total Steps, Total Reward\n")
    overall_log_file.flush()
    for episode in range(num_episodes):
        print(f"Episode: {episode}")
        # create log dir
        curr_episode_dir = f"{logdir}/episode_{episode}"
        print(f"Creating log directory: {curr_episode_dir}")
        os.makedirs(curr_episode_dir, exist_ok=True)
        
        for trial_idx in range(5):
            try:
                cpu_time, api_time, total_episodes, total_steps, total_reward = agent.train_policy(world, curr_episode_dir)
                overall_log_file.write(f"{episode + 1}, {cpu_time}, {api_time}, {total_episodes}, {total_steps}, {total_reward}\n")
                overall_log_file.flush()
                print(f"{trial_idx + 1}th trial attempt succeeded in training")
                break
            except Exception as e:
                print(
                    f"{trial_idx + 1}th trial attempt failed with error in training: {e}"
                )
                traceback.print_exc()

                if trial_idx == 4:
                    print(f"All {trial_idx + 1} trials failed. Train terminated")
                    exit(1)
                continue
    overall_log_file.close()


In [None]:
run_training_loop(**mountaincar_params)

### Policy Visualization

In [None]:
def run_policy(*args, **kwargs):
    agent = LLMNumOptimAgent(
        kwargs['logdir'],
        dim_action=kwargs['dim_actions'],
        dim_state=kwargs['dim_states'],
        max_traj_count=kwargs['max_traj_count'],
        max_traj_length=kwargs['max_traj_length'],
        llm_si_template=llm_si_template,
        llm_output_conversion_template=llm_output_conversion_template,
        llm_model_name=kwargs['llm_model_name'],
        num_evaluation_episodes=kwargs['num_evaluation_episodes'],
        bias=kwargs['bias'],
        optimum=kwargs['optimum'],
        search_step_size=SEARCH_STD
    )

    world = ContinualSpaceGeneralWorld(
        kwargs['gym_env_name'],
        render_mode="rgb_array",
        max_traj_length=kwargs['max_traj_length']
    )

    parameter_filename = os.path.join(kwargs['logdir'], kwargs['episode_dir'], "parameters.txt")
    parameters = []
    with open(parameter_filename, "r") as f:
        lines = f.readlines()
        for line in lines:
            if "parameter ends" in line:
                break
            try:
                parameters.append([float(x) for x in line.split(",")])
            except:
                continue
        parameters = np.array(parameters)

    agent.policy.update_policy(parameters)    
    state = world.reset()
    state = np.expand_dims(state, axis=0)

    done = False
    step_idx = 0
    frames = []  # List to store frames for GIF generation

    while not done:
        img = world.env.render()
        if isinstance(img, np.ndarray):
            frames.append(img)  # Append rendered frame to the list
        else:
            img = np.array(img)
            frames.append(img)

        action = agent.policy.get_action(state.T)
        action = np.reshape(action, (1, kwargs['dim_actions']))
        next_state, reward, done = world.step(action)
        state = next_state
        step_idx += 1

    # Save frames as a GIF
    gif_filename = os.path.join(kwargs['logdir'], kwargs['episode_dir'], "policy_visualization.gif")
    imageio.mimsave(gif_filename, frames, fps=30)

    return gif_filename

In [None]:
from IPython.display import HTML

episode_399_gif = run_policy(**mountaincar_params, episode_dir="episode_399")

display(HTML("<h3>Episode 399</h3>"))
display(HTML(f'<img src="{episode_399_gif}" style="width: 100%; max-width: 800px;">'))

cpu_times, api_times, total_episodes, total_steps, total_rewards = read_file(find_file(mountaincar_params['logdir']))
plot_data(total_episodes, total_rewards, "ProPS Optimization on MountainCar")

### Tutorial Extension: Optimizing a Linear Policy for Swimmer-v5 using an LLM

This section extends the tutorial by applying the LLM-driven optimization approach to a more complex task: the Swimmer-v5 environment from the Mujoco physics simulation suite, available through Gymnasium. Similar to the MountainCar problem, our goal is to utilize a Large Language Model (LLM), such as GPT-4o or a Gemini model, to perform Policy Search. We will aim to discover the optimal parameters for a linear policy that enables the swimmer agent to achieve efficient forward locomotion.

##### Environment: State and Action Variables

The Swimmer-v5 environment presents a challenging continuous control problem. The agent is a simple swimmer composed of three rigid links connected by two actuated rotational joints (rotors). This chain-like structure is simulated in a viscous fluid. The primary objective for the swimmer is to move forward (typically along the positive x-axis) as quickly as possible by applying torques to its two rotors. The interaction with the fluid and the multi-link dynamics make this a non-trivial control task. Like MountainCar, this environment can be modeled as a Markov Decision Process (MDP), where the next state and reward are determined by the current state and the action taken. Mujoco environments are generally deterministic given the same initial conditions and actions.

At each timestep, the agent receives an observation of the environment's current state. For Swimmer-v5, this state is represented by an 8-dimensional continuous vector:
$$ S = [q_{tip}, q_{rotor1}, q_{rotor2}, v_x, v_y, \omega_{tip}, \omega_{rotor1}, \omega_{rotor2} ] $$
Where:
*   $q_{tip}$: Angle of the front tip (the first link).
*   $q_{rotor1}$: Angle of the first rotor.
*   $q_{rotor2}$: Angle of the second rotor.
*   $v_x$: Velocity of the tip along the x-axis (forward direction).
*   $v_y$: Velocity of the tip along the y-axis.
*   $\omega_{tip}$: Angular velocity of the front tip.
*   $\omega_{rotor1}$: Angular velocity of the first rotor.
*   $\omega_{rotor2}$: Angular velocity of the second rotor.

Control over the swimmer is exerted by applying torques to its two rotors. The action $A$ is a 2-dimensional continuous vector:
$$ A = [ \tau_1, \tau_2 ] $$
Here, $\tau_1$ is the torque applied to the first rotor, and $\tau_2$ is the torque applied to the second rotor. Both torque values are typically clipped within the range [-1, 1]. The reward function in Swimmer-v5 is primarily based on the forward velocity ($v_x$), encouraging the agent to swim quickly in the target direction. There might also be small control costs to discourage excessive torque usage.

##### Policy Representation

Consistent with the approach for MountainCar, we will employ a **linear policy** for the Swimmer. The action (torques) will be a linear combination of the observed state variables. Given the 8 state variables and 2 action variables, the policy will be parameterized by a weight matrix $W$ of shape 8x2:
$$ W = \begin{bmatrix}
w_{1,1} & w_{1,2} \\
w_{2,1} & w_{2,2} \\
\vdots & \vdots \\
w_{8,1} & w_{8,2}
\end{bmatrix} $$
Given a state vector $S$ (an 8x1 vector), the action vector $A$ (a 2x1 vector) is computed as:
$$ A = S^T W $$
Each element $A_j = \sum_{i=1}^{8} S_i \cdot w_{i,j}$. The weights $w_{i,j}$ determine the influence of the $i$-th state variable on the $j$-th action (torque). The optimization goal is to find the 16 weights in $W$ that maximize the total reward accumulated over an episode.

##### Optimization Strategy: LLM-Driven Policy Search

The core objective remains to find the optimal policy parameters $W$ that maximize the expected cumulative reward $R$ over an episode. This is a Policy Search problem:
$$ \max_{W} \mathbb{E}\left[ \sum_{t=0}^{T} r_t \right]$$
We will again use a Replay Buffer to store pairs of (policy parameters $W$, total reward $R$). The LLM will treat the relationship $f(W) = R$ as a **black-box function**, learning to propose better parameters based on observed performance without direct knowledge of the Swimmer's physics or complex dynamics.

##### LLM as the Optimizer

The chosen LLM will act as the optimization engine. The process mirrors that used for MountainCar:
1.  **Warmup Phase**: Initial episodes are run with randomly generated policy parameters $W$ to populate the Replay Buffer.
2.  **Iterative Refinement**: The LLM is prompted with the optimization goal, the historical data from the Replay Buffer (pairs of $W$ and $R$), output format instructions, and guidance on exploration vs. exploitation.
3.  The LLM proposes a new set of parameters $W'$.
4.  The agent's policy is updated with $W'$, and evaluation episodes are run.
5.  The resulting ( $W'$, cumulative $R'$) pair is added to the Replay Buffer.
This cycle repeats, allowing the LLM to iteratively refine the 16 policy parameters for the Swimmer agent. The prompt will guide the LLM to optimize this higher-dimensional function $f(W) = R$.

In [None]:
swimmer_params = {
    "num_episodes": NUM_EPISODES,
    "gym_env_name": "Swimmer-v5",
    "render_mode": RENDER_MODE,
    "logdir": "logs/mujoco_swimmer_tutorial",
    "dim_actions": 2,
    "dim_states": 8,
    "max_traj_count": MAX_TRAJ_COUNT,
    "max_traj_length": MAX_TRAJ_LENGTH,
    "llm_model_name": LLM_MODEL_NAME,
    "num_evaluation_episodes": NUM_EVALUATION_EPISODES,
    "warmup_episodes": WARMUP_EPISODES,
    "warmup_dir": None,
    "bias": True,
    "rank": None,
    "optimum": 250,
    "search_step_size": SEARCH_STD
}

run_training_loop(**swimmer_params)

In [None]:
swimmer_episode_399_gif = run_policy(**swimmer_params, episode_dir="episode_399")

display(HTML("<h3>Swimmer Episode 399</h3>"))
display(HTML(f'<img src="{swimmer_episode_399_gif}" style="width: 100%; max-width: 800px;">'))

cpu_times, api_times, total_episodes, total_steps, total_rewards = read_file(find_file(swimmer_params['logdir']))
plot_data(total_episodes, total_rewards, "ProPS Optimization on Swimmer")