# **Prompted Policy Search + Envrionment Description (ProPS+): Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs**

This notebook serves as a detailed tutorial on ProPS+, an extension of the *[Prompted Policy Search (ProPS)](https://props-llm.github.io/)* (Zhou et al., 2025) reinfocement learning method that unifies numerical and linguistic reasoning within a single framework. ProPS places a large language model (LLM) at the center of the policy optimization loop; directly proposing policy updates based on both reward feedback and natural language input.

ProPS+ extends this framework by bridging the gap between numerical optimization and semantic understanding. **It extends ProPS by adding rich, task-specific and contextual knowledge via semantically-informed prompts**, allowing the LLM to perform "Linguistic Reasoning" alongside numerical optimization.

In this tutorial, we utilize a Large Language Model (LLM), specifically the Gemini 2.0/2.5 Flash, to perform Policy Search for a linear continuous control policy within an OpenAI Gym reinforcement learning environment. Our focus will be on the Swimmer environment, where we aim to employ the Gemini family of models to discover the optimal parameters for a linear policy that enables the agent to successfully perform forward locomotion.

## **Environment: State and Action Variables**
<img src="https://raw.githubusercontent.com/intro-to-icl/intro-to-icl.github.io/refs/heads/master/static/images/swimmer.gif" height="200"><img src="https://raw.githubusercontent.com/intro-to-icl/intro-to-icl.github.io/refs/heads/master/static/images/swimmer_parameters.png" height="200">


The Swimmer-v5 environment presents a challenging continuous control problem. The agent is a simple swimmer composed of three rigid links connected by two actuated rotational joints (rotors). This chain-like structure is simulated in a viscous fluid. The primary objective for the swimmer is to move forward (typically along the positive x-axis) as quickly as possible by applying torques to its two rotors. The interaction with the fluid and the multi-link dynamics make this a non-trivial control task. This environment can be modeled as a Markov Decision Process (MDP), where the next state and reward are determined by the current state and the action taken. Mujoco environments are generally deterministic given the same initial conditions and actions.

At each timestep, the agent receives an observation of the environment's current state. For Swimmer-v5, this state is represented by an 8-dimensional continuous vector:

$$ S^{T} = [q_{tip}, q_{rotor1}, q_{rotor2}, v_x, v_y, \omega_{tip}, \omega_{rotor1}, \omega_{rotor2} ] $$

Where:
*   $q_{tip}$: Angle of the front tip (the first link).
*   $q_{rotor1}$: Angle of the first rotor.
*   $q_{rotor2}$: Angle of the second rotor.
*   $v_x$: Velocity of the tip along the x-axis (forward direction).
*   $v_y$: Velocity of the tip along the y-axis.
*   $\omega_{tip}$: Angular velocity of the front tip.
*   $\omega_{rotor1}$: Angular velocity of the first rotor.
*   $\omega_{rotor2}$: Angular velocity of the second rotor.

Control over the swimmer is exerted by applying torques to its two rotors. The action $A$ is a 2-dimensional continuous vector:
$$ A = [ \tau_1, \tau_2 ] $$

Here, $\tau_1$ is the torque applied to the first rotor, and $\tau_2$ is the torque applied to the second rotor. Both torque values are typically clipped within the range [-1, 1]. The reward function in Swimmer-v5 is primarily based on forward locomotion - encouraging the agent to swim quickly in the target direction, and a control penalty for taking actions too large.

## **Policy Representation**
In reinforcement learning, the agent's behavior is dictated by a "policy," which essentially maps observed states to appropriate actions. For this specific problem, we adopt a straightforward **linear policy**. This implies that the action is calculated as a linear combination of the current state variables (position and velocity). The action (torques) will be a linear combination of the observed state variables. Given the 8 state variables and 2 action variables, the policy - π will be parameterized by a parameter matrix $θ$ of shape 8x2:
$$ π_θ = \begin{bmatrix}
θ_{1,1} & θ_{1,2} \\
θ_{2,1} & θ_{2,2} \\
\vdots & \vdots \\
θ_{8,1} & θ_{8,2}
\end{bmatrix} $$

Given a state vector $S$ (an 8x1 vector), the action vector $A$ (a 2x1 vector) is computed as:

$$ A = S^T θ $$

Each element $A_j = \sum_{i=1}^{8} S_i \cdot θ_{i,j}$. The parameters $θ_{i,j}$ determine the influence of the $i$-th state variable on the $j$-th action (torque). The optimization goal is to find the 16 parameters in $θ$ that maximize the total reward accumulated over an episode.

## **Optimization Strategy: LLM-Driven Policy Search**
The core objective is to identify the optimal policy parameters $π_θ$ that maximize the cumulative reward $R$ gathered over a complete episode, which consists of a sequence of steps from the start until termination (either reaching the goal or hitting the maximum step limit, e.g., 1000). This task is formally known as **Policy Search**. Mathematically, we seek to solve:
$$ \max_{θ} \mathbb{E}\left[ \sum_{t=0}^{T} r_t \right]$$

where $r_t$ is the reward at timestep $t$ and $T$ is the episode length.

To facilitate this optimization, we utilize a "Replay Buffer." After each episode concludes, having been run with the current parameters $θ$, the total reward $R$ is calculated. This $(θ, R)$ pair is then stored in the buffer. The optimizer, which is the LLM in our case, observes the $(θ, R)$ pair, and the domain knowledge of the environment.


## **LLM as the Optimizer**

We harness the capabilities of the Gemini models to conduct this optimization. The LLM is instructed via a prompt to function as an optimization assistant. Furthermore, the prompt is augmented with the domain knowledge including the description of the environment, detailed definitions of the parameter types, specifications of the policy structure, and human provided hints regarding optimal behavior in the Swimmer environment. This allows the LLM to use common knowledge about the environment, driving faster convergence.
</br>
</br>

##### ProPS+ Prompt Augmentation
```
The swimmers consist of three or more segments (’links’) and one less articulation joints (’rotors’) - one rotor joint connects exactly two links to form a linear chain. The swimmer is suspended in a two-dimensional pool and always starts in the same position (subject to some deviation drawn from a uniform distribution), and the goal is to move as fast as possible towards the right by applying torque to the rotors and using fluid friction.
The state is a vector of 8 elements, representing the following:
- state[0] angle of the front tip (-inf to inf rad)
- state[1] angle of the first rotor (-inf to inf rad)
- state[2] angle of the second rotor (-inf to inf rad)
- state[3] velocity of the tip along the x-axis (-inf to inf m/s)
- state[4] velocity of the tip along the y-axis (-inf to inf m/s)
- state[5] angular velocity of front tip (-inf to inf rad/s)
- state[6] angular velocity of first rotor (-inf to inf rad/s)
- state[7] angular velocity of second rotor (-inf to inf rad/s)
The action space is a vector of 2 float numbers, representing the torques applied between the links (-1 to 1 N).
The policy is a linear policy with 5 parameters and works as follows:
action = argmax(state @ W + B), where
state = [state[0], state[1], state[2], state[3], state[4], state[5], state[6], state[7]]
W = [[params[0], params[1]],
    [params[2], params[3]],
    [params[4], params[5]],
    [params[6], params[7]],
    [params[8], params[9]],
    [params[10], params[11]],
    [params[12], params[13]],
    [params[14], params[15]]]
b = [params[16], params[17]]
The goal is to try to move forward. However, in the meantime, the control cost should also be minimized. The reward function is as follows:
reward = x-velocity - 1e-4 * action^2.

```


The process begins with a "warmup" phase, where several episodes are run using randomly selected parameters $π_θ$. The resulting $(θ, R)$ pairs populate the Replay Buffer, providing initial data. Subsequently, the LLM is presented with a detailed prompt containing the optimization goal, the historical data from the Replay Buffer, output format instructions, and guidance on balancing exploration (trying novel parameters) versus exploitation (refining promising parameters), adapting this balance as the optimization progresses. Based on this prompt and the historical context (enabling in-context learning), the LLM proposes a new set of parameters $θ$ anticipated to yield improved rewards.

<br />
<p style="text-align:center;">
<img src="https://github.com/k-pratyush/props-llm-examples/blob/main/static/approach_overview_props_plus.png?raw=1" alt="image" width=350>
</p>
Figure (a): Props+ Optimization Approach

<br />
<br />

The agent's policy is then updated with these suggested parameters, and one or more evaluation episodes are executed in the environment. The cumulative reward obtained from these evaluations is recorded, and the new $(π_θ, R)$ pair is added to the Replay Buffer. This cycle of prompting the LLM, receiving parameter suggestions, evaluating the updated policy, and updating the buffer is repeated for a predetermined number of episodes (e.g., 400), allowing the LLM to iteratively refine the policy parameters towards optimality. The prompt design treats the task purely as optimizing an unknown function $f(θ_1, θ_2) = R$, guiding the LLM with hints on step size and search ranges but without revealing the underlying simulation details.


## **Code Overview**

The implementation follows a modular design. The **World** component (`ContinualSpaceGeneralWorld`) wraps the standard Gymnasium environment, managing state transitions, reward calculations, and episode termination. The **Agent** component (`LLMNumOptimAgent`) integrates the learning elements. It includes the **`LinearPolicy`** module, which stores the policy parameter matrix $θ$ and computes actions based on states. It also contains the **`EpisodeRewardBufferNoBias`** module, responsible for maintaining the Replay Buffer of ($θ$ , $R$) pairs. Finally, the **`LLMBrain`** module orchestrates all interactions with the LLM, including prompt generation using Jinja2 templates, API communication (handling both OpenAI and Gemini models), and parsing the LLM's responses to extract the suggested new parameters.

## **Hyperparameters**

Several hyperparameters govern the experiment's execution. `NUM_EPISODES` (e.g., 400) sets the total number of optimization iterations. `RENDER_MODE` controls environment visualization. `MAX_TRAJ_COUNT` (e.g., 1000) defines the Replay Buffer size, influencing the historical context available to the LLM. `MAX_TRAJ_LENGTH` (e.g., 1000) sets the maximum steps per episode. `LLM_MODEL_NAME` specifies the LLM used. `NUM_EVALUATION_EPISODES` (e.g., 20) determines how many runs are averaged to evaluate a new policy. `WARMUP_EPISODES` (e.g., 20) sets the number of initial random runs. `SEARCH_STD` (e.g., 1.0) provides a hint to the LLM regarding the step size for parameter exploration.

## **Training Loop**
<p style="text-align:center;">
<img src="https://raw.githubusercontent.com/intro-to-icl/intro-to-icl.github.io/refs/heads/master/static/images/swimmer_loop.gif" alt="image" height="350">
</p>
Figure (b): Code Overview

<br />
<br />

The `run_training_loop` function orchestrates the process. It initializes the World and Agent components. It performs the initial warmup runs if necessary, populating the replay buffer. Then, it enters the main loop, iterating `NUM_EPISODES` times. In each iteration, it interacts with the LLM (`agent.train_policy`) to get updated policy parameters based on the replay buffer history. It then evaluates the performance of this new policy over `NUM_EVALUATION_EPISODES` (`agent.evaluate_policy`), calculates the cumulative reward, and adds the new (parameters, cumulative reward) pair back into the replay buffer. Logging occurs at each step.

## **Output Structure**

The training process generates structured logs. A main log directory contains subdirectories for each episode (`episode_*`) and potentially a `warmup/` directory. Each episode directory stores logs of evaluation trajectories, the parameters suggested by the LLM for that episode (`parameters.txt`), and the full LLM interaction including its reasoning (`parameters_reasoning.txt`). The final notebook cells typically include code for visualizing the learned policy in action and plotting the reward curve over episodes, illustrating the learning progress.

## **Before Running the Demo, Follow the Instructions Below**
To run the full experiment:
1. Ensure all dependencies are imported and installed.
2.	Visit Google AI Studio (https://aistudio.google.com/) to obtain your Gemini API key.
3.	Once API key is generated, copy and paste it into the demo when prompted.

#### **If having trouble creating an API Key, follow the link below for instructions:**
* #### **[Instructions on How to Obtain a Gemini API Key](https://docs.google.com/document/d/17pgikIpvZCBgcRh4NNrcee-zp3Id0pU0vq2SG52KIGE/edit?usp=sharing)**



In [None]:
#@title **Import and Install Necessary Libraries**
!pip install gymnasium[mujoco]

import re
import os
import time
from decimal import Decimal
from collections import deque
import matplotlib.pyplot as plt
from IPython import display
import traceback
import numpy as np
from IPython.display import display, update_display
import imageio.v2 as imageio

import numpy as np
import gymnasium as gym
from jinja2 import Template
from google import genai
import getpass
import ipywidgets as widgets


In [None]:
#@title **Setting Up Gemini Client**
apikey = getpass.getpass("Enter your Gemini API Key: ")

In [None]:
#@title **Choose a Model**
model_dropdown = widgets.Dropdown(
    options=[
        ("Gemini 2.5 Flash", "gemini-2.5-flash"),
        ("Gemini 2.0 Flash", "gemini-2.0-flash")
    ],
    description="Model:",
    value="gemini-2.5-flash",
    style={'description_width': 'initial'}
)

confirm_button = widgets.Button(
    description="Confirm Selection"
)

output = widgets.Output()

model_name = None

def on_confirm_click(b):
    global model_name, batch_size

    model_name = model_dropdown.value

    with output:
        output.clear_output()
        print(f"\nSelected model: {model_name}")

confirm_button.on_click(on_confirm_click)

display(model_dropdown, confirm_button, output)

This cell introduces and lists the key hyperparameters that control the execution and behavior of the reinforcement learning experiment. Hyperparameters are settings that are not learned by the agent itself but are defined by the user before the training process begins. They significantly influence the learning process and the performance of the agent.

`NUM_EPISODES` (e.g., 400): Defines the total number of optimization iterations or training episodes the agent will go through. A higher number allows for more learning but increases computation time.

`RENDER_MODE` (e.g., None): Controls how the environment is visualized during execution. Options typically include 'human' (real-time window), 'rgb_array' (returns a pixel array, useful for recording), or None (no visualization, fastest for training).

`MAX_TRAJ_COUNT` (e.g., 1000): Sets the maximum size of the Replay Buffer. This buffer stores (policy parameters, reward) pairs, and its size determines how much historical data the LLM has access to when making decisions.

`MAX_TRAJ_LENGTH` (e.g., 1000): Specifies the maximum number of steps allowed in a single episode. If the agent doesn't reach a terminal state within these steps, the episode is truncated.

`LLM_MODEL_NAME` (e.g., "gemini-2.5-flash-preview-04-17"): Specifies which Large Language Model will be used as the optimizer. The comment lists several compatible models from OpenAI and Google.
NUM_EVALUATION_EPISODES (e.g., 20): Determines how many times a newly proposed policy is run in the environment to get an average measure of its performance. Averaging helps to reduce variance in the reward signal.

`WARMUP_EPISODES` (e.g., 20): Sets the number of initial episodes run with randomly generated policy parameters. This "warmup" phase populates the Replay Buffer with some initial data points before the LLM starts optimizing.

`SEARCH_STD` (e.g., 1.0): Provides a hint to the LLM regarding the standard deviation or step size it should consider when exploring new parameter values, especially during the initial exploration phase.

In [None]:
#@title **Key Hyperparameters**
NUM_EPISODES=200 # Total number of episodes to train for
RENDER_MODE=None # Choose from 'human', 'rgb_array', or None
MAX_TRAJ_COUNT=1000 # Maximum number of trajectories to store in buffer for prompt
MAX_TRAJ_LENGTH=1000 # Maximum number of steps in a trajectory
LLM_MODEL_NAME=model_name

NUM_EVALUATION_EPISODES=20 # Number of episodes to generate agent rollouts for evaluation
WARMUP_EPISODES=20 # Number of randomly generated initial episodes
SEARCH_STD=1.0 # Step size for LLM to search for optimal parameters during exploration

The below cell defines the template for the black box optimization prompt. The prompt template uses variables defined in the code for setting the number of parameters required to optimize, the global optimum of the function, step size, current step count and the history of (parameter, reward) tuples.

In [None]:
#@title **Black Box Optimization Prompt Example**
LLM_SI_TEMPLATE_STRING = """
You are good global RL policy optimizer, helping me find the global optimal policy in the following environment:

# Environment: {{ env_description }}

# Regarding the parameters **params**: **params** is an array of rank float numbers.
**params** values are in the range of [-6.0, 6.0] with 1 decimal place. params represent a
linear policy. f(params) is the episodic reward of the policy.

# Here's how we'll interact:
1. I will first provide MAX_STEPS (200) along with a few training examples.
2. You will provide your response in the following exact format:
    * Line 1: a new input 'params[0]: , params[1]: , params[2]: ,..., params[{{ rank - 1 }}]: ', aiming to maximize the function's value f(params).
    Please propose params values in the range of [-6.0, 6.0], with 1 decimal place.
    * Line 2: detailed explanation of why you chose that input.
3. I will then provide the function's value f(params) at that point, and the current iteration.
4. We will repeat steps 2-3 until we reach the maximum number of iterations.

# Remember:
1. **Do not propose previously seen params.**
2. **The global optimum should be around {{ optimum }}.** If you are below that, this is just a local optimum. You should explore instead of exploiting.
3. Search both positive and negative values. **During exploration, use search step size of {{ step_size }}**.


Next, you will see examples of params and their episodic reward f(params) pairs.
{{ episode_reward_buffer_string }}

Now you are at iteration {{step_number}} out of 200. Please provide the results in the indicated format. Do not provide any additional texts."""


llm_si_template = Template(LLM_SI_TEMPLATE_STRING)
llm_output_conversion_template = llm_si_template

## **World**

The `ContinualSpaceGeneralWorld` is a wrapper class over the Gymnasium environments to give standardized interface for the agents.

In [None]:
#@title **Swimmer-v5**
import gymnasium as gym


class ContinualSpaceGeneralWorld():
    def __init__(
        self,
        gym_env_name,
        render_mode,
        max_traj_length=1000,
    ):
        assert render_mode in ["human", "rgb_array", None]

        if gym_env_name == "gym_navigation:NavigationTrack-v0":
            self.env = gym.make(
                gym_env_name,
                render_mode=render_mode,
                track_id=1,
            )
        elif gym_env_name == "maze-sample-3x3-v0":
            self.env = gym.make(
                gym_env_name,
                enable_render=render_mode,
            )
        else:
            self.env = gym.make(gym_env_name, render_mode=render_mode)
        self.gym_env_name = gym_env_name
        self.render_mode = render_mode
        self.steps = 0
        self.accu_reward = 0
        self.max_traj_length = max_traj_length
        if isinstance(self.env.action_space, gym.spaces.Discrete):
            self.discretize = True
        else:
            self.discretize = False

    def reset(self, new_reward=False):
        """ This method resets the environment to its initial state.
        If `new_reward` is True, it initializes the environment with a different reward structure.
        """
        del self.env
        if not new_reward:
            self.env = gym.make(self.gym_env_name, render_mode=self.render_mode)
        else:
            self.env = gym.make(self.gym_env_name, render_mode=self.render_mode, healthy_reward=0)

        state, _ = self.env.reset()
        self.steps = 0
        self.accu_reward = 0
        return state

    def step(self, action):
        """
        This method executes a step in the environment with the given action.
        It updates the environment state, accumulates the reward, and checks if the episode is done.
        """
        self.steps += 1
        action = np.asarray(action).reshape(-1)
        state, reward, terminated, truncated, _ = self.env.step(action)
        self.accu_reward += reward

        if self.steps >= self.max_traj_length or terminated or truncated:
            done = True
        else:
            done = False

        return state, reward, done

    def get_accu_reward(self):
        """
        This method returns the accumulated reward for the current episode.
        """
        return self.accu_reward


## **Sub Modules**

`EpisodeRewardBufferNoBias`: Store and manage collection of (policy parameters and reward) pairs, acting as the replay buffer.

`LinearPolicy`: Implements a linear policy where the action is computed as a dot product of the state and weights, plus a bias term: $a = s^T W + b$.

`LinearPolicyNoBias`: Implements a linear policy without a bias term: $a = s^T W$.

`LLMBrain`: Coordinates with the LLM to get new parameters for the policy based on existing policy (parameter, reward) pairs.

In [None]:
#@title **Details**
class EpisodeRewardBufferNoBias:
    def __init__(self, max_size):
        self.buffer = deque(maxlen=max_size)

    def add(self, weights: np.ndarray, reward):
        self.buffer.append((weights, reward))

    def sort(self):
        self.buffer = deque(sorted(self.buffer, key=lambda x: x[1], reverse=False), maxlen=self.buffer.maxlen)

    def __str__(self):
        buffer_table = "Parameters | Reward\n"
        for weights, reward in self.buffer:
            buffer_table += f"{weights.reshape(1, -1)} | {reward}\n"
        return buffer_table

    def load(self, folder):
        # Find all episode files
        all_files = [os.path.join(folder, x) for x in os.listdir(folder) if x.startswith('warmup_rollout')]
        all_files.sort(key=lambda x: int(x.split('_')[-1].split('.')[0]))

        # Load parameters from all episodes
        for filename in all_files:
            with open(filename, 'r') as f:
                lines = f.readlines()
                parameters = []
                for line in lines:
                    if "parameter ends" in line:
                        break
                    try:
                        parameters.append([float(x) for x in line.split(',')])
                    except:
                        continue
                parameters = np.array(parameters)

                rewards = []
                for line in lines:
                    if "Total reward" in line:
                        try:
                            rewards.append(float(line.split()[-1]))
                        except:
                            continue
                rewards_mean = np.mean(rewards)
                self.add(parameters, rewards_mean)
                f.close()
        print(self)


class LinearPolicy():
    """
    Linear policy for continuous action space. The policy is represented as a (2,1) matrix of weights.
    Next action is calculated as the dot product of the state and the weight matrix.
    state.T * weight + bias -> action
    (1,2) * (2,1) + (1,1) -> (1,1)
    """
    def __init__(self, dim_states, dim_actions):

        self.dim_states =dim_states
        self.dim_actions = dim_actions

        self.weight = np.random.rand(self.dim_states, self.dim_actions)
        self.bias = np.random.rand(1, self.dim_actions)

    def initialize_policy(self):
        self.weight = np.round(np.random.normal(0., 3., size=(self.dim_states, self.dim_actions)), 1)
        self.bias = np.round(np.random.normal(0., 3., size=(1, self.dim_actions)), 1)

    def get_action(self, state):
        state = state.T
        return np.matmul(state, self.weight) + self.bias

    def __str__(self):
        output = "Weights:\n"
        for w in self.weight:
            output += ", ".join([str(i) for i in w])
            output += "\n"

        output += "Bias:\n"
        for b in self.bias:
            output += ", ".join([str(i) for i in b])
            output += "\n"

        return output

    def update_policy(self, weight_and_bias_list):
        if weight_and_bias_list is None:
            return

        weight_and_bias_list = np.array(weight_and_bias_list).reshape(self.dim_states + 1, self.dim_actions)
        self.weight = np.array(weight_and_bias_list[:-1])
        self.bias = np.expand_dims(np.array(weight_and_bias_list[-1]), axis=0)

    def get_parameters(self):
        parameters = np.concatenate((self.weight, self.bias), axis=0)
        return parameters

class LinearPolicyNoBias():
    def __init__(self, dim_states, dim_actions):

        self.dim_states = dim_states
        self.dim_actions = dim_actions

        self.weight = np.random.rand(self.dim_states, self.dim_actions)

    def initialize_policy(self):
        self.weight = np.round((np.random.rand(self.dim_states, self.dim_actions) - 0.5) * 6, 1)

    def get_action(self, state):
        state = state.T
        return np.matmul(state, self.weight)

    def __str__(self):
        output = "Weights:\n"
        for w in self.weight:
            output += ", ".join([str(i) for i in w])
            output += "\n"

        return output

    def update_policy(self, weight_and_bias_list):
        if weight_and_bias_list is None:
            return
        self.weight = np.array(weight_and_bias_list)
        self.weight = self.weight.reshape(-1)
        for i in range(len(self.weight)):
            self.weight[i] = Decimal(self.weight[i]).normalize()

        self.weight = self.weight.reshape(
            self.dim_states, self.dim_actions
        )

    def get_parameters(self):
        return self.weight


class LLMBrain:
    def __init__(
        self,
        llm_si_template: Template,
        llm_output_conversion_template: Template,
        llm_model_name: str,
    ):
        self.llm_si_template = llm_si_template
        self.llm_output_conversion_template = llm_output_conversion_template
        self.llm_conversation = []
        assert llm_model_name in [
            "gemini-2.0-flash"
        ]
        self.llm_model_name = llm_model_name
        if "gemini" in llm_model_name:
            self.model_group = "gemini"
            self.client = genai.Client(api_key=apikey)
        elif "claude" in llm_model_name:
            self.model_group = "anthropic"
            self.client = anthropic.Client(api_key=os.environ["ANTHROPIC_API_KEY"])
        else:
            self.model_group = "openai"
            self.client = OpenAI()

    def reset_llm_conversation(self):
        self.llm_conversation = []

    def add_llm_conversation(self, text, role):
        if self.model_group == "openai":
            self.llm_conversation.append({"role": role, "content": text})
        elif self.model_group == "anthropic":
            self.llm_conversation.append({"role": role, "content": text})
        else:
            self.llm_conversation.append({"role": role, "parts": text})

    def query_llm(self):
        for attempt in range(10):
            try:
                if self.model_group == "openai":
                    completion = self.client.chat.completions.create(
                        model=self.llm_model_name,
                        messages=self.llm_conversation,
                    )
                    response = completion.choices[0].message.content
                elif self.model_group == "anthropic":
                    message = self.client.messages.create(
                        model=self.llm_model_name,
                        messages=self.llm_conversation,
                        max_tokens=1024,
                    )
                    response = message.content[0].text
                else:
                    prompt = self.llm_conversation[-1]["parts"]
                    response = self.client.models.generate_content(
                        model=model_name,
                        contents=[prompt]
                    )
                    response = response.text
            except Exception as e:
                print(f"Error: {e}")
                print("Retrying...")
                if attempt == 9:
                    raise Exception("Failed")
                else:
                    print("Waiting for 60 seconds before retrying...")
                    time.sleep(60)

            if self.model_group == "openai":
                # add the response to self.llm_conversation
                self.add_llm_conversation(response, "assistant")
            else:
                self.add_llm_conversation(response, "model")

            return response


    def parse_parameters(self, parameters_string):
        new_parameters_list = []

        # Update the Q-table based on the new Q-table
        for row in parameters_string.split("\n"):
            if row.strip().strip(","):
                try:
                    parameters_row = [
                        float(x.strip().strip(",")) for x in row.split(",")
                    ]
                    new_parameters_list.append(parameters_row)
                except Exception as e:
                    print(e)

        return new_parameters_list


    def llm_update_parameters_num_optim(
        self,
        episode_reward_buffer,
        parse_parameters,
        step_number,
        rank=None,
        optimum=None,
        search_step_size=0.1,
        actions=None,
        env_description=None,
    ):
        self.reset_llm_conversation()

        system_prompt = self.llm_si_template.render(
            {
                "episode_reward_buffer_string": str(episode_reward_buffer),
                "step_number": str(step_number),
                "rank": rank,
                "optimum": str(optimum),
                "step_size": str(search_step_size),
                "actions": actions,
                "env_description": env_description,
            }
        )

        self.add_llm_conversation(system_prompt, "user")

        api_start_time = time.time()
        new_parameters_with_reasoning = self.query_llm()
        api_time = time.time() - api_start_time
        new_parameters_list = parse_parameters(new_parameters_with_reasoning)

        return (
            new_parameters_list,
            "system:\n"
            + system_prompt
            + "\n\n\nLLM:\n"
            + new_parameters_with_reasoning,
            api_time,
        )


### **Agent**

The below cell defines the core agent wrapper. It is responsibe for managing the policy, interacting with the world and coordinating with the LLMBrain to learn.

In [None]:
#@title **Core Agent Wrapper**
class LLMNumOptimAgent:
    def __init__(
        self,
        logdir,
        dim_action,
        dim_state,
        max_traj_count,
        max_traj_length,
        llm_si_template,
        llm_output_conversion_template,
        llm_model_name,
        num_evaluation_episodes,
        bias,
        optimum,
        search_step_size,
        env_description,
    ):
        self.start_time = time.process_time()
        self.api_call_time = 0
        self.total_steps = 0
        self.total_episodes = 0
        self.dim_action = dim_action
        self.dim_state = dim_state
        self.bias = bias
        self.optimum = optimum
        self.search_step_size = search_step_size
        self.env_description = env_description

        if not self.bias:
            param_count = dim_action * dim_state
        else:
            param_count = dim_action * dim_state + dim_action
        self.rank = param_count

        # Initialize the policy and replay buffer
        if not self.bias:
            self.policy = LinearPolicyNoBias(
                dim_actions=dim_action, dim_states=dim_state
            )
        else:
            self.policy = LinearPolicy(dim_actions=dim_action, dim_states=dim_state)
        self.replay_buffer = EpisodeRewardBufferNoBias(max_size=max_traj_count)
        self.llm_brain = LLMBrain(
            llm_si_template, llm_output_conversion_template, llm_model_name
        )
        self.logdir = logdir
        self.num_evaluation_episodes = num_evaluation_episodes
        self.training_episodes = 0

        if self.bias:
            self.dim_state += 1

    def rollout_episode(self, world, logging_file, record=True):
        """Simulates an episode in the environment using the current policy."""
        state = world.reset()
        state = np.expand_dims(state, axis=0)
        logging_file.write(
            f"{', '.join([str(x) for x in self.policy.get_parameters().reshape(-1)])}\n"
        )
        logging_file.write(f"parameter ends\n\n")
        logging_file.write(f"state | action | reward\n")
        done = False
        step_idx = 0
        while not done:
            action = self.policy.get_action(state.T)
            action = np.reshape(action, (1, self.dim_action))
            if world.discretize:
                action = np.argmax(action)
                action = np.array([action])
            next_state, reward, done = world.step(action)
            logging_file.write(f"{state.T[0]} | {action[0]} | {reward}\n")
            state = next_state
            step_idx += 1
            self.total_steps += 1
        logging_file.write(f"Total reward: {world.get_accu_reward()}\n")
        self.total_episodes += 1
        if record:
            self.replay_buffer.add(
                self.policy.get_parameters(), world.get_accu_reward()
            )
        return world.get_accu_reward()

    def random_warmup(self, world, logdir, num_episodes):
        for episode in range(num_episodes):
            self.policy.initialize_policy()
            # Run the episode and collect the trajectory
            print(f"Rolling out warmup episode {episode}...")
            logging_filename = f"{logdir}/warmup_rollout_{episode}.txt"
            logging_file = open(logging_filename, "w")
            result = self.rollout_episode(world, logging_file)
            print(f"Result: {result}")

    def train_policy(self, world, logdir):
        """Core method to train single iteration of the policy using LLM optimization."""

        def parse_parameters(input_text):
            # This regex looks for integers or floating-point numbers (including optional sign)
            s = input_text.split("\n")[0]
            print("response:", s)
            pattern = re.compile(r"params\[(\d+)\]:\s*([+-]?\d+(?:\.\d+)?)")
            matches = pattern.findall(s)

            # Convert matched strings to float (or int if you prefer to differentiate)
            results = []
            for match in matches:
                results.append(float(match[1]))
            print(results)
            assert len(results) == self.rank
            return np.array(results).reshape(-1)

        def str_nd_examples(replay_buffer: EpisodeRewardBufferNoBias, n):

            all_parameters = []
            for weights, reward in replay_buffer.buffer:
                parameters = weights
                all_parameters.append((parameters.reshape(-1), reward))

            text = ""
            for parameters, reward in all_parameters:
                l = ""
                for i in range(n):
                    l += f"params[{i}]: {parameters[i]:.5g}; "
                fxy = reward
                l += f"f(params): {fxy:.2f}\n"
                text += l
            return text

        # Update the policy using llm_brain, q_table and replay_buffer
        print("Updating the policy...")
        new_parameter_list, reasoning, api_time = self.llm_brain.llm_update_parameters_num_optim(
            str_nd_examples(self.replay_buffer, self.rank),
            parse_parameters,
            self.training_episodes,
            self.rank,
            self.optimum,
            self.search_step_size,
            env_description = self.env_description,
        )
        self.api_call_time += api_time

        print(self.policy.get_parameters().shape)
        print(new_parameter_list.shape)
        self.policy.update_policy(new_parameter_list)
        print(self.policy.get_parameters().shape)
        logging_q_filename = f"{logdir}/parameters.txt"
        logging_q_file = open(logging_q_filename, "w")
        logging_q_file.write(str(self.policy))
        logging_q_file.close()
        q_reasoning_filename = f"{logdir}/parameters_reasoning.txt"
        q_reasoning_file = open(q_reasoning_filename, "w")
        q_reasoning_file.write(reasoning)
        q_reasoning_file.close()
        print("Policy updated!")

        # Run the episode and collect the trajectory
        print(f"Rolling out episode {self.training_episodes}...")
        logging_filename = f"{logdir}/training_rollout.txt"
        logging_file = open(logging_filename, "w")
        results = []
        for idx in range(self.num_evaluation_episodes):
            if idx == 0:
                result = self.rollout_episode(world, logging_file, record=False)
            else:
                result = self.rollout_episode(world, logging_file, record=False)
            results.append(result)
        print(f"Results: {results}")
        result = np.mean(results)
        self.replay_buffer.add(new_parameter_list, result)

        self.training_episodes += 1

        _cpu_time = time.process_time() - self.start_time
        _api_time = self.api_call_time
        _total_episodes = self.total_episodes
        _total_steps = self.total_steps
        _total_reward = result
        return _cpu_time, _api_time, _total_episodes, _total_steps, _total_reward


    def evaluate_policy(self, world, logdir):
        results = []
        for idx in range(self.num_evaluation_episodes):
            logging_filename = f"{logdir}/evaluation_rollout_{idx}.txt"
            logging_file = open(logging_filename, "w")
            result = self.rollout_episode(world, logging_file, record=False)
            results.append(result)
        return results


The below cell orchestrates the entire training process from initialization to completion. The `run_training_loop` function starts with initialization the world, and the agent instances. Then, it creates a set of warmup episodes to pass in as initial replay buffer to the optimizer. The code then runs the training loop for specified number of episodes and optimizes the policy parameters.

In [None]:
#@title **Training Loop**
def ordinal(n):
    if 10 <= n % 100 <= 20:
        suffix = "th"
    else:
        suffix = {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th")
    return f"{n}{suffix}"

def run_training_loop(
    num_episodes,
    gym_env_name,
    render_mode,
    logdir,
    dim_actions,
    dim_states,
    max_traj_count,
    max_traj_length,
    llm_model_name,
    num_evaluation_episodes,
    warmup_episodes,
    warmup_dir,
    bias=None,
    rank=None,
    optimum=100,
    search_step_size=SEARCH_STD,
    env_description=None,
):
    world = ContinualSpaceGeneralWorld(
        gym_env_name,
        render_mode,
        max_traj_length,
    )

    agent = LLMNumOptimAgent(
        logdir,
        dim_actions,
        dim_states,
        max_traj_count,
        max_traj_length,
        llm_si_template,
        llm_output_conversion_template,
        llm_model_name,
        num_evaluation_episodes,
        bias,
        optimum,
        search_step_size,
        env_description,
    )
    print('init done')

    if not warmup_dir:
        warmup_dir = f"{logdir}/warmup"
        os.makedirs(warmup_dir, exist_ok=True)
        agent.random_warmup(world, warmup_dir, warmup_episodes)
    else:
        agent.replay_buffer.load(warmup_dir)

    overall_log_file = open(f"{logdir}/overall_log.txt", "w")
    overall_log_file.write("Iteration, CPU Time, API Time, Total Episodes, Total Steps, Total Reward\n")
    overall_log_file.flush()
    for episode in range(num_episodes):
        print("-----------------------------------------------------------------------------------------------------------")
        print(f"Episode: {episode}")
        # create log dir
        curr_episode_dir = f"{logdir}/episode_{episode}"
        print(f"Creating log directory: {curr_episode_dir}")
        os.makedirs(curr_episode_dir, exist_ok=True)

        for trial_idx in range(5):
            try:
                cpu_time, api_time, total_episodes, total_steps, total_reward = agent.train_policy(world, curr_episode_dir)
                overall_log_file.write(f"{episode + 1}, {cpu_time}, {api_time}, {total_episodes}, {total_steps}, {total_reward}\n")
                overall_log_file.flush()
                print(f"{ordinal(trial_idx + 1)} trial attempt succeeded in training")
                break
            except Exception as e:
                print(f"{ordinal(trial_idx + 1)} trial attempt failed with error in training: {e}")
                traceback.print_exc()

                if trial_idx == 4:
                    print(f"All {trial_idx + 1} trials failed. Train terminated")
                    exit(1)
                continue
    overall_log_file.close()


In [None]:
#@title **Run the Training Loop**
LLM_MODEL_NAME = "gemini-2.0-flash"

swimmer_description = """
The swimmers consist of three or more segments (’links’) and one less articulation joints (’rotors’) - one rotor joint connects exactly two links to form a linear chain. The swimmer is suspended in a two-dimensional pool and always starts in the same position (subject to some deviation drawn from a uniform distribution), and the goal is to move as fast as possible towards the right by applying torque to the rotors and using fluid friction.
The state is a vector of 8 elements, representing the following:
- state[0] angle of the front tip (-inf to inf rad)
- state[1] angle of the first rotor (-inf to inf rad)
- state[2] angle of the second rotor (-inf to inf rad)
- state[3] velocity of the tip along the x-axis (-inf to inf m/s)
- state[4] velocity of the tip along the y-axis (-inf to inf m/s)
- state[5] angular velocity of front tip (-inf to inf rad/s)
- state[6] angular velocity of first rotor (-inf to inf rad/s)
- state[7] angular velocity of second rotor (-inf to inf rad/s)
The action space is a vector of 2 float numbers, representing the torques applied between the links (-1 to 1 N).
The policy is a linear policy with 5 parameters and works as follows:
action = argmax(state @ W + B), where
state = [state[0], state[1], state[2], state[3], state[4], state[5], state[6], state[7]]
W = [[params[0], params[1]],
    [params[2], params[3]],
    [params[4], params[5]],
    [params[6], params[7]],
    [params[8], params[9]],
    [params[10], params[11]],
    [params[12], params[13]],
    [params[14], params[15]]]
b = [params[16], params[17]]
The goal is to try to move forward. However, in the meantime, the control cost should also be minimized. The reward function is as follows:
reward = x-velocity - 1e-4 * action^2.
"""

run_training_loop(
    num_episodes=NUM_EPISODES,
    gym_env_name="Swimmer-v5", # https://gymnasium.farama.org/environments/mujoco/swimmer/
    render_mode=RENDER_MODE,
    logdir="logs/mujoco_swimmer_tutorial",
    dim_actions=2,
    dim_states=8,
    max_traj_count=MAX_TRAJ_COUNT,
    max_traj_length=MAX_TRAJ_LENGTH,
    llm_model_name=LLM_MODEL_NAME,
    num_evaluation_episodes=NUM_EVALUATION_EPISODES,
    warmup_episodes=WARMUP_EPISODES,
    warmup_dir=None,
    bias=None,
    rank=None,
    optimum=250,
    search_step_size=SEARCH_STD,
    env_description=swimmer_description,
)

In [None]:
#@title **Policy Visualization (Swimmer-v5)**

EPISODE_DIR = "episode_197"   # keep your existing episode folder name
LOGDIR = "logs/mujoco_swimmer_tutorial"

import os
# Use EGL on most GPU-backed Linux servers; try "osmesa" if EGL isn't available
os.environ.setdefault("MUJOCO_GL", "egl")   # alternatives: "osmesa", "glfw"

def run_policy(
    render_mode="rgb_array",
    logdir=LOGDIR,
    episode_dir=EPISODE_DIR,
    save_gif=False,
    fast_mode=True,
):

    # Visualization speed settings
    if fast_mode:
        sleep_time = 0.0
        display_interval = 1
        downscale_factor = 1
    else:
        sleep_time = 0.05
        display_interval = 1
        downscale_factor = 1

    ENV_NAME = "Swimmer-v5"
    DIM_STATE = 8
    DIM_ACTION = 2

    world = ContinualSpaceGeneralWorld(
        ENV_NAME,
        render_mode=render_mode,
        max_traj_length=MAX_TRAJ_LENGTH,
    )

    agent = LLMNumOptimAgent(
        logdir,
        dim_action=DIM_ACTION,
        dim_state=DIM_STATE,
        max_traj_count=MAX_TRAJ_COUNT,
        max_traj_length=MAX_TRAJ_LENGTH,
        llm_si_template=llm_si_template,
        llm_output_conversion_template=llm_output_conversion_template,
        llm_model_name=LLM_MODEL_NAME,
        num_evaluation_episodes=NUM_EVALUATION_EPISODES,
        bias=False,
        optimum=100,
        search_step_size=SEARCH_STD,
        env_description=swimmer_description,
    )

    # Load trained parameters safely
    param_path = os.path.join(logdir, episode_dir, "parameters.txt")
    if not os.path.exists(param_path):
        raise FileNotFoundError(f"Could not find parameter file at {param_path}")

    weights = []
    with open(param_path, "r") as f:
        lines = f.readlines()

    for line in lines[1:]:  # skip header
        line = line.strip()
        if not line or not any(c.isdigit() or c in "-." for c in line):
            continue
        try:
            weights.append(float(line))
        except ValueError:
            # Handle "Bias: 0.2" or similar
            parts = line.replace(",", " ").split()
            for p in parts:
                try:
                    weights.append(float(p))
                except ValueError:
                    continue

    agent.policy.update_policy([weights])

    # Rollout
    raw_state = world.reset()
    state = np.expand_dims(raw_state, axis=0)
    done = False
    step_idx = 0
    frames = [] if save_gif else None

    def get_frame():
        """Safely obtain an RGB frame across Gymnasium/Gym versions."""
        # Gymnasium (>=0.26): render_mode set at make-time; call with no args
        try:
            img = world.env.render()
            if img is not None:
                return img
        except TypeError:
            # Old Gym fallback path
            pass
        except Exception:
            pass

        # Legacy Gym fallbacks
        for call in (lambda: world.env.render("rgb_array"),
                    lambda: world.env.render(mode="rgb_array")):
            try:
                img = call()
                if img is not None:
                    return img
            except Exception:
                continue

        return None

    # First frame setup
    img = get_frame()
    if img is None:
        raise RuntimeError("Unable to obtain an RGB frame from the environment.")
    if not isinstance(img, np.ndarray):
        img = np.array(img)
    if downscale_factor > 1:
        img = img[::downscale_factor, ::downscale_factor]

    # Matplotlib setup
    fig, ax = plt.subplots(figsize=(6, 6))
    im = ax.imshow(img)
    ax.axis("off")
    text_box = ax.text(5, 15, '', color='white', fontsize=12, backgroundcolor='black')
    display_id = f"policy_vis_{int(time.time()*1000)}"
    display(fig, display_id=display_id)

    # Rollout loop
    while not done:
        img = get_frame()
        if img is None:
            print(f"Warning: got None frame at step {step_idx}")
            break

        if not isinstance(img, np.ndarray):
            img = np.array(img)
        if downscale_factor > 1:
            img = img[::downscale_factor, ::downscale_factor]

        im.set_data(img)
        text_box.set_text(f"Step {step_idx}")

        if step_idx % display_interval == 0:
            update_display(fig, display_id=display_id)

        if save_gif:
            frames.append(img.copy())

        # Get action
        try:
            action = agent.policy.get_action(state.T)
        except Exception as e:
            print(f"Error getting action at step {step_idx}: {e}")
            break

        action_np = np.asarray(action)
        if action_np.ndim == 0:
            action_np = np.reshape(action_np, (1, 1))
        elif action_np.ndim == 1:
            action_np = action_np.reshape(1, -1)
        elif action_np.ndim >= 2:
            action_np = action_np[0:1, :]

        # Environment step
        try:
            next_state, reward, done = world.step(action_np)
        except Exception as e:
            try:
                next_state, reward, done = world.step(np.squeeze(action_np))
            except Exception as e2:
                print("Error stepping env:", e, e2)
                break

        state = np.expand_dims(next_state, 0) if next_state.ndim == 1 else next_state
        step_idx += 1
        time.sleep(sleep_time)

    plt.close(fig)

    # Save visualization
    if save_gif and frames:
        gif_dir = "output_gif"
        os.makedirs(gif_dir, exist_ok=True)
        gif_path = os.path.join(gif_dir, f"{episode_dir}.gif")
        imageio.mimsave(gif_path, frames, fps=25)
        print(f"Saved visualization → {gif_path}")

    print("Finished visualization.")

# Run visualization
run_policy(save_gif=True, fast_mode=True)


In [None]:
#@title **Episodes Reward Summary**
all_succ = []
root_folder = "logs/mujoco_swimmer_tutorial"

all_folders = [os.path.join(root_folder, x) for x in os.listdir(root_folder) if 'episode' in x]
all_folders.sort(key=lambda x: int(x.split('_')[-1]))
for folder in all_folders:
    # read all text files in the folder. Read the last line of each file and extract the total reward. The last line looks like this: "Total reward: -157.0"
    rewards_succ = []
    rewards_fail = []
    for filename in os.listdir(folder):
        if 'training' in filename:
            with open(os.path.join(folder, filename), 'r') as f:
                lines = f.readlines()
                rewards = []
                for line in lines:
                    if 'Total reward' in line:
                        total_reward = float(line.split()[-1])
                        rewards.append(total_reward)
                if rewards:  # prevent empty list
                    rewards_succ.append(np.mean(rewards))

    # Evaluation files (separate loop, not "else")
    curr_episode_rewards = []
    for filename in os.listdir(folder):
        if 'evaluation' in filename:
            with open(os.path.join(folder, filename), 'r') as f:
                lines = f.readlines()
                curr_rewards = []
                for line in lines[1:]:
                    try:
                        curr_rewards.append(float(line.split('|')[-1]))
                    except ValueError:
                        continue  # skip malformed lines
                if curr_rewards:
                    curr_episode_rewards.append(np.sum(curr_rewards))
    if curr_episode_rewards:
        rewards_succ.append(np.mean(curr_episode_rewards))

    print(rewards_succ)
    # print(rewards_fail)


    all_rewards = rewards_succ + rewards_fail

    print("Average reward for all episodes:", np.mean(all_rewards))
    print("Standard deviation of reward for all episodes:", np.std(all_rewards))
    print("------------------")

    if 'descent' in root_folder:
        all_succ.append(1500 - np.mean(all_rewards))
    else:
        all_succ.append(np.mean(all_rewards))
# print(all_succ)
print(max(all_succ))
for i in range(len(all_succ)):
    if all_succ[i] >= max(all_succ) * 0.95:
        print(i + 1)
        break

In [None]:
#@title **Reward Curve**
episodes = list(range(1, len(all_succ) + 1))

# Creating the plot
plt.figure(figsize=(15, 6))

# Plot the main line with better styling
plt.plot(episodes, all_succ, linewidth=2, color='red', label="Swimmer-v5 LLM Optimization Tutorial - Learning Curve")

plt.xlabel("Episodes", fontsize=12)
plt.ylabel("Reward", fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout(pad=3.0)

os.makedirs('results_curves', exist_ok=True)
plt.savefig(f'results_curves/{root_folder.split("/")[1]}.png', dpi=300)
%matplotlib inline
plt.show()

The `Reward Curve` plotted over the training episodes illustrates the progressive improvement of the learned policy. Typically, in early stages of training, rewards fluctuate due to random initialization and broad exploration. As the LLM repeatedly refines the parameters, the curve begins to rise steadily, indicating that the model is discovering more effective control strategies. Eventually, the reward curve stabilizes, reflecting convergence toward a locally optimal linear policy for the environment.

## **Summary**
This demo illustrates how ProPS+, an extension of the Prompted Policy Search (ProPS) framework, integrates numerical optimization with semantically informed, task-specific knowledge to accelerate LLM-driven reinforcement learning. While ProPS relies solely on replay-buffer data linking policy parameters to rewards, ProPS+ augments this with rich environment descriptions, structural knowledge, and human-provided behavioral hints, enabling the LLM to perform both numerical reasoning and linguistic reasoning during policy optimization. In the context of the Swimmer-v5 locomotion task, the LLM receives not only the replayed history of policy pairs but also an explicit description of the agent’s morphology, state variables, actuator roles, and reward structure. This added semantic grounding guides the model toward more purposeful exploration and more informed parameter refinement, helping it infer how different policy weights influence forward motion, torque efficiency, and swimming dynamics.

Beyond this Swimmer demonstration, ProPS+ generalizes to a wide range of continuous-control and decision-making tasks where environment semantics play a meaningful role. By embedding structural descriptions, domain knowledge, or task constraints directly inside the prompt, ProPS+ enables LLMs to reason about optimal behavior at a conceptual level while still performing numerical optimization over policy parameters. This hybrid reasoning capability illustrates the flexibility of the ProPS+ framework, showing how linguistic priors can complement in-context numerical learning to produce more sample-efficient and interpretable policy-improvement loops.

## **Conclusion**
This demonstration shows that LLMs, when equipped with semantically enriched prompts, can serve as powerful policy optimizers—capturing the central insight of the ProPS+ framework. Without gradients, environment internals, or model-based planners, the LLM learns to propose improved policy parameters by synthesizing replay-buffer outcomes with detailed knowledge about the Swimmer’s physics, morphology, and control objectives. While the model does not internally simulate Mujoco dynamics, its ability to leverage both contextual numerical patterns and linguistic domain knowledge enables it to approximate effective policy updates that steadily increase forward locomotion reward.

This highlights how ProPS+ advances the original ProPS approach by making LLM-driven reinforcement learning more efficient, interpretable, and guided. The LLM not only reacts to past performance but also reasons about the meaning of state variables, the role of torques, and the physical constraints of swimming. This synergy between language-based reasoning and numerical optimization demonstrates a practical and compelling path for using LLMs as unified RL optimizers—capable of learning complex continuous control behavior through structured prompts alone.

<br>

For more details, please refer to the [ProPS Project Page](https://props-llm.github.io/) and the associated research paper.

## **References**
Zhou, Y., Grover, S., El Mistiri, M., Kalirathnam, K., Kerhalkar, P., Mishra, S., Kumar, N., Gaurav, S., Aran, O., & Ben Amor, H. (2025). Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs, Advances in Neural Information Processing Systems (NeurIPS 2025).