### Example: Aligning RL Policies with PFM

In this example, we demonstrate how PFM can enhance action planning. To illustrate our approach, we use the simple `BipedalWalker` environment—a physics-based simulation game built on box2d.

Aligning reinforcement learning policies presents a unique challenge: preference datasets typically consist of trajectory pairs, requiring alignment methods to extract a preference signal from these samples. The policy is then adjusted to improve action planning based on this learned preference.

We adopt a pre-trained policy provided by `Stable-Baselines3` using `DDPG`. Try running the below cell to see how the pre-trained reference policy iteracts with the environment.

In [None]:
import gymnasium as gym
import matplotlib.pyplot as plt
from IPython.display import clear_output

from huggingface_sb3 import load_from_hub
from stable_baselines3 import DDPG

from dataset import BipedalWalkerResetWrapper


env = gym.make('BipedalWalker-v3', render_mode="rgb_array")
env = BipedalWalkerResetWrapper(env)

checkpoint = load_from_hub(
	repo_id="sb3/ddpg-BipedalWalker-v3",
	filename="ddpg-BipedalWalker-v3.zip",
)
policy = DDPG.load(checkpoint)

done = False
obs, _ = env.reset()

while not done:
    action, _ = policy.predict(obs, deterministic=False)
    obs, _, terminated, truncated, _ = env.step(action)
    done = (terminated or truncated)
    
    clear_output(wait=True)
    plt.imshow(env.render())
    plt.show()

### Preference Dataset Collection

Following the prior works on the preference-based reinforcement learning (PbRL) literature, we first randomly choose a starting state $s_{0} \sim S$, and sample two trajectories $\tau^{+}, \tau^{-}$, where the preference $\tau^{+} > \tau^{-}$ is obtained using a scripted teacher. For simplicity, we define a sample "jumping reward" that motivates the agent to jump more while walking. 

In [None]:
from dataset import generate_pbrl_dataset


def jump_reward(state):
    """
    Sample reward function to use as a scripted teacher.
    You can replace this function with any other models of your interest.
    """
    vel_y = state[3]
    return abs(vel_y * 100) ** 2


seg_len = 10        # segment length for action sequence planning and learning
num_pairs = 1000    # number of sample pairs to collect

dataset = generate_pbrl_dataset(
    env, policy, jump_reward, 
    seg_len=seg_len,
    num_pairs=num_pairs,
)

### Training PFM for RL Policy Alignment

Since our dataset consists of trajectory-level preference pairs $(x, \tau^{+}, \tau^{-})$, where the context $x$ is a current state observation $s_{0}$, and

$$
\begin{align}
\tau^{+} &:= (a_{0}, a_{1}, \cdots, a_{\ell}) \sim \pi_{\mathrm{ref}}(\cdot | s_{0})\\
\tau^{-} &:= (a_{0}', a_{1}', \cdots, a_{\ell}') \sim \pi_{\mathrm{ref}}(\cdot | s_{0})
\end{align}
$$

with fixed length $\ell \geq 2$ (denoted by `seg_len` in the code), we directly learn a flow among the action trajectories, from $\tau^{-}$ to $\tau^{+}$, conditioned on the initial state $s_{0}$. Hence, the input dimension of the flow matching module should be `seg_len`$\times$`action_dim`. We use a simple multi-layer perceptron (MLP) for our flow matching module.

Within this formulation, training a flow matching module can be done within a few lines of codes:

In [None]:
from warnings import filterwarnings
filterwarnings('ignore', category=DeprecationWarning)

import torch

from flow import OptimalTransportConditionalFlowMatching
from models import MLP

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_dim = env.action_space.shape[0] * seg_len
state_dim = env.observation_space.shape[0]

flow_model = MLP(input_dim, context_dim=state_dim).to(device)
flow_matching = OptimalTransportConditionalFlowMatching(flow_model, device=device)

trained_model, _ = flow_matching.fit(
    dataset,
    num_epochs=10,
    batch_size=100,
    learning_rate=1e-3,
    conditional=True,
)

### Interacting with RL Environment with PFM

For inference at a given state $s_{t}$, we sample an action trajectory $\tau = (a_{t}, \cdots, a_{t+\ell})$ from the reference policy $\pi_{\mathrm{ref}}(\cdot|s_{t})$, and apply flow matching to obtain a better action sequence. Then, we choose for the first action of the obtained action sequence as our final action in the current state $s_{t}$. This process is done for every current state observation $s_{t}$, which requires an environment dynamics model to rollout a sample trajectory using a reference policy. Therefore, we provide the PFM policy with a copy of an environment, and a reference policy. All this process is wrapped within the `FlowPolicy` class.

In [None]:
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import clear_output

from flow import FlowPolicy

done = False
obs, _ = env.reset()

# A wrapper policy class that improves action planning,
# using trained flow matching module.
flow_policy = FlowPolicy(env, flow_matching, policy, seg_len=seg_len)

# Environment iteraction using PFM policy
while not done:
    action = flow_policy(obs, use_torchdiffeq=False)
    obs, _, terminated, truncated, _ = env.step(action)
    done = (terminated or truncated)
    
    clear_output(wait=True)
    plt.imshow(env.render())
    plt.show()
    
env.close()
