Copyright 2024 Google LLC.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## **Demo:** Language Model Predictive Control

**What is LMPC?**

Language Model Predictive Control (LMPC) is a method to improve teachability (i.e. fast adaptation to language feedback) of robot code-writing LLMs. One key observation is that when human-robot interactions (HRI) are formulated as a partially observable Markov decision process (POMDP, in which human language inputs are observations, and robot code outputs are actions), then
training an LLM to autoregressively complete previous interactions
can be viewed as training a transition dynamics model -- that can be
combined with classic robotics techniques such as model predictive
control (MPC) to discover shorter paths to preferred outcomes (also predicted by the model). Specifically, LMPC fine-tunes an LLM to predict imagined future rollouts of language-based human-robot interactions -- then at inference time, samples multiple futures (with non-zero decoding temperature) to search for the best one and take the next action (i.e., receding horizon control as a decoding strategy).

This is an open-source implementation of the work: ["Learning to Learn Faster from Human Feedback with Language Model Predictive Control"](https://robot-teaching.github.io/)

**What's in this notebook?**

Fine-tune LLMs with LMPC to improve teachability for simple 2D goal-driven navigation with simulated language feedback.

**Important:** this notebook was written for **illustrative purposes** only (to show one way how LMPC can be implemented). This toy environment is far from perfect in terms of eliciting the strengths of LMPC, and over-simplifies the HRI setting. In fact, LMPC can be quite sensitive to small tweaks to the environment, and the performance improvements are relatively marginal. Creating meaningful toy environments for HRI is still an open problem, that we leave for future extensions. Note that the code in this notebook was also written for hackability rather than speed.

### **Quick Start:**

**Step 1.** Register for an [OpenAI API key](https://openai.com/blog/openai-api/) to use GPT-3.5 and enter it below

**Step 2.** Menu > Runtime > Run all (takes about 40 mins to complete everything)

In [None]:
openai_api_key = ""

### Setup

In [None]:
!pip install openai==1.12.0

import time

import matplotlib.pyplot as plt
import numpy as np
from openai import OpenAI

client = OpenAI(api_key=openai_api_key)

### Helper Functions

In [None]:
BASE_MODEL = "gpt-3.5-turbo-1106"  # This is also the model we are fine-tuning.

def LLM(messages, model=BASE_MODEL, stop=None, max_tokens=256, temperature=0.3):
  responses = client.chat.completions.create(model=model, messages=messages, max_tokens=max_tokens, temperature=temperature, stop=stop)
  text = responses.choices[0].message.content
  return text

# Test LLM.
LLM([{"role": "user", "content": "hello world!"}])

## **Environment**

This toy 2D goal-driven navigation environment starts with a language instruction (e.g. "navigate to the bottom left"), that the agent (LLM) takes as input, to then output simple navigation code (as a sequence of primitive functions such as `move_left()`). If the code does not enable the agent to reach the goal, then the environment subsequently simulates language feedback at every timestep. The objective of LMPC is to enable the agent to minimize the average number of language inputs (e.g. corrections) before successfully reaching the goal.

Human guidance is imperfect in real HRI settings, and this is modeled in the toy environment as noise on the feedback simulator. This notebook shows how to improve LMPC to respond to navigation feedback with top-user conditioning, which (i) identifies top users (by performance on training tasks), (ii) groups their data together with a special username “top-user,” then (iii) conditions inference-time LMPC rollouts on this special username (i.e., assume everyone is a top-user). In the toy setting, (as of Feb 2024) fine-tuning `gpt-3.5-turbo-1106` reduces the average number of language inputs before success from 2.4 to 2.2, and improves the success rate of held out test tasks.

In [None]:
MAX_NUM_FEEDBACK = 5  # Number of human language inputs (not including initial instruction).
TOP_USERS = ["user-noise-0.0"]
TEST_TASKS = ["top left", "left"]

In [None]:
class SimpleParticleNav:

  def __init__(self, is_real_user=False):
    self.goals = {'center':   (0, 0),
                  'bottom left': (-0.5, -0.5),
                  'bottom': (0, -0.5),
                  'bottom right': (0.5, -0.5),
                  'right': (0.5, 0),
                  'top right': (0.5, 0.5),
                  'top':      (0, 0.5),
                  # Test tasks (compositional generalization):
                  'top left': (-0.5, 0.5),
                  'left':     (-0.5, 0),}
    self.is_real_user = is_real_user

  def reset(self, noise=0):
    self.noise = noise  # Noise on the feedback.
    self.terminated = False
    self.num_steps = 0

    # Build a valid random initialization that is not immediately success.
    while True:
      self.init_name = np.random.choice(list(self.goals.keys()))
      self.agent_pos = np.float32(self.goals[self.init_name])
      self.goal_name = np.random.choice(list(self.goals.keys()))
      self.goal_pos = np.float32(self.goals[self.goal_name])
      if self.is_success() is None:  # None means episode is still ongoing.
        break

    self.path = self.agent_pos.copy()
    state = f"new episode: the agent is at the {self.init_name}"
    instruction = f"user: navigate to the {self.goal_name} goal"
    return state, instruction, self.is_success()

  def step(self, act):
    self.agent_pos += act
    self.agent_pos = np.clip(self.agent_pos, -1, 1)  # Note: nonlinearity.
    self.path = np.vstack((self.path, self.agent_pos.copy()))
    self.num_steps += 1
    return self.get_feedback(), self.is_success()

  def get_feedback(self):
    if self.is_real_user:
      self.render()
      return f"user: {input(f'Enter next instruction:')}"

    # Random noise on where the user thinks the goal is.
    noise_goal_pos = self.goal_pos if np.random.rand() > self.noise else self.goals[np.random.choice(list(self.goals.keys()))]
    if noise_goal_pos[1] > self.agent_pos[1]:
      return "user: go north"  # Slightly less trivial mapping of feedback to code.
    elif noise_goal_pos[1] < self.agent_pos[1]:
      return "user: go south"
    elif noise_goal_pos[0] > self.agent_pos[0]:
      return "user: go east"
    elif noise_goal_pos[0] < self.agent_pos[0]:
      return "user: go west"
    else:
      return "user: do not move"

  def is_success(self):
    if np.all(np.isclose(self.goal_pos, self.agent_pos)):
      return True
    if self.num_steps > MAX_NUM_FEEDBACK:
      return False

  def render(self):
    plt.scatter(self.goal_pos[0], self.goal_pos[1], c="tab:green", s=300, alpha=0.5)
    plt.plot([-1, -1, 1, 1, -1],
             [-1, 1, 1, -1, -1], c="#dddddd", linewidth=2)
    plt.plot(self.path[:, 0], self.path[:, 1], c="tab:blue", alpha=0.5, linewidth=2, linestyle='dashed')
    plt.scatter(self.path[-1, 0], self.path[-1, 1], c="tab:blue", s=50, alpha=0.5, zorder=10)
    plt.axis('equal')
    plt.axis('off')
    plt.show()


class ParticleAgent:
  """Simple helper class that holds the next action from exec()."""

  def __init__(self):
    self.delta = np.float32([0, 0])

  def move_up(self):
    self.delta[1] += 0.5

  def move_down(self):
    self.delta[1] -= 0.5

  def move_left(self):
    self.delta[0] -= 0.5

  def move_right(self):
    self.delta[0] += 0.5

  def wait(self):
    pass


def code_to_actions(code):
  agent = ParticleAgent()
  try:
    exec(code)
  except:
    print("Invalid code.")
    pass
  return agent.delta

## **Prompt**

The policy is driven by a prompt passed as input to an instruction-tuned code-writing LLM. The prompt contains a preamble that describes the environments,  API functions available to the agent, and an example of interacting with a user.

Note this assumes that the LLM (with a prompt) already has some base level performance on code-writing to achieve non-zero success during data collection. Noise is only added to the user feedback from the environment -- future extensions of this notebook may consider adding noise on the agent's actions as well to simulate imperfect human-robot interactions.

In [None]:
PROMPT = [{"role": "system", "content": "You are an agent that navigates to a goal location, and can call the following functions: move_up(), move_down(), move_left(), move_right(), wait(). Please write code according to user feedback to navigate to the goal. For example:"},
          {"role": "user", "content": "new episode: the agent is at the top right"},
          {"role": "assistant", "content": "user: navigate to the bottom"},
          {"role": "assistant", "content": "agent.move_down()\nagent.move_left()"},
          {"role": "assistant", "content": "user: go south"},
          {"role": "assistant", "content": "agent.move_down()"}]

## **Data Collection (2 mins)**

Collect 25 episodes (chat sessions) with each user of varying noise.

Note we are prompting the chat completion APIs with every message as coming from "assistant." As of Feb 2024, this is needed so we can fine-tune the model to predict "what a user might say." Doing so otherwise does not work (it is possible that OpenAI model training incurs different completion losses on tokens from "assistant" vs "user").

In [None]:
def collect_data(policy, name="user", is_real_user=False):
  data = {"user": [], "task": [], "session": [], "chat_length": [], "success":[]}

  env = SimpleParticleNav(is_real_user=is_real_user)
  for noise in [0, 0.3, 0.6, 0.8]:
    user = f"user-noise-{noise:.1f}"
    for _ in range(25):

      # New episode (chat session).
      episode = []  # Tracks messages in the current episode.
      state, feedback, success = env.reset(noise=noise)
      print(f"\nUser: {user}:\nDescription: {state}")
      episode.append({"role": "user", "content": state})

      # Dialogue between agent and user.
      while success is None:
        feedback = feedback.replace("user:", f"{name}:")
        print(f"  {feedback}")

        # Policy.
        episode.append({"role": "assistant", "content": feedback})
        code = policy(episode)
        print(f"  code:\t{code}".replace("\n", "\n\t"))
        episode.append({"role": "assistant", "content": code})
        act = code_to_actions(code)

        # Step environment.
        feedback, success = env.step(act)

      data["user"].append(user)
      data["task"].append(env.goal_name)
      data["session"].append(episode)
      data["chat_length"].append(env.num_steps)
      data["success"].append(success)

      print("Success:", success)
      env.render()

  return data


np.random.seed(42)
train_data = collect_data(policy=lambda x: LLM(PROMPT + x), name="user")


###Show Metrics

Show some statistics from the training data: avg success rates and chat lengths.

In [None]:
def task_to_split(task):
  return 'test' if task in TEST_TASKS else 'train'

def show_metrics(data):
  # Show overall success and chat length metrics.
  print("Overall:", f"\n\t\t\tSuccess: {np.mean(data['success'])*100:.0f}%", f"\tAvg Chat Length: {np.mean([l for l, s in zip(data['chat_length'], data['success']) if s]):.1f}")

  # Show success and chat length metrics by user.
  user_to_success = {user: [] for user in set(data["user"])}
  user_to_chat_length = {user: [] for user in set(data["user"])}
  for user, chat_length, success in zip(data["user"], data["chat_length"], data["success"]):
    user_to_success[user].append(success)
    if success:
      user_to_chat_length[user].append(chat_length)
  print("By users:")
  for user in sorted(list(user_to_success.keys())):
    print("  ", user, f"\tSuccess: {np.mean(user_to_success[user])*100:.0f}%", f"\tAvg Chat Length: {np.mean(user_to_chat_length[user]):.1f}")

  # Show success and chat length metrics by train:test task split.
  split_to_success = {"train": [], "test": []}
  split_to_chat_length = {"train": [], "test": []}
  for task, chat_length, success in zip(data["task"], data["chat_length"], data["success"]):
    if task in TEST_TASKS:
      split_to_success["test"].append(success)
      if success:
        split_to_chat_length["test"].append(chat_length)
    else:
      split_to_success["train"].append(success)
      if success:
        split_to_chat_length["train"].append(chat_length)
  print("By tasks split:")
  for split in ["train", "test"]:
    print("  ", split, f"\t\tSuccess: {np.mean(split_to_success[split])*100:.0f}%", f"\tAvg Chat Length: {np.mean(split_to_chat_length[split]):.1f}")

  # Show success and chat length split by both (user and test/train):
  user_to_splits = {user: {} for user in set(data["user"])}
  for user in user_to_splits.keys():
    user_to_splits[user] = {"train": {"chat_length": [], "success": []}, "test": {"chat_length": [], "success": []}}
  for user, task, chat_length, success in zip(data["user"], data["task"], data["chat_length"], data["success"]):
    split = task_to_split(task)
    user_to_splits[user][split]['success'].append(success)
    if success:
      user_to_splits[user][split]['chat_length'].append(chat_length)
  print ("By users and splits:")
  for user in sorted(list(user_to_splits.keys())):
    for split in ["train", "test"]:
      print("  ", f"{user} ({split})", f"\t\tSuccess: {np.mean(user_to_splits[user][split]['success'])*100:.0f}%", f"\tAvg Chat Length: {np.mean(user_to_splits[user][split]['chat_length']):.1f}")
    print()

show_metrics(train_data)

## **Train LMPC (15 mins)**

Fine-tunes an LLM (GPT-3.5) for LMPC using OpenAI API.

### Prepare Data

Formats episodes (chat sessions) as .jsonl files that the OpenAI API expects for fine-tuning.

Note here that we are also fine-tuning the LLM to be user conditioned, and re-labeling top performing users (e.g. with noise) as "experts." We will later use this during inference time to drive performance improvements.

In [None]:
import json
import copy

train_sessions = []
test_sessions = []
for user, task, messages, chat_length, success in zip(train_data["user"], train_data["task"], train_data["session"], train_data["chat_length"], train_data["success"]):
  name = "expert" if user in TOP_USERS else user  # Re-label top users as "experts."
  session = copy.deepcopy(PROMPT)
  for m in messages:
    m["content"] = m["content"].replace("user:", f"{name}:")
    session.append(m)
  session.append({"role": "assistant", "content": f"success: {success}"})
  test_sessions.append(session) if task in TEST_TASKS else train_sessions.append(session)

# Save to .jsonl files.
with open('train-lmpc.jsonl', 'w') as f:
  for i, session in enumerate(train_sessions):
    print(session)
    json.dump({"messages": session}, f)
    if i < len(train_sessions) - 1:
      f.write('\n')
  print("Train dataset size:", len(train_sessions))

with open('test-lmpc.jsonl', 'w') as f:
  for i, session in enumerate(test_sessions):
    print(session)
    json.dump({"messages": session}, f)
    if i < len(test_sessions) - 1:
      f.write('\n')
  print("Test dataset size:", len(test_sessions))

### Fine-tune LLM

Uploads training data to OpenAI then starts a fine-tuning job.

In [None]:
# Upload training data.
train_file = client.files.create(file=open("train-lmpc.jsonl", "rb"), purpose="fine-tune")
test_file = client.files.create(file=open("test-lmpc.jsonl", "rb"), purpose="fine-tune")

# Start fine-tuning job.
ftjob = client.fine_tuning.jobs.create(training_file=train_file.id, validation_file=test_file.id, model="gpt-3.5-turbo-1106")

# Track fine-tuning job until complete.
while ftjob.status != "succeeded":
  ftjob = client.fine_tuning.jobs.retrieve(ftjob.id)
  print(time.ctime(), "Status:", ftjob.status)
  time.sleep(30)
fine_tuned_model = ftjob.fine_tuned_model

Test the fine-tuned model.

In [None]:
LLM([{"role": "user", "content": "hello world!"}], fine_tuned_model)

## **LMPC**

Simple illustrative implementation of LMPC.

**Important:** Our HRI experiments show that LMPC improves LLM adaptation to language feedback for robot code-writing, but that is with real humans. It is worthwhile to think about how LMPC improves performance in a toy setting like this one. There are at least 2 key aspects:

* Top user conditioning allows LMPC to generate (and search among) future rollouts with less feedback noise at inference-time.

* LMPC is also a decoding strategy that benefits from sampling.
  * One can also run top user conditioning without MPC (though we observe slightly worse performance).

In [None]:
def LMPC(episode):

  # Do LMPC rollouts.
  num_rollouts = 4
  best_rollout = None
  rollouts = []
  for _ in range(num_rollouts):
    rollout = PROMPT + episode
    num_turns = MAX_NUM_FEEDBACK + 1  # Include initial language instructions.
    num_msgs = 2 * num_turns  # Each chat turn has 2 messages.
    max_preds = num_msgs - len(episode) + 1  # Number of steps into the future.
    for _ in range(max_preds):
      text = LLM(rollout, fine_tuned_model)
      rollout.append({"role": "assistant", "content": text})
      if "success: True" in text:
        rollouts.append(rollout[len(PROMPT)+len(episode):])  # Only look into the future.
        break
    best_rollout = rollout[len(PROMPT)+len(episode):]  # Defaults to the last rollout.

  # Find shortest path to success and take next action.
  if len(rollouts) > 0:
    for rollout in rollouts:
      if len(rollout) < len(best_rollout):
        best_rollout = rollout
  code = best_rollout[0]["content"]
  return code

## **Evals (20 mins)**

Evaluates fine-tuned LMPC using the same data collection protocol.

Top user conditioned LMPC inference assumes each user is a top user.

In [None]:
np.random.seed(1234)
eval_data = collect_data(policy=LMPC, name="expert")  # Assume user is an "expert" (top user).

In [None]:
print("BEFORE:")
show_metrics(train_data)
print("\nAFTER:")
show_metrics(eval_data)

## **Playground**

Play with the fine-tuned LLM or base model by substituting into the environment as a real human that provides feedback.

In [None]:
finetuned_model = True # @param {type:"boolean"}

if finetuned_model:
  tmp_data = collect_data(policy=LMPC, name="user", is_real_user=True)
else:
  tmp_data = collect_data(policy=lambda x: LLM(PROMPT + x), name="user", is_real_user=True)