##### Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## **LLMs as General Pattern Machines:** CartPole Environment

Large language models (LLMs) are trained to absorb the myriad of patterns that are woven into the structure of language. We observe that they are capable of autoregressively completing complex token sequences -- from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstract Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. These results suggest that without any additional training, LLMs can serve as general sequence modelers, driven by in-context learning. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics.

This colab explores least-to-most prompting of reward-conditioned trajectories that can discover and represent closed-loop policies to in-context learn a stabilizing controller for CartPole. While difficult to deploy today for real systems due to latency, context size limitations, and compute costs, the approach of using LLMs to drive low-level control may provide an exciting glimpse into how the patterns among words could be transferred to actions.

### **Quick Start:**

**Step 1.** Register for an [OpenAI API key](https://openai.com/blog/openai-api/) to use GPT-3 (there's a free trial) and enter it below

**Step 2.** Menu > Runtime > Run all

In [None]:
openai_api_key = "your-api-key-here"

## **Setup**

This only needs a CPU (public) runtime.

In [None]:
!pip install gymnasium[classic-control]
!pip install openai
!pip install transformers
!pip install tiktoken

In [None]:
import time

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import openai
# from transformers import GPT2Tokenizer
import tiktoken

openai.api_key = openai_api_key

## **LLM** Functions
**Note**: this can get expensive. 200 episodes is a few dollars with text-ada-001, but up to few hundred dollars for text-davinci-003+

In [None]:
# This does GPT-3 inference.
model = "text-ada-001"  # Small GPT-3?
def LLM(prompt, max_tokens=256, stop=None, temperature=0.0):
  while True:
    try:
      response = openai.Completion.create(engine=model, prompt=prompt, max_tokens=max_tokens, temperature=temperature, stop=stop)
      break
    except:
      print("LLM failed. Retrying in 10s.")
      time.sleep(10)
  text = [choice['text'] for choice in response['choices']]
  return text if len(text) > 1 else text[0]

# GPT-3 tokenizer is the same as GPT-2's (warning: slow).
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2")  # (slower)
tokenizer = tiktoken.encoding_for_model(model)

# Handshake to make sure we can talk to the LLM.
print(LLM("hello world!"))

## **CartPole** Environment

This wraps the CartPole environment and makes it LLM friendly:
* Reduce observation to just pole angle position and velocity.
* Normalize pole angle position from [-0.25, 0.25] to [0, 100] ints.
* Normalize pole angle velocity from [-3.00, 3.00] to [0, 100] ints.

In [None]:
class CartPoleEnv:

  def __init__(self):
    self.env = gym.make("CartPole-v1")  #, render_mode="human"
    self.reset()

  def reset(self):
    obs, info = self.env.reset()
    self.terminated = False
    self.reward = 0
    self.state = self.norm_state([obs[2], obs[3]])  # Pole angle and velocity.
    return self.state

  def step(self, act):
    obs, reward, terminated, truncated, info = self.env.step(act)
    self.state = self.norm_state([obs[2], obs[3]])
    self.terminated = terminated or truncated
    self.reward += np.int32(reward)
    return self.state

  def random_act(self):
    return self.env.action_space.sample()

  def norm_state(self, state):
    p = (np.clip(state[0], -0.25, 0.25) + 0.25) * (100 / 0.5)
    v = (np.clip(state[1], -3, 3) + 3) * (100 / 6)
    return int(np.round(p)), int(np.round(v))

  def state_to_str(self, state):
    return f" {state[0]} {state[1]}"

  def act_to_str(self, act):
    # LLMs bias on 0 so lets make the actions 1 and 2 instead.
    return f" {act + 1}"

  def str_to_act(self, str):
    return int(str) - 1

## Sequence **Improvement**
Online in-context policy optimization with online rollouts (max reward is 200).

In [None]:
init_episodes = 100
max_episodes = 200
temperature = 0.0
max_context = 1020  # In tokens.

# Memory bank with reward-labeled episodes: each is a list of state-action tuples.
episodes = []
rewards = []

env = CartPoleEnv()

# Generate some random policy rollouts and add them to memory.
while len(episodes) < init_episodes:
  episode = []
  s = env.reset()
  while not env.terminated:
    a = env.random_act()
    episode.append((s, a))
    s = env.step(a)
  episodes.append(episode)
  rewards.append(env.reward)

# Incremental rollouts with the LLM in the loop.
while len(episodes) < max_episodes:

  # Set a desired reward for the current rollout.
  desired_reward = np.max(rewards) + 20 + np.int32(np.random.uniform() * 10)
  prompt = f"{desired_reward}:"

  # Environment reset.
  state = env.reset()
  buffer = []

  while not env.terminated and env.reward < 200:
    prompt += f"{env.state_to_str(state)},"
    num_tokens = len(tokenizer.encode(prompt))

    # Build context of episodes sorted by ascending rewards.
    context = ""
    for i in np.argsort(rewards)[::-1]:
      if num_tokens + 10 > max_context:  # Each episode should have at least 10 tokens.
        break
      episode, reward = episodes[i], rewards[i]
      size = min(len(episode), (max_context - num_tokens) // 5)
      text = f"{reward}:" + ",".join([f"{env.state_to_str(s)},{env.act_to_str(a)}" for s, a in episode[:size]])
      num_tokens += 2 + size * 5   # Manual math here to count tokens. Calling the tokenizer too much can get slow.
      context = f"{text}\n{context}"

    # LLM inference.
    pred = LLM(context + prompt, max_tokens=4, stop=[",", "\n"], temperature=temperature)

    # If predicted action is invalid, sample random action.
    try:
      act = env.str_to_act(pred.strip())
    except:
      act = -1
    if act not in [0, 1]:
      print(f"Invalid action '{pred}'. Sampling random one.")
      act = env.random_act()

    prompt += f"{env.act_to_str(act)},"
    buffer.append((state, act))

    # Show LLM input.
    print(context + prompt)
    print("---------------------------------------------------------")
    print("Num episodes:", len(episodes), "Curr highest return:", np.max(rewards))
    print("---------------------------------------------------------")

    # Step environment.
    state = env.step(act)

  episodes.append(buffer)
  rewards.append(env.reward)

  # Make a plot of performance over time.
  plt.scatter(np.arange(init_episodes), rewards[:init_episodes], c="gray", alpha=0.3)
  plt.scatter(np.arange(init_episodes, len(rewards)), rewards[init_episodes:], alpha=0.3)
  max_over_time = [rewards[init_episodes]]
  for reward in rewards[init_episodes+1:]:
    max_over_time.append(max(reward, max_over_time[-1]))
  plt.plot(np.arange(init_episodes, len(rewards)), max_over_time)
  plt.axhline(y=200, color='gray', linestyle='--', alpha=0.3)
  plt.show()