##### Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## **LLMs as General Pattern Machines:** PCFG Benchmark

We observe that pretrained large language models (LLMs) are capable of autoregressively completing complex token sequences -- from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstract Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. These results suggest that without any additional training, LLMs can serve as general sequence modelers, driven by in-context learning. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics -- from extrapolating sequences of numbers that represent states over time to complete simple motions, to least-to-most prompting of reward-conditioned trajectories that can discover and represent closed-loop policies (e.g., a stabilizing controller for CartPole). While difficult to deploy today for real systems due to latency, context size limitations, and compute costs, the approach of using LLMs to drive low-level control may provide an exciting glimpse into how the patterns among words could be transferred to actions.

Our **PCFG benchmark** is a procedurally generated, adjustable-difficulty benchmark for measuring abstract in-context learning sequence transformation capabilities in LLMs, based on the PCFG from [Hupkes et al. 2020](https://arxiv.org/abs/1908.08351). Here are the primitive operations in the PCFG that can be applied on one or two sequences of (arbitrary) tokens.

<img src="https://socraticmodels.github.io/images/pcfg-ops.png" height="300px">

These transformations include a collection of lexical rules that may be composed. The complexity of the transformations increase by increasing the number of tokens that represent the input sequences, or by increasing the number of primitives chained together. For example:

<img src="https://socraticmodels.github.io/images/pcfg-example.png" height="200px">

This colab runs GPT-3 on the PCFG benchmark with consistent tokenization (described more in Sec. 4 of the main paper). Evaluating on PCFG with random token alphabets can provide a more unbiased evaluation of pattern reasoning capabilities between LLMs. Our experiments suggest that PCFG completion accuracy improves with model scale. Note that we use PCFG for out-of-the-box in-context evaluation only (rather than for training data).

### **Quick Start:**

**Step 1.** Register for an [OpenAI API key](https://openai.com/blog/openai-api/) to use GPT-3 (there's a free trial) and enter it below

**Step 2.** Menu > Runtime > Run all

In [1]:
openai_api_key = "your-api-key-here"

## **Setup**

**Note:** only needs a CPU (public) runtime.

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install openai
!pip install tiktoken

import numpy as np
import tiktoken  # Faster than GPT2Tokenizer from HuggingFace.
import openai
import time

openai.api_key = openai_api_key

## **API:** Large Language Models

Define helper functions to call large language models and the tokenizer.

**Note:** this can get expensive. You can reduce the number of calls by reducing `num_datasets` when running on the benchmark.

In [3]:
encoding = tiktoken.get_encoding("gpt2")
encode = lambda s: encoding.encode(s, allowed_special={"<|endoftext|>"})
decode = lambda s: encoding.decode(s)

In [4]:
model = "text-davinci-003"

def LM(prompt, max_tokens=256, stop=None, temperature=0):
  responses = openai.Completion.create(engine=model, prompt=prompt, max_tokens=max_tokens, temperature=temperature, stop=stop)
  out_text = [response['text'] for response in responses['choices']]
  return out_text

## **PCFG:** Generator

Define the rules that generate arbitrary sequence-to-sequence transformations. Each dataset is constructed using a PCFG rule applied on random tokens. The goal of in-context learning would be (i) to learn the PCFG rule from the dataset of examples and (ii) apply it to the new $x_{query}$. Both steps are done in a single forward pass.

In [5]:
def pcfg_copy(x):
  return x

def pcfg_reverse(x):
  return x[::-1]

def pcfg_shift(x):
  return x[1:] + x[:1]

def pcfg_swap(x):
  return x if len(x) < 2 else x[-1:] + x[1:-1] + x[:1]

def pcfg_repeat(x):
  return x + x

def pcfg_echo(x):
  return x + x[-1:]

def pcfg_append(x, y):
  return x + y

def pcfg_prepend(x, y):
  return y + x

def pcfg_remove_first(x, y):
  return y

def pcfg_remove_second(x, y):
  return x

pcfg_fns = [pcfg_copy, pcfg_reverse, pcfg_shift, pcfg_swap, pcfg_repeat, pcfg_echo, pcfg_append, pcfg_prepend, pcfg_remove_first, pcfg_remove_second]
def is_unary(pcfg_fn):
  return pcfg_fn in [pcfg_copy, pcfg_reverse, pcfg_shift, pcfg_swap, pcfg_repeat, pcfg_echo]

In [6]:
class Rule:

  def __init__(self, num_fns):
    self.num_fns = num_fns
    self.num_vars = 0

    # Recursively generate a rule using a specific number of functions.
    def gen_rule():
      if self.num_fns == 0:
        var_id = self.num_vars
        self.num_vars += 1
        return var_id
      else:
        rand_fn = np.random.choice(pcfg_fns)
        self.num_fns -= 1
        if is_unary(rand_fn):
          return [rand_fn, gen_rule()]
        else:
          return [rand_fn, gen_rule(), gen_rule()]
    self.fn_list = gen_rule()

  # Apply the rule to a list of variables (lists).
  def __call__(self, vars):
    def apply_fns(params):
      if type(params) == int:
        return vars[params]
      else:
        next_fn = params[0]
        if is_unary(next_fn):
          return next_fn(apply_fns(params[1]))
        else:
          return next_fn(apply_fns(params[1]), apply_fns(params[2]))
    return apply_fns(self.fn_list)

## **PCFG:** Datasets

Procedurally generate datasets for infinite benchmarking.

In [7]:
num_fns = 4  # Number of functions per PCFG rule. This increases task complexity.
x_size = 10 # Number of tokens in each x.
prompt_size = 1024  # Number of tokens in prompt.

var_delim = ","  # Separates variables in each x.
sample_delim = "\n---\n"  # Separates x from one another.
input_output_delim = "\n"  # Separates x and y from one another.

num_datasets = 10  # How many datasets to generate?

In [8]:
def vars_to_tokens(vars):
  tokens = []
  for i, var in enumerate(vars):
    token = encode(" " + str(var))
    if i < len(vars) - 1:
      tokens += token + encode(var_delim)
    else:
      tokens += token
  return tokens

def flatten(vars):
  all_vars = []
  for var in vars:
    all_vars += var
  return all_vars

In [9]:
# Fixed test token vocabulary.
vocab = [str(i) for i in range(10)]

In [10]:
# Build PCFG datasets.
pcfg_data = []
for seed in range(num_datasets):
  np.random.seed(seed)  # Reproduce generated datasets.

  # Generate a random PCFG rule.
  pcfg_rule = Rule(num_fns)

  # Randomly split the list of tokens N-1 times to generate N variables for each x.
  num_splits = pcfg_rule.num_vars - 1
  split_idx = np.sort(np.random.choice(np.arange(1, x_size), size=num_splits, replace=False)).tolist()
  split_idx = [0,] + split_idx + [x_size,]

  # Fill the prompt with random examples generated using the PCFG rule.
  prompt = []
  context_size = 0
  while True:

    # Randomly generate a list of tokens (from vocab) to represent variables.
    x_tokens = np.random.choice(vocab, size=x_size, replace=True).tolist()
    x_vars = []
    for i in range(1, len(split_idx)):
      start, end = split_idx[i - 1], split_idx[i]
      x_vars.append(x_tokens[start:end])

    # Run the rule on x variables to get y.
    y_vars = pcfg_rule(x_vars)
    x_tokens = vars_to_tokens(flatten(x_vars)) + encode(input_output_delim)
    y_tokens = vars_to_tokens(y_vars) + encode(sample_delim)

    # Ensure the expected number of tokens does not exceed prompt size.
    if len(prompt + x_tokens + y_tokens) > prompt_size:
      break

    prompt += x_tokens + y_tokens
    context_size += 1
    y_len = len(y_tokens)

  prompt, target = prompt[:-y_len], prompt[-y_len:]
  pcfg_data.append((prompt, target))

print("Prompt:\n", decode(prompt))
print("Target:\n", decode(target))
print("Total number of in-context examples:", context_size)

Prompt:
  5, 1, 0, 8, 8, 8, 2, 6, 8, 1
 8, 1, 1
---
 8, 3, 5, 3, 6, 7, 9, 0, 8, 1
 8, 1, 1
---
 8, 1, 6, 6, 2, 8, 4, 5, 3, 4
 3, 4, 4
---
 0, 8, 0, 4, 5, 4, 8, 3, 8, 4
 8, 4, 4
---
 8, 0, 1, 2, 3, 7, 7, 2, 0, 4
 0, 4, 4
---
 6, 2, 2, 2, 3, 9, 2, 1, 3, 0
 3, 0, 0
---
 1, 0, 2, 0, 4, 8, 0, 7, 0, 2
 0, 2, 2
---
 4, 0, 1, 6, 7, 6, 1, 7, 6, 2
 6, 2, 2
---
 5, 0, 0, 9, 5, 3, 7, 0, 6, 9
 6, 9, 9
---
 2, 6, 5, 0, 9, 0, 5, 2, 5, 3
 5, 3, 3
---
 6, 3, 1, 6, 1, 8, 8, 6, 8, 1
 8, 1, 1
---
 7, 0, 1, 3, 6, 2, 0, 7, 4, 2
 4, 2, 2
---
 8, 8, 1, 2, 4, 8, 1, 6, 4, 9
 4, 9, 9
---
 3, 5, 1, 9, 2, 1, 7, 1, 8, 9
 8, 9, 9
---
 7, 3, 4, 2, 3, 3, 1, 2, 6, 3
 6, 3, 3
---
 1, 2, 4, 9, 6, 5, 7, 5, 1, 3
 1, 3, 3
---
 6, 9, 4, 5, 2, 9, 5, 2, 9, 4
 9, 4, 4
---
 8, 2, 8, 1, 0, 7, 9, 2, 9, 2
 9, 2, 2
---
 7, 4, 2, 7, 3, 9, 1, 0, 0, 9
 0, 9, 9
---
 4, 0, 3, 3, 6, 4, 8, 9, 0, 6
 0, 6, 6
---
 9, 4, 7, 4, 0, 3, 8, 6, 5, 7
 5, 7, 7
---
 3, 6, 2, 6, 5, 5, 6, 6, 5, 4
 5, 4, 4
---
 7, 1, 8, 8, 6, 7, 4, 5, 3, 6
 3, 6, 6
---
 0

## **Evaluate:** PCFG

Run with GPT-3.


In [11]:
print('x_size = ', x_size, 'num_fns = ', num_fns)
verbose = True

success = []
count = 0
for i, (prompt, target) in enumerate(pcfg_data):
  count += 1

  target_string = decode(target)
  try:
    output = LM(prompt, max_tokens=len(target_string)+3, stop=sample_delim)[0]
  except:
    print('*** failed call ***')
    print('Prompt:', prompt)
    print('***')
    continue

  success.append(target_string[:target_string.find(sample_delim)] in output)
  print('Count:', count, ', Success percent:', int(100 * np.mean(success)))
  if verbose and not success[-1]:
    print('*** Failed: #', count)
    print('Target:', target_string, 'Output:', output)
    print('***')

print('-'*10)
print('Overall:')
print('x_size = ', x_size, ', num_fns = ', num_fns, ', count =', count,
      ', success rate =', np.mean(success))

x_size =  10 num_fns =  4
Count: 1 , Success percent: 100
Count: 2 , Success percent: 100
Count: 3 , Success percent: 100
Count: 4 , Success percent: 100
Count: 5 , Success percent: 80
*** Failed: # 5
Target:  5, 9, 1, 1, 9, 3, 3
---
 Output:  5, 9, 1, 1, 9, 5, 5
***
Count: 6 , Success percent: 83
Count: 7 , Success percent: 71
*** Failed: # 7
Target:  8, 0, 5, 5, 8, 7, 0, 5, 5, 7
---
 Output:  5, 0, 5, 5, 8, 7, 0, 5, 5, 7
***
Count: 8 , Success percent: 62
*** Failed: # 8
Target:  7, 1, 4, 9, 3, 6, 7, 1, 0, 7, 1, 4, 9, 3, 6, 7, 1, 0
---
 Output:  1, 1, 4, 9, 3, 6, 7, 1, 7, 1, 1, 4, 9, 3, 6, 7, 1, 7
***
Count: 9 , Success percent: 55
*** Failed: # 9
Target:  1, 4, 1, 1, 7, 5, 4, 1, 1, 4, 1, 1, 7, 5, 4, 1
---
 Output:  1, 1, 4, 1, 7, 5, 4, 1, 1, 1, 4, 1, 7, 5, 4, 1
***
Count: 10 , Success percent: 60
----------
Overall:
x_size =  10 , num_fns =  4 , count = 10 , success rate = 0.6
