Outline:

1. Introduce the data gathering and parsing
1. How to create a VIT from the GPT code
    1. Change to an encoder
    1. Handling the position embedding
    1. Adding the text goals
    1. Switching to regresion instead of classification
1. Sanitising the data and standardization.
1. Adding goal images
1. Adding the input masking
1. Evaluating the model in sim
1. Recording videos of the results for evaluation

## Get full trajectories
Instead of a single image we want something that looks more like a sequence, similar to text. 
For robotics applications our "language" is images and actions.

# Load Datasets

The data for robotics applications is often more complicated. There are images, actions and text descriptions. Also, the text descriptions is per episode, instead of at each frame, which is common for RL/BC.

In [None]:
# Install repo
!pip3 install -e .
!pip3 install -r requirements.txt
!pip install moviepy>=1.0.3
import cv2
# import jax
# import tensorflow as tf
import tensorflow_datasets as tfds
import tqdm
import rlds, numpy as np
import mediapy as media
from PIL import Image
from IPython import display

: 

In [None]:
from PIL import Image
from IPython import display
def as_gif(images, path='temp.gif'):
  # Render the images as the gif:
  images = [Image.fromarray(image) for image in images]
  images[0].save(path, save_all=True, append_images=images[1:], duration=1000, loop=0)
  gif_bytes = open(path,'rb').read()
  return gif_bytes

# create RLDS dataset builder
builder = tfds.builder_from_directory(builder_dir='gs://gresearch/robotics/bridge/0.1.0/')
dataset = builder.as_dataset(split='train[:1]')

# sample episode + resize to 256x256 (default third-person cam resolution)
episode = next(iter(dataset))
print(episode)

steps = list(episode['steps'])
images = np.array([cv2.resize(np.array(step['observation']['image']), (256, 256)) for step in steps])

# extract goal image & language instruction
goal_image = images[-1]
language_instruction = steps[0]['observation']['natural_language_instruction'].numpy().decode()

# visualize episode
print(f'Instruction: {language_instruction}')
# media.show_video(images, fps=10)
display.Image(as_gif(images))

: 

In [None]:
## Grab a chunk of data for training
## Model hyperparameters
image_shape = [64, 64, 3]
num_episodes = 5 ## How many episodes to grab from the dataset for training

builder = tfds.builder_from_directory(builder_dir='gs://gresearch/robotics/bridge/0.1.0/')
datasetRemote = builder.as_dataset(split='train[:' + str(num_episodes) + ']')
dataset_tmp = {"img": [], "action": [], "goal": [], "goal_img": [],
                "rotation_delta": [], "open_gripper": [] }
shortest_goal_txt = 10000000000
for episode in datasetRemote:
    episode_ = {'steps': [] }
    episode = list(episode['steps'])
    goal_img = cv2.resize(np.array(episode[-1]['observation']['image'], dtype=np.float32), (image_shape[0], image_shape[1]))  
    for i in range(len(episode)): ## Resize images to reduce computation
        obs = cv2.resize(np.array(episode[i]['observation']['image'], dtype=np.float32), (image_shape[0], image_shape[1])) 
        goal = episode[i]['observation']['natural_language_instruction'].numpy().decode()
        # action = torch.as_tensor(action) # grab first dimention
        dataset_tmp["img"].append(obs)
        dataset_tmp["action"].append(np.array(episode[i]['action']['world_vector']))
        dataset_tmp["rotation_delta"].append(np.array(episode[i]['action']['rotation_delta']))
        dataset_tmp["open_gripper"].append(np.array(episode[i]['action']['open_gripper']))
        dataset_tmp["goal"].append(goal)
        dataset_tmp["goal_img"].append(goal_img)
        if len(goal) < shortest_goal_txt: shortest_goal_txt = len(goal)

# here are all the unique characters that occur in this text
chars = sorted(list(set([item for row in dataset_tmp["goal"] for item in row]))) ## Flatten to a long string
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode_txt = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode_txy = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
print("vocab_size:", vocab_size)
print("example text encode:", encode_txt(dataset_tmp["goal"][0]))

print("Dataset shape:", len(dataset_tmp["img"]))
dataset_tmp["img"] = np.array(dataset_tmp["img"], dtype=np.uint8)
dataset_tmp["action"] = np.array(dataset_tmp["action"], dtype=np.float32)
# dataset_tmp["goal"] = np.array(dataset_tmp["goal"], dtype=np.float32)
dataset_tmp["goal_img"] = np.array(dataset_tmp["goal_img"], dtype=np.uint8)

dataset = {"train": dataset_tmp} 

: 

# How to create a VIT from the Transformer (NanoGPT) code
We have some familiarity with a robotics dataset. It contains multiple types of inputs, text and images and outputs, continuous values We need to make a model that will process in the input data and output the correct values depending on the input observation (img) and goal (text or image). I am going to extend some of Karpathi's gpt-nano code to keep as much of the model and details visable as we learn about this process. We will start with creating a vision transformer from the code.

In [None]:
## Need to adjust the masking. 
# We can have fully connected attention (encoder) or partially attended transformers. 
# Need to discuss this because we are going to modify this a couple times for our GRP.
# For images we want an encoder
import torch
import torch.nn as nn
from torch.nn import functional as F
# Self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False) ## Here is what I am interested in
query = nn.Linear(C, head_size, bias=False) ## This is what I have
value = nn.Linear(C, head_size, bias=False) ## Here is what I will communicate if you find me useful
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T,T))
# wei = wei.masked_fill(tril == 0, float('-inf')) ## 
wei = F.softmax(wei, dim=-1) ## This normalizes the values for a good distribution (sum to 1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

: 

In [None]:
## Look at the attension for one head
wei[0]

: 

## Tokenizing Images
Transformers process tokens. The nano GPT model was designed to treat each letter as an individual token and train a model to continue outputing more tokens given a recent context of tokens. However most LLMs break up words into phonemes and pieces of words in a particular way to both represent the possible components of words well and keep the number of possible tokens within a reasonable number. This asks the question, how to tokenize an image? We litterally chop the image into $m$ equally sized __patches__.

In [None]:
def get_patches(images):
  print("images.shape:", images.shape)
  batch_size, height, width, channels = images.shape
  n_patches = 8
  assert height == width, "only square images are supported"

  patches = torch.zeros(batch_size, n_patches ** 2, height * width * channels // (n_patches ** 2))
  patch_size = height // n_patches

  for idx, image in enumerate(images):
      for row in range(n_patches):
          for column in range(n_patches):
            ## Channel first
            patch = image[:, row * patch_size: (row + 1) * patch_size, column * patch_size: (column + 1) * patch_size]
            patches[idx, row * n_patches + column] = patch.flatten()

  return patches

: 

In [None]:
out = get_patches(torch.tensor(dataset_tmp["img"][:5]))
out[0]

: 

Show some of the patching with images.

## Postition Encodings


In [None]:
def compute_pos_embed(sequence_length, d):
    out = torch.ones(sequence_length, d)
    for i in range(sequence_length):
        for j in range(d):
            out[i][j] = np.sin(i / (10000 ** (j / d))) if j % 2 == 0 else np.cos(i / (10000 ** ((j - 1) / d)))
    return out

: 

In [None]:
def get_patches_fast(images):
    from einops import rearrange
    batch_size, channels, height, width = images.shape
    patch_size = height // n_patches

    p = patch_size # P in maths

    patches = rearrange(images, 'b (h p1) (w p2) c -> b (h w) (p1 p2 c)', p1 = p, p2 = p)
    return patches

: 

## Transformer Code

In [None]:
## Code for a Transfomer.
n_patches = 8
n_embed = 64
dropout = 0.1
batch_size=64
# data loading
def get_batch_vit(split):
    # generate a small batch of inputs x and targets y
    data = dataset['train'] if split == 'train' else dataset['test']
    ix = np.random.randint(int(len(data["img"])), size=(batch_size,))
    x = torch.tensor(data["img"][ix], dtype=torch.float)
    y = torch.tensor(data["label"][ix], dtype=torch.long)
    # x, y = x.to(device), y.to(device)
    return x, y

def calc_positional_embeddings(sequence_length, d):
    result = torch.ones(sequence_length, d)
    for i in range(sequence_length):
        for j in range(d):
            result[i][j] = np.sin(i / (10000 ** (j / d))) if j % 2 == 0 else np.cos(i / (10000 ** ((j - 1) / d)))
    return result

def get_patches_fast(images):
    from einops import rearrange
    batch_size, channels, height, width = images.shape
    patch_size = height // n_patches

    p = patch_size # P in maths

    patches = rearrange(images, 'b (h p1) (w p2) c -> b (h w) (p1 p2 c)', p1 = p, p2 = p)
    return patches

def calc_positional_embeddings(sequence_length, d):
    result = torch.ones(sequence_length, d)
    for i in range(sequence_length):
        for j in range(d):
            result[i][j] = np.sin(i / (10000 ** (j / d))) if j % 2 == 0 else np.cos(i / (10000 ** ((j - 1) / d)))
    return result

## This is an encoder head (full attention)
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        # wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) ## Remove masking
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


: 

In [None]:
n_embd = 64
action_bins = 10
## Creating the model for a Vision Transformers
class VIT(nn.Module):
  def __init__(self, mlp_ratio=4):
    super(VIT, self).__init__()
    self.patch_size = (image_shape[0] / n_patches, image_shape[1] / n_patches)

    #Positional embedding
    self.register_buffer('positional_embeddings', calc_positional_embeddings(n_patches ** 2 + 1, n_embd), persistent=False)
    
    ## Add class tokens
    self.class_tokens = nn.Parameter(torch.rand(1, n_embd))

    self.input_d = int(image_shape[2] * self.patch_size[0] * self.patch_size[1])

    self.lin_map = nn.Linear(self.input_d, n_embd, bias=False) 

    # 4) Transformer encoder blocks
    self.blocks = nn.ModuleList([Block(n_embd, n_head) for _ in range(n_blocks)])

    # 5) Classification MLPk
    self.mlp = nn.Sequential(
        nn.Linear(n_embd, action_bins),
        nn.Softmax(dim=-1)
    )

  def forward(self, images, targets=None):
    # Dividing images into patches
    n, c, h, w = images.shape
    patches = get_patches_fast(images)
    
    # Running linear layer tokenization
    # Map the vector corresponding to each patch to the hidden size dimension
    out = self.lin_map(patches)
    
    # Adding classification token to the tokens
    out = torch.cat((self.class_tokens.expand(n, 1, -1), out), dim=1)
    
    # Adding positional embedding
    out = out + self.positional_embeddings.repeat(n, 1, 1)
    
    # Transformer Blocks
    for block in self.blocks:
        out = block(out)

    # Getting the classification token only
    out = out[:, 0]
    logits = self.mlp(out)
        
    if targets is None:
        loss = None
    else:
        B, C = logits.shape
        targets = targets.view(B)
        loss = F.cross_entropy(logits, targets)
    return (logits, loss)

if __name__ == "__main__":
    model = VIT()
    m = model.to(device)
    # print the number of parameters in the model
    print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    for iter in range(max_iters):
        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # sample a batch of data
        xb, yb = get_batch_vit('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()


: 

# Generalist Robotics Policies (GRPs)
These models need multiple types of inputs.
(goal_text, goal_image, observation_img, robot_post, robot_velocity) -> GRP -> action
Let's start with adding text conditioning

## Adding text goals

In [None]:
## Text goals




: 

In [None]:
n_head = 8
n_embed = 128

class GRP(nn.Module):
  def __init__(self, mlp_ratio=4):
    super(GRP, self).__init__()
    ## Text processing portion
    self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
    #Positional embedding
    self.register_buffer('positional_embeddings', calc_positional_embeddings(n_patches ** 2 + 1, n_embd), persistent=False)

    self.patch_size = (image_shape[0] / n_patches, image_shape[1] / n_patches)

    self.class_tokens = nn.Parameter(torch.rand(1, n_embd))

    self.input_d = int(image_shape[2] * self.patch_size[0] * self.patch_size[1])

    self.lin_map = nn.Linear(self.input_d, n_embd, bias=False) 
    self.lin_map_goal_img = nn.Linear(self.input_d, n_embd, bias=False) 

    # 4) Transformer encoder blocks
    self.blocks = nn.ModuleList([Block(n_embd, n_head) for _ in range(n_blocks)])

    # 5) Classification MLPk
    self.mlp = nn.Sequential(
        nn.Linear(n_embd, action_bins),
        nn.Softmax(dim=-1)
    )

  def forward(self, images, goals, goal_img, targets):
    # Dividing images into patches
    n, c, h, w = images.shape
    B, T = goals.shape
    patches = get_patches_fast(images)
    goals_e = self.token_embedding_table(goals)
    
    # Running linear layer tokenization
    # Map the vector corresponding to each patch to the hidden size dimension
    out = self.lin_map(patches)
    
    # Adding classification token to the tokens
    out = torch.cat((self.class_tokens.expand(n, 1, -1), out), dim=1)
    out = torch.cat([goals_e, out], dim=1) ## Add text and goal image encoding to begining of encoding.
    
    # Adding positional embedding
    out = out + self.positional_embeddings.repeat(n, 1, 1)
    
    # Transformer Blocks
    for block in self.blocks:
        out = block(out)

    # Getting the classification token only
    out = out[:, 0]
    logits = self.mlp(out)
        
    if targets is None:
        loss = None
    else:
        # B,T,C = 4,8,2 # batch, time, channels
        B, C = logits.shape
        # logits = logits.view(B*T, C)
        targets = targets.view(B)
        loss = F.cross_entropy(logits, targets)
    return (logits, loss)

: 

In [None]:
## Train this mode to convergence on one batch
model = GRP()
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        # wandb.log({"train loss": losses['train'], "val loss": losses['val']})

    # sample a batch of data
    xb, x2b, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, x2b, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
# context = torch.zeros((1, 1), dtype=torch.long, device=device)
# print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
# wandb.finish()

: 

## Continuous Actions
While the above model works so far it is not handeling action properly. Actions can be discrete, and often that helps get a well performing policy, however, the world is continuous and so are most robots. This is just a small change in loss function.

In [None]:
loss = F.mse_loss(logits, targets)

: 

In [None]:
## Need to update the model definition
n_head = 8
n_embed = 128

class GRP(nn.Module):
  def __init__(self, mlp_ratio=4):
    super(GRP, self).__init__()
    ## Text processing portion
    self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
    #Positional embedding
    self.register_buffer('positional_embeddings', calc_positional_embeddings(n_patches ** 2 + 1, n_embd), persistent=False)

    self.patch_size = (image_shape[0] / n_patches, image_shape[1] / n_patches)

    self.class_tokens = nn.Parameter(torch.rand(1, n_embd))

    self.input_d = int(image_shape[2] * self.patch_size[0] * self.patch_size[1])

    self.lin_map = nn.Linear(self.input_d, n_embd, bias=False) 
    self.lin_map_goal_img = nn.Linear(self.input_d, n_embd, bias=False) 

    # 4) Transformer encoder blocks
    self.blocks = nn.ModuleList([Block(n_embd, n_head) for _ in range(n_blocks)])

    # 5) Classification MLPk
    self.mlp = nn.Sequential(
        nn.Linear(n_embd, action_bins),
        nn.Softmax(dim=-1)
    )

  def forward(self, images, goals, goal_img, targets):
    # Dividing images into patches
    n, c, h, w = images.shape
    B, T = goals.shape
    patches = get_patches_fast(images)
    goals_e = self.token_embedding_table(goals)
    
    # Running linear layer tokenization
    # Map the vector corresponding to each patch to the hidden size dimension
    out = self.lin_map(patches)
    
    # Adding classification token to the tokens
    out = torch.cat((self.class_tokens.expand(n, 1, -1), out), dim=1)
    out = torch.cat([goals_e, out], dim=1) ## Add text and goal image encoding to begining of encoding.
    
    # Adding positional embedding
    out = out + self.positional_embeddings.repeat(n, 1, 1)
    
    # Transformer Blocks
    for block in self.blocks:
        out = block(out)

    # Getting the classification token only
    out = out[:, 0]
    logits = self.mlp(out)
        
    if targets is None:
        loss = None
    else:
        # B,T,C = 4,8,2 # batch, time, channels
        B, C = logits.shape
        # logits = logits.view(B*T, C)
        targets = targets.view(B)
        loss = F.cross_entropy(logits, targets)
    return (logits, loss)

In [None]:
## Train this mode to convergence on one batch
model = GRP()
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        # wandb.log({"train loss": losses['train'], "val loss": losses['val']})

    # sample a batch of data
    xb, x2b, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, x2b, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
