### Introduction
This notebook is designed to run inference on the [Diffuser](https://arxiv.org/abs/2205.09991) planning model for model-based RL. The notebook is modified from the authors' [original](https://colab.research.google.com/drive/1YajKhu-CUIGBJeQPehjVPJcK_b38a8Nc?usp=sharing#scrollTo=57hSzI4mCgat). For those new to reinforcement learning, consider checking out the HuggingFace [Reinforcement Learning Course](https://huggingface.co/blog/deep-rl-intro) for a primer.

> Colab made by [natolambert](https://twitter.com/natolambert).

![diffusers_library](https://github.com/huggingface/diffusers/raw/main/docs/source/imgs/diffusers_library.jpg)


### Installing Packages

#### `apt-get install` requirements 

These requirements primarily pertain to install mujoco and run it in the colab.
Source was inspired by this (fairly recent) [demo](https://colab.research.google.com/drive/1KGMZdRq6AemfcNscKjgpRzXqfhUtCf-V?usp=sharing).

In [1]:
# installations primiarly needed for Mujoco
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libgl1-mesa-dev is already the newest version (20.0.8-0ubuntu1~18.04.1).
libgl1-mesa-dev set to manually installed.
software-properties-common is already the newest version (0.96.24.32.18).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
Suggested packages:
  glew-utils
The following NEW packages will be installed:
  libgl1-mesa-glx libglew-dev libglew2.0 libosmesa6 libosmesa6-dev
0 upgraded, 5 newly installed, 0 to remove and 49 not upgraded.
Need to get 2,916 kB of archives.
After this operation, 12.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libgl1-mesa-glx amd64 20.0.8-0ubuntu1~18.04.1 [5,532 B]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libglew2.0 amd64 2.0.0-5 [140 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/univ

#### Install Diffusers

In [2]:
%cd /content

# install latest HF diffusers
!git clone https://github.com/huggingface/diffusers 
!pip install -q /content/diffusers 
!pip install -q datasets transformers 

/content
Cloning into 'diffusers'...
remote: Enumerating objects: 3493, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 3493 (delta 177), reused 170 (delta 170), pack-reused 3309[K
Receiving objects: 100% (3493/3493), 914.57 KiB | 18.29 MiB/s, done.
Resolving deltas: 100% (2285/2285), done.
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 101 kB 4.9 MB/s 
[

#### `pip install` requirements

In [3]:
# primarily RL-sepcific requirements
%pip install -f https://download.pytorch.org/whl/torch_stable.html \
                free-mujoco-py \
                einops \
                gym \
                protobuf==3.20.1 \
                git+https://github.com/rail-berkeley/d4rl.git \
                mediapy \
                Pillow==9.0.0 


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting git+https://github.com/rail-berkeley/d4rl.git
  Cloning https://github.com/rail-berkeley/d4rl.git to /tmp/pip-req-build-pyic35em
  Running command git clone -q https://github.com/rail-berkeley/d4rl.git /tmp/pip-req-build-pyic35em
Collecting free-mujoco-py
  Downloading free_mujoco_py-2.1.6-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 6.8 MB/s 
[?25hCollecting einops
  Downloading einops-0.4.1-py3-none-any.whl (28 kB)
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 38.1 MB/s 
[?25hCollecting mediapy
  Downloading mediapy-1.0.3-py3-none-any.whl (24 kB)
Collecting Pillow==9.0.0
  Downloading Pillow-9.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.

#### Import D4RL to initialize Mujoco
[Mujoco](https://github.com/deepmind/mujoco) is a physics simulator used extensively in reinforcement learning research. Here, we import [D4RL](https://github.com/rail-berkeley/d4rl) (a library of datasets and environments for Offline RL), which results in the building of Mujoco.

In [4]:
## cythonize mujoco-py at first import
import d4rl

Compiling /usr/local/lib/python3.7/dist-packages/mujoco_py/cymj.pyx because it changed.
[1/1] Cythonizing /usr/local/lib/python3.7/dist-packages/mujoco_py/cymj.pyx
running build_ext
building 'mujoco_py.cymj' extension
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuextensionbuilder
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuextensionbuilder/temp.linux-x86_64-3.7
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuextensionbuilder/temp.linux-x86_64-3.7/usr
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuextensionbuilder/temp.linux-x86_64-3.7/usr/local
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuextensionbuilder/temp.linux-x86_64-3.7/usr/local/lib
creating /usr/local/lib/python3.7/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_37_linuxcpuexten

No module named 'flow'
No module named 'carla'




---



### Environment & Model Setup
In this section, we will create the environment, handle the data, and run the diffusion model.

#### Imports



In [5]:
import torch
import tqdm
import numpy as np
import gym 

#### Create environment
This colab is designed to run with pretrained models from the hopper environment. As more models are trained, this can be extended.


In [6]:
env_name = "hopper-medium-expert-v2"
env = gym.make(env_name)
data = env.get_dataset() # dataset is only used for normalization in this colab



Downloading dataset: http://rail.eecs.berkeley.edu/datasets/offline_rl/gym_mujoco_v2/hopper_medium_expert-v2.hdf5 to /root/.d4rl/datasets/hopper_medium_expert-v2.hdf5


load datafile: 100%|██████████| 9/9 [00:03<00:00,  2.33it/s]


#### Define constants

In [7]:
# Cuda settings for colab
torch.cuda.get_device_name(0)
DEVICE = 'cuda:0'
DTYPE = torch.float

# diffusion model settings
n_samples = 4   # number of trajectories planned via diffusion
horizon = 128   # length of sampled trajectories
state_dim = env.observation_space.shape[0] 
action_dim = env.action_space.shape[0]
num_inference_steps = 100 # number of difusion steps

#### Helper functions
* `normalize` scales the state values corresponding to the training data-set in D4RL,
* `de_normalize` unscales the data for correct rendering,
* `to_torch` handles casting to torch for both numpy arrays and dicts (used for conditionning the model, see `reset_x0`).

In [31]:
def normalize(x_in, data, key):
  upper = np.max(data[key], axis=0)
  lower = np.min(data[key], axis=0)
  x_out = 2*(x_in - lower)/(upper-lower) - 1
  return x_out

def de_normalize(x_in, data, key):
	upper = np.max(data[key], axis=0)
	lower = np.min(data[key], axis=0)
	x_out = lower + (upper - lower)*(1 + x_in) /2
	return x_out
	
def to_torch(x_in, dtype=None, device=None):
	dtype = dtype or DTYPE
	device = device or DEVICE
	if type(x_in) is dict:
		return {k: to_torch(v, dtype, device) for k, v in x_in.items()}
	elif torch.is_tensor(x_in):
		return x_in.to(device).type(dtype)
	return torch.tensor(x_in, dtype=dtype, device=device)


#### Sample env. initial state

In [32]:
## Can set environment seed for debugging
# torch.manual_seed(0)
# np.random.seed(0)
# env.seed(1996)

obs = env.reset()
obs_raw = obs

# normalize observations for forward passes
obs = normalize(obs, data, 'observations')

### Run the Diffusion Process

#### Initialize model
In this section, we create a scheduler and load a pretrained model from the Hub. An important detail in the RL application space is to save `conditions` which will allow the model to optimize trajectories only from the current state (which is cruical to making decisions!). 

In [33]:
from diffusers import DDPMScheduler, TemporalUNet

# Two generators for different parts of the diffusion loop to work in colab
generator = torch.Generator(device='cuda')
generator_cpu = torch.Generator(device='cpu')

scheduler = DDPMScheduler(timesteps=100,beta_schedule="squaredcos_cap_v2")

# 3 different pretrained models are available for this task. 
# The horizion represents the length of trajectories used in training.
network = TemporalUNet.from_pretrained("fusing/ddpm-unet-rl-hopper-hor128").to(device=DEVICE)
# network = TemporalUNet.from_pretrained("fusing/ddpm-unet-rl-hopper-hor256").to(device=DEVICE)
# network = TemporalUNet.from_pretrained("fusing/ddpm-unet-rl-hopper-hor512").to(device=DEVICE)

#### Planning helper function
`reset_x0` is used to constrain the diffusion process to trajectories starting at the current state of the agent. 
Without this, the diffusion process would generate arbitrary high-reward trajectories, rather than trajectories beginning at the current state.

In [16]:
def reset_x0(x_in, cond, act_dim):
	for key, val in cond.items():
		x_in[:, key, act_dim:] = val.clone()
	return x_in

#### Setup for denoising
`conditions` is the variable used to hold the first state of the planned trajectories to the current state (it is passed into `reset_x0`).

In [34]:
# network specific constants for inference
clip_denoised = network.clip_denoised
predict_epsilon = network.predict_epsilon

## add a batch dimension and repeat for multiple samples
## [ observation_dim ] --> [ n_samples x observation_dim ]
obs = obs[None].repeat(n_samples, axis=0)
conditions = {
    0: to_torch(obs, device=DEVICE)
  }

# constants for inference
batch_size = len(conditions[0])
shape = (batch_size, horizon, state_dim+action_dim)

#### Sample initial noise

In [35]:
# sample random initial noise vector
x1 = torch.randn(shape, device=DEVICE, generator=generator)
x = reset_x0(x1, conditions, action_dim)
x = to_torch(x)

#### Generate trajectories
The diffusion process for trajectories has 4 central components:
1. sampling an predicted original sample from the model (note that this model directly predicts the sample, rather than the error term `epsilon` used in many diffusion models),
2. use the scheduler to predict the sample at the previous timestep,
3. [optional] add posterior noise to the sample,
4. condition the trajectory to constrain the initial state.

In [36]:
eta = 1.0 # noise factor for sampling reconstructed state

# run the diffusion process
for i in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):

    # create batch of timesteps to pass into model
    timesteps = torch.full((batch_size,), i, device=DEVICE, dtype=torch.long)
    
    # 1. generate prediction from model
    with torch.no_grad():
      residual = network(x, timesteps)
    
    # 2. use the model prediction to reconstruct an observation (de-noise)
    obs_reconstruct = scheduler.step(residual.cpu(), x.cpu(), i, predict_epsilon=predict_epsilon)

    # 3. [optional] add posterior noise to the sample
    if eta > 0:
      noise = torch.randn(obs_reconstruct.shape, generator=generator_cpu).to(obs_reconstruct.device)
      posterior_variance = scheduler.get_variance(i) # * noise
      # no noise when t == 0
      # NOTE: original implementation missing sqrt on posterior_variance
      obs_reconstruct = obs_reconstruct + int(i>0) * (0.5 * posterior_variance) * eta* noise  # MJ had as log var, exponentiated

    # 4. apply conditions to the trajectory
    obs_reconstruct_postcond = reset_x0(obs_reconstruct, conditions, action_dim)
    x = to_torch(obs_reconstruct_postcond)


100%|██████████| 100/100 [00:01<00:00, 67.03it/s]




---



### Render the samples

#### Renderering Tools
Rendering from Mujoco is historically not easy. Here is a modified version from the original paper. Additionally, a TODO is to investigate this web-based [viewer](https://github.com/kevinzakka/mjc_viewer).

##### Video helpers

In [40]:
import os
import mediapy as media

def to_np(x_in):
	if torch.is_tensor(x_in):
		x_in = x_in.detach().cpu().numpy()
	return x_in

# from MJ's Diffuser code 
# https://github.com/jannerm/diffuser/blob/76ae49ae85ba1c833bf78438faffdc63b8b4d55d/diffuser/utils/colab.py#L79
def mkdir(savepath):
    """
        returns `True` iff `savepath` is created
    """
    if not os.path.exists(savepath):
        os.makedirs(savepath)
        return True
    else:
        return False


def show_sample(renderer, observations, filename='sample.mp4', savebase='/content/videos'):
    '''
    observations : [ batch_size x horizon x observation_dim ]
    '''

    mkdir(savebase)
    savepath = os.path.join(savebase, filename)

    images = []
    for rollout in observations:
        ## [ horizon x height x width x channels ]
        img = renderer._renders(rollout, partial=True)
        images.append(img)

    ## [ horizon x height x (batch_size * width) x channels ]
    images = np.concatenate(images, axis=2)

    media.show_video(images, codec='h264', fps=60)

##### Renderer helpers
These functions involve setting the state of the environment and reading it out in a pixel form.

In [41]:
# Code adapted from Michael Janner
# source: https://github.com/jannerm/diffuser/blob/main/diffuser/utils/rendering.py
import mujoco_py as mjc

def env_map(env_name):
    '''
        map D4RL dataset names to custom fully-observed
        variants for rendering
    '''
    if 'halfcheetah' in env_name:
        return 'HalfCheetahFullObs-v2'
    elif 'hopper' in env_name:
        return 'HopperFullObs-v2'
    elif 'walker2d' in env_name:
        return 'Walker2dFullObs-v2'
    else:
        return env_name

def get_image_mask(img):
    background = (img == 255).all(axis=-1, keepdims=True)
    mask = ~background.repeat(3, axis=-1)
    return mask

def atmost_2d(x):
    while x.ndim > 2:
        x = x.squeeze(0)
    return x

def set_state(env, state):
    qpos_dim = env.sim.data.qpos.size
    qvel_dim = env.sim.data.qvel.size
    if not state.size == qpos_dim + qvel_dim:
        warnings.warn(
            f'[ utils/rendering ] Expected state of size {qpos_dim + qvel_dim}, '
            f'but got state of size {state.size}')
        state = state[:qpos_dim + qvel_dim]

    env.set_state(state[:qpos_dim], state[qpos_dim:])


##### Rendering class
Use the previously defined helpers to programatically render pixel sequences from a trajectory of states. 
This class takes the re-scaled outputs of the diffusion process and visualizes them.

In [45]:
class MuJoCoRenderer:
    '''
        default mujoco renderer
    '''

    def __init__(self, env):
        if type(env) is str:
            env = env_map(env)
            self.env = gym.make(env)
        else:
            self.env = env
        ## - 1 because the envs in renderer are fully-observed
        ## @TODO : clean up
        self.observation_dim = np.prod(self.env.observation_space.shape) - 1
        self.action_dim = np.prod(self.env.action_space.shape)
        try:
            self.viewer = mjc.MjRenderContextOffscreen(self.env.sim)
        except:
            print('[ utils/rendering ] Warning: could not initialize offscreen renderer')
            self.viewer = None

    def pad_observation(self, observation):
        state = np.concatenate([
            np.zeros(1),
            observation,
        ])
        return state

    def pad_observations(self, observations):
        qpos_dim = self.env.sim.data.qpos.size
        ## xpos is hidden
        xvel_dim = qpos_dim - 1
        xvel = observations[:, xvel_dim]
        xpos = np.cumsum(xvel) * self.env.dt
        states = np.concatenate([
            xpos[:,None],
            observations,
        ], axis=-1)
        return states

    def render(self, observation, dim=256, partial=False, qvel=True, render_kwargs=None, conditions=None):

        if type(dim) == int:
            dim = (dim, dim)

        if self.viewer is None:
            return np.zeros((*dim, 3), np.uint8)

        if render_kwargs is None:
            xpos = observation[0] if not partial else 0
            render_kwargs = {
                'trackbodyid': 2,
                'distance': 3,
                'lookat': [xpos, -0.5, 1],
                'elevation': -20
            }

        for key, val in render_kwargs.items():
            if key == 'lookat':
                self.viewer.cam.lookat[:] = val[:]
            else:
                setattr(self.viewer.cam, key, val)

        if partial:
            state = self.pad_observation(observation)
        else:
            state = observation

        qpos_dim = self.env.sim.data.qpos.size
        if not qvel or state.shape[-1] == qpos_dim:
            qvel_dim = self.env.sim.data.qvel.size
            state = np.concatenate([state, np.zeros(qvel_dim)])

        set_state(self.env, state)

        self.viewer.render(*dim)
        data = self.viewer.read_pixels(*dim, depth=False)
        data = data[::-1, :, :]
        return data

    def _renders(self, observations, **kwargs):
        images = []
        for observation in observations:
            img = self.render(observation, **kwargs)
            images.append(img)
        return np.stack(images, axis=0)

    def renders(self, samples, partial=False, **kwargs):
        if partial:
            samples = self.pad_observations(samples)
            partial = False

        sample_images = self._renders(samples, partial=partial, **kwargs)

        composite = np.ones_like(sample_images[0]) * 255

        for img in sample_images:
            mask = get_image_mask(img)
            composite[mask] = img[mask]

        return composite

    def __call__(self, *args, **kwargs):
        return self.renders(*args, **kwargs)

#### Show Plans
This section renders 4 trajectories chosen from the same initial state in the environment.

##### Initialize renderer class for the environment

In [46]:
render = MuJoCoRenderer(env)

##### Show the video
Show the states generated by the diffusion model in the real environment. 
Not that the actions are dropped from the data.

In [47]:
de_normalized = de_normalize(to_np(x[:,:,action_dim:]), data, 'observations')
show_sample(render, de_normalized)


0
This browser does not support the video tag.


### [WIP] Run a trajectory in the environment
Code adapted from the [Trajectory Transformer](https://github.com/jannerm/trajectory-transformer) evaluation script. *Note*, performance is often low in this colab. This was engineered for debugging and is not utilizing the compute necessary for high performance. Even with that, this part is relatively slow!

TODO: Add the reward "guide" from the original paper.

#### Define diffusion as function
This is the same code used to create trajectories, we will re-ruse it at every state.

In [51]:
def run_diffusion(obs, horizon=128, n_samples=4):
  # normalize observations for forward passes
  obs = normalize(obs, data, 'observations')

  ## add a batch dimension and repeat for multiple samples
  ## [ observation_dim ] --> [ n_samples x observation_dim ]
  obs = obs[None].repeat(n_samples, axis=0)
  conditions = {
      # 0: to_torch(obs, device=DEVICE)
      0: torch.tensor(obs, device=DEVICE)
    }

  # constants for inference
  batch_size = len(conditions[0])
  shape = (batch_size, horizon, state_dim+action_dim)

  # sample random initial noise vector
  x1 = torch.randn(shape, device=DEVICE, generator=generator)
  x = reset_x0(x1, conditions, action_dim)

  eta = 1.0 # noise factor for sampling reconstructed state
  for i in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
      timesteps = torch.full((batch_size,), i, device=DEVICE, dtype=torch.long)
      with torch.no_grad():
        residual = network(x, timesteps)
      
      obs_reconstruct = scheduler.step(residual.cpu(), x.cpu(), i, predict_epsilon=False)

      if eta > 0:
        noise = torch.randn(obs_reconstruct.shape, generator=generator_cpu).to(obs_reconstruct.device)
        posterior_variance = scheduler.get_variance(i) # * noise
        # no noise when t == 0
        # NOTE: original implementation missing sqrt on posterior_variance
        obs_reconstruct = obs_reconstruct + int(i>0) * (0.5 * posterior_variance) * eta* noise  # MJ had as log var, exponentiated

      obs_reconstruct_postcond = reset_x0(obs_reconstruct, conditions, action_dim)
      x = to_torch(obs_reconstruct_postcond)

  return x

#### Repeatedly run diffusion and act
Increase replan frequency (how many steps per re-computing trajectories) or lower horizon / samples to increase the speed of this process.

In [52]:
# constants
replan_freq = 1
horizon = 32
n_samples = 16
T = 100 # default would be env.max_episode_steps, but that is very long

# reset the environment
observation = env.reset()
total_reward = 0

# observations for rendering
rollout = [observation.copy()]

for t in range(T):

  # plan every N steps
  if t % replan_freq == 0:
    sequences = run_diffusion(observation, horizon=horizon, n_samples=n_samples)
    plans = to_np(x[:,:,:action_dim])
    

    # select random plan
    idx = np.random.randint(plans.shape[0])

  else:
    plans = plans[:, 1:, :]

  # select action at correct time
  action = plans[idx, 0, :]
  
  ## execute action in environment
  next_observation, reward, terminal, _ = env.step(action)

  ## update return
  total_reward += reward

  # save observations for rendering
  rollout.append(next_observation.copy())

  observation = next_observation
  if ((t+1)%10) == 0: print(f"completed step {t+1}")


100%|██████████| 100/100 [00:01<00:00, 65.66it/s]
100%|██████████| 100/100 [00:01<00:00, 69.10it/s]
100%|██████████| 100/100 [00:01<00:00, 67.51it/s]
100%|██████████| 100/100 [00:01<00:00, 68.64it/s]
100%|██████████| 100/100 [00:01<00:00, 69.83it/s]
100%|██████████| 100/100 [00:01<00:00, 67.36it/s]
100%|██████████| 100/100 [00:01<00:00, 67.59it/s]
100%|██████████| 100/100 [00:01<00:00, 68.43it/s]
100%|██████████| 100/100 [00:01<00:00, 55.80it/s]
100%|██████████| 100/100 [00:01<00:00, 68.37it/s]


completed step 10


100%|██████████| 100/100 [00:01<00:00, 69.94it/s]
100%|██████████| 100/100 [00:01<00:00, 68.83it/s]
100%|██████████| 100/100 [00:01<00:00, 69.48it/s]
100%|██████████| 100/100 [00:01<00:00, 68.84it/s]
100%|██████████| 100/100 [00:01<00:00, 68.77it/s]
100%|██████████| 100/100 [00:01<00:00, 67.04it/s]
100%|██████████| 100/100 [00:01<00:00, 68.96it/s]
100%|██████████| 100/100 [00:01<00:00, 69.89it/s]
100%|██████████| 100/100 [00:01<00:00, 69.88it/s]
100%|██████████| 100/100 [00:01<00:00, 69.65it/s]


completed step 20


100%|██████████| 100/100 [00:01<00:00, 68.35it/s]
100%|██████████| 100/100 [00:01<00:00, 67.80it/s]
100%|██████████| 100/100 [00:01<00:00, 68.01it/s]
100%|██████████| 100/100 [00:01<00:00, 67.36it/s]
100%|██████████| 100/100 [00:01<00:00, 69.64it/s]
100%|██████████| 100/100 [00:01<00:00, 69.99it/s]
100%|██████████| 100/100 [00:01<00:00, 69.80it/s]
100%|██████████| 100/100 [00:01<00:00, 67.96it/s]
100%|██████████| 100/100 [00:01<00:00, 69.92it/s]
100%|██████████| 100/100 [00:01<00:00, 69.31it/s]


completed step 30


100%|██████████| 100/100 [00:01<00:00, 69.53it/s]
100%|██████████| 100/100 [00:01<00:00, 69.16it/s]
100%|██████████| 100/100 [00:01<00:00, 69.07it/s]
100%|██████████| 100/100 [00:01<00:00, 62.17it/s]
100%|██████████| 100/100 [00:01<00:00, 67.08it/s]
100%|██████████| 100/100 [00:01<00:00, 53.41it/s]
100%|██████████| 100/100 [00:01<00:00, 53.97it/s]
100%|██████████| 100/100 [00:02<00:00, 49.42it/s]
100%|██████████| 100/100 [00:02<00:00, 45.23it/s]
100%|██████████| 100/100 [00:02<00:00, 40.13it/s]


completed step 40


100%|██████████| 100/100 [00:01<00:00, 68.13it/s]
100%|██████████| 100/100 [00:01<00:00, 67.50it/s]
100%|██████████| 100/100 [00:01<00:00, 69.44it/s]
100%|██████████| 100/100 [00:01<00:00, 67.07it/s]
100%|██████████| 100/100 [00:01<00:00, 68.92it/s]
100%|██████████| 100/100 [00:01<00:00, 69.02it/s]
100%|██████████| 100/100 [00:01<00:00, 70.37it/s]
100%|██████████| 100/100 [00:01<00:00, 68.25it/s]
100%|██████████| 100/100 [00:01<00:00, 68.38it/s]
100%|██████████| 100/100 [00:01<00:00, 69.62it/s]


completed step 50


100%|██████████| 100/100 [00:01<00:00, 67.10it/s]
100%|██████████| 100/100 [00:01<00:00, 69.01it/s]
100%|██████████| 100/100 [00:01<00:00, 68.26it/s]
100%|██████████| 100/100 [00:01<00:00, 69.91it/s]
100%|██████████| 100/100 [00:01<00:00, 69.10it/s]
100%|██████████| 100/100 [00:01<00:00, 67.83it/s]
100%|██████████| 100/100 [00:01<00:00, 70.03it/s]
100%|██████████| 100/100 [00:01<00:00, 69.22it/s]
100%|██████████| 100/100 [00:01<00:00, 69.83it/s]
100%|██████████| 100/100 [00:01<00:00, 69.76it/s]


completed step 60


100%|██████████| 100/100 [00:01<00:00, 67.96it/s]
100%|██████████| 100/100 [00:01<00:00, 69.47it/s]
100%|██████████| 100/100 [00:01<00:00, 66.93it/s]
100%|██████████| 100/100 [00:01<00:00, 68.03it/s]
100%|██████████| 100/100 [00:01<00:00, 68.53it/s]
100%|██████████| 100/100 [00:01<00:00, 69.40it/s]
100%|██████████| 100/100 [00:01<00:00, 68.53it/s]
100%|██████████| 100/100 [00:01<00:00, 68.49it/s]
100%|██████████| 100/100 [00:01<00:00, 68.84it/s]
100%|██████████| 100/100 [00:01<00:00, 67.36it/s]


completed step 70


100%|██████████| 100/100 [00:01<00:00, 67.24it/s]
100%|██████████| 100/100 [00:01<00:00, 68.44it/s]
100%|██████████| 100/100 [00:01<00:00, 68.29it/s]
100%|██████████| 100/100 [00:01<00:00, 69.93it/s]
100%|██████████| 100/100 [00:01<00:00, 67.14it/s]
100%|██████████| 100/100 [00:01<00:00, 67.33it/s]
100%|██████████| 100/100 [00:01<00:00, 68.64it/s]
100%|██████████| 100/100 [00:01<00:00, 67.91it/s]
100%|██████████| 100/100 [00:01<00:00, 68.71it/s]
100%|██████████| 100/100 [00:01<00:00, 70.48it/s]


completed step 80


100%|██████████| 100/100 [00:01<00:00, 68.99it/s]
100%|██████████| 100/100 [00:01<00:00, 65.82it/s]
100%|██████████| 100/100 [00:01<00:00, 68.09it/s]
100%|██████████| 100/100 [00:01<00:00, 68.83it/s]
100%|██████████| 100/100 [00:01<00:00, 67.83it/s]
100%|██████████| 100/100 [00:01<00:00, 68.23it/s]
100%|██████████| 100/100 [00:01<00:00, 69.26it/s]
100%|██████████| 100/100 [00:01<00:00, 65.42it/s]
100%|██████████| 100/100 [00:01<00:00, 67.69it/s]
100%|██████████| 100/100 [00:01<00:00, 70.43it/s]


completed step 90


100%|██████████| 100/100 [00:01<00:00, 66.78it/s]
100%|██████████| 100/100 [00:01<00:00, 69.49it/s]
100%|██████████| 100/100 [00:01<00:00, 69.82it/s]
100%|██████████| 100/100 [00:01<00:00, 67.73it/s]
100%|██████████| 100/100 [00:01<00:00, 68.85it/s]
100%|██████████| 100/100 [00:01<00:00, 69.89it/s]
100%|██████████| 100/100 [00:01<00:00, 68.52it/s]
100%|██████████| 100/100 [00:01<00:00, 69.10it/s]
100%|██████████| 100/100 [00:01<00:00, 67.61it/s]
100%|██████████| 100/100 [00:01<00:00, 67.09it/s]

completed step 100





#### Render the roll-out


In [53]:
show_sample(render, np.expand_dims(np.stack(rollout),axis=0))

0
This browser does not support the video tag.
