# Reinforcement Learning Project
#### by Luca Subitoni and Massimiliano Nigro

Code relative to the "reproducibility challenge" project of the <a href="https://www11.ceda.polimi.it/manifestidott/manifestidott/controller/MainPublic.do?EVN_DETTAGLIOINSEGNAMENTO=EVENTO&c_insegn=061642&aa=2023&k_corso_la=1380">Reinforcement Learning PhD course</a> offered by Politecnico di Milano.

Selected paper: <a href="https://arxiv.org/abs/1606.03476">Generative Adversarial Imitation Learning</a> by Jonathan Ho and Stefano Ermon.

#### CODE DESCRIPTION:
The code is meant to work on Google Colab (last check: 28th June 2024). 
The setup of the Jupyter Notebook is subdivided in the following sections:

- <b>Install packages</b>: it installs the necessary packages (including Mujoco) and loads the RL_project folder (which you can download in the GitHub repository) from your Google Drive account. A script from the imitation Pythoin library is corrected

- <b>General functions</b>: definition of Python functions useful throughout the code

The main code is then subdivided in the following sections:

1. <b>Creating the environment for both the expert policy and the imitation algorithms</b>: creations of the environments (must run)
2. <b>Train the expert algorithm</b>: uses TRPO to train an expert policy and a random policy (optional, if you already have run this, it can be skipped! Load the previous results in the Google Colab file manager and go to number 3.)
3. <b>Imitation learning</b>: defines the imitation learning algorithm implemented, i.e., Behavioral Cloning and GAIL. (must run)
4. <b>Compare the GAIL and Behavioral Cloning to the expert</b>: train and compare the imitation algorithms to the expert and the random policy (must run)
5. <b>Plot curves to visualize the comparison</b>: generates the normalized curves present in the GAIL paper. If you already have the .csv results, you can directly load them in the Google Colab file manager


# Install packages

In [1]:
!pip install stable-baselines3[extra] # Does not have TRPO
!pip install sb3-contrib # For TRPO
!pip install imitation # Imitation Learning
!pip install renderlab  # For rendering environments

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium<0.30,>=0.28.1 (from stable-baselines3[extra])
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=1.3.0 (from stable-baselines3[extra])
  Downloading Shimmy-1.3.0-py3-none-any.whl (37 kB)
Collecting autorom[accept-rom-license]~=0.6.1 (from stable-baselines3[extra])
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m21.7 MB/s[0m eta [

In [2]:
# MuJoCo Setup (part 1)
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf
!pip install free-mujoco-py

# MuJoCo Setup (part 2)
import os
if not os.path.exists('.mujoco_setup_complete'):
  # Get the prereqs
  !apt-get -qq update
  !apt-get -qq install -y libosmesa6-dev libgl1-mesa-glx libglfw3 libgl1-mesa-dev libglew-dev patchelf
  # Get Mujoco
  !mkdir ~/.mujoco
  !wget -q https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz -O mujoco.tar.gz
  !tar -zxf mujoco.tar.gz -C "$HOME/.mujoco"
  !rm mujoco.tar.gz
  # Add it to the actively loaded path and the bashrc path (these only do so much)
  !echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco210/bin' >> ~/.bashrc
  !echo 'export LD_PRELOAD=$LD_PRELOAD:/usr/lib/x86_64-linux-gnu/libGLEW.so' >> ~/.bashrc
  # THE ANNOYING ONE, FORCE IT INTO LDCONFIG SO WE ACTUALLY GET ACCESS TO IT THIS SESSION
  !echo "/root/.mujoco/mujoco210/bin" > /etc/ld.so.conf.d/mujoco_ld_lib_path.conf
  !ldconfig
  # Install Mujoco-py
  !pip3 install -U 'mujoco-py<2.2,>=2.1'
  # run once
  !touch .mujoco_setup_complete

try:
  if _mujoco_run_once:
    pass
except NameError:
  _mujoco_run_once = False
if not _mujoco_run_once:
  # Add it to the actively loaded path and the bashrc path (these only do so much)
  try:
    os.environ['LD_LIBRARY_PATH']=os.environ['LD_LIBRARY_PATH'] + ':/root/.mujoco/mujoco210/bin'
  except KeyError:
    os.environ['LD_LIBRARY_PATH']='/root/.mujoco/mujoco210/bin'
  try:
    os.environ['LD_PRELOAD']=os.environ['LD_PRELOAD'] + ':/usr/lib/x86_64-linux-gnu/libGLEW.so'
  except KeyError:
    os.environ['LD_PRELOAD']='/usr/lib/x86_64-linux-gnu/libGLEW.so'
  # presetup so we don't see output on first env initialization
  import mujoco_py
  _mujoco_run_once = True

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
software-properties-common is already the newest version (0.99.22.9).
The following additional packages will be installed:
  libegl-dev libgl-dev libgles-dev libgles1 libglu1-mesa libglu1-mesa-dev libglvnd-core-dev
  libglvnd-dev libglx-dev libopengl-dev libosmesa6
The following NEW packages will be installed:
  libegl-dev libgl-dev libgl1-mesa-dev libgl1-mesa-glx libgles-dev libgles1 libglew-dev
  libglu1-mesa libglu1-mesa-dev libglvnd-core-dev libglvnd-dev libglx-dev libopengl-dev libosmesa6
  libosmesa6-dev
0 upgraded, 15 newly installed, 0 to remove and 45 not upgraded.
Need to get 4,020 kB of archives.
After this operation, 19.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libglx-dev amd64 1.4.0-1 [14.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libgl-dev amd64 1.4.0-1 [101 kB]
Get:3 http://archive.ubuntu.com/ubuntu 

INFO:root:running build_ext
INFO:root:building 'mujoco_py.cymj' extension
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr/local
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr/local/lib
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.1.2.14_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-31

In [3]:
# Changing a function inside the package "imitation" in the script "util.py"
# See issue: https://github.com/HumanCompatibleAI/imitation/issues/820
from google.colab import drive
drive.mount('/content/drive')
!cp -r "/content/drive/My Drive/RL_project/util.py" "/usr/local/lib/python3.10/dist-packages/imitation/util/"

Mounted at /content/drive


# General functions

In [4]:
def evaluate(env, policy, gamma=1., num_episodes=50, deterministic=True, print_results=False, use_random_policy=False):
    """
    DESCRIPTION:
        Evaluate a RL agent
    INPUT:
        - env (Env object): the Gym environment
        - policy (BasePolicy object): the policy in stable_baselines3
        - gamma (float, default=1.0): the discount factor
        - num_episodes (int, default=50): number of episodes to evaluate it
        - deterministic (bool, default=True): whether to use a deterministic policy or not
        - print_results (bool, default=False): whether to print results or not
        - use_random_policy (bool, default=False): whether to use a random policy or not (useful since it is the reference used in the GAIL paper to compare the algorithms)
    OUTPUT:
        - mean_episode_reward (float): the mean rewards across the "num_episodes" episodes
        - std_episode_reward: the standard deviation across the tested "num_episodes" episodes
    """

    from tqdm import tqdm
    import numpy as np
    
    all_episode_rewards = []
    for i in tqdm(range(num_episodes)): # iterate over the episodes
        episode_rewards = []
        done = False
        discounter = 1.
        obs, _ = env.reset()
        while not done: # iterate over the steps until termination

            # If I want to use a random policy I sample a random action
            if use_random_policy == True:
                action = env.action_space.sample()

            # otherwise I use the model to predict the action to take
            else:
                action, _ = policy.predict(obs, deterministic=deterministic)

            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_rewards.append(reward * discounter) # compute discounted reward
            discounter *= gamma

        all_episode_rewards.append(sum(episode_rewards))

    if print_results:
        mean_episode_reward = np.mean(all_episode_rewards)
        std_episode_reward = np.std(all_episode_rewards) / np.sqrt(num_episodes - 1)
        print("\nMean reward:", mean_episode_reward,
              "- Std reward:", std_episode_reward,
              "- Num episodes:", num_episodes)

        return mean_episode_reward, std_episode_reward

In [5]:
def plot_train_curve(log_dir):

    """
    DESCRIPTION:
        This function is used to plot the train curve of the experts
    INPUT:
        - log_dir (string): the string which specifies the path which contains the "evaluations.npz" file
    OUTPUT:
        - None: it plots and saves (in the log_dir path) the train curve figure
    """

    import numpy as np
    import matplotlib.pyplot as plt

    data = np.load(log_dir+"evaluations.npz")

    timesteps = data["timesteps"] # n_timesteps
    rewards = data["results"] # n_timesteps, n_eval_episodes

    plt.figure(figsize=(12,4))
    plt.plot(timesteps, rewards, color="black", alpha=0.1, lw=2)
    plt.plot(timesteps, np.mean(rewards, axis=1), color="red", lw=2, label="Median reward")
    plt.xlabel("Timesteps")
    plt.ylabel("Reward")
    plt.grid()
    plt.savefig(log_dir+"train_reward_curve.png", dpi=300)
    plt.show()

In [6]:
# General imports
import numpy as np
import torch as th
import os, shutil
import pandas as pd
import matplotlib.pyplot as plt

# To download iteratively the files from Google Colab
from google.colab import files

# Functions for the RL expert policy
import gymnasium as gym
from sb3_contrib import TRPO
import renderlab

# Function to load the expert policy for the imitation package
from imitation.policies.serialize import load_policy

# Functions for creating the vectorized environments
from imitation.util import util
from imitation.util.util import make_vec_env
from imitation.data.wrappers import RolloutInfoWrapper

# Importing the function which creates the rollouts for the imitation algorithm
from imitation.data import rollout

# Functions for defining custom policies
from stable_baselines3.common import policies, torch_layers

# Behavioral cloning
from imitation.algorithms import bc

# Callbacks to stop the training the expert
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold, CallbackList
from stable_baselines3.common.monitor import Monitor

# Removing the sample_data folder from Google Colab
if os.path.exists("/content/sample_data/"):
    shutil.rmtree("/content/sample_data/")

INFO:numexpr.utils:NumExpr defaulting to 2 threads.
  np.bool8: (False, True),

  from scipy.ndimage.filters import sobel



# Code

## 1. Creating the environment for both the expert policy and the imitation algorithms

In [7]:
# List of environments to choose from (CartPole and Acrobot are slightly different from the ones used in the paper, but it should not be a problem in a meaningful way)
env_list_string = ["CartPole-v1", # Classic 0 (v0 is deprecated) --> Increased max_episode_steps (from 200 to 500) + increased reward_threshold (from 195.0 to 475.0)
                   "Acrobot-v1", # Classic 1 (v0 is deprecated) --> Instead of the direct angles, it provides the sine and cosine of each angle + reward rescaled from +-200 to +-500
                   "MountainCar-v0", # Classic 2 (ok)
                   "HalfCheetah-v2", # MuJoCo 3 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator
                   "Hopper-v2", # MuJoCo 4 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator
                   "Walker2d-v2", # MuJoCo 5 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator
                   "Ant-v2", # MuJoCo 6 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator
                   "Humanoid-v2", # MuJoCo 7 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator
                   "Reacher-v2"] # MuJoCo 8 (v1 is deprecated) --> No differences, only provides compatibility with newer MuJoCo simulator

# Queste soglie, tranne la prima, sono quelle del paper, volendo si possono abbassare un po' per evitare tempi di calcolo enormi
# Thresholds for rewards (i.e. when I consider the model an expert)
reward_thr = {"CartPole-v1": 500.00, # Different from the paper since they changed the maximum_episode_steps from v0 to v1
              "Acrobot-v1": -75.25,
              "MountainCar-v0": -98.75,
              "Reacher-v2": -4.09,
              "HalfCheetah-v2": 4463.46,
              "Hopper-v2": 3571.38,
              "Walker2d-v2": 6717.08,
              "Ant-v2": 4228.37,
              "Humanoid-v2": 9575.40,
              }

# String ID for the environment
environment_string = env_list_string[4]
print("The selected environment is:", environment_string)

# Seed for the random number generator
seed = 42

# Number of vectorized environments (multiprocessing) --> 32 o 64 sono ok per la GPU di Colab (circa dalle 3k alle 6k iter/s)
if th.cuda.is_available():
    n_env = 64 # GPU
else:
    n_env = 8 # CPU

# Random Number Generator
rng = np.random.default_rng(seed)

# Create log directory for the current environment
log_dir = "/content/log/"+environment_string+"/"
if os.path.exists(log_dir):
    shutil.rmtree(log_dir) # delete the old directory, if it exists
os.makedirs(log_dir)

# This is the name of the zip file which will be automatically downloaded and which will contain the results of the expert and of the random policy
results_zip_name = "results_" + environment_string + ".zip"

# Creating the vectorized environment for training the models faster
venv = util.make_vec_env(environment_string,
                         n_envs=n_env,
                         rng=rng,
                         post_wrappers=[lambda venv, _: RolloutInfoWrapper(venv)]  # needed for computing rollouts later (capire se mettere env o venv, vedi: https://imitation.readthedocs.io/en/latest/tutorials/1_train_bc.html)
                         )

# Creating the environment for evaluating the model
env_eval = gym.make(environment_string)
env_eval = Monitor(env_eval)

# Creating the environment for playing the video with renderlab and save them into the log directory
env_movie = gym.make(environment_string, render_mode = "rgb_array")
env_movie = renderlab.RenderFrame(env_movie, log_dir)

# Create a flag that tell whether I have trained the model now or whether I reloaded the zip file of the model on Google Colab
model_trained_now = False # This must not be changed (the code automatically accounts for it)

  logger.deprecation(

  logger.deprecation(



The selected environment is: Hopper-v2


## 2. Train the expert algorithm

In [None]:
# Custom MLP policy
policy_kwargs = dict(activation_fn=th.nn.Tanh,
                     net_arch=dict(pi=[100, 100], vf=[100, 100]))

# Callbacks for stopping training when the model reaches the predefined reward threshold
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=reward_thr[environment_string], verbose=1)
stop_callback = EvalCallback(env_eval, callback_on_new_best=callback_on_best, verbose=1, deterministic=True)

# Maximum training time steps (approximately, the maximum number of steps to train the algorithm)
max_time_steps = 5e6

# How often do I want to evaluate the algorithms to obtain the training curve (it is scaled by the n_envs during training)
# 10000 for classic 50000 for mujoco in eval_iter for eval_freq below! (otherwise too slow to train since it is always evaluating...)
eval_iter = {"CartPole-v1": 25000,
             "Acrobot-v1": 25000,
             "MountainCar-v0": 25000,
             "Reacher-v2": 50000,
             "HalfCheetah-v2": 50000,
             "Hopper-v2": 50000,
             "Walker2d-v2": 50000,
             "Ant-v2": 50000,
             "Humanoid-v2": 50000,
             }

# Callback for evaluating the model periodically (training curve)
eval_callback = EvalCallback(env_eval, log_path=log_dir, eval_freq=int(eval_iter[environment_string]/n_env), n_eval_episodes=10, deterministic=True, render=False)
# Create the callback list
callback = CallbackList([stop_callback, eval_callback])

# Train (log_interval è il numero di episodi prima di generare un report)
model = TRPO("MlpPolicy", venv, verbose=1, policy_kwargs=policy_kwargs)
#model.learn(total_timesteps=10_000, log_interval=1, progress_bar=True) # Train for fixed number of timesteps (utile per avere la progress bar e valutare la velocità)
# (N.B il log_interval genera un box con rollout usando una policy non deterministica e quindi si vedono risultati peggiori di quelli che sono in realtà)
model.learn(total_timesteps=int(max_time_steps), log_interval=1, callback=callback) # Train until a mean reward is reached or up until total_timesteps is reached

# Save the model
model.save(log_dir+"/"+environment_string+"_model") # Genera un file zip dove viene salvato il modello

# Plot and save the training curve
plot_train_curve(log_dir)

# Set the flag which indicates that I have trained the model now to True
model_trained_now = True

In [None]:
# EXPERT ========================================================
# Evaluate the trained model and save the final results
model_eval_mean, model_eval_std = evaluate(env_eval, model, gamma=1., num_episodes=50, deterministic=True, print_results=True)
np.save(log_dir+"expert_evaluation.npy", np.array([model_eval_mean, model_eval_std]))

# Show movie
print("\n\n===============================")
evaluate(env_movie, model, num_episodes=1, deterministic=True, print_results=False)
env_movie.play()

In [None]:
# RANDOM POLICY ==================================================
# Evaluate a random policy and save the final results
model_eval_mean, model_eval_std = evaluate(env_eval, policy=None, gamma=1., num_episodes=50, deterministic=True, print_results=True, use_random_policy=True)
np.save(log_dir+"random_evaluation.npy", np.array([model_eval_mean, model_eval_std]))

# Show movie
print("\n\n===============================")
evaluate(env_movie, policy=None, num_episodes=1, deterministic=True, print_results=False, use_random_policy=True)
env_movie.play()

In [None]:
# Create a zip file to download all files
!zip -r {results_zip_name} {log_dir}
# Download the folder in which all data have been saved
files.download(results_zip_name)

## 3. Imitation learning

In [17]:
import os
import shutil

# If I have loaded the trained model on Google Colab I have to load back the results
if model_trained_now == False:

    # Unzip the zip file which containes the training results of the expert and of the random policy
    if os.path.exists(log_dir):
        shutil.rmtree(log_dir) # delete the old directory, if it exists
    os.makedirs(log_dir)
    path_in_which_files_are_unzipped = log_dir+log_dir[1:] # Due to the zip file construction is unzipped in a subfolder
    if os.path.exists(path_in_which_files_are_unzipped):
        shutil.rmtree(path_in_which_files_are_unzipped) # delete the old directory, if it exists
    !unzip {results_zip_name} -d {log_dir}
    # Mode files in the correct log dir
    for file in os.listdir(path_in_which_files_are_unzipped):
        shutil.move(path_in_which_files_are_unzipped+file, log_dir)
    # Remove the original folder
    shutil.rmtree(log_dir+"content")
    print("The zipped model results have been loaded!")

# Testing if the model loading was correct
test_if_correct = False
if test_if_correct:
    # To load policy from disk use "ppo" as key register (https://imitation.readthedocs.io/en/latest/tutorials/1_train_bc.html)
    expert = load_policy("ppo", venv, path="/content/log/"+environment_string+"/"+environment_string+"_model")
    # Evaluate the trained model
    _, _ = evaluate(env_eval, model, gamma=1., num_episodes=10, deterministic=True, print_results=True) # Same expert model as before (you must not restart the kernel, otherwise you will not have the "model" variable)
    _, _ = evaluate(env_eval, expert, gamma=1., num_episodes=10, deterministic=True, print_results=True) # Loaded expert model

Archive:  results_Hopper-v2.zip
   creating: /content/log/Hopper-v2/content/
  inflating: /content/log/Hopper-v2/__MACOSX/._content  
  inflating: /content/log/Hopper-v2/content/.DS_Store  
  inflating: /content/log/Hopper-v2/__MACOSX/content/._.DS_Store  
   creating: /content/log/Hopper-v2/content/log/
  inflating: /content/log/Hopper-v2/__MACOSX/content/._log  
  inflating: /content/log/Hopper-v2/content/log/.DS_Store  
  inflating: /content/log/Hopper-v2/__MACOSX/content/log/._.DS_Store  
   creating: /content/log/Hopper-v2/content/log/Hopper-v2/
  inflating: /content/log/Hopper-v2/__MACOSX/content/log/._Hopper-v2  
  inflating: /content/log/Hopper-v2/content/log/Hopper-v2/Hopper-v2_model.zip  
  inflating: /content/log/Hopper-v2/__MACOSX/content/log/Hopper-v2/._Hopper-v2_model.zip  
  inflating: /content/log/Hopper-v2/content/log/Hopper-v2/expert_evaluation.npy  
  inflating: /content/log/Hopper-v2/__MACOSX/content/log/Hopper-v2/._expert_evaluation.npy  
  inflating: /content/log/

In [18]:
def get_transitions(min_timesteps=50, min_episodes=10, rng=rng, expert=None, keep_only_min_episodes=True, keep_only_500_pairs=True):

    # min_timesteps is the Number of state action pairs
    # min_episodes is the Number of trajectories in the dataset (x axis of the GAIL paper)
    # https://imitation.readthedocs.io/en/latest/_api/imitation.data.rollout.html

    from imitation.data.types import TrajectoryWithRew

    # I want the expert to behave deterministically
    kwargs = {"deterministic_policy": True}

    # Generate trajectories for the imitation learning algorithm
    rollouts = rollout.rollout(expert,
                               venv,
                               rollout.make_sample_until(min_timesteps=min_timesteps, min_episodes=min_episodes),
                               rng=rng,
                               **kwargs)

    # Keep only the first min_episodes number of trajectories (the function above can oversample sometimes)
    if keep_only_min_episodes == True:
        rollouts = rollouts[:int(min_episodes)]

    # Keep only the first 50 samples (?) (as it written in the paper --> approximately 50 pairs of observation and action for each trajectory)
    if keep_only_500_pairs == True:
        for i in range(len(rollouts)):
            rollouts[i] = TrajectoryWithRew(obs = rollouts[i].obs[:499+1], # We have one more observation than actions/rewards
                                            acts = rollouts[i].acts[:499],
                                            infos = rollouts[i].infos,
                                            terminal = rollouts[i].terminal,
                                            rews = rollouts[i].rews[:499])

    # Flatten the trajectories
    transitions = rollout.flatten_trajectories(rollouts)
    return transitions, rollouts

In [19]:
def get_obs_act(transitions):
    """
    DESCRIPTION:
        Function to get observations and actions within the transitions which were generated. 
        This function is useful for the loss function computation for the Behavioral Cloning algorithm
    INPUT:
        - transitions: the transitions of the expert, as generated by the get_transitions() function
    OUTPUT:
        - observations: the observations within the transitions
        - actions: the actions within the transitions
    """
    
    observations = np.array([transitions[i]["obs"] for i in range(len(transitions))])
    actions = np.array([transitions[i]["acts"] for i in range(len(transitions))])

    return th.from_numpy(observations), th.from_numpy(actions)

#### 3.1. Behavioral Cloning function definition

In [20]:
# Creating a custom policy as defined in the paper
class FeedForward100Policy(policies.ActorCriticPolicy):
    """
    A feed forward policy network with two hidden layers of 100 units and Tanh activation function in between.
    """

    def __init__(self, *args, **kwargs):
        """Builds FeedForward32Policy; arguments passed to `ActorCriticPolicy`."""
        super().__init__(*args, **kwargs, net_arch=[100, 100]) # Default is Tanh (see https://stable-baselines.readthedocs.io/en/master/_modules/stable_baselines/common/policies.html#FeedForwardPolicy)

In [21]:
def behavioral_cloning_train(rng=rng, play_video=False, env_eval=None, venv=None, env_movie=None, transitions=None, device="cpu"):

    """
    DESCRIPTION:
        This function performs the train of the Behavioral Cloning algorithm
    INPUT:
        - rng (default=rng, defined above in the code): random number generator
        - play_video (bool, defualt=False): whether to play the video or not
        - env_eval (default=None): the environment in which to perform the evaluation
        - venv (default=None): the vectorized environment in which to perform the training steps
        - env_movie (default=None): the fictitious environment which is used to create the movie of the expert using Google Colab
        - transition (default=None): the transitions of the expert
        - device (default="cpu"): the device used to run the Behavioral cloning training algorithm
    OUTPUT:
        - im_mean_reward: the mean reward of the Behavioral Cloning algorithm
        - im_std_reward the standard deviation of the Behavioral Cloning algorithm
    """

    from imitation.algorithms.bc import BehaviorCloningLossCalculator

    # Create the object of the policy (vedi https://imitation.readthedocs.io/en/latest/_modules/imitation/algorithms/bc.html#BC.__init__)
    custom_policy = FeedForward100Policy(observation_space=venv.observation_space,
                                        action_space=venv.action_space,
                                        lr_schedule=lambda _: th.finfo(th.float32).max,)

    # Train val split for behavioral cloning
    train_transitions = transitions[:int(len(transitions)*0.7)]
    val_transitions = transitions[int(len(transitions)*0.7):]
    loss_calculator = BehaviorCloningLossCalculator(ent_weight=0.001, l2_weight=0.0)

    # Train the imitation learning algorithm
    bc_trainer = bc.BC(observation_space=venv.observation_space,
                action_space=venv.action_space,
                demonstrations=train_transitions,
                batch_size=128,
                policy=custom_policy,
                device=device,
                rng=rng, # Questo poi va ri-inizializzato ad ogni re-run (loro fanno 5-7 re-run)
    )

    # Compute the loss in order to perform an early stopping based on the validation curve
    loss = np.inf
    current_loss = -np.inf
    while current_loss < loss:
        # If it is not the first loop
        if current_loss != -np.inf:
            loss = current_loss
        # Train the algorithm
        bc_trainer.train(n_epochs=1)
        # Evaluate the algorithm
        val_observations, val_actions = get_obs_act(val_transitions)
        current_loss = float(loss_calculator(bc_trainer.policy, val_observations, val_actions).loss)
        print(current_loss)

    # Evaluate the imitation learning algorithm
    im_mean_reward, im_std_reward = evaluate(env_eval, bc_trainer.policy, gamma=1., num_episodes=50, deterministic=True, print_results=True) # Loaded expert model

    # Play the video
    if play_video:
        evaluate(env_movie, bc_trainer.policy, num_episodes=1, deterministic=True, print_results=False)
        env_movie.play()

    return im_mean_reward, im_std_reward

#### 3.2. GAIL function definition

In [22]:
from torch import nn
from imitation.util import networks, util
from imitation.rewards.reward_nets import RewardNet
from stable_baselines3.common import preprocessing
from typing import Any, Callable, Dict, Iterable, Optional, Sequence, Tuple, Type, cast

class CustomRewardNet(RewardNet):

    def __init__(
        self,
        observation_space: gym.Space,
        action_space: gym.Space,
        use_state: bool = True,
        use_action: bool = True,
        use_next_state: bool = False,
        use_done: bool = False,
        **kwargs,
    ):
        """Builds reward MLP.

        Args:
            observation_space: The observation space.
            action_space: The action space.
            use_state: should the current state be included as an input to the MLP?
            use_action: should the current action be included as an input to the MLP?
            use_next_state: should the next state be included as an input to the MLP?
            use_done: should the "done" flag be included as an input to the MLP?
            kwargs: passed straight through to `build_mlp`.
        """
        super().__init__(observation_space, action_space)
        combined_size = 0

        self.use_state = use_state
        if self.use_state:
            combined_size += preprocessing.get_flattened_obs_dim(observation_space)

        self.use_action = use_action
        if self.use_action:
            combined_size += preprocessing.get_flattened_obs_dim(action_space)

        self.use_next_state = use_next_state
        if self.use_next_state:
            combined_size += preprocessing.get_flattened_obs_dim(observation_space)

        self.use_done = use_done
        if self.use_done:
            combined_size += 1

        full_build_mlp_kwargs: Dict[str, Any] = {
            "hid_sizes": (100, 100),
            "activation": th.nn.Tanh,
            **kwargs,
            # we do not want the values below to be overridden
            "in_size": combined_size,
            "out_size": 1,
            "squeeze_output": True,
        }

        self.mlp = networks.build_mlp(**full_build_mlp_kwargs)


    def forward(self, state, action, next_state, done):
        inputs = []
        if self.use_state:
            inputs.append(th.flatten(state, 1))
        if self.use_action:
            inputs.append(th.flatten(action, 1))
        if self.use_next_state:
            inputs.append(th.flatten(next_state, 1))
        if self.use_done:
            inputs.append(th.reshape(done, [-1, 1]))

        inputs_concat = th.cat(inputs, dim=1)

        outputs = self.mlp(inputs_concat)
        assert outputs.shape == state.shape[:1]

        return outputs


In [23]:
def gail_train(training_iterations=300, state_action_pairs_per_iteration=5000, play_video=False, env_eval=None,  venv=None, env_movie=None, rollouts=None):
    """
    DESCRIPTION:
        This function performs the train of the GAIL algorithm
    INPUT:
        - training_iterations (int, default=300): the number of training iterations of the GAIL algorithm (as described in the Appendix of the paper)
        - state_action_pairs_per_iteration (int, default=5000): the number of state action pairs per iteration (as described in the Appendix of the paper)
        - play_video (bool, defualt=False): whether to play the video or not
        - env_eval (default=None): the environment in which to perform the evaluation
        - venv (default=None): the vectorized environment in which to perform the training steps
        - env_movie (default=None): the fictitious environment which is used to create the movie of the expert using Google Colab
        - rollouts (default=None): the rollouts generated by the expert
    OUTPUT:
        - im_mean_reward: the mean reward of the GAIL algorithm
        - im_std_reward the standard deviation of the GAIL algorithm
    """

    from imitation.rewards.reward_nets import BasicRewardNet
    from stable_baselines3.ppo import MlpPolicy
    from sb3_contrib import TRPO
    from imitation.algorithms.adversarial.gail import GAIL

    # Compute the number of iterations
    n_timesteps = int(training_iterations * state_action_pairs_per_iteration)
    training_kwargs = {"gen_train_timesteps": int(state_action_pairs_per_iteration)}

    # Custom network 1
    reward_net = CustomRewardNet(
        observation_space=venv.observation_space,
        action_space=venv.action_space
    )

    # Custom network 2
    policy_kwargs = dict(activation_fn=th.nn.Tanh,
                        net_arch=dict(pi=[100, 100], vf=[100, 100]))

    # Learning algorithm
    learner = TRPO("MlpPolicy",
                    venv,
                    verbose=1,
                    gamma=0.995,
                    gae_lambda=0.97,
                    policy_kwargs=policy_kwargs)

    # Compose GAIL trainer
    gail_trainer = GAIL(demonstrations=rollouts,
                        demo_batch_size= 1024, 
                        gen_replay_buffer_capacity=512,
                        n_disc_updates_per_round=8,
                        venv=venv,
                        gen_algo=learner,
                        reward_net=reward_net,
                        allow_variable_horizon=True,
                        **training_kwargs)

    # Train the GAIL algorithm
    gail_trainer.train(n_timesteps)

    # Evaluate the imitation learning algorithm
    im_mean_reward, im_std_reward = evaluate(env_eval, gail_trainer.policy, gamma=1., num_episodes=50, deterministic=True, print_results=True) # Loaded expert model

    # Play the video
    if play_video:
        evaluate(env_movie, gail_trainer.policy, num_episodes=1, deterministic=True, print_results=False)
        env_movie.play()

    return im_mean_reward, im_std_reward

## 4. Compare the GAIL and Behavioral Cloning to the expert

In [None]:
# To load policy from disk use "ppo" as key register (https://imitation.readthedocs.io/en/latest/tutorials/1_train_bc.html)
expert = load_policy("ppo", venv, path="/content/log/"+environment_string+"/"+environment_string+"_model")

# Testing if the model loading was correct
test_if_correct = False
if test_if_correct:
    # Evaluate the trained model
    _, _ = evaluate(env_eval, model, gamma=1., num_episodes=10, deterministic=True, print_results=True) # Same expert model as before (you must not restart the kernel, otherwise you will not have the "model" variable)
    _, _ = evaluate(env_eval, expert, gamma=1., num_episodes=10, deterministic=True, print_results=True) # Loaded expert model


# Load the expert evaluations
expert_eval = np.load(log_dir+"expert_evaluation.npy")[0]
random_eval = np.load(log_dir+"random_evaluation.npy")[0]

# Select GAIL training iterations depending on the environment (These are the correct ones)
training_iterations_gail = {"CartPole-v1": 300,
                            "Acrobot-v1": 300,
                            "MountainCar-v0": 300,
                            "Reacher-v2": 200,
                            "HalfCheetah-v2": 500,
                            "Hopper-v2": 500,
                            "Walker2d-v2": 500,
                            "Ant-v2": 500,
                            "Humanoid-v2": 1500,
                            }
# Write here a fraction if you want to reduce the number of training iterations for computational reasons (1/5th of the true ones)
retuce_training_iterations_fraction = 1/5

state_action_pairs_per_iteration_gail = {"CartPole-v1": 5000,
                                        "Acrobot-v1": 5000,
                                        "MountainCar-v0": 5000,
                                        "Reacher-v2": 50000,
                                        "HalfCheetah-v2": 50000,
                                        "Hopper-v2": 50000,
                                        "Walker2d-v2": 50000,
                                        "Ant-v2": 50000,
                                        "Humanoid-v2": 50000,
                                        }

min_episodes_list = {"CartPole-v1": [1, 4, 7, 10],
                    "Acrobot-v1": [1, 4, 7, 10],
                    "MountainCar-v0": [1, 4, 7, 10],
                    "Reacher-v2": [4, 11, 18],
                    "HalfCheetah-v2": [4, 11, 18, 25],
                    "Hopper-v2": [4, 11, 18, 25],
                    "Walker2d-v2": [4, 11, 18, 25],
                    "Ant-v2": [4, 11, 18, 25],
                    "Humanoid-v2": [80, 160, 240],
                    }

min_timesteps = 50
mean_r_bc_list = []
mean_r_gail_list = []

# Cycling on the minimum number of trajectories (this changes for each environment...)
first_run = True
for min_episodes in min_episodes_list[environment_string]:

    print("Number of episodes selected:", min_episodes)

    # Train the algorithms for a number of times (They "run the algorithms over 5-7 reruns", in this case we use less, otherwise Google Colab shuts down the connection)
    n_reruns = 7
    for i in range(n_reruns):

        # Set new random seed
        rng = np.random.default_rng(i)

        # Get training samples
        transitions, rollouts = get_transitions(min_timesteps=50,
                                                min_episodes=min_episodes,
                                                rng=rng,
                                                expert=expert,
                                                keep_only_min_episodes=True, # We only keep the min_episodes trajectory
                                                keep_only_500_pairs=True) # We only keep the first 500 pairs of (obs,act) for each trajectory

        # Train Behavioral Cloning and get results
        mean_r_bc, std_r_bc = behavioral_cloning_train(rng=rng, env_eval=env_eval, env_movie=env_movie, venv=venv, transitions=transitions)
        mean_r_bc_list.append(mean_r_bc)

        # Train GAIL and get results (only a percentage of the training iterations, otherwise it is too slow)
        mean_r_gail, std_r_gail = gail_train(training_iterations=int(retuce_training_iterations_fraction*training_iterations_gail[environment_string]),
                                            state_action_pairs_per_iteration = state_action_pairs_per_iteration_gail[environment_string],
                                            venv=venv, rollouts=rollouts, env_eval=env_eval, env_movie=env_movie)
        mean_r_gail_list.append(mean_r_gail)

        # Print summary at the end of the for loop
        print("========================================================")
        print("Minimum number of episodes selected:", min_episodes)
        print("ITERATION", i, "- SUMMARY:")
        print("- Expert:", expert_eval)
        print("- Random:", random_eval)
        print("- Behavioral Cloning:", mean_r_bc)
        print("- GAIL:", mean_r_gail)
        print("========================================================")

        # Rescale the results between 0 (= random) and 1 (= expert)
        rescaled_mean_r_bc = (mean_r_bc - random_eval) / (expert_eval - random_eval)
        rescaled_mean_r_gail = (mean_r_gail - random_eval) / (expert_eval - random_eval)

        # Create a dataframe and appending the rows
        row_columns = ["n_episodes", "Mean Reward BC", "Mean Reward GAIL", "Rescaled Mean Reward BC", "Rescaled Mean Reward GAIL"]
        row = [min_episodes, mean_r_bc, mean_r_gail, rescaled_mean_r_bc, rescaled_mean_r_gail]
        # Iteratively create the dataframe and save it
        if first_run == True:
            dataframe = pd.DataFrame(dict(zip(row_columns, row)), index=[0])
            first_run = False
        else:
            current_row_in_dataframe = pd.DataFrame(dict(zip(row_columns, row)), index=[0])
            dataframe = pd.concat([dataframe, current_row_in_dataframe]).reset_index(drop=True)

        # Save the pandas dataframe iteratively (overwrites)
        dataframe.to_csv(log_dir+"dataframe_results.csv")
        # Save it also on Google Drive to avoid loosing results when Google Colab stops the connection
        dataframe.to_csv("/content/drive/MyDrive/RL_project/dataframe_results_"+environment_string+".csv")
        print("CSV dataframe saved!")

# Download when finished (this works only if Google Colab does not end the connection... otherwise you will find the results on your Google Drive)
files.download(log_dir+"dataframe_results.csv")

INFO:root:Loading Stable Baselines policy for '<class 'stable_baselines3.ppo.ppo.PPO'>' from '/content/log/Hopper-v2/Hopper-v2_model'


Number of episodes selected: 25


INFO:root:Rollout stats: {'n_traj': 32, 'return_min': 3593.4563449529765, 'return_mean': 3618.016123661458, 'return_std': 16.506941145310424, 'return_max': 3646.376665965219, 'len_min': 1000, 'len_mean': 1000.0, 'len_std': 0.0, 'len_max': 1000}
0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -0.00426 |
|    entropy        | 4.26     |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 223      |
|    loss           | 3.27     |
|    neglogp        | 3.27     |
|    prob_true_act  | 0.0413   |
|    samples_so_far | 32       |
--------------------------------


266batch [00:02, 119.77batch/s]
272batch [00:02, 102.89batch/s]


1.916487693786621


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -0.00339 |
|    entropy        | 3.39     |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 244      |
|    loss           | 1.92     |
|    neglogp        | 1.93     |
|    prob_true_act  | 0.146    |
|    samples_so_far | 32       |
--------------------------------


261batch [00:01, 157.94batch/s]
272batch [00:01, 153.03batch/s]


1.0717577934265137


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -0.00255 |
|    entropy        | 2.55     |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 252      |
|    loss           | 1.06     |
|    neglogp        | 1.07     |
|    prob_true_act  | 0.344    |
|    samples_so_far | 32       |
--------------------------------


265batch [00:01, 168.98batch/s]
272batch [00:01, 162.03batch/s]


0.23854409158229828


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | -0.00172 |
|    entropy        | 1.72     |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 260      |
|    loss           | 0.25     |
|    neglogp        | 0.252    |
|    prob_true_act  | 0.779    |
|    samples_so_far | 32       |
--------------------------------


266batch [00:01, 162.06batch/s]
272batch [00:01, 164.16batch/s]


-0.5597281455993652


0batch [00:00, ?batch/s]

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000896 |
|    entropy        | 0.896     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 267       |
|    loss           | -0.572    |
|    neglogp        | -0.571    |
|    prob_true_act  | 1.77      |
|    samples_so_far | 32        |
---------------------------------


271batch [00:01, 148.39batch/s]
272batch [00:01, 154.59batch/s]


-1.345734715461731


0batch [00:00, ?batch/s]

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -8.44e-05 |
|    entropy        | 0.0844    |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 276       |
|    loss           | -1.36     |
|    neglogp        | -1.36     |
|    prob_true_act  | 3.9       |
|    samples_so_far | 32        |
---------------------------------


257batch [00:01, 159.92batch/s]
272batch [00:01, 153.07batch/s]


-2.1181066036224365


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00072  |
|    entropy        | -0.72    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 284      |
|    loss           | -2.15    |
|    neglogp        | -2.15    |
|    prob_true_act  | 8.57     |
|    samples_so_far | 32       |
--------------------------------


263batch [00:02, 103.45batch/s]
272batch [00:02, 113.49batch/s]


-2.889061450958252


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00151  |
|    entropy        | -1.51    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 292      |
|    loss           | -2.87    |
|    neglogp        | -2.87    |
|    prob_true_act  | 18       |
|    samples_so_far | 32       |
--------------------------------


267batch [00:02, 95.73batch/s]
272batch [00:02, 97.22batch/s]


-3.510756731033325


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00229  |
|    entropy        | -2.29    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 300      |
|    loss           | -3.52    |
|    neglogp        | -3.53    |
|    prob_true_act  | 34.8     |
|    samples_so_far | 32       |
--------------------------------


270batch [00:02, 150.25batch/s]
272batch [00:02, 125.38batch/s]


-4.307999610900879


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00303  |
|    entropy        | -3.03    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 308      |
|    loss           | -4.29    |
|    neglogp        | -4.29    |
|    prob_true_act  | 74.8     |
|    samples_so_far | 32       |
--------------------------------


270batch [00:01, 140.18batch/s]
272batch [00:01, 138.09batch/s]


-4.9197587966918945


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00373  |
|    entropy        | -3.73    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 316      |
|    loss           | -4.96    |
|    neglogp        | -4.97    |
|    prob_true_act  | 148      |
|    samples_so_far | 32       |
--------------------------------


265batch [00:01, 155.37batch/s]
272batch [00:01, 146.17batch/s]


-5.476769924163818


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00439  |
|    entropy        | -4.39    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 323      |
|    loss           | -5.62    |
|    neglogp        | -5.63    |
|    prob_true_act  | 284      |
|    samples_so_far | 32       |
--------------------------------


261batch [00:01, 156.44batch/s]
272batch [00:01, 160.39batch/s]


-5.802836894989014


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00495  |
|    entropy        | -4.95    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 330      |
|    loss           | -5.4     |
|    neglogp        | -5.4     |
|    prob_true_act  | 359      |
|    samples_so_far | 32       |
--------------------------------


270batch [00:01, 159.91batch/s]
272batch [00:01, 156.11batch/s]


-6.017879962921143


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00548  |
|    entropy        | -5.48    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 336      |
|    loss           | -6.03    |
|    neglogp        | -6.03    |
|    prob_true_act  | 598      |
|    samples_so_far | 32       |
--------------------------------


265batch [00:02, 106.62batch/s]
272batch [00:02, 126.21batch/s]


-6.513702392578125


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00585  |
|    entropy        | -5.85    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 340      |
|    loss           | -6.54    |
|    neglogp        | -6.54    |
|    prob_true_act  | 915      |
|    samples_so_far | 32       |
--------------------------------


268batch [00:02, 111.16batch/s]
272batch [00:02, 105.37batch/s]


-6.693505764007568


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00613  |
|    entropy        | -6.13    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 345      |
|    loss           | -6.57    |
|    neglogp        | -6.58    |
|    prob_true_act  | 1.02e+03 |
|    samples_so_far | 32       |
--------------------------------


263batch [00:01, 151.29batch/s]
272batch [00:01, 137.98batch/s]


-6.908071517944336


0batch [00:00, ?batch/s]

--------------------------------
| batch_size        | 32       |
| bc/               |          |
|    batch          | 0        |
|    ent_loss       | 0.00647  |
|    entropy        | -6.47    |
|    epoch          | 0        |
|    l2_loss        | 0        |
|    l2_norm        | 349      |
|    loss           | -6.73    |
|    neglogp        | -6.74    |
|    prob_true_act  | 1.32e+03 |
|    samples_so_far | 32       |
--------------------------------


269batch [00:01, 161.64batch/s]
272batch [00:01, 150.93batch/s]


-6.756986141204834


100%|██████████| 50/50 [00:17<00:00,  2.81it/s]


Mean reward: 1508.6815155520173 - Std reward: 74.48115870882025 - Num episodes: 50
Using cpu device
Running with `allow_variable_horizon` set to True. Some algorithms are biased towards shorter or longer episodes, which may significantly confound results. Additionally, even unbiased algorithms can exploit the information leak from the termination condition, producing spuriously high performance. See https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html for more information.



round:   0%|          | 0/100 [00:00<?, ?it/s]

------------------------------------------
| raw/                        |          |
|    gen/rollout/ep_len_mean  | 19.3     |
|    gen/rollout/ep_rew_mean  | 15.9     |
|    gen/time/fps             | 1425     |
|    gen/time/iterations      | 1        |
|    gen/time/time_elapsed    | 11       |
|    gen/time/total_timesteps | 16384    |
------------------------------------------
--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 31.1     |
|    gen/rollout/ep_rew_mean          | 28.4     |
|    gen/rollout/ep_rew_wrapped_mean  | 13.1     |
|    gen/time/fps                     | 1087     |
|    gen/time/iterations              | 2        |
|    gen/time/time_elapsed            | 30       |
|    gen/time/total_timesteps         | 32768    |
|    gen/train/explained_variance     | 0.0354   |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00935  |
|

round:   1%|          | 1/100 [01:13<2:02:04, 73.98s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 117      |
|    gen/rollout/ep_rew_mean          | 168      |
|    gen/rollout/ep_rew_wrapped_mean  | 60       |
|    gen/time/fps                     | 1605     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 10       |
|    gen/time/total_timesteps         | 81920    |
|    gen/train/explained_variance     | 0.526    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00835  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 4        |
|    gen/train/policy_objective       | 0.0247   |
|    gen/train/std                    | 0.963    |
|    gen/train/value_loss             | 37.2     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   2%|▏         | 2/100 [02:22<1:55:08, 70.50s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 359      |
|    gen/rollout/ep_rew_mean          | 484      |
|    gen/rollout/ep_rew_wrapped_mean  | 185      |
|    gen/time/fps                     | 1687     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 147456   |
|    gen/train/explained_variance     | 0.877    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00765  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 8        |
|    gen/train/policy_objective       | 0.02     |
|    gen/train/std                    | 0.902    |
|    gen/train/value_loss             | 11.3     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   3%|▎         | 3/100 [03:30<1:52:47, 69.77s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 269      |
|    gen/rollout/ep_rew_mean          | 436      |
|    gen/rollout/ep_rew_wrapped_mean  | 154      |
|    gen/time/fps                     | 2088     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 212992   |
|    gen/train/explained_variance     | 0.929    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00731  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 12       |
|    gen/train/policy_objective       | 0.0189   |
|    gen/train/std                    | 0.875    |
|    gen/train/value_loss             | 12.2     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   4%|▍         | 4/100 [04:32<1:46:38, 66.65s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 475      |
|    gen/rollout/ep_rew_mean          | 815      |
|    gen/rollout/ep_rew_wrapped_mean  | 190      |
|    gen/time/fps                     | 2031     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 278528   |
|    gen/train/explained_variance     | 0.648    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00499  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 16       |
|    gen/train/policy_objective       | 0.0163   |
|    gen/train/std                    | 0.864    |
|    gen/train/value_loss             | 33.1     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   5%|▌         | 5/100 [05:34<1:42:56, 65.01s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 313      |
|    gen/rollout/ep_rew_mean          | 720      |
|    gen/rollout/ep_rew_wrapped_mean  | 201      |
|    gen/time/fps                     | 1792     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 344064   |
|    gen/train/explained_variance     | 0.794    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00651  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 20       |
|    gen/train/policy_objective       | 0.0215   |
|    gen/train/std                    | 0.851    |
|    gen/train/value_loss             | 52.4     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   6%|▌         | 6/100 [06:37<1:40:23, 64.08s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 273      |
|    gen/rollout/ep_rew_mean          | 774      |
|    gen/rollout/ep_rew_wrapped_mean  | 299      |
|    gen/time/fps                     | 2065     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 409600   |
|    gen/train/explained_variance     | 0.893    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00642  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 24       |
|    gen/train/policy_objective       | 0.0223   |
|    gen/train/std                    | 0.825    |
|    gen/train/value_loss             | 65.6     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   7%|▋         | 7/100 [07:37<1:37:38, 63.00s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 260      |
|    gen/rollout/ep_rew_mean          | 824      |
|    gen/rollout/ep_rew_wrapped_mean  | 311      |
|    gen/time/fps                     | 2104     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 475136   |
|    gen/train/explained_variance     | 0.969    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00777  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 28       |
|    gen/train/policy_objective       | 0.0251   |
|    gen/train/std                    | 0.779    |
|    gen/train/value_loss             | 33.3     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   8%|▊         | 8/100 [08:43<1:38:03, 63.95s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 286      |
|    gen/rollout/ep_rew_mean          | 923      |
|    gen/rollout/ep_rew_wrapped_mean  | 249      |
|    gen/time/fps                     | 1784     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 540672   |
|    gen/train/explained_variance     | 0.964    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00664  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 32       |
|    gen/train/policy_objective       | 0.0203   |
|    gen/train/std                    | 0.741    |
|    gen/train/value_loss             | 31.2     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:   9%|▉         | 9/100 [09:49<1:37:31, 64.30s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 362      |
|    gen/rollout/ep_rew_mean          | 1.18e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 308      |
|    gen/time/fps                     | 2179     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 606208   |
|    gen/train/explained_variance     | 0.953    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00608  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 36       |
|    gen/train/policy_objective       | 0.018    |
|    gen/train/std                    | 0.723    |
|    gen/train/value_loss             | 46.4     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  10%|█         | 10/100 [10:49<1:34:40, 63.12s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 408      |
|    gen/rollout/ep_rew_mean          | 1.35e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 418      |
|    gen/time/fps                     | 2115     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 671744   |
|    gen/train/explained_variance     | 0.975    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00686  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 40       |
|    gen/train/policy_objective       | 0.0231   |
|    gen/train/std                    | 0.671    |
|    gen/train/value_loss             | 27.2     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  11%|█         | 11/100 [11:50<1:32:28, 62.35s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 535      |
|    gen/rollout/ep_rew_mean          | 1.78e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 366      |
|    gen/time/fps                     | 2023     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 737280   |
|    gen/train/explained_variance     | 0.981    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00691  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 44       |
|    gen/train/policy_objective       | 0.019    |
|    gen/train/std                    | 0.615    |
|    gen/train/value_loss             | 25.3     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  12%|█▏        | 12/100 [12:49<1:30:06, 61.43s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 758      |
|    gen/rollout/ep_rew_mean          | 2.5e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 387      |
|    gen/time/fps                     | 1807     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 802816   |
|    gen/train/explained_variance     | 0.977    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00797  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 48       |
|    gen/train/policy_objective       | 0.0209   |
|    gen/train/std                    | 0.57     |
|    gen/train/value_loss             | 17.5     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  13%|█▎        | 13/100 [13:50<1:28:59, 61.37s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 795      |
|    gen/rollout/ep_rew_mean          | 2.69e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 577      |
|    gen/time/fps                     | 1767     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 868352   |
|    gen/train/explained_variance     | 0.98     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00808  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 52       |
|    gen/train/policy_objective       | 0.0212   |
|    gen/train/std                    | 0.523    |
|    gen/train/value_loss             | 11.5     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  14%|█▍        | 14/100 [14:54<1:28:50, 61.99s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 724      |
|    gen/rollout/ep_rew_mean          | 2.5e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 429      |
|    gen/time/fps                     | 2051     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 933888   |
|    gen/train/explained_variance     | 0.983    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00803  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 56       |
|    gen/train/policy_objective       | 0.0246   |
|    gen/train/std                    | 0.493    |
|    gen/train/value_loss             | 12.6     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  15%|█▌        | 15/100 [15:54<1:27:09, 61.53s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 779      |
|    gen/rollout/ep_rew_mean          | 2.71e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 457      |
|    gen/time/fps                     | 2097     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 999424   |
|    gen/train/explained_variance     | 0.985    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00861  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 60       |
|    gen/train/policy_objective       | 0.0263   |
|    gen/train/std                    | 0.464    |
|    gen/train/value_loss             | 8.26     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  16%|█▌        | 16/100 [16:53<1:25:04, 60.77s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 735      |
|    gen/rollout/ep_rew_mean          | 2.58e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 531      |
|    gen/time/fps                     | 2151     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1064960  |
|    gen/train/explained_variance     | 0.987    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00801  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 64       |
|    gen/train/policy_objective       | 0.0243   |
|    gen/train/std                    | 0.449    |
|    gen/train/value_loss             | 6.74     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  17%|█▋        | 17/100 [17:52<1:23:15, 60.18s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 750      |
|    gen/rollout/ep_rew_mean          | 2.64e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 469      |
|    gen/time/fps                     | 2252     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1130496  |
|    gen/train/explained_variance     | 0.984    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00897  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 68       |
|    gen/train/policy_objective       | 0.0291   |
|    gen/train/std                    | 0.427    |
|    gen/train/value_loss             | 3.46     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  18%|█▊        | 18/100 [18:56<1:23:49, 61.33s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 781      |
|    gen/rollout/ep_rew_mean          | 2.76e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 451      |
|    gen/time/fps                     | 2011     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 1196032  |
|    gen/train/explained_variance     | 0.988    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00916  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 72       |
|    gen/train/policy_objective       | 0.0235   |
|    gen/train/std                    | 0.411    |
|    gen/train/value_loss             | 5.21     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  19%|█▉        | 19/100 [19:56<1:22:12, 60.90s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 781      |
|    gen/rollout/ep_rew_mean          | 2.78e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 626      |
|    gen/time/fps                     | 1822     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 1261568  |
|    gen/train/explained_variance     | 0.984    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00908  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 76       |
|    gen/train/policy_objective       | 0.0276   |
|    gen/train/std                    | 0.392    |
|    gen/train/value_loss             | 4.53     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  20%|██        | 20/100 [20:58<1:21:37, 61.22s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 741      |
|    gen/rollout/ep_rew_mean          | 2.64e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 458      |
|    gen/time/fps                     | 1805     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 1327104  |
|    gen/train/explained_variance     | 0.984    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0084   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 80       |
|    gen/train/policy_objective       | 0.0283   |
|    gen/train/std                    | 0.381    |
|    gen/train/value_loss             | 6.12     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  21%|██        | 21/100 [22:02<1:21:41, 62.05s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 771      |
|    gen/rollout/ep_rew_mean          | 2.75e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 437      |
|    gen/time/fps                     | 2208     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1392640  |
|    gen/train/explained_variance     | 0.963    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00836  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 84       |
|    gen/train/policy_objective       | 0.0566   |
|    gen/train/std                    | 0.385    |
|    gen/train/value_loss             | 4.3      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  22%|██▏       | 22/100 [23:05<1:21:09, 62.43s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 494      |
|    gen/rollout/ep_rew_mean          | 1.79e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 522      |
|    gen/time/fps                     | 1778     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 1458176  |
|    gen/train/explained_variance     | 0.969    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00867  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 88       |
|    gen/train/policy_objective       | 0.0307   |
|    gen/train/std                    | 0.369    |
|    gen/train/value_loss             | 14.8     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  23%|██▎       | 23/100 [24:09<1:20:54, 63.04s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 503      |
|    gen/rollout/ep_rew_mean          | 1.85e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 310      |
|    gen/time/fps                     | 2179     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1523712  |
|    gen/train/explained_variance     | 0.976    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00921  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 92       |
|    gen/train/policy_objective       | 0.0267   |
|    gen/train/std                    | 0.347    |
|    gen/train/value_loss             | 9.07     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  24%|██▍       | 24/100 [25:13<1:19:53, 63.07s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 535      |
|    gen/rollout/ep_rew_mean          | 1.98e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 371      |
|    gen/time/fps                     | 2142     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1589248  |
|    gen/train/explained_variance     | 0.979    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00929  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 96       |
|    gen/train/policy_objective       | 0.0317   |
|    gen/train/std                    | 0.338    |
|    gen/train/value_loss             | 5.83     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  25%|██▌       | 25/100 [26:15<1:18:45, 63.01s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 435      |
|    gen/rollout/ep_rew_mean          | 1.61e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 441      |
|    gen/time/fps                     | 1429     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 11       |
|    gen/time/total_timesteps         | 1654784  |
|    gen/train/explained_variance     | 0.982    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00933  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 100      |
|    gen/train/policy_objective       | 0.029    |
|    gen/train/std                    | 0.334    |
|    gen/train/value_loss             | 5.17     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  26%|██▌       | 26/100 [27:23<1:19:19, 64.32s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 404      |
|    gen/rollout/ep_rew_mean          | 1.5e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 261      |
|    gen/time/fps                     | 2349     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 6        |
|    gen/time/total_timesteps         | 1720320  |
|    gen/train/explained_variance     | 0.988    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.009    |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 104      |
|    gen/train/policy_objective       | 0.0287   |
|    gen/train/std                    | 0.33     |
|    gen/train/value_loss             | 5.57     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  27%|██▋       | 27/100 [28:25<1:17:27, 63.66s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 429      |
|    gen/rollout/ep_rew_mean          | 1.59e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 319      |
|    gen/time/fps                     | 2127     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1785856  |
|    gen/train/explained_variance     | 0.987    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00903  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 108      |
|    gen/train/policy_objective       | 0.027    |
|    gen/train/std                    | 0.315    |
|    gen/train/value_loss             | 5.17     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  28%|██▊       | 28/100 [29:31<1:17:19, 64.44s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 338      |
|    gen/rollout/ep_rew_mean          | 1.23e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 346      |
|    gen/time/fps                     | 1981     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 1851392  |
|    gen/train/explained_variance     | 0.979    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0092   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 112      |
|    gen/train/policy_objective       | 0.0268   |
|    gen/train/std                    | 0.307    |
|    gen/train/value_loss             | 11       |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  29%|██▉       | 29/100 [30:34<1:15:32, 63.84s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 422      |
|    gen/rollout/ep_rew_mean          | 1.56e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 253      |
|    gen/time/fps                     | 2252     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 1916928  |
|    gen/train/explained_variance     | 0.986    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00925  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 116      |
|    gen/train/policy_objective       | 0.0295   |
|    gen/train/std                    | 0.299    |
|    gen/train/value_loss             | 5.82     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  30%|███       | 30/100 [31:41<1:15:44, 64.91s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 492      |
|    gen/rollout/ep_rew_mean          | 1.8e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 349      |
|    gen/time/fps                     | 1689     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 1982464  |
|    gen/train/explained_variance     | 0.977    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00899  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 120      |
|    gen/train/policy_objective       | 0.0307   |
|    gen/train/std                    | 0.298    |
|    gen/train/value_loss             | 6.26     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  31%|███       | 31/100 [32:52<1:16:36, 66.61s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 541      |
|    gen/rollout/ep_rew_mean          | 1.95e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 266      |
|    gen/time/fps                     | 1366     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 11       |
|    gen/time/total_timesteps         | 2048000  |
|    gen/train/explained_variance     | 0.979    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00935  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 124      |
|    gen/train/policy_objective       | 0.0339   |
|    gen/train/std                    | 0.293    |
|    gen/train/value_loss             | 7.36     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  32%|███▏      | 32/100 [33:59<1:15:44, 66.82s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 780      |
|    gen/rollout/ep_rew_mean          | 2.77e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 422      |
|    gen/time/fps                     | 2365     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 6        |
|    gen/time/total_timesteps         | 2113536  |
|    gen/train/explained_variance     | 0.983    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00897  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 128      |
|    gen/train/policy_objective       | 0.0404   |
|    gen/train/std                    | 0.282    |
|    gen/train/value_loss             | 1.58     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  33%|███▎      | 33/100 [35:01<1:13:03, 65.42s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 721      |
|    gen/rollout/ep_rew_mean          | 2.58e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 436      |
|    gen/time/fps                     | 1935     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 2179072  |
|    gen/train/explained_variance     | 0.966    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00953  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 132      |
|    gen/train/policy_objective       | 0.0324   |
|    gen/train/std                    | 0.274    |
|    gen/train/value_loss             | 8.55     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  34%|███▍      | 34/100 [36:09<1:12:53, 66.26s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 756      |
|    gen/rollout/ep_rew_mean          | 2.68e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 449      |
|    gen/time/fps                     | 2107     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2244608  |
|    gen/train/explained_variance     | 0.966    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00928  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 136      |
|    gen/train/policy_objective       | 0.0333   |
|    gen/train/std                    | 0.271    |
|    gen/train/value_loss             | 2.54     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  35%|███▌      | 35/100 [37:12<1:10:39, 65.22s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 379      |
|    gen/rollout/ep_rew_mean          | 1.33e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 352      |
|    gen/time/fps                     | 2091     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2310144  |
|    gen/train/explained_variance     | 0.935    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00898  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 140      |
|    gen/train/policy_objective       | 0.03     |
|    gen/train/std                    | 0.266    |
|    gen/train/value_loss             | 22.4     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  36%|███▌      | 36/100 [38:15<1:08:57, 64.65s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 398      |
|    gen/rollout/ep_rew_mean          | 1.42e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 202      |
|    gen/time/fps                     | 1926     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 2375680  |
|    gen/train/explained_variance     | 0.981    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00957  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 144      |
|    gen/train/policy_objective       | 0.0285   |
|    gen/train/std                    | 0.263    |
|    gen/train/value_loss             | 9.12     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  37%|███▋      | 37/100 [39:19<1:07:31, 64.30s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 572      |
|    gen/rollout/ep_rew_mean          | 2.06e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 302      |
|    gen/time/fps                     | 2290     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2441216  |
|    gen/train/explained_variance     | 0.984    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00913  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 148      |
|    gen/train/policy_objective       | 0.0277   |
|    gen/train/std                    | 0.259    |
|    gen/train/value_loss             | 6.61     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  38%|███▊      | 38/100 [40:18<1:04:47, 62.71s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 813      |
|    gen/rollout/ep_rew_mean          | 2.92e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 440      |
|    gen/time/fps                     | 2213     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2506752  |
|    gen/train/explained_variance     | 0.987    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00935  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 152      |
|    gen/train/policy_objective       | 0.0263   |
|    gen/train/std                    | 0.252    |
|    gen/train/value_loss             | 4.2      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  39%|███▉      | 39/100 [41:18<1:02:53, 61.86s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 842      |
|    gen/rollout/ep_rew_mean          | 3.01e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 432      |
|    gen/time/fps                     | 2151     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2572288  |
|    gen/train/explained_variance     | 0.972    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00927  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 156      |
|    gen/train/policy_objective       | 0.0358   |
|    gen/train/std                    | 0.245    |
|    gen/train/value_loss             | 1.34     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  40%|████      | 40/100 [42:20<1:01:56, 61.94s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 924      |
|    gen/rollout/ep_rew_mean          | 3.3e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 504      |
|    gen/time/fps                     | 1842     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 2637824  |
|    gen/train/explained_variance     | 0.979    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00948  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 160      |
|    gen/train/policy_objective       | 0.0264   |
|    gen/train/std                    | 0.237    |
|    gen/train/value_loss             | 5.01     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  41%|████      | 41/100 [43:25<1:01:41, 62.74s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 934      |
|    gen/rollout/ep_rew_mean          | 3.33e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 438      |
|    gen/time/fps                     | 2118     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2703360  |
|    gen/train/explained_variance     | 0.949    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00946  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 164      |
|    gen/train/policy_objective       | 0.0264   |
|    gen/train/std                    | 0.23     |
|    gen/train/value_loss             | 2.54     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  42%|████▏     | 42/100 [44:26<1:00:09, 62.23s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 903      |
|    gen/rollout/ep_rew_mean          | 3.19e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 434      |
|    gen/time/fps                     | 2183     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2768896  |
|    gen/train/explained_variance     | 0.948    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0095   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 168      |
|    gen/train/policy_objective       | 0.0342   |
|    gen/train/std                    | 0.229    |
|    gen/train/value_loss             | 2.54     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  43%|████▎     | 43/100 [45:27<58:59, 62.09s/it]  

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 957      |
|    gen/rollout/ep_rew_mean          | 3.36e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 487      |
|    gen/time/fps                     | 1976     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 2834432  |
|    gen/train/explained_variance     | 0.988    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00897  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 172      |
|    gen/train/policy_objective       | 0.0465   |
|    gen/train/std                    | 0.223    |
|    gen/train/value_loss             | 1.3      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  44%|████▍     | 44/100 [46:29<57:55, 62.05s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 967      |
|    gen/rollout/ep_rew_mean          | 3.34e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 458      |
|    gen/time/fps                     | 1833     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 2899968  |
|    gen/train/explained_variance     | 0.96     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00933  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 176      |
|    gen/train/policy_objective       | 0.0248   |
|    gen/train/std                    | 0.218    |
|    gen/train/value_loss             | 1.79     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  45%|████▌     | 45/100 [47:33<57:17, 62.51s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 889      |
|    gen/rollout/ep_rew_mean          | 3.14e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 441      |
|    gen/time/fps                     | 2316     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 2965504  |
|    gen/train/explained_variance     | 0.95     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00909  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 180      |
|    gen/train/policy_objective       | 0.0276   |
|    gen/train/std                    | 0.214    |
|    gen/train/value_loss             | 7.45     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  46%|████▌     | 46/100 [48:34<55:55, 62.13s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 850      |
|    gen/rollout/ep_rew_mean          | 3.05e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 428      |
|    gen/time/fps                     | 2130     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3031040  |
|    gen/train/explained_variance     | 0.963    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0067   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 184      |
|    gen/train/policy_objective       | 0.0228   |
|    gen/train/std                    | 0.213    |
|    gen/train/value_loss             | 6.34     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  47%|████▋     | 47/100 [49:38<55:14, 62.54s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 808      |
|    gen/rollout/ep_rew_mean          | 2.91e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 425      |
|    gen/time/fps                     | 1892     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 3096576  |
|    gen/train/explained_variance     | 0.981    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00927  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 188      |
|    gen/train/policy_objective       | 0.0271   |
|    gen/train/std                    | 0.209    |
|    gen/train/value_loss             | 5.3      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  48%|████▊     | 48/100 [50:41<54:31, 62.92s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 809      |
|    gen/rollout/ep_rew_mean          | 2.9e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 398      |
|    gen/time/fps                     | 2344     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 6        |
|    gen/time/total_timesteps         | 3162112  |
|    gen/train/explained_variance     | 0.967    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00946  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 192      |
|    gen/train/policy_objective       | 0.0279   |
|    gen/train/std                    | 0.207    |
|    gen/train/value_loss             | 8.25     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  49%|████▉     | 49/100 [51:45<53:34, 63.03s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 860      |
|    gen/rollout/ep_rew_mean          | 3.06e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 410      |
|    gen/time/fps                     | 1778     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 3227648  |
|    gen/train/explained_variance     | 0.91     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00932  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 196      |
|    gen/train/policy_objective       | 0.0274   |
|    gen/train/std                    | 0.206    |
|    gen/train/value_loss             | 8.79     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  50%|█████     | 50/100 [52:50<53:06, 63.74s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 720      |
|    gen/rollout/ep_rew_mean          | 2.52e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 395      |
|    gen/time/fps                     | 2131     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3293184  |
|    gen/train/explained_variance     | 0.777    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00961  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 200      |
|    gen/train/policy_objective       | 0.0276   |
|    gen/train/std                    | 0.205    |
|    gen/train/value_loss             | 54       |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  51%|█████     | 51/100 [53:53<51:57, 63.63s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 757      |
|    gen/rollout/ep_rew_mean          | 2.64e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 453      |
|    gen/time/fps                     | 1806     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 3358720  |
|    gen/train/explained_variance     | 0.843    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0095   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 204      |
|    gen/train/policy_objective       | 0.0241   |
|    gen/train/std                    | 0.203    |
|    gen/train/value_loss             | 44.2     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  52%|█████▏    | 52/100 [54:58<51:01, 63.78s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 871      |
|    gen/rollout/ep_rew_mean          | 3.01e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 511      |
|    gen/time/fps                     | 1857     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 3424256  |
|    gen/train/explained_variance     | 0.74     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00944  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 208      |
|    gen/train/policy_objective       | 0.0259   |
|    gen/train/std                    | 0.201    |
|    gen/train/value_loss             | 25.9     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  53%|█████▎    | 53/100 [56:02<50:03, 63.90s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 975      |
|    gen/rollout/ep_rew_mean          | 3.37e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 509      |
|    gen/time/fps                     | 2182     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3489792  |
|    gen/train/explained_variance     | 0.638    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00971  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 212      |
|    gen/train/policy_objective       | 0.0254   |
|    gen/train/std                    | 0.198    |
|    gen/train/value_loss             | 4.21     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  54%|█████▍    | 54/100 [57:07<49:16, 64.26s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 961      |
|    gen/rollout/ep_rew_mean          | 3.35e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 528      |
|    gen/time/fps                     | 1785     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 3555328  |
|    gen/train/explained_variance     | 0.959    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00963  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 216      |
|    gen/train/policy_objective       | 0.0281   |
|    gen/train/std                    | 0.196    |
|    gen/train/value_loss             | 5.66     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  55%|█████▌    | 55/100 [58:11<48:02, 64.06s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 958      |
|    gen/rollout/ep_rew_mean          | 3.36e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 528      |
|    gen/time/fps                     | 2284     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3620864  |
|    gen/train/explained_variance     | 0.943    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00986  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 220      |
|    gen/train/policy_objective       | 0.0181   |
|    gen/train/std                    | 0.193    |
|    gen/train/value_loss             | 2.92     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  56%|█████▌    | 56/100 [59:13<46:32, 63.46s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 970      |
|    gen/rollout/ep_rew_mean          | 3.42e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 536      |
|    gen/time/fps                     | 1516     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 10       |
|    gen/time/total_timesteps         | 3686400  |
|    gen/train/explained_variance     | 0.952    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00941  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 224      |
|    gen/train/policy_objective       | 0.0271   |
|    gen/train/std                    | 0.191    |
|    gen/train/value_loss             | 5.88     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  57%|█████▋    | 57/100 [1:00:22<46:44, 65.22s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 960      |
|    gen/rollout/ep_rew_mean          | 3.42e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 558      |
|    gen/time/fps                     | 2299     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3751936  |
|    gen/train/explained_variance     | 0.95     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00995  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 228      |
|    gen/train/policy_objective       | 0.0296   |
|    gen/train/std                    | 0.187    |
|    gen/train/value_loss             | 5.48     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  58%|█████▊    | 58/100 [1:01:21<44:20, 63.35s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 939      |
|    gen/rollout/ep_rew_mean          | 3.37e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 568      |
|    gen/time/fps                     | 2301     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 3817472  |
|    gen/train/explained_variance     | 0.961    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00943  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 232      |
|    gen/train/policy_objective       | 0.0383   |
|    gen/train/std                    | 0.184    |
|    gen/train/value_loss             | 7.37     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  59%|█████▉    | 59/100 [1:02:22<42:52, 62.73s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 852      |
|    gen/rollout/ep_rew_mean          | 3.08e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 514      |
|    gen/time/fps                     | 1939     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 3883008  |
|    gen/train/explained_variance     | 0.972    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00968  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 236      |
|    gen/train/policy_objective       | 0.0275   |
|    gen/train/std                    | 0.181    |
|    gen/train/value_loss             | 4.54     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  60%|██████    | 60/100 [1:03:31<42:57, 64.43s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 844      |
|    gen/rollout/ep_rew_mean          | 3.05e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 479      |
|    gen/time/fps                     | 1692     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 3948544  |
|    gen/train/explained_variance     | 0.975    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00966  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 240      |
|    gen/train/policy_objective       | 0.0296   |
|    gen/train/std                    | 0.178    |
|    gen/train/value_loss             | 4.42     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  61%|██████    | 61/100 [1:04:35<41:55, 64.50s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 938      |
|    gen/rollout/ep_rew_mean          | 3.39e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 622      |
|    gen/time/fps                     | 1837     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4014080  |
|    gen/train/explained_variance     | 0.955    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00969  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 244      |
|    gen/train/policy_objective       | 0.0359   |
|    gen/train/std                    | 0.176    |
|    gen/train/value_loss             | 7.92     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  62%|██████▏   | 62/100 [1:05:39<40:43, 64.29s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 895      |
|    gen/rollout/ep_rew_mean          | 3.24e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 551      |
|    gen/time/fps                     | 2307     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4079616  |
|    gen/train/explained_variance     | 0.976    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00989  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 248      |
|    gen/train/policy_objective       | 0.0362   |
|    gen/train/std                    | 0.175    |
|    gen/train/value_loss             | 6.17     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  63%|██████▎   | 63/100 [1:06:42<39:22, 63.86s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 940      |
|    gen/rollout/ep_rew_mean          | 3.39e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 481      |
|    gen/time/fps                     | 1993     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4145152  |
|    gen/train/explained_variance     | 0.968    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00964  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 252      |
|    gen/train/policy_objective       | 0.029    |
|    gen/train/std                    | 0.175    |
|    gen/train/value_loss             | 3.63     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  64%|██████▍   | 64/100 [1:07:46<38:20, 63.91s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 918      |
|    gen/rollout/ep_rew_mean          | 3.31e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 497      |
|    gen/time/fps                     | 1871     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4210688  |
|    gen/train/explained_variance     | 0.964    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00975  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 256      |
|    gen/train/policy_objective       | 0.0329   |
|    gen/train/std                    | 0.173    |
|    gen/train/value_loss             | 1.63     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  65%|██████▌   | 65/100 [1:08:49<37:07, 63.64s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 942      |
|    gen/rollout/ep_rew_mean          | 3.38e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 552      |
|    gen/time/fps                     | 2152     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4276224  |
|    gen/train/explained_variance     | 0.965    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00972  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 260      |
|    gen/train/policy_objective       | 0.0281   |
|    gen/train/std                    | 0.171    |
|    gen/train/value_loss             | 2.99     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  66%|██████▌   | 66/100 [1:09:53<36:05, 63.69s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 933      |
|    gen/rollout/ep_rew_mean          | 3.37e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 536      |
|    gen/time/fps                     | 1837     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4341760  |
|    gen/train/explained_variance     | 0.979    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00989  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 264      |
|    gen/train/policy_objective       | 0.03     |
|    gen/train/std                    | 0.17     |
|    gen/train/value_loss             | 6.45     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  67%|██████▋   | 67/100 [1:10:58<35:16, 64.12s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 941      |
|    gen/rollout/ep_rew_mean          | 3.4e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 499      |
|    gen/time/fps                     | 2188     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4407296  |
|    gen/train/explained_variance     | 0.977    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00973  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 268      |
|    gen/train/policy_objective       | 0.0262   |
|    gen/train/std                    | 0.169    |
|    gen/train/value_loss             | 2.77     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  68%|██████▊   | 68/100 [1:11:59<33:39, 63.10s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 960      |
|    gen/rollout/ep_rew_mean          | 3.45e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 564      |
|    gen/time/fps                     | 2302     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4472832  |
|    gen/train/explained_variance     | 0.931    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00972  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 272      |
|    gen/train/policy_objective       | 0.0287   |
|    gen/train/std                    | 0.166    |
|    gen/train/value_loss             | 3.6      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  69%|██████▉   | 69/100 [1:13:03<32:45, 63.40s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 973      |
|    gen/rollout/ep_rew_mean          | 3.51e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 616      |
|    gen/time/fps                     | 1827     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4538368  |
|    gen/train/explained_variance     | 0.973    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00958  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 276      |
|    gen/train/policy_objective       | 0.0285   |
|    gen/train/std                    | 0.166    |
|    gen/train/value_loss             | 3.06     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  70%|███████   | 70/100 [1:14:06<31:41, 63.38s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 940      |
|    gen/rollout/ep_rew_mean          | 3.4e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 600      |
|    gen/time/fps                     | 2036     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4603904  |
|    gen/train/explained_variance     | 0.968    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00975  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 280      |
|    gen/train/policy_objective       | 0.0261   |
|    gen/train/std                    | 0.163    |
|    gen/train/value_loss             | 4.3      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  71%|███████   | 71/100 [1:15:10<30:45, 63.65s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 960      |
|    gen/rollout/ep_rew_mean          | 3.48e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 531      |
|    gen/time/fps                     | 1735     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 4669440  |
|    gen/train/explained_variance     | 0.952    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00988  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 284      |
|    gen/train/policy_objective       | 0.027    |
|    gen/train/std                    | 0.161    |
|    gen/train/value_loss             | 2.8      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  72%|███████▏  | 72/100 [1:16:18<30:19, 64.99s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 988      |
|    gen/rollout/ep_rew_mean          | 3.57e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 523      |
|    gen/time/fps                     | 2208     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4734976  |
|    gen/train/explained_variance     | 0.9      |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00989  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 288      |
|    gen/train/policy_objective       | 0.0275   |
|    gen/train/std                    | 0.16     |
|    gen/train/value_loss             | 2.19     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  73%|███████▎  | 73/100 [1:17:22<29:03, 64.59s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 981      |
|    gen/rollout/ep_rew_mean          | 3.54e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 591      |
|    gen/time/fps                     | 1797     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 4800512  |
|    gen/train/explained_variance     | 0.91     |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00638  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 292      |
|    gen/train/policy_objective       | 0.0191   |
|    gen/train/std                    | 0.159    |
|    gen/train/value_loss             | 2.11     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  74%|███████▍  | 74/100 [1:18:26<27:55, 64.45s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 986      |
|    gen/rollout/ep_rew_mean          | 3.55e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 644      |
|    gen/time/fps                     | 2042     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 4866048  |
|    gen/train/explained_variance     | 0.982    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00987  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 296      |
|    gen/train/policy_objective       | 0.0292   |
|    gen/train/std                    | 0.158    |
|    gen/train/value_loss             | 1.57     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  75%|███████▌  | 75/100 [1:19:30<26:44, 64.16s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 1e+03    |
|    gen/rollout/ep_rew_mean          | 3.61e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 760      |
|    gen/time/fps                     | 2129     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 4931584  |
|    gen/train/explained_variance     | 0.911    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00969  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 300      |
|    gen/train/policy_objective       | 0.028    |
|    gen/train/std                    | 0.156    |
|    gen/train/value_loss             | 2.48     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  76%|███████▌  | 76/100 [1:20:32<25:26, 63.58s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 997      |
|    gen/rollout/ep_rew_mean          | 3.61e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 676      |
|    gen/time/fps                     | 1783     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 4997120  |
|    gen/train/explained_variance     | 0.898    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00998  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 304      |
|    gen/train/policy_objective       | 0.0279   |
|    gen/train/std                    | 0.155    |
|    gen/train/value_loss             | 2.7      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  77%|███████▋  | 77/100 [1:21:36<24:26, 63.76s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 995      |
|    gen/rollout/ep_rew_mean          | 3.6e+03  |
|    gen/rollout/ep_rew_wrapped_mean  | 588      |
|    gen/time/fps                     | 1921     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 5062656  |
|    gen/train/explained_variance     | 0.876    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00956  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 308      |
|    gen/train/policy_objective       | 0.031    |
|    gen/train/std                    | 0.153    |
|    gen/train/value_loss             | 1.37     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  78%|███████▊  | 78/100 [1:22:38<23:12, 63.28s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 1e+03    |
|    gen/rollout/ep_rew_mean          | 3.62e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 637      |
|    gen/time/fps                     | 2209     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 5128192  |
|    gen/train/explained_variance     | 0.992    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.0097   |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 312      |
|    gen/train/policy_objective       | 0.0161   |
|    gen/train/std                    | 0.151    |
|    gen/train/value_loss             | 1.28     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  79%|███████▉  | 79/100 [1:23:43<22:18, 63.75s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 996      |
|    gen/rollout/ep_rew_mean          | 3.61e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 721      |
|    gen/time/fps                     | 1854     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 5193728  |
|    gen/train/explained_variance     | 0.994    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00968  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 316      |
|    gen/train/policy_objective       | 0.0436   |
|    gen/train/std                    | 0.149    |
|    gen/train/value_loss             | 1.92     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  80%|████████  | 80/100 [1:24:53<21:52, 65.62s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 996      |
|    gen/rollout/ep_rew_mean          | 3.62e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 723      |
|    gen/time/fps                     | 1776     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 5259264  |
|    gen/train/explained_variance     | 0.876    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00968  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 320      |
|    gen/train/policy_objective       | 0.0239   |
|    gen/train/std                    | 0.148    |
|    gen/train/value_loss             | 1.37     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  81%|████████  | 81/100 [1:26:00<20:52, 65.92s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 1e+03    |
|    gen/rollout/ep_rew_mean          | 3.63e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 680      |
|    gen/time/fps                     | 2320     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 7        |
|    gen/time/total_timesteps         | 5324800  |
|    gen/train/explained_variance     | 0.935    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00981  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 324      |
|    gen/train/policy_objective       | 0.0249   |
|    gen/train/std                    | 0.147    |
|    gen/train/value_loss             | 2.7      |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  82%|████████▏ | 82/100 [1:27:02<19:26, 64.82s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 999      |
|    gen/rollout/ep_rew_mean          | 3.62e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 720      |
|    gen/time/fps                     | 1995     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 8        |
|    gen/time/total_timesteps         | 5390336  |
|    gen/train/explained_variance     | 0.966    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00657  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 328      |
|    gen/train/policy_objective       | 0.0249   |
|    gen/train/std                    | 0.145    |
|    gen/train/value_loss             | 2.34     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

round:  83%|████████▎ | 83/100 [1:28:07<18:24, 64.99s/it]

--------------------------------------------------
| raw/                                |          |
|    gen/rollout/ep_len_mean          | 999      |
|    gen/rollout/ep_rew_mean          | 3.62e+03 |
|    gen/rollout/ep_rew_wrapped_mean  | 684      |
|    gen/time/fps                     | 1795     |
|    gen/time/iterations              | 1        |
|    gen/time/time_elapsed            | 9        |
|    gen/time/total_timesteps         | 5455872  |
|    gen/train/explained_variance     | 0.969    |
|    gen/train/is_line_search_success | 1        |
|    gen/train/kl_divergence_loss     | 0.00974  |
|    gen/train/learning_rate          | 0.001    |
|    gen/train/n_updates              | 332      |
|    gen/train/policy_objective       | 0.0346   |
|    gen/train/std                    | 0.145    |
|    gen/train/value_loss             | 1.37     |
--------------------------------------------------
--------------------------------------------------
| raw/                         

## 5. Plot curves to visualize the comparison

This cell can be executed as a standalone cell if you re-upload the dataframe results which were stored in the .csv file

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# If I reuploaded the results, I have to change here the environment string of the title
if model_trained_now == False:
    environment_string = "Ant-v2"

# Loading the dataframe which contains the results of the relative environment
dataframe = pd.read_csv("dataframe_results_" + environment_string + ".csv", index_col=0)

# Group the results and compute the mean and standard deviation of the rescaled results
dataframe_grouped = dataframe.groupby("n_episodes").agg(
                    mean_rescaled_bc = ("Rescaled Mean Reward BC", "mean"),
                    std_rescaled_bc = ("Rescaled Mean Reward BC", "std"),
                    mean_rescaled_gail = ("Rescaled Mean Reward GAIL", "mean"),
                    std_rescaled_gail = ("Rescaled Mean Reward GAIL", "std"))

# PLOT
# ==============================================================================
plt.figure(figsize=(10,5))

# Expert
plt.axhline(0, linestyle="dashed", lw=2, color="tab:green", label="Random")

# Random
plt.axhline(1, linestyle="dashed", lw=2, color="tab:blue", label="Expert")

# Plot for Rescaled Mean Reward GAIL
plt.plot(dataframe_grouped.index, dataframe_grouped['mean_rescaled_gail'], label='GAIL', marker='s', color="#add8e6")
plt.fill_between(dataframe_grouped.index,
                 dataframe_grouped['mean_rescaled_gail'] - dataframe_grouped['std_rescaled_gail'],
                 dataframe_grouped['mean_rescaled_gail'] + dataframe_grouped['std_rescaled_gail'],
                 alpha=0.2,
                 color="#add8e6")

# Plot for Rescaled Mean Reward BC
plt.plot(dataframe_grouped.index, dataframe_grouped['mean_rescaled_bc'], label='Behavioral Cloning', marker='o',color="tab:orange")
plt.fill_between(dataframe_grouped.index,
                 dataframe_grouped['mean_rescaled_bc'] - dataframe_grouped['std_rescaled_bc'],
                 dataframe_grouped['mean_rescaled_bc'] + dataframe_grouped['std_rescaled_bc'],
                 alpha=0.2,
                 color="tab:orange")


plt.xlabel('Number of trajectories in dataset')
plt.ylabel('Performance (scaled)')
plt.title(environment_string)
plt.legend(loc="lower center", ncol=2)
plt.ylim([-0.3, 1.1])
plt.xticks(dataframe_grouped.index.to_list())
plt.yticks([0, 0.2, 0.4, 0.6, 0.6, 0.8, 1])
plt.grid(True)
plt.show()