The MIT License

Copyright (c) 2016 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

# Using Evolution Strategies to train Roboschool Environments

## Introduction

This Jupyter Notebook is based on the [paper](https://arxiv.org/abs/1703.03864), [blog article](https://openai.com/blog/evolution-strategies/) and [implementation](https://github.com/openai/evolution-strategies-starter) of OpenAI on the topic of using an Evolution Strategy algorithm for a typical reinforcement learning task. 

My implementation summarizes their implementation, by simplifying, refactoring and organizing the code into this Jupyter notebook which can be used to test the algorithm. One can tweak the hyperparameters, change the environment which shall be trained or even expand the implementation to support for example Atari environments.

I recommend reading the paper or at least the article before trying out the notebook. Also depending on the environment the training can be very computationally intense (for example training the Humanoid), so if you want to try out the harder ones I recommend using a highly parallelizable machine, i.e. a machine with a high number of cores/threads.

## Setup

Before starting any computation we need to configure the program and define some methods and objects we will use later on.


### Configuration & Result Classes

Using a `namedtuple` allows use to quickly create a class with different attributes, which is ideal for defining a Config and Result Class.

The Config class defines our general configuration of the program. The following table explains what each attribute does:


| Attribute             | Explanation   |
| :---------------------|:--------------|
| `env_id`              |  |
| `population_size`     |      |
| `population_size`     |       |
| `learning_rate`       |
| `noise_stdev`         |
| `snapshot_freq`       |
| `return_proc_mode`    |
| `timesteps_per_batch` |
| `calc_obstat_prob`    |
| `eval_prob`           |

In [None]:
from collections import namedtuple

Config = namedtuple('Config', [
    'env_id',
    'population_size',
    'num_workers',
    'learning_rate',
    'noise_stdev',
    'snapshot_freq',
    'return_proc_mode'
    #'timesteps_per_batch',
    #'calc_obstat_prob',
    #'eval_prob'
])

Result = namedtuple('Result', [
    'worker_id',
    'noise_inds_n','returns_n2', 'signreturns_n2', 'lengths_n2',
    'eval_return', 'eval_length',
    'ob_sum', 'ob_sumsq', 'ob_count',
    'task_id'
])

## Config

In [None]:
import os

config = Config(
    env_id="RoboschoolAnt-v1",
    population_size=32,
    num_workers=os.cpu_count(),
    learning_rate=0.005,
    noise_stdev=0.02,
    snapshot_freq=1,
    return_proc_mode="centered_rank"
)

#config.num_workers = config.num_workers if config.num_workers else os.cpu_count()

## Environment

Create one for every worker -> done in worker method
Master also

In [None]:
import gym, roboschool # Roboschool import needed to register the environments within gym
env = gym.make(config.env_id)

## Tensorflow Session

In [None]:
#import tensorflow as tf
#sess = tf.InteractiveSession()

## Policy setup

Currently saves the arguments as local variable, then creates a TensorFlow variable scope where the neural network
architecture gets created.

Currently emitted:
1. Observation normalization
2. Obseration clipping
3. _act function
6. set_all_vars

## Keras as Model

Original implementation used hand written dense layers and tensorflow operations. I use a Keras model and their
functional API to create the net. In testing the two version differ in 0.x float scope. Something to worry about?

In [None]:
import numpy as np

args = {
      "ac_bins": "continuous:",
      "ac_noise_std": 0.01,
      #"connection_type": "ff",
      "hidden_dims": [
        256,
        256
      ],
      "nonlin_type": "tanh"
}

ob_space= env.observation_space
ac_space = env.action_space
ac_bins = args["ac_bins"]
ac_noise_std = args["ac_noise_std"]
hidden_dims = args["hidden_dims"]
nonlin = args["nonlin_type"]

# TODO more nonlinear functions


## Keras clearin backend to support multiprocessing

In [None]:
def create_model(initial_weights=None, model_name="model", save_path=None):
    #tf.keras.backend.clear_session()
    import tensorflow as tf
    nonlin = tf.tanh

    with tf.variable_scope("RoboschoolPolicy/" + model_name):
        # Observation normalization
        #ob_mean = tf.get_variable(
        #    'ob_mean', ob_space.shape, tf.float32, tf.constant_initializer(np.nan), trainable=False)
        #ob_std = tf.get_variable(
        #    'ob_std', ob_space.shape, tf.float32, tf.constant_initializer(np.nan), trainable=False)
        #in_mean = tf.placeholder(tf.float32, ob_space.shape)
        #in_std = tf.placeholder(tf.float32, ob_space.shape)
        #self._set_ob_mean_std = U.function([in_mean, in_std], [], updates=[
            #tf.assign(ob_mean, in_mean),
            #tf.assign(ob_std, in_std),
        #])

        # Normalize observation space and clip to [-5.0, 5.0]
        #o = tf.clip_by_value((o - ob_mean) / ob_std, -5.0, 5.0)

        # Policy network

        input = x = tf.keras.Input(ob_space.shape, dtype=tf.float32)

        for hd in hidden_dims:
            x = tf.keras.layers.Dense(
                hd, activation=nonlin,
                kernel_initializer=tf.initializers.random_normal,
                bias_initializer=tf.initializers.zeros)(x)

        # Map to action
        adim = ac_space.shape[0]

        a = tf.keras.layers.Dense(
        adim,
        kernel_initializer=tf.initializers.random_normal,
        bias_initializer=tf.initializers.zeros)(x)
        model = tf.keras.Model(inputs=input, outputs=a, name=model_name)

        # Initializer for the newly created weights. TODO possible replacement tf.keras.initializers.RandomNormal
        # out = np.random.randn(*adim).astype(np.float32)
        # out *=  0.01 / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        # initializer= tf.constant(out)


    if initial_weights is not None:
        set_from_flat(model, initial_weights)
        
    if save_path:
        model.save_weights(save_path)
    
    return model



In [None]:
def act(ob, model, random_stream=None):
    action = model.predict(ob)
    
    # TODO why randomstream? Better generalization?
    if random_stream is not None and ac_noise_std != 0:
        action += random_stream.randn(*action.shape) * ac_noise_std
    return action

def get_initial_weights():
    model = create_model()
    
    # Print out the model
    model.summary()
    
    return model.get_weights()

# Plot the Neural Network Architecture
#master_model.summary()

from multiprocessing import Pool

with Pool(1) as pool:
    theta = pool.apply(func=get_initial_weights)


# Plot the Neural Network Architecture
#master_model.summary()
#all_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope.name)


#trainable_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, master_scope.name)
#trainable_variables = model.get_weights()

num_params = sum(np.prod(v.shape) for v in theta)

#placeholders = [tf.placeholder(v.value().dtype, v.get_shape().as_list()) for v in self.all_variables]

# self.set_all_vars = U.function(
#     inputs=placeholders,
#     outputs=[],
#     updates=[tf.group(*[v.assign(p) for v, p in zip(self.all_variables, placeholders)])]
# )

In [None]:
#optimizer = {'sgd': SGD, 'adam': Adam}[exp['optimizer']['type']](policy, **exp['optimizer']['args'])

## Shared Noise

In [None]:
class SharedNoiseTable(object):
    def __init__(self):
        import ctypes, multiprocessing
        seed = 123
        count = 250000000  # 1 gigabyte of 32-bit numbers. Will actually sample 2 gigabytes below.
        #logger.info('Sampling {} random numbers with seed {}'.format(count, seed))

        # Instantiate an array of C float datatype with size count
        self._shared_mem = multiprocessing.Array(ctypes.c_float, count)

        # Convert to numpy array
        self.noise = np.ctypeslib.as_array(self._shared_mem.get_obj())
        assert self.noise.dtype == np.float32
        self.noise[:] = np.random.RandomState(seed).randn(count)  # 64-bit to 32-bit conversion here
        #logger.info('Sampled {} bytes'.format(self.noise.size * 4))

    def get(self, i, dim):
        return self.noise[i:i + dim]

    def sample_index(self, stream, dim):
        return stream.randint(0, len(self.noise) - dim + 1)

noise = SharedNoiseTable()

## Get flat

In [None]:
# def set_new_weights(model, theta, epsilon):
#     assert isinstance(model, tf.keras.Model)
#     assert isinstance(theta, list)
#         
#     for t in theta:
#         t += epsilon
#     
#     model.set_weights(theta)

def get_flat(theta):
     return np.concatenate([np.reshape(v, [-1]) for v in theta], 0)

def set_from_flat(model, theta):
    old_theta = model.get_weights()
    shapes = [v.shape for v in old_theta]
    total_size = theta.size
    
    start = 0
    reshapes = []
    
    for (shape, v) in zip(shapes, theta):
        size = int(np.prod(shape))
        reshapes.append(np.reshape(theta[start:start+size], shape))
        start += size
    
    assert start == total_size
    model.set_weights(reshapes)
    

## Set from flat

In [None]:
# def _create_set_from_flat_op(var_list, orig):
#     shapes = [v.shape for v in orig]
#     total_size = np.sum([v.size for v in orig])
#         
#     start=0
#     assigns = []
#     for (shape, v) in zip(shapes, var_list):
#         size = v.size
#         assigns.append(np.reshape(var_list[start:start+size], shape))
#         start += size
#         
#     assert start == total_size
#     
#     return assigns
#  
# 
# def set_from_flat(var_list):
#     old_weights = get_flat(model.get_weights())
# 
#     new_weights = old_weights + var_list
#     op_set_from_flat = _create_set_from_flat_op(new_weights)
#     model.set_weights(sess.run(op_set_from_flat))
#     
#     print("PID " + str(os.getpid()) + ": " + "Set weights from flat")

## Rollout TODO

In [None]:
def rollout(env, model, *, render=False, timestep_limit=None, save_obs=False, random_stream=None):
    """
    If random_stream is provided, the rollout will take noisy actions with noise drawn from that stream.
    Otherwise, no action noise will be added.
    """
    
    env_timestep_limit = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
    timestep_limit = env_timestep_limit if timestep_limit is None else min(timestep_limit, env_timestep_limit)
    rews = []
    t = 0
    if save_obs:
        obs = []
    ob = env.reset()
    for _ in range(timestep_limit):
        ac = act(ob[None], model, random_stream=random_stream)[0]
        if save_obs:
            obs.append(ob)
        ob, rew, done, _ = env.step(ac)
        rews.append(rew)
        t += 1
        if render:
            env.render()
        if done:
            break
    rews = np.array(rews, dtype=np.float32)
    if save_obs:
        return rews, t, np.array(obs)
    return rews, t


# Worker method


In [None]:
def run_worker(num_jobs, theta): #min_task_runtime=.2):

    print("PID " + str(os.getpid()) + ": " + "Started worker with " + str(num_jobs) + "Jobs")
    #with lock:
    #    logger.info('run_worker: {}'.format(locals()))

    assert isinstance(noise, SharedNoiseTable)

    # Setup
    #config, env, sess, policy = setup(exp, single_threaded=True)
    env = gym.make(config.env_id)
    model = create_model(initial_weights=theta, model_name=str(os.getpid()))
    

    # Random stream used for todo
    rs = np.random.RandomState()
    #worker_id = rs.randint(2 ** 31)

    #assert policy.needs_ob_stat == (config.calc_obstat_prob != 0)

    #while True:
    # Prevent accessing empty array (master did not emit task yet)
    #while not tasks:
    #    time.sleep(0.05)

    #task_data = tasks[-1]

    #task_tstart = time.time()

    #assert isinstance(task_data, Task)
    #task_id = task_data.task_id
    #assert isinstance(task_id, int)

    #if policy.needs_ob_stat:
    #    policy.set_ob_stat(task_data.ob_mean, task_data.ob_std)

    # # todo whats this condition doing?
    # if rs.rand() < config.eval_prob:
    #     # Evaluation: noiseless weights and noiseless actions
    #     policy.set_trainable_flat(task_data.params)
    # 
    #     eval_rews, eval_length = policy.rollout(env)  # eval rollouts don't obey task_data.timestep_limit
    #     eval_return = eval_rews.sum()
    # 
    #     with lock:
    #         logger.info('Eval result: task={} return={:.3f} length={}'.format(task_id, eval_return, eval_length))
    # 
    #     result_queue.put(Result(
    #         worker_id=worker_id,
    #         noise_inds_n=None,
    #         returns_n2=None,
    #         signreturns_n2=None,
    #         lengths_n2=None,
    #         eval_return=eval_return,
    #         eval_length=eval_length,
    #         ob_sum=None,
    #         ob_sumsq=None,
    #         ob_count=None,
    #         task_id=task_id
    #     ))

    # Rollouts with noise
    noise_inds, returns, signreturns, lengths = [], [], [], []
    #task_ob_stat = RunningStat(env.observation_space.shape, eps=0.)  # eps=0 because we're incrementing only
    
    #while not noise_inds or time.time() - task_tstart < min_task_runtime:
    
    for _ in range(num_jobs):

        # ------------- Noise sample -------------------------------
        noise_idx = noise.sample_index(rs, num_params)
        epsilon = config.noise_stdev * noise.get(noise_idx, num_params)

        # Evaluate the sampled noise positive
        set_from_flat(model, theta + epsilon)
        rews_pos, len_pos = rollout(env, model, random_stream=rs)

        # rews_pos, len_pos = rollout_and_update_ob_stat(
        #     policy, env, task_data.timestep_limit, rs, task_ob_stat, config.calc_obstat_prob)
        
        # Evaluate the sample noise negative
        set_from_flat(model, theta - epsilon)
        rews_neg, len_neg = rollout(env, model, random_stream=rs)

        # rews_neg, len_neg = rollout_and_update_ob_stat(
        #     policy, env, task_data.timestep_limit, rs, task_ob_stat, config.calc_obstat_prob)
        
    
        # Gather results
        noise_inds.append(noise_idx)
        returns.append([rews_pos.sum(), rews_neg.sum()])
        signreturns.append([np.sign(rews_pos).sum(), np.sign(rews_neg).sum()])
        lengths.append([len_pos, len_neg])
        
        
    # result_queue.put(Result(
    #     worker_id=worker_id,
    #     noise_inds_n=np.array(noise_inds),
    #     returns_n2=np.array(returns, dtype=np.float32),
    #     signreturns_n2=np.array(signreturns, dtype=np.float32),
    #     lengths_n2=np.array(lengths, dtype=np.int32),
    #     eval_return=None,
    #     eval_length=None,
    #     ob_sum=None if task_ob_stat.count == 0 else task_ob_stat.sum,
    #     ob_sumsq=None if task_ob_stat.count == 0 else task_ob_stat.sumsq,
    #     ob_count=task_ob_stat.count,
    #     task_id=task_id
    # ))
    print("PID " + str(os.getpid()) + ": " + "Returned result")
    result = Result(
        worker_id=None,
        noise_inds_n=np.array(noise_inds),
        returns_n2=np.array(returns, dtype=np.float32),
        signreturns_n2=np.array(signreturns, dtype=np.float32),
        lengths_n2=np.array(lengths, dtype=np.int32),
        eval_return=None,
        eval_length=None,
        ob_sum=None,
        ob_count=None,
        ob_sumsq=None,
        task_id = 0
    )
    
    return result

In [None]:
def itergroups(items, group_size):
    assert group_size >= 1
    group = []
    for x in items:
        group.append(x)
        if len(group) == group_size:
            yield tuple(group)
            del group[:]
    if group:
        yield tuple(group)
        
def batched_weighted_sum(weights, vecs, batch_size):
    total = 0.
    num_items_summed = 0
    for batch_weights, batch_vecs in zip(itergroups(weights, batch_size), itergroups(vecs, batch_size)):
        assert len(batch_weights) == len(batch_vecs) <= batch_size
        total += np.dot(np.asarray(batch_weights, dtype=np.float32), np.asarray(batch_vecs, dtype=np.float32))
        num_items_summed += len(batch_weights)
    return total, num_items_summed

import errno

def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise

## Optimization: Fitness shaping with a rank transformation

In [None]:
def compute_ranks(x):
    """
    Returns ranks in [0, len(x))
    Note: This is different from scipy.stats.rankdata, which returns ranks in [1, len(x)].
    """
    assert x.ndim == 1
    ranks = np.empty(len(x), dtype=int)
    ranks[x.argsort()] = np.arange(len(x))
    return ranks


def compute_centered_ranks(x):
    y = compute_ranks(x.ravel()).reshape(x.shape).astype(np.float32)
    return y

## Optimization: Using Adam Optimizer

Defining it manually since with Keras you have to define a loss function and use training set, etc. Manually seems
easier for now.

In [None]:
class Optimizer(object):
    def __init__(self):
        self.dim = num_params
        self.t = 0

    def update(self, globalg):
        self.t += 1
        step = self._compute_step(globalg)
        ratio = np.linalg.norm(step) / np.linalg.norm(theta)
        theta_new = theta + step
        return theta_new, ratio

    def _compute_step(self, globalg):
        raise NotImplementedError

class Adam(Optimizer):
    def __init__(self, stepsize, beta1=0.9, beta2=0.999, epsilon=1e-08):
        Optimizer.__init__(self)
        self.stepsize = stepsize
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = np.zeros(self.dim, dtype=np.float32)
        self.v = np.zeros(self.dim, dtype=np.float32)

    def _compute_step(self, globalg):
        a = self.stepsize * np.sqrt(1 - self.beta2 ** self.t) / (1 - self.beta1 ** self.t)
        self.m = self.beta1 * self.m + (1 - self.beta1) * globalg
        self.v = self.beta2 * self.v + (1 - self.beta2) * (globalg * globalg)
        step = -a * self.m / (np.sqrt(self.v) + self.epsilon)
        return step

# Master

In [None]:
import time

from multiprocessing import Pool

env = gym.make(config.env_id)
rs = np.random.RandomState()

optimizer = Adam(stepsize=0.01)

save_directory = "/tmp/es_master_{}/".format(os.getpid())

mkdir_p(save_directory)

# ob_stat = RunningStat(
#     env.observation_space.shape,
#     eps=1e-2  # eps to prevent dividing by zero at the beginning when computing mean/stdev
# )

tslimit, incr_tslimit_threshold, tslimit_incr_ratio = None, None, None
adaptive_tslimit = False


episodes_so_far = 0
timesteps_so_far = 0
tstart = time.time()

task_counter = 0

assert config.num_workers != 0

num_jobs_per_worker = [int(config.population_size / config.num_workers)] * config.num_workers

mod = config.population_size % config.num_workers
i = 0
while mod > 0:
    num_jobs_per_worker[i] += 1
    mod -= 1
    i += 1
    
assert len(num_jobs_per_worker) == config.num_workers
generation_counter = 0

theta = get_flat(theta)

while True:
    print("----------------------GENERATION: " + str(generation_counter) + "------------------------------------")
    
    step_tstart = time.time() 
            
    #assert theta.dtype == np.float32

    # Task counter is used to recognize false tasks from previous iterations later
    curr_task_id = task_counter
    task_counter += 1
    
    # Start workers
    
    workers = []
    results = []
    
    pool = Pool(processes=config.num_workers)
    
    print("PID " + str(os.getpid()) + ": " + "Waiting for results")
    for i in num_jobs_per_worker:
        result = pool.apply_async(func=run_worker, args=(i, theta))
        results.append(result)

    for i in range(len(results)):
        results[i] = results[i].get()

    pool.close()   
    pool.join()  

    # Pop off results for the current task
    curr_task_results, eval_rets, eval_lens, worker_ids = [], [], [], []
    num_results_skipped, num_episodes_popped, num_timesteps_popped, ob_count_this_batch = 0, 0, 0, 0
   #while num_episodes_popped < config.episodes_per_batch:
    for result in results:
        assert isinstance(result, Result)
        # task_id = result.task_id
        # assert isinstance(task_id, int)

        # assert (result.eval_return is None) == (result.eval_length is None)
        # worker_ids.append(result.worker_id)
        # 
        # if result.eval_length is not None:
        #     # This was an eval job
        #     episodes_so_far += 1
        #     timesteps_so_far += result.eval_length
        #     # Store the result only for current tasks
        #     if task_id == curr_task_id:
        #         eval_rets.append(result.eval_return)
        #         eval_lens.append(result.eval_length)
        # else:

        assert (result.noise_inds_n.ndim == 1 and
                result.returns_n2.shape == result.lengths_n2.shape == (len(result.noise_inds_n), 2))
        assert result.returns_n2.dtype == np.float32
        
        # Update counts
        result_num_eps = result.lengths_n2.size
        result_num_timesteps = result.lengths_n2.sum()
        episodes_so_far += result_num_eps
        timesteps_so_far += result_num_timesteps
        # Store results only for current tasks
        curr_task_results.append(result)
        num_episodes_popped += result_num_eps
        num_timesteps_popped += result_num_timesteps
        # Update ob stats
        # if policy.needs_ob_stat and result.ob_count > 0:
        #     ob_stat.increment(result.ob_sum, result.ob_sumsq, result.ob_count)
        #     ob_count_this_batch += result.ob_count


    # Compute skip fraction
    #frac_results_skipped = num_results_skipped / (num_results_skipped + len(curr_task_results))
    # if num_results_skipped > 0:
    #     logger.warning('Skipped {} out of date results ({:.2f}%)'.format(
    #         num_results_skipped, 100. * frac_results_skipped))
    
    print("Gathered results")

    # Assemble results
    noise_inds_n = np.concatenate([r.noise_inds_n for r in curr_task_results])
    returns_n2 = np.concatenate([r.returns_n2 for r in curr_task_results])
    lengths_n2 = np.concatenate([r.lengths_n2 for r in curr_task_results])
    assert noise_inds_n.shape[0] == returns_n2.shape[0] == lengths_n2.shape[0]
    
    # Process returns
    if config.return_proc_mode == 'centered_rank':
        proc_returns_n2 = compute_centered_ranks(returns_n2)
    else:
        proc_returns_n2 = returns_n2
    
    g, count = batched_weighted_sum(
        proc_returns_n2[:, 0] - proc_returns_n2[:, 1],
        (noise.get(idx, num_params) for idx in noise_inds_n),
        batch_size=500
    )
    
    g /= returns_n2.size
    
    #g /= config.noise_stdev
    #g *= config.learning_rate
    
    assert g.shape == (num_params,) and g.dtype == np.float32 and count == len(noise_inds_n)
    #update_ratio = optimizer.update(-g + config.l2coeff * theta)
    #update_ratio = optimizer.update(config.l2coeff * g)

    # UPDATE

    theta, _ = optimizer.update(-g + config.learning_rate * theta)
    
    
    
    #set_from_flat(master_model, theta + g)
    #set_from_flat(master_model, theta + g)
   
    #theta += g

    # Update ob stat (we're never running the policy in the master, but we might be snapshotting the policy)
    # if policy.needs_ob_stat:
    #     policy.set_ob_stat(ob_stat.mean, ob_stat.std)

    # Update number of steps to take
    # if adaptive_tslimit and (lengths_n2 == tslimit).mean() >= incr_tslimit_threshold:
    #     old_tslimit = tslimit
    #     tslimit = int(tslimit_incr_ratio * tslimit)
    #     logger.info('Increased timestep limit from {} to {}'.format(old_tslimit, tslimit))

    step_tend = time.time()
    # tlogger.record_tabular("EpRewMean", returns_n2.mean())
    # tlogger.record_tabular("EpRewStd", returns_n2.std())
    # tlogger.record_tabular("EpLenMean", lengths_n2.mean())
    # 
    # tlogger.record_tabular("EvalEpRewMean", np.nan if not eval_rets else np.mean(eval_rets))
    # tlogger.record_tabular("EvalEpRewStd", np.nan if not eval_rets else np.std(eval_rets))
    # tlogger.record_tabular("EvalEpLenMean", np.nan if not eval_rets else np.mean(eval_lens))
    # tlogger.record_tabular("EvalPopRank", np.nan if not eval_rets else (
    #     np.searchsorted(np.sort(returns_n2.ravel()), eval_rets).mean() / returns_n2.size))
    # tlogger.record_tabular("EvalEpCount", len(eval_rets))
    # 
    # tlogger.record_tabular("Norm", float(np.square(policy.get_trainable_flat()).sum()))
    # tlogger.record_tabular("GradNorm", float(np.square(g).sum()))
    # tlogger.record_tabular("UpdateRatio", float(update_ratio))
    # 
    # tlogger.record_tabular("EpisodesThisIter", lengths_n2.size)
    # tlogger.record_tabular("EpisodesSoFar", episodes_so_far)
    # tlogger.record_tabular("TimestepsThisIter", lengths_n2.sum())
    # tlogger.record_tabular("TimestepsSoFar", timesteps_so_far)
    # 
    # num_unique_workers = len(set(worker_ids))
    # tlogger.record_tabular("UniqueWorkers", num_unique_workers)
    # tlogger.record_tabular("UniqueWorkersFrac", num_unique_workers / len(worker_ids))
    # tlogger.record_tabular("ResultsSkippedFrac", frac_results_skipped)
    # tlogger.record_tabular("ObCount", ob_count_this_batch)
    # 
    # tlogger.record_tabular("TimeElapsedThisIter", step_tend - step_tstart)
    # tlogger.record_tabular("TimeElapsed", step_tend - tstart)
    # tlogger.dump_tabular()

    if config.snapshot_freq != 0 and generation_counter % config.snapshot_freq == 0:
        from multiprocessing import Process
        
        p = Process(target=create_model, args=(theta, 
                                               config.env_id + "_Generation_" + str(generation_counter), 
                                               save_directory + 'snapshot_{:05d}'.format(generation_counter) + ".h5"))
        p.start()
        p.join()
        
        print("Saved model in generation {}".format(generation_counter))
            
    generation_counter+= 1

# Visualize

In [None]:
trained_model = create_model()
trained_model.load_weights(save_directory + "snapshot_0.h5")


import gym, roboschool
from IPython import display
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

env.reset()
img = plt.imshow(env.render(mode='rgb_array'))

def rollout_rend(env, model, *, render=False, timestep_limit=None, save_obs=False, random_stream=None):
    """
    If random_stream is provided, the rollout will take noisy actions with noise drawn from that stream.
    Otherwise, no action noise will be added.
    """
    
    env_timestep_limit = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
    timestep_limit = env_timestep_limit if timestep_limit is None else min(timestep_limit, env_timestep_limit)
    rews = []
    t = 0
    if save_obs:
        obs = []
    ob = env.reset()
    for _ in range(timestep_limit):
        ac = act(ob[None], model, random_stream=random_stream)[0]
        if save_obs:
            obs.append(ob)
        ob, rew, done, _ = env.step(ac)
        rews.append(rew)
        t += 1
        if render:
            img.set_data(env.render(mode='rgb_array'))
            display.display(plt.gcf())
            display.clear_output(wait=True)
        if done:
            break
    rews = np.array(rews, dtype=np.float32)
    if save_obs:
        return rews, t, np.array(obs)
    return rews, t

#rollout_rend(env, trained_model, render=True)