The MIT License

Copyright (c) 2016 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

# Using evolution strategies to train Roboschool Environments

## Introduction

This Jupyter Notebook is based on the [paper](https://arxiv.org/abs/1703.03864), [blog article](https://openai.com/blog/evolution-strategies/) and [implementation](https://github.com/openai/evolution-strategies-starter) of OpenAI on the topic of using an evolution strategy algorithm for a typical reinforcement learning task. 

My implementation summarizes their implementation, by simplifying, refactoring and organizing the code into this Jupyter notebook which can be used to test the algorithm. One can tweak the hyperparameters, change the environment which shall be trained or even expand the implementation to support for example Atari environments.

I recommend reading the paper or at least the article before trying out the notebook. Also depending on the environment the training can be very computationally intense (for example training the Humanoid), so if you want to try out the harder ones I recommend using a highly parallelizable machine, i.e. a machine with a high number of cores/threads which can use multiple processes simultaneously.

## Algorithm overview

This section gives a brief overview over the algorithm. First of all we need to define what this implementation is going to do. The Roboschool is a group of environments in the [OpenAi Gym](https://gym.openai.com/), a program to test the behavior of machine learning algorithms on _real world_ problems. In our case, we want to train different robotic environments using an evolutionary algorithm which belongs to the class of natural evolution strategies. We therefore define a neural net with a configurable number of hidden layers, where the input dimension equals the observation space of the environment and the dimension of the output layer equals the dimension of the action space of the environment. This neural net is also called policy or in this implementation also referred to as a model. Therefore we train our policy to output the best possible action sequence given an observation sequence. Now, how do we train this policy? Training an evolutionary strategy consists of a cycle which is repeated over and over. First, an initial weight vector is randomly generated. In our context this weight vector is equal to the weights of our policy. Then we perturb the vector with gaussian noise. The number of perturbations is called the population size. What we now have is a population of slightly different weight vectors compared with the weight vector we started. Each one of these vectors will then be evaluated by first updating our policy with the weights and then run the environment using the policy. When this is done for the whole population, we calculate a gradient ascent step in the direction of steepest ascent. In our case, where we are dealing with natural evolution strategies, we calculate the step with the natural gradient. This is done by approximating this gradient using Monte Carlo estimates.

So lets say we have our initial weight vector $\theta$, a population size $n$, random perturbations $\epsilon_i$, $0 \leq i \leq n$, learning rate $\alpha$, noise standard deviation $\sigma$ and a fitness Function $F(\cdot)$. We then calculate the resulting weight vector like this:
$\theta_{t+1} = \theta_{t} + \alpha \frac{1}{n \sigma} \sum \limits_{i=0}^n F(\theta_{t} + \epsilon_i)$


This gives us the weight vector for the next cycle which we will then, again, perturb a number of times (depending on the population size). A cycle in the context of evolutionary strategies is called a generation.

One might ask himself now what this fitness function is in the context of robotic simulations. When initializing such an environment one can call the `step` function on the environment with an array in the shape of the action space (in our case this would be the output of the policy). The environment then evaluates the provided action based on the current observation and other parameters in the environment and outputs a reward. This is done for either a fixed number of timesteps (some environments have a maximum of timesteps defined) or stops, when the action resulted in a state where the environment is `done`, for example when the `Humanoid` environment falls over and touches the surface. The rewards get summed up for every timestep which forms the reward for one action.

## Setup

Before starting any computation we need to configure the training and define some methods and objects we will use later on.


### Imports

Note that TensorFlow does not get imported here. We will only import it inside of a function which runs in another process. This is due to the fact that when importing TensorFlow a session is created in the background which will interefere with our models which we run in subprocesses. When importing the package only inside a function and then running these functions inside of subprocesses, every process has its own TensorFlow session and they therefore don't interfere with each other.

In [None]:
import csv
import errno
import json
import os
import time

from collections import namedtuple, OrderedDict

import ctypes
import multiprocessing
import numpy as np

import gym, roboschool

### Logging and save directory

For evaluating the trained data we need to define a directory where we want to store the trained policies, as well as the log file to record the results of every generation.

Depending on your disk space you may not want to save every model, but for an indepth evaluation this is necessary. During training there will be so called _evaluation runs_ which will not add noise but test the currently trained policy to give insight on training. But since it relies on probability the number of evaluation runs will not be equal through generations. An additional Jupyter Notebook with the prefix **-visualization** can be used after training to load all saved weight files and evaluate them a given number of times.

If you want to change the location change the variable `save_directory` to a directory where the user which runs this notebook has write permissions. If it does not exist the program will create it. The default location is the /tmp/es_xx directory where xx is the process id of this notebook.

In [None]:
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise

save_directory = "/tmp/es_{}/".format(os.getpid())
mkdir_p(save_directory)

### Configuration, Task and Result Classes

Using `namedtuple` from the `collections` package from python allows us to quickly create small classes with different attributes, which is ideal for quickly accessing different attributes during training as well as saving the configurations to a file.

Each attribute will be explained in a bit, where objects of these classes are created and their parameters are set.

In [None]:
Config = namedtuple('Config', [
    'env_id',
    'population_size',
    'num_workers',
    'learning_rate',
    'noise_stdev',
    'snapshot_freq',
    'return_proc_mode',
    'calc_obstat_prob',
    'l2coeff',
    'eval_prob'
])

Optimizations = namedtuple('Optimizations', [
    'mirrored_sampling',
    'fitness_shaping',
    'weight_decay',
    'discretize_actions',
    'neural_network_optimizer',
    'observation_normalization'
])

ModelStructure = namedtuple('ModelStructure', [
    'ac_noise_std',
    'ac_bins',
    'hidden_dims',
    'nonlin_type',
    'nn_optimizer',
    'nn_optimizer_args'
])

Task = namedtuple('Task', [
    'theta', 'ob_mean', 'ob_std', 'task_id'])

Result = namedtuple('Result', [
    'noise_inds','returns', 'signreturns', 'lengths',
    'eval_return', 'eval_length',
    'ob_sum', 'ob_sumsq', 'ob_count',
    'task_id',
    'times_per_mutation',
    'time_create_model', 'time_clear_sess', 'time_set_sess',
    'times_set_flat', 'times_sample', 'times_get_noise',
    'times_predict'
])

### Configuration

#### Optimizations object

First we start with the optimizations for the training, since other parameters are only used when the respective optimization is activated.

All values can only be either `True` or `False`. The term _activated_ means the value is set to `True` in this context.

When `mirrored_sampling` is activated, sampled noise gets used twice: One time it will get added to the parameter vector and the result gets evaluated and the other time it gets subtracted from the parameter vector and the result will be evaluated.

`fitness_shaping` processes the rewards by applying a rank transformation.

`weight_decay` is currently not implemented.

`discretize_actions` can be used to bin the actions. This means that you can provide a number of uniformely shaped bins in which the output of the model will be put. For some environments this can encourage exploration behavior.

`neural_network_optimizer` will use a neural network optimizer for the gradient ascent step, for example the `Adam` optimizer.

`observation_normalization` When turned on, before an observation gets feeded into the neural network it will be subtracted by the observation mean and divided with the observation standard deviation. The observation mean and standard deviation get constantly updated on training based on the configured probability.

In [None]:
optimizations = Optimizations(
    mirrored_sampling=True,
    fitness_shaping=True,
    weight_decay=True,
    discretize_actions=False,
    neural_network_optimizer=True,
    observation_normalization=True
)

#### Config object

The config object will serve us as a general configuration for the training.

First of all it defines the `env_id` which has to be a valid ID for a `Roboschool` environment, for example `RoboschoolAnt-v1`. A complete list can be found [here](https://github.com/openai/roboschool/blob/master/roboschool/__init__.py).

`population_size` defines the number of perturbations per generation.

`num_workers` defines the amount of parallel processes to be created during calculation and must be larger than 0. By default this value is the output of `os.cpu_count()` which allows the program to use the maximum amount of computational power in terms of the provided hardware.

`learning_rate` defines how much the estimated gradient influences the next generation and must be larger than 0 for the training to work.

`noise_stdev` is the standard devation for the noise and should be larger than 0. It cannot be equal to 0 since if it would be at some point there could then be a division by zero. Other than that it would not benefit training.

`snapshot_freq` describes the frequency in which generations shall be saved to a `.h5` file. For example a snapshot frequency of 5 would save every fifth generation to a file. Setting it to 0 disables snapshotting.

`return_proc_mode` translates to return processing mode and describes how the calculated rewards for a generation shall be processed. By default this is `centered_rank` which will calculate the ranks of the rewards. This option is only used when also activating the `fitness_shaping` optimization. One can choose between the three strings `RETURN_PROC_MODE_CR`, `RETURN_PROC_MODE_SIGN` and `RETURN_PROC_MODE_CR_SIGN`.

`calc_obstat_prob` is the probability of saving the observations during a rollout (evaluating the fitness of a policy) and updating the observation mean and standard deviation. These values are used to normalize the input of a model which helps a neural network to generalize faster. The parameter is only used in combination with the `observation_normalization` optimization. Must be greater than 0 when using the optimization since otherwise it would waste performance while not normalizing.

`eval_prob` is the probability of inserting an evaluation run. This is useful for training to quickly monitor the reward mean, reward standard deviation and length mean of the current generation. The value must be greater or equal 0 (equal 0 turns of the evaluation runs).

In [None]:
RETURN_PROC_MODE_CR = 'centered_rank'
RETURN_PROC_MODE_SIGN = 'sign'
RETURN_PROC_MODE_CR_SIGN = 'centered_sign_rank'

config = Config(
    env_id="RoboschoolAnt-v1",
    population_size=300,
    num_workers=os.cpu_count(),
    learning_rate=0.001,
    noise_stdev=0.02,
    snapshot_freq=1,
    return_proc_mode=RETURN_PROC_MODE_CR,
    calc_obstat_prob=0.01,
    l2coeff=0.005,
    eval_prob=0.003
)

assert config.population_size > 0
assert config.num_workers > 0
assert config.learning_rate > 0
assert config.noise_stdev != 0
assert config.eval_prob >= 0


if (config.return_proc_mode != RETURN_PROC_MODE_CR
    and config.return_proc_mode != RETURN_PROC_MODE_SIGN
    and config.return_proc_mode != RETURN_PROC_MODE_CR_SIGN):
    
    raise NotImplementedError
    
if optimizations.observation_normalization:
    assert config.calc_obstat_prob > 0

if optimizations.neural_network_optimizer:
    assert config.l2coeff > 0

#### ModelStructure object

The ModelStructure object defines the overall structure of the neural network.

`ac_noise_std` is the standard deviation for the noise which is added during training. Adding noise shall generalize the training. It must be greater than 0, or equal to 0 for no noise.

When using the `discretize_actions` optimizations, `ac_bins` defines into how much uniformely spaced bins the actions shall be put. For example if the possible action values range from -1 to 1 and one defines 5 action bins the model will output values from $\{-1, -0.5, 0, 0.5, 1\}$. If you use the optimization the number of bins must be greater than 0.

`hidden_dims` define the number of hidden layers and their dimensions. It must be a list of positive Integers.

`nonlin_type` defines the activation function for the hidden layers. Can be `tanh`, `relu`, `lrelu` or `elu` for the hyperbolic tangent, rectified linear, leaky ReLU and the exponential linear. If something else is defined `tanh` will be picked.

`nn_optimizer` is only used when the `neural_network_optimizer` optimization is turned on. It can be `adam` for the Adam optimizer or `sgd` for the SGD Optimizer. Defining anything other will result in an error.

`nn_optimizer_args` is only used when the `neural_network_optimizer` optimization is turned on. This will be feeded into the constructor of the optimizer. For the Adam and SGD optimizer one must specify the `stepsize` but can also specify other optimizer specific attributes. Please check the constructor signature for the names. If you specify something else than stepsize be careful, this does not get checked for errors.

In [None]:
OPTIMIZER_ADAM = 'adam'
OPTIMIZER_SGD = 'sgd'

model_structure = ModelStructure(
    ac_noise_std=0.01,
    ac_bins=5,
    hidden_dims=[256, 256],
    nonlin_type='tanh',
    nn_optimizer=OPTIMIZER_ADAM,
    nn_optimizer_args={
        'stepsize': 0.01
    }
)

assert model_structure.ac_noise_std >= 0
assert isinstance(model_structure.hidden_dims, list)
assert all(hd >= 0 for hd in model_structure.hidden_dims)

if optimizations.neural_network_optimizer:
    if model_structure.nn_optimizer != OPTIMIZER_ADAM and model_structure.nn_optimizer != OPTIMIZER_SGD:
        raise NotImplementedError
    
    try:
        stepsize = model_structure.nn_optimizer_args['stepsize']
        assert stepsize > 0
    except KeyError:
        print("Please provide the stepsize parameter.")

if optimizations.discretize_actions:
    assert model_structure.ac_bins > 0

#### Task class

During training the master will enqueue a new Task object per generation. The workers will then take the latest task, compute it and push a result object on a queue. In the following table the attributes of a task object are explained in depth.

| Attribute | Explanation |
| :---------|:------------|
| `theta`   | The one-dimensional parameter vector of this task, i.e. the current generation|
| `ob_mean` | When observation normalization is used this is the current mean of the observed observation |
| `ob_std`  | When observation normalization is used this is the current standard deviation of the observed observation|
| `task_id` | The ID of this Task, which equals the generation number |



#### Result class

An object of the Result class will define a computed task by the workers. This can either be an evaluation task, where no noise gets added and the policy will simply be evaluated on the environment or a regular task. This means that the noise gets sampled, added (and subtracted when mirrored sampling is used) and evaluated. The following table gives a more detailed information on each attribute.

| Attribute             | Explanation   |
| :---------------------|:--------------|
| `noise_inds`        | A numpy array with the indices of the used noise|
| `returns`        | A numpy array with the rewards. When mirrored sampling is used this list is two dimensional| |`signreturns` | The sum of the signs of the rewards. When mirrored sampling is used this list is two dimensional| |`lengths` | A numpy array with the sum of the timesteps. When mirrored sampling is used this list is two dimensional|
| `eval_return` | np.nan if this was not an evaluation task, otherwise a numpy array with the reward of the evaluation|
| `eval_length`|np.nan if this was not an evaluation task, otherwise a numpy array with the timesteps of the evaluation| |`ob_sum` | If observation normalization is used this contains the sum of the tracked observations |
| `ob_sumsq` | If observation normalization is used this contains the squared sum of the tracked observations |
| `ob_count` | If observation normalization is used this contains the amount of tracked observations|
| `task_id` | The ID of the task that has been calculated|


### Save the configuration

After specifying all necessary configurations, they get saved to the `save_directory` so when training is done one can reproduce the training with the exact parameters.

In [None]:
with open(save_directory + 'config.json', 'w', encoding='utf-8') as f:
    chained_dict = OrderedDict([
        ('config', config._asdict()),
        ('model_structure', model_structure._asdict()),
        ('optimizations', optimizations._asdict())])

    json.dump(chained_dict, f, ensure_ascii=False, indent=4)

## Function and variable definitions

### Environment

Before we continue we must create the environment object and the observation and action space variables. They are needed for constructing the models.

In [None]:
env = gym.make(config.env_id)

ob_space = env.observation_space
ac_space = env.action_space

### Using Keras for the models

The original implementation from OpenAI used chained TensorFlow operations to construct the model. This notebook however, uses Keras (which is integrated into TensorFlow) as a high-level API to construct `Model` objects. These objects are, in our case, chained layers which will represent our neural network.

Also, note that all imports of TensorFlow need to be inside function definitions and these functions can only be called inside subprocesses. This is needed since importing TensorFlow automatically starts a background session which would be inherited to then started subprocesses. Therefore each worker would have the same TensorFlow session which would interfere with their respective models. Another noteworthy finding is that when creating Keras models in a loop there will be a memory leak when one does not clear the session. This will be adressed in the `run_worker` method.

In the following cell the method `create_model` creates and returns a Keras model as defined with the configurations. It needs to have the `ob_space` and `ac_space` variable set to the observation space and action space of the used environment in training, because they define the input and output dimension.

After adding an input layer the method will add the custom `ObservationNormalizationLayer` if the `observation_normalization` optimization is turned on. This layer uses the method parameters `ob_mean` and `ob_std` to normalize the input $o$ with this equation $\frac{o - \text{ob_mean}}{\text{ob_std}}$ and clip the values to $[-5, 5]$.

After that depending on the dimension and number of hidden layers defined these hidden layers get added as Dense layers. If the `discretize_actions` optimization is turned on the custom `DiscretizeActionsLayer`

All weights use the custom defined `Normc_initializer` to initialize their weights. This is copied from the OpenAI implementation, since initializing them differently can lead to a large minus reward when starting the training.
**TODO** low std, e.g. std=0.01 leads to a more stable result.



The method will create an input layer in the shape of the observation space, optionally add the custom `ObservationNormalizationLayer` if the `observation_normalization` optimization is turned on, add the specified hidden layers with their dimensions and adds the output layer. If the `discretize_actions` optimization is turned on another layer gets added which will serve as a transformation for the output into the bins.

When `initial_weights` is provided the model will be updated with these weights. They need to be in the correct shape for the model. When the function is called with a valid `save_path` the model gets saved to this path.

`ob_mean` and `ob_std` are only needed for the normalizing of the observation space. If it is turned on the input gets subtracted with `ob_mean`, divided by `ob_std` and clipped to $[-5,5]$.

The two custom classes `ObservationNormalizationLayer` and `Normc_initializer` need to be supplied when loading a saved model. An example and also visualizing the trained data can be found in the other Jupyter Notebook.

In [None]:
def create_model(initial_weights=None, model_name="model", save_path=None, ob_mean=None, ob_std=None):
       
    import tensorflow as tf
    
    class Normc_initializer(tf.keras.initializers.Initializer):
        """
        Create a TensorFlow constant with random numbers normed in the given shape.
        :param std:
        :return:
        """
        def __init__(self, std=1.0):
            self.std=std

        def __call__(self, shape, dtype=None, partition_info=None):#pylint: disable=W0613
            out = np.random.randn(*shape).astype(np.float32)
            out *= self.std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
            return tf.constant(out)
    
    class ObservationNormalizationLayer(tf.keras.layers.Layer):
        def __init__(self, ob_mean, ob_std, **kwargs):
            self.ob_mean = ob_mean
            self.ob_std = ob_std
            super(ObservationNormalizationLayer, self).__init__(**kwargs)

        def call(self, x):
            return tf.clip_by_value((x - self.ob_mean) / self.ob_std, -5.0, 5.0)
        
        # get_config and from_config need to implemented to be able to serialize the model
        def get_config(self):
            base_config = super(ObservationNormalizationLayer, self).get_config()
            base_config['ob_mean'] = self.ob_mean
            base_config['ob_std'] = self.ob_std
            return base_config
        
        @classmethod
        def from_config(cls, config):
            return cls(**config)
        
    class DiscretizeActionsUniformLayer(tf.keras.layers.Layer):
        def __init__(self, num_ac_bins, adim, ahigh, alow, **kwargs):
            self.num_ac_bins = num_ac_bins
            self.adim = adim
            self.ahigh = ahigh
            self.alow = alow
            super(DiscretizeActionsUniformLayer, self).__init__(**kwargs)

        def call(self, x):            
            # Reshape to [n x i x j] where n is dynamically chosen, i equals action dimension and j equals the number
            # of bins
            scores_nab = tf.reshape(x, [-1, self.adim, self.num_ac_bins])
            # This picks the bin with the greatest value
            a = tf.argmax(scores_nab, 2)
            
            # Then transform the interval from [0, num_ac_bins - 1] to [-1, 1] which equals alow and ahigh
            ac_range_1a = (self.ahigh - self.alow)[None, :]
            return 1. / (self.num_ac_bins - 1.) * tf.keras.backend.cast(a, 'float32') * ac_range_1a + self.alow[None, :]        
        
        # get_config and from_config need to implemented to be able to serialize the model
        def get_config(self):
            base_config = super(DiscretizeActionsUniformLayer, self).get_config()
            base_config['num_ac_bins'] = self.num_ac_bins
            base_config['adim'] = self.adim
            base_config['ahigh'] = self.ahigh
            base_config['alow'] = self.alow
            return base_config
        
        @classmethod
        def from_config(cls, config):
            return cls(**config)
    
    nonlin = tf.nn.tanh
    
    if model_structure.nonlin_type == 'relu':
        nonlin = tf.nn.relu
    elif model_structure.nonlin_type == 'lrelu':
        nonlin = tf.nn.leaky_relu
    elif model_structure.nonlin_type == 'elu':
        nonlin = tf.nn.leaky_relu

    # Policy network
    input_layer = x = tf.keras.Input(ob_space.shape, dtype=tf.float32)
    
    if ob_mean is not None and ob_std is not None and optimizations.observation_normalization:
        if ob_std.all() != 0:
            x = ObservationNormalizationLayer(ob_mean, ob_std)(x)
                
    for hd in model_structure.hidden_dims:
        x = tf.keras.layers.Dense(
            hd, activation=nonlin,
            kernel_initializer=Normc_initializer(std=1.0),
            bias_initializer=tf.initializers.zeros())(x)

    # Action dimension and the lowest and highest possible values for an action
    adim, ahigh, alow = ac_space.shape[0], ac_space.high, ac_space.low        
    
    if optimizations.discretize_actions:
        num_ac_bins = int(model_structure.ac_bins)
        x = tf.keras.layers.Dense(
                        adim * num_ac_bins,
                        kernel_initializer=Normc_initializer(std=0.01),
                        bias_initializer=tf.initializers.zeros())(x)
        a = DiscretizeActionsUniformLayer(num_ac_bins, adim, ahigh, alow)(x)
    else:
        a = tf.keras.layers.Dense(
            adim,
            kernel_initializer=Normc_initializer(std=0.01),
            bias_initializer=tf.initializers.zeros())(x)
    
    model = tf.keras.Model(inputs=input_layer, outputs=a, name=model_name)
    
    if initial_weights is not None:
        set_from_flat(model, initial_weights)
        
    if save_path:
        model.save(save_path)
        
    return model

### Defining the act function

The `act` function gets called every time the environment gives out the current observation. It will use the provided `ob` and `model` to predict the best action and returns it. When a `random_stream` is provided, which is done by default in training, noise gets added to the predicted training to make the model more robust and generalize better.

In [None]:
def act(ob, model, random_stream=None):
    time_predict_s = time.time()
    action = model.predict_on_batch(ob)
    time_predict_e = time.time() - time_predict_s

    if random_stream is not None and model_structure.ac_noise_std != 0:
        action += random_stream.randn(*action.shape) * model_structure.ac_noise_std
    return action, time_predict_e

In [None]:
class RunningStat(object):
    def __init__(self, shape, eps):
        self.sum = np.zeros(shape, dtype=np.float32)
        self.sumsq = np.full(shape, eps, dtype=np.float32)
        self.count = eps

    def increment(self, s, ssq, c):
        self.sum += s
        self.sumsq += ssq
        self.count += c

    @property
    def mean(self):
        return self.sum / self.count

    @property
    def std(self):
        return np.sqrt(np.maximum(self.sumsq / self.count - np.square(self.mean), 1e-2))

### Specifiying the number of paramters

For the Optimizer classes we need the total amount of parameters in our models. For this we define the `get_initial_weights` methods, which will create us a normal model, prints its layout and returns the random weights.
Remember this needs to be done in a subprocess so we call a `multiprocessing` `Pool` which allows us to spawn a subprocess and return the weights. 

We then calculate the number of parameters from our weight vector $\theta$. The weight vector is also important for later, because it will be the starting weight vector for our training.

ob_stat = RunningStat(
    env.observation_space.shape,
    eps=1e-2  # eps to prevent dividing by zero at the beginning when computing mean/stdev
    )

model = create_model(ob_mean=ob_stat.mean, ob_std=ob_stat.std)
model.summary()
ob = np.ones(ob_space.shape)
a = model.predict(ob[None])

#a = np.reshape(a, [adim, 10])
#a[1][0] += 1
#a = np.argmax(a, 1)

#a = np.clip(a, -1., 1.)
#adim, ahigh, alow = ac_space.shape[0], ac_space.high, ac_space.low 
#ac_range_1a = (ahigh - alow)[None, :]
#(lambda a: 1. / (10 - 1.) * a * ac_range_1a + alow[None, :])(a)
a

In [None]:
def get_initial_weights(ob_mean=None, ob_std=None):
    
    model = create_model(ob_mean=ob_mean, ob_std=ob_std)
    
    # Print out the model
    model.summary()
    
    return model.get_weights()

with multiprocessing.Pool(1) as pool:
    if optimizations.observation_normalization:
        ob_stat = RunningStat(
            env.observation_space.shape,
            eps=1e-2  # eps to prevent dividing by zero at the beginning when computing mean/stdev
            )
        theta = pool.apply(func=get_initial_weights, args=(ob_stat.mean, ob_stat.std))
    else:
        theta = pool.apply(func=get_initial_weights)

num_params = sum(np.prod(v.shape) for v in theta)

### Optimization: Using neural network optimizer

These optimizer are copied from the implementation from OpenAI. They are impemented in Keras but need a loss function to work which we do not have when using 

Defining it manually since with Keras you have to define a loss function and use training set, etc. Manually seems
easier for now.
Completely copied from OpenAI, except for not using the policy variable but providing theta directly

In [None]:
class Optimizer(object):
    def __init__(self):
        self.dim = num_params
        self.t = 0

    def update(self, theta, globalg):
        self.t += 1
        step = self._compute_step(globalg)
        ratio = np.linalg.norm(step) / np.linalg.norm(theta)
        theta_new = theta + step
        return theta_new, ratio

    def _compute_step(self, globalg):
        raise NotImplementedError

class SGD(Optimizer):
    def __init__(self, stepsize, momentum=0.9):
        Optimizer.__init__(self)
        self.v = np.zeros(self.dim, dtype=np.float32)
        self.stepsize, self.momentum = stepsize, momentum

    def _compute_step(self, globalg):
        self.v = self.momentum * self.v + (1. - self.momentum) * globalg
        step = -self.stepsize * self.v
        return step
        
class Adam(Optimizer):
    def __init__(self, stepsize, beta1=0.9, beta2=0.999, epsilon=1e-08):
        Optimizer.__init__(self)
        self.stepsize = stepsize
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = np.zeros(self.dim, dtype=np.float32)
        self.v = np.zeros(self.dim, dtype=np.float32)

    def _compute_step(self, globalg):
        a = self.stepsize * np.sqrt(1 - self.beta2 ** self.t) / (1 - self.beta1 ** self.t)
        self.m = self.beta1 * self.m + (1 - self.beta1) * globalg
        self.v = self.beta2 * self.v + (1 - self.beta2) * (globalg * globalg)
        step = -a * self.m / (np.sqrt(self.v) + self.epsilon)
        return step

In [None]:
if optimizations.neural_network_optimizer:
    if model_structure.nn_optimizer == OPTIMIZER_ADAM:
        optimizer = Adam(**model_structure.nn_optimizer_args)
    elif model_structure.nn_optimizer == OPTIMIZER_SGD:
        optimizer = SGD(**model_structure.nn_optimizer_args)
    else:
        raise NotImplementedError

## Shared Noise

In [None]:
class SharedNoiseTable(object):
    def __init__(self):
        seed = 123
        count = 250000000  # 1 gigabyte of 32-bit numbers. Will actually sample 2 gigabytes below.
        print('Sampling {} random numbers with seed {}'.format(count, seed))

        # Instantiate an array of C float datatype with size count
        self._shared_mem = multiprocessing.Array(ctypes.c_float, count)

        # Convert to numpy array
        self.noise = np.ctypeslib.as_array(self._shared_mem.get_obj())
        assert self.noise.dtype == np.float32
        self.noise[:] = np.random.RandomState(seed).randn(count)  # 64-bit to 32-bit conversion here
        print('Sampled {} bytes'.format(self.noise.size * 4))

    def get(self, i, dim):
        return self.noise[i:i + dim]

    def sample_index(self, stream, dim):
        return stream.randint(0, len(self.noise) - dim + 1)

## Get flat

In [None]:
def get_flat(theta):
     return np.concatenate([np.reshape(v, [-1]) for v in theta], 0)

def set_from_flat(model, theta):
    old_theta = model.get_weights()
    shapes = [v.shape for v in old_theta]
    total_size = theta.size
        
    start = 0
    reshapes = []
    
    for (shape, v) in zip(shapes, theta):
        size = int(np.prod(shape))
        reshapes.append(np.reshape(theta[start:start+size], shape))
        start += size
    

    assert start == total_size
    model.set_weights(reshapes)
    

## Set from flat

## Rollout TODO

In [None]:
def rollout(env, 
            model, 
            *, 
            render=False, 
            timestep_limit=None, 
            save_obs=False, 
            random_stream=None):
    """
    If random_stream is provided, the rollout will take noisy actions with noise drawn from that stream.
    Otherwise, no action noise will be added.
    """
    
    env_timestep_limit = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
    timestep_limit = env_timestep_limit if timestep_limit is None else min(timestep_limit, env_timestep_limit)
    rews = []
    times_predict = []
    t = 0
    if save_obs:
        obs = []
    ob = env.reset()
    for _ in range(timestep_limit):
        ac, time_predict = act(ob[None], model, random_stream=random_stream)
        ac = ac[0]
        times_predict.append(time_predict)
        if save_obs:
            obs.append(ob)
        ob, rew, done, _ = env.step(ac)
        rews.append(rew)
        t += 1
        if render:
            env.render()
        if done:
            break
    rews = np.array(rews, dtype=np.float32)
    if save_obs:
        return rews, t, np.array(obs), times_predict
    return rews, t, times_predict


# Worker method


In [None]:
def rollout_and_update_ob_stat(env, model, rs, task_ob_stat):
    if optimizations.observation_normalization and config.calc_obstat_prob != 0 and rs.rand() < config.calc_obstat_prob:
        rollout_rews, rollout_len, obs, times_predict = rollout(
            env, model, save_obs=True, random_stream=rs)
        task_ob_stat.increment(obs.sum(axis=0), np.square(obs).sum(axis=0), len(obs))
    else:
        rollout_rews, rollout_len, times_predict = rollout(env, model, random_stream=rs)
    return rollout_rews, rollout_len, times_predict

In [None]:
def run_worker(task_list, result_queue):
    from tensorflow.keras import backend as K
    import tensorflow as tf
    
    K.set_session(tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)))

    while not task_list:
        time.sleep(0.05)
    
    print("PID {}: Started worker".format(os.getpid()))
    
    assert isinstance(noise, SharedNoiseTable)

    # Setup
    # Create a new gym environment object because each worker needs its own one
    env = gym.make(config.env_id)
    
    # Initialize the model with the supplied weights 'theta' to calcualate based on the current generation
    
    # Random stream used for adding noise to the actions as well as deciding if the observation statistics shall be
    # updated
    rs = np.random.RandomState()
    
    wait_time = 1
    
    cached_task = None
    cached_task_id = -1
    model = None
    
    while True:
        # Get the latest Task from the Manger list
        try:
            task_tuple = task_list[-1]
        except IndexError:
            if wait_time > 100:
                print("The task list does not get tasks, something went wrong in the Master. Aborting.")
                break
            print("Task list is empty, waiting {} seconds before trying again".format(wait_time))
            wait_time *= 2
            time.sleep(wait_time)
            continue
    

        #assert isinstance(task, Task)
        assert isinstance(task_tuple, tuple)
        #task_id = task.task_id
        task_id = task_tuple[0]
        assert isinstance(task_id, int)
        
        time_create_model_e = 0
        time_set_sess_e = 0
        time_clear_sess_e = 0
        
        if task_id != cached_task_id:
            cached_task = task_tuple[1]
            assert isinstance(cached_task, Task)
            cached_task_id = task_id
        
            time_clear_sess_s = time.time()
            K.clear_session()
            time_clear_sess_e = time.time() - time_clear_sess_s

            time_set_sess_s = time.time()
            K.set_session(tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)))
            time_set_sess_e = time.time() - time_set_sess_s

            time_create_model_s = time.time()        
            model = create_model(initial_weights=cached_task.theta, 
                             model_name=str(os.getpid()),
                             ob_mean=cached_task.ob_mean,
                             ob_std=cached_task.ob_std)

            time_create_model_e = time.time() - time_create_model_s
        
        if rs.rand() < config.eval_prob:
            # Evaluation sample
            eval_rews, eval_length, times_predict = rollout(env, model)
            
            result_queue.put(Result(
                noise_inds=None,
                returns=None,
                signreturns=None,
                lengths=None,
                eval_return=eval_rews.sum(),
                eval_length=eval_length,
                ob_sum=None,
                ob_sumsq=None,
                ob_count=None,
                task_id=cached_task_id,
                times_per_mutation=None,
                time_create_model=None if time_create_model_e == 0 else time_create_model_e,
                time_clear_sess=None if time_clear_sess_e == 0 else time_clear_sess_e,
                time_set_sess=None if time_set_sess_e == 0 else time_set_sess_e,
                times_set_flat=None,
                times_sample=None,
                times_get_noise=None,
                times_predict=times_predict
            ))
            
        else:
            task_ob_stat = RunningStat(env.observation_space.shape, eps=0.)  # eps=0 because we're incrementing only
            
            noise_inds, returns, signreturns, lengths = [], [], [], []
            times_per_mutation, times_set_flat, times_sample, times_get_noise = [], [], [], []
            times_predict = []
            
            while not noise_inds:

                # Noise sample
                time_sample_s = time.time()
                noise_idx = noise.sample_index(rs, num_params)
                times_sample.append(time.time() - time_sample_s)
                
                time_get_noise_s = time.time()
                epsilon = config.noise_stdev * noise.get(noise_idx, num_params)
                times_get_noise.append(time.time() - time_get_noise_s)
                
                # Evaluate the sampled noise
                time_set_flat_s = time.time()
                set_from_flat(model, cached_task.theta + epsilon)
                times_set_flat.append(time.time() - time_set_flat_s)
                
                time_mutation_s = time.time()
                rews_pos, len_pos, times_predict_pos = rollout_and_update_ob_stat(env,
                                                                                  model,
                                                                                  rs=rs,
                                                                                  task_ob_stat=task_ob_stat)
                times_per_mutation.append(time.time() - time_mutation_s)
                                
                # Gather results
                noise_inds.append(noise_idx)
                returns.append([rews_pos.sum()])
                signreturns.append([np.sign(rews_pos).sum()])
                lengths.append([len_pos])
                
                times_predict += times_predict_pos

                # Mirrored sampling also evaluates the noise by subtracting it
                if optimizations.mirrored_sampling:
                    time_set_flat_s = time.time()
                    set_from_flat(model, cached_task.theta - epsilon)
                    times_set_flat.append(time.time() - time_set_flat_s)
                    
                    times_mutation_s = time.time()
                    rews_neg, len_neg, times_predict_neg = rollout_and_update_ob_stat(env,
                                                                                      model, 
                                                                                      rs=rs, 
                                                                                      task_ob_stat=task_ob_stat)
                    times_per_mutation.append(time.time() - times_mutation_s)

                    returns[-1].append(rews_neg.sum())
                    signreturns[-1].append(np.sign(rews_neg).sum())
                    lengths[-1].append(len_neg)
                    
                    times_predict += times_predict_neg

                times_per_mutation.append(time.time() - time_mutation_s)
            
            result_queue.put(Result(
                noise_inds=np.array(noise_inds),
                returns=np.array(returns, dtype=np.float32),
                signreturns=np.array(signreturns, dtype=np.float32),
                lengths=np.array(lengths, dtype=np.int32),
                eval_return=None,
                eval_length=None,
                ob_sum=None if task_ob_stat.count == 0 else task_ob_stat.sum,
                ob_sumsq=None if task_ob_stat.count == 0 else task_ob_stat.sumsq,
                ob_count=task_ob_stat.count,
                task_id=cached_task_id,
                times_per_mutation=times_per_mutation,
                time_create_model=None if time_create_model_e == 0 else time_create_model_e,
                time_clear_sess=None if time_clear_sess_e == 0 else time_clear_sess_e,
                time_set_sess=None if time_set_sess_e == 0 else time_set_sess_e,
                times_set_flat=times_set_flat, 
                times_sample=times_sample,
                times_get_noise=times_get_noise,
                times_predict=times_predict
            ))

In [None]:
def itergroups(items, group_size):
    assert group_size >= 1
    group = []
    for x in items:
        group.append(x)
        if len(group) == group_size:
            yield tuple(group)
            del group[:]
    if group:
        yield tuple(group)
        
def batched_weighted_sum(weights, vecs, batch_size):
    total = 0.
    num_items_summed = 0
    for batch_weights, batch_vecs in zip(itergroups(weights, batch_size), itergroups(vecs, batch_size)):
        assert len(batch_weights) == len(batch_vecs) <= batch_size
        total += np.dot(np.asarray(batch_weights, dtype=np.float32), np.asarray(batch_vecs, dtype=np.float32))
        num_items_summed += len(batch_weights)
    return total, num_items_summed

## Optimization: Fitness shaping with a rank transformation

In [None]:
def compute_ranks(x):
    """
    Returns ranks in [0, len(x))
    Note: This is different from scipy.stats.rankdata, which returns ranks in [1, len(x)].
    """
    assert x.ndim == 1
    ranks = np.empty(len(x), dtype=int)
    ranks[x.argsort()] = np.arange(len(x))
    return ranks


def compute_centered_ranks(x):
    y = compute_ranks(x.ravel()).reshape(x.shape).astype(np.float32)
    return y

# Master

In [None]:
rs = np.random.RandomState()

noise = SharedNoiseTable()

manager = multiprocessing.Manager()
task_list = manager.list()
result_queue = multiprocessing.Queue()

# Start workers
workers = []

for _ in range(int(config.num_workers)):
    worker_p = multiprocessing.Process(target=run_worker, args=(task_list, result_queue))
    workers.append(worker_p)
    worker_p.start()
    
theta = get_flat(theta)

episodes_so_far = 0
timesteps_so_far = 0
generations = 0
tstart = time.time()

ob_mean = ob_stat.mean if optimizations.observation_normalization else None
ob_std = ob_stat.std if optimizations.observation_normalization else None

generation_log = OrderedDict()
generation_log_file = save_directory + 'log.csv'
fieldnames = [
    'Generation',
    'GenRewMean', 'GenRewStd', 'GenLenMean', 
    'EvalGenRewardMean', 'EvalGenRewardStd', 'EvalGenLengthMean', 'EvalGenCount',
    'EpisodesThisGen', 'EpisodesSoFar', 'TimestepsThisGen', 'TimestepsSoFar',
    'UniqueWorkers', 'ResultsSkippedFrac', 'ObCount',
    'TimeElapsedThisGen', 'TimeElapsed',
    'TimePerMutationMin', 'TimePerMutationMax', 'TimePerMutationMean', 'TimePerMutationCount',
    'TimeCreateModelMin', 'TimeCreateModelMax', 'TimeCreateModelMean', 'TimeCreateModelCount',                     
    'TimeSetFlatMin', 'TimeSetFlatMax', 'TimeSetFlatMean', 'TimeSetFlatCount', 
    'TimeSampleMin', 'TimeSampleMax', 'TimeSampleMean', 'TimeSampleCount',
    'TimeGetNoiseMin', 'TimeGetNoiseMax', 'TimeGetNoiseMean', 'TimeGetNoiseCount',
    'TimePredictMin', 'TimePredictMax', 'TimePredictMean', 'TimePredictCount',
    'TimeClearSessMin', 'TimeClearSessMax', 'TimeClearSessMean', 'TimeClearSessCount',
    'TimeSetSessMin', 'TimeSetSessMax', 'TimeSetSessMean', 'TimeSetSessCount']

with open(generation_log_file, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    

while True:
    step_tstart = time.time()

    task_list.append(
        (generations, Task(
        theta=theta,
        ob_mean=ob_mean,
        ob_std=ob_std,
        task_id = generations
    )))
    
    print("---------------- Generation: " + str(generations) + "----------------")
            
    assert theta.dtype == np.float32
    
    curr_task_results, eval_returns, eval_lengths = [], [], []
    num_results_skipped, num_episodes_popped, ob_count_this_gen = 0, 0, 0
    
    times_per_mutation, times_create_model, times_clear_sess, times_set_sess = [], [], [], []
    times_set_flat, times_sample, times_get_noise = [], [], []
    times_predict = []
        
    print("PID " + str(os.getpid()) + ": " + "Waiting for results")
   
    while num_episodes_popped < config.population_size:
        result = result_queue.get()
        
        assert isinstance(result, Result)
        task_id = result.task_id
        assert isinstance(task_id, int)
        
        assert (result.eval_return is None) == (result.eval_length is None)
        
        if result.time_create_model is not None:
            times_create_model.append(result.time_create_model)
        if result.time_clear_sess is not None:
            times_clear_sess.append(result.time_clear_sess)
        if result.time_set_sess is not None:
            times_set_sess.append(result.time_set_sess)
        
        if result.eval_length is not None:
            # The result was an evaluation job
            
            episodes_so_far += 1
            timesteps_so_far += result.eval_length
            
            if task_id == generations:
                eval_returns.append(result.eval_return)
                eval_lengths.append(result.eval_length)
                times_predict += result.times_predict
        else:
            assert result.noise_inds.ndim == 1 and result.returns.dtype == np.float32

            if optimizations.mirrored_sampling:
                assert result.returns.shape == result.lengths.shape == (len(result.noise_inds), 2)
            else:
                assert result.returns.shape == result.lengths.shape == (len(result.noise_inds), 1)

            # Update counts
            result_num_eps = result.lengths.size
            result_num_timesteps = result.lengths.sum()
            episodes_so_far += result_num_eps
            timesteps_so_far += result_num_timesteps
            
            
            if task_id == generations:
                curr_task_results.append(result)
                num_episodes_popped += result_num_eps

                # Update observation stats if the optimization is used
                if optimizations.observation_normalization and result.ob_count > 0:
                    ob_stat.increment(result.ob_sum, result.ob_sumsq, result.ob_count)
                    ob_count_this_gen += result.ob_count

                times_per_mutation += result.times_per_mutation
                times_set_flat += result.times_set_flat
                times_sample += result.times_sample
                times_get_noise += result.times_get_noise
                times_predict += result.times_predict
            else:
                num_results_skipped += 1
                
    print("Gathered results")
    
    # Compute skip fraction
    frac_results_skipped = num_results_skipped / (num_results_skipped + len(curr_task_results))
    if num_results_skipped > 0:
        print("Skipped {} out of date results ({:.2f}%)".format(
            num_results_skipped, 100. * frac_results_skipped))
    
    # Assemble results
    noise_inds = np.concatenate([r.noise_inds for r in curr_task_results])
    returns = np.concatenate([r.returns for r in curr_task_results])
    lengths = np.concatenate([r.lengths for r in curr_task_results])
    assert noise_inds.shape[0] == returns.shape[0] == lengths.shape[0]
    
    # If fitness shaping is turned on rank the results
    if optimizations.fitness_shaping:
        if config.return_proc_mode == RETURN_PROC_MODE_CR:
            proc_returns = compute_centered_ranks(returns)
        # sign and centered_sign_rank are obviously only useful in combination with mirrored sampling
        elif config.return_proc_mode == RETURN_PROC_MODE_SIGN:
            proc_returns = np.concatenate([r.signreturns for r in curr_task_results])
        elif config.return_proc_mode == RETURN_PROC_MODE_CR_SIGN:
            proc_returns = compute_centered_ranks(np.concatenate([r.signreturns for r in curr_task_results]))
        else:
            # Throw error to indicate the false input instead of silently pass on.
            # This should have been already catched in the configuration section, so this here is a misconfiguration.
            raise NotImplementedError
    else:
        proc_returns = returns
    
    # Mirrored sampling returns a 2D numpy array therefore we need to preprocess it accordingly
    if optimizations.mirrored_sampling:
        # Calculates the difference between the rewards sampled with the positive and negative noise
        proc_returns = proc_returns[:, 0] - proc_returns[:, 1]
    else:
        proc_returns = proc_returns.ravel()
    
    # Calculate the approximated gradient with a batch variant which saves time TODO saving time true?
    g, count = batched_weighted_sum(
        proc_returns,
        (noise.get(idx, num_params) for idx in noise_inds),
        batch_size=500
    )
    
    assert g.shape == (num_params,) and g.dtype == np.float32 and count == len(noise_inds)
    
    # Update with the approximated gradient
    g /= returns.size
    
    if optimizations.neural_network_optimizer:
        theta, _ = optimizer.update(theta, -g + config.l2coeff * theta)
    else:
        theta += ((config.learning_rate / config.noise_stdev) * g)
    
    # Update ob_mean and ob_std
    if optimizations.observation_normalization and ob_count_this_gen > 0:
        ob_mean = ob_stat.mean
        ob_std = ob_stat.std
    
    step_tend = time.time()
    
    # Log the generation and print to stdout
    generation_log['Generation'] = generations
    
    generation_log['GenRewMean'] = returns.mean()
    generation_log['GenRewStd'] = returns.std()
    generation_log['GenLenMean'] = lengths.mean()
    
    generation_log['EvalGenRewardMean'] = np.nan if not eval_returns else np.mean(eval_returns)
    generation_log['EvalGenRewardStd'] = np.nan if not eval_returns else np.std(eval_returns)
    generation_log['EvalGenLengthMean'] = np.nan if not eval_lengths else np.mean(eval_lengths)
    generation_log['EvalGenCount'] = len(eval_returns)
    
    generation_log['EpisodesThisGen'] = lengths.size
    generation_log['EpisodesSoFar'] = episodes_so_far
    generation_log['TimestepsThisGen'] = lengths.sum()
    generation_log['TimestepsSoFar'] = timesteps_so_far
    
    generation_log['UniqueWorkers'] = config.num_workers
    generation_log['ResultsSkippedFrac'] = frac_results_skipped
    generation_log['ObCount'] = ob_count_this_gen
    
    generation_log['TimeElapsedThisGen'] = step_tend - step_tstart
    generation_log['TimeElapsed'] = step_tend - tstart
    
    generation_log['TimePerMutationMin'] = np.amin(times_per_mutation)
    generation_log['TimePerMutationMax'] = np.amax(times_per_mutation)
    generation_log['TimePerMutationMean'] = np.mean(times_per_mutation)
    generation_log['TimePerMutationCount'] = len(times_per_mutation)
    
    generation_log['TimeCreateModelMin'] = np.amin(times_create_model)
    generation_log['TimeCreateModelMax'] = np.amax(times_create_model)
    generation_log['TimeCreateModelMean'] = np.mean(times_create_model)
    generation_log['TimeCreateModelCount'] = len(times_create_model)
    
    generation_log['TimeSetFlatMin'] = np.amin(times_set_flat)
    generation_log['TimeSetFlatMax'] = np.amax(times_set_flat)
    generation_log['TimeSetFlatMean'] = np.mean(times_set_flat)
    generation_log['TimeSetFlatCount'] = len(times_set_flat)
    
    generation_log['TimeSampleMin'] = np.amin(times_sample)
    generation_log['TimeSampleMax'] = np.amax(times_sample)
    generation_log['TimeSampleMean'] = np.mean(times_sample)
    generation_log['TimeSampleCount'] = len(times_sample)
    
    generation_log['TimeGetNoiseMin'] = np.amin(times_get_noise)
    generation_log['TimeGetNoiseMax'] = np.amax(times_get_noise)
    generation_log['TimeGetNoiseMean'] = np.mean(times_get_noise)
    generation_log['TimeGetNoiseCount'] = len(times_get_noise)
    
    generation_log['TimePredictMin'] = np.amin(times_predict)
    generation_log['TimePredictMax'] = np.amax(times_predict)
    generation_log['TimePredictMean'] = np.mean(times_predict)
    generation_log['TimePredictCount'] = len(times_predict)
    
    generation_log['TimeClearSessMin'] = np.amin(times_clear_sess)
    generation_log['TimeClearSessMax'] = np.amax(times_clear_sess)
    generation_log['TimeClearSessMean'] = np.mean(times_clear_sess)
    generation_log['TimeClearSessCount'] = len(times_clear_sess)
    
    generation_log['TimeSetSessMin'] = np.amin(times_set_sess)
    generation_log['TimeSetSessMax'] = np.amax(times_set_sess)
    generation_log['TimeSetSessMean'] = np.mean(times_set_sess)
    generation_log['TimeSetSessCount'] = len(times_set_sess)
    
    for key, value in generation_log.items():
        print(f'{key:25} {value}')
    
    # Append the log the csv file
    with open(generation_log_file, 'a', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writerow(generation_log)

    # Note that the model is created with a custom layer and custom initializer, and therefore needs these two
    # custom classes if one wants to load a saved model
    if config.snapshot_freq != 0 and generations % config.snapshot_freq == 0:
        from multiprocessing import Process
        
        p = Process(target=create_model, args=(
                                            theta, 
                                            config.env_id + "_Generation_" + str(generations), 
                                            save_directory + 'snapshot_{:05d}'.format(generations) + ".h5",
                                            ob_mean,
                                            ob_std))
        p.start()
        p.join()
        
        print("Saved model in generation {} to {}".format(generations, save_directory))
            
    generations += 1