# Optimization - CEM

<img src="https://raw.githubusercontent.com/jeremiedecock/polytechnique-inf581-2021/master/logo.jpg" style="float: left; width: 15%" />

[INF581-2021](https://moodle.polytechnique.fr/course/view.php?id=9352) Lab session #8

2019-2021 Jérémie Decock

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeremiedecock/polytechnique-inf581-2021/blob/master/lab8_optim_cem_answers.ipynb)

[![My Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jeremiedecock/polytechnique-inf581-2021/master?filepath=lab8_optim_cem_answers.ipynb)

[![NbViewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/jeremiedecock/polytechnique-inf581-2021/blob/master/lab8_optim_cem_answers.ipynb)

[![Local](https://img.shields.io/badge/Local-Save%20As...-blue)](https://github.com/jeremiedecock/polytechnique-inf581-2021/raw/master/lab8_optim_cem_answers.ipynb)

**Notice**: this notebook requires the following libraries: OpenAI *Gym*, NumPy, Pandas, Seaborn and imageio.

You can install them with the following command (the next cells do this for you if you use the Google Colab environment):

``
pip install gym[box2d] numpy pandas seaborn imageio
``

C.f. https://github.com/openai/gym#installing-everything

In [None]:
%matplotlib inline

import subprocess

try:
    from inf581 import *
except ModuleNotFoundError:
    process = subprocess.Popen("pip install inf581".split(), stdout=subprocess.PIPE)

    for line in process.stdout:
        print(line.decode().strip())

    from inf581 import *

import matplotlib.pyplot as plt

import gym
import math
import numpy as np
import pandas as pd
import seaborn as sns
import itertools
import json

from IPython.display import Image   # To display GIF images in the notebook

In [None]:
sns.set_context("talk")

$
\newcommand{\vs}[1]{\mathbf{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\mathbf{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\U{V}
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
%
\def\E{\mathbb{E}}
%\newcommand{transition}{T(s,a,s')}
%\newcommand{transitionfunc}{\mathcal{T}^a_{ss'}}
\newcommand{transitionfunc}{P}
\newcommand{transitionfuncinst}{P(\nextstate|\state,\action)}
\newcommand{transitionfuncpi}{\mathcal{T}^{\pi_i(s)}_{ss'}}
\newcommand{rewardfunc}{r}
\newcommand{rewardfuncinst}{r(\state,\action,\nextstate)}
\newcommand{rewardfuncpi}{r(s,\pi_i(s),s')}
\newcommand{statespace}{\mathcal{S}}
\newcommand{statespaceterm}{\mathcal{S}^F}
\newcommand{statespacefull}{\mathcal{S^+}}
\newcommand{actionspace}{\mathcal{A}}
\newcommand{reward}{R}
\newcommand{statet}{S}
\newcommand{actiont}{A}
\newcommand{newstatet}{S'}
\newcommand{nextstate}{\state'}
\newcommand{newactiont}{A'}
\newcommand{stepsize}{\alpha}
\newcommand{discount}{\gamma}
\newcommand{qtable}{Q}
\newcommand{finalstate}{\state_F}
%
\newcommand{\vs}[1]{\boldsymbol{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\boldsymbol{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\vit{Value Iteration}
\def\pit{Policy Iteration}
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
\def\cstateset{\mathcal{X}} %%%
\def\x{\vs{x}}                    % TODO cstate
\def\cstate{\vs{x}}               %%%
\def\policy{\pi}
\def\piparam{\vs{\theta}}         % TODO pparam
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\caction{\vs{u}}       % action
\def\cactionset{\mathcal{U}} %%%
\def\decision{\vs{d}}       % decision
\def\randvar{\vs{\omega}}       %%%
\def\randset{\Omega}       %%%
\def\transition{T}       %%%
\def\immediatereward{r}    %%%
\def\strategichorizon{s}    %%% % TODO
\def\tacticalhorizon{k}    %%%  % TODO
\def\operationalhorizon{h}    %%%
\def\constalpha{a}    %%%
\def\U{V}              % utility function
\def\valuefunc{V}
\def\X{\mathcal{X}}
\def\meu{Maximum Expected Utility}
\def\finaltime{T}
\def\timeindex{t}
\def\iterationindex{i}
\def\decisionfunc{d}       % action
\def\mdp{\text{MDP}}
$

## Introduction

In the previous lab we studied a method that allowed us to apply reinforcement learning in continuous state spaces and/or continuous action spaces.
We used REINFORCE, a *Policy gradient* method that directly optimize the parametric policy $\pi_{\theta}$.
The parameter $\theta$ was iteratively updated toward a local maximum of the total expected reward $J(\theta)$ using a gradient ascent method:
$$\theta \leftarrow \theta + \alpha \nabla_{\theta}J(\theta)$$
A convenient analytical formulation of $\nabla_{\theta}J(\theta)$ was obtained thanks to the *Policy Gradient theorem*:

$$\nabla_\theta J(\theta) = \nabla_\theta V^{\pi_\theta}(s) = \E_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta (s,a) Q^{\pi_\theta}(s,a) \right].$$
However, gradient ascent methods may have a slow convergence and will only found a local optimum.
Moreover, this approach requires an analytical formulation of $\nabla_\theta \log \pi_\theta (s,a)$ which is not always known (when something else than a neural networks is used for the agent's policy).

Direct Policy Search methods using gradient free optimization procedures like CEM are interesting alternatives to Policy Gradient algorithms.
They can be successfully applied as long as the $\pi_\theta$ policy has no more than few hundreds of parameters.
Moreover, these method can solve complex problems that cannot be modeled as Markov Decision Processes.

As for previous Reinforcement Learning labs, we will use standard problems provided by OpenAI Gym suite.
Especially, we will try to solve the MountainCarContinuous-v0 problem (https://github.com/openai/gym/wiki/MountainCarContinuous-v0) which offers both continuous space and action states.

An underpowered car must climb a one-dimensional hill to reach a target.
The target is on top of a hill on the right-hand side of the car. If the car reaches it or goes beyond, the episode terminates.
On the left-hand side, there is another hill. Climbing this hill can be used to gain potential energy and accelerate towards the target. On top of this second hill, the car cannot go further than a position equal to -1, as if there was a wall. Hitting this limit does not generate a penalty. The problem is considered solved if a reward of 90 is obtained. 

## Exercise 1: Implement CEM and test it on the CartPole environment

Fill the following cells to implement the CEM algorithm (defined in the introduction of this notebook).

In [None]:
# Since the goal is to attain an average return of 195, horizon should be larger than 195 steps (say 300 for instance)
EPISODE_DURATION = 900
NUM_EPISODES = 10  # 100


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def logistic_regression(s, theta):
    prob_push_right = sigmoid(np.dot(s, np.transpose(theta)))
    return prob_push_right


def draw_action(s, theta):
    prob_push_right = logistic_regression(s, theta)
    r = np.random.rand()
    if r < prob_push_right:
        return 1
    else:
        return 0


# Generate an episode
def eval_one_episode(theta, max_episode_length=EPISODE_DURATION, render=False):
    s_t = env.reset()

    episode_rewards = []

    for t in range(max_episode_length):

        if render:
            env.render_wrapper.render()

        a_t = draw_action(s_t, theta)
        s_t, r_t, done, info = env.step(a_t)

        episode_rewards.append(r_t)

        if done:
            break

    return sum(episode_rewards)


def eval_multiple_episodes(theta, num_episodes=NUM_EPISODES, max_episode_length=EPISODE_DURATION, render=False):
    
    reward_list = []

    for episode_index in range(num_episodes):
        episode_global_reward = eval_one_episode(theta, max_episode_length, render)
        reward_list.append(episode_global_reward)

        if render:
            print("Test Episode {}: Total Reward = {}".format(episode_index, episode_global_reward))

    return np.mean(reward_list)

**Task 1**: Implement the `cem_uncorrelated` function that updates $\theta$ parameters with a Cross Entropy Method until the agent got an average reward greater or equals to 195 over 100 consecutive trials.

In [None]:
def cem_uncorrelated(objective_function, mean_array, var_array,
                     n_iterations=500, sample_size=50, elite_frac=0.2,
                     print_every=10,
                     hist_dict=None):
    """Cross-entropy method.
        
    Params
    ======
        objective_function (function): the function to maximize
        mean_array (array of floats): the initial proposal distribution (mean)
        var_array (array of floats): the initial proposal distribution (variance)
        n_iterations (int): number of training iterations
        sample_size (int): size of population at each iteration
        elite_frac (float): rate of top performers to use in update with elite_frac ∈ ]0;1]
        print_every (int): how often to print average score
        hist_dict (dict): logs
    """
    assert 0. < elite_frac <= 1.

    n_elite = math.ceil(sample_size * elite_frac)

    for iteration_index in range(0, n_iterations):

        # SAMPLE A NEW POPULATION OF SOLUTIONS (THETA VECTORS) ################

        theta_array = np.random.multivariate_normal(mean=mean_array, cov=np.diag(var_array), size=(sample_size))

        # EVALUATE SAMPLES AND EXTRACT THE BEST ONES ("ELITE") ################

        score_array = np.array([objective_function(theta) for theta in theta_array])

        sorted_indices_array = score_array.argsort()              # Sort from the lower score to the higher one
        elite_indices_array = sorted_indices_array[-n_elite:]     # Here we wants to *maximize* the objective function thus we take the samples that are at the end of the sorted_indices

        elite_theta_array = theta_array[elite_indices_array]

        #df = pd.DataFrame(theta_array)
        #df['elite'] = 0
        #df.iloc[elite_indices_array, -1] = 1
        #print(df)
        #sns.pairplot(df, hue="elite")
        #plt.show()

        # FIT THE NORMAL DISTRIBUTION ON THE ELITE POPULATION #################

        mean_array = elite_theta_array.mean(axis=0)
        var_array = elite_theta_array.var(axis=0)

        # PRINT STATUS ########################################################
        
        if iteration_index % print_every == 0:
            #print("Iteration {}\tScore {}\tMean: {}\tVar: {}".format(iteration_index, objective_function(mean_array), str(mean_array), str(var_array)))
            print("Iteration {}\tScore {}".format(iteration_index, objective_function(mean_array)))
        
        if hist_dict is not None:
            hist_dict[iteration_index] = [objective_function(mean_array)] + mean_array.tolist() + var_array.tolist()

    return mean_array

**Task 2:** Train the agent

In [None]:
env = gym.make("CartPole-v0")

In [None]:
hist_dict = {}

objective_function = eval_one_episode
#objective_function = eval_multiple_episodes

init_mean_array = np.zeros(4)
init_var_array = np.ones(4) * 100.

theta = cem_uncorrelated(objective_function=objective_function, mean_array=init_mean_array, var_array=init_var_array,
                         n_iterations=50, sample_size=50, elite_frac=0.1,
                         print_every=1,
                         hist_dict=hist_dict)

In [None]:
env.close()

In [None]:
df = pd.DataFrame.from_dict(hist_dict, orient='index', columns=["score", "mu1", "mu2", "mu3", "mu4", "var1", "var2", "var3", "var4"])
df.score.plot(title="Average reward", figsize=(30, 5));
plt.xlabel("Training Steps")
plt.ylabel("Reward")

df[["mu1", "mu2", "mu3", "mu4"]].plot(title="Theta w.r.t training steps", figsize=(30, 5));
plt.xlabel("Training Steps")

df[["var1", "var2", "var3", "var4"]].plot(logy=True, title="Variance w.r.t training steps", figsize=(30, 5))
plt.xlabel("Training Steps");

In [None]:
theta

**Task 3:** Test final policy

In [None]:
#np.random.seed(1234)
env = gym.make("CartPole-v0")
RenderWrapper.register(env, force_gif=True)

In [None]:
eval_multiple_episodes(theta, num_episodes=5, render=True)
env.render_wrapper.make_gif("ex1")

In [None]:
env.close()

## Exercise 2: Implement a policy for environments having a continuous action space

In [None]:
###############################################################################
# Parametric Stochastic Policy ################################################
###############################################################################

# Activation functions ########################################################

def identity(x):
    return x

def tanh(x):
    return np.tanh(x)

def relu(x):
    x_and_zeros = np.array([x, np.zeros(x.shape)])
    return np.max(x_and_zeros, axis=0)

# Dense Multi-Layer Neural Network ############################################

class NeuralNetworkPolicy:

    def __init__(self, activation_functions, shape_list):
        self.activation_functions = activation_functions
        self.shape_list = shape_list

    def __call__(self, state, theta):
        weights = unflatten_weights(theta, self.shape_list)

        return feed_forward(inputs=state,
                            weights=weights,
                            activation_functions=self.activation_functions)



def feed_forward(inputs, weights, activation_functions, verbose=False):
    x = inputs.copy()
    for layer_weights, layer_activation_fn in zip(weights, activation_functions):

        y = np.dot(x, layer_weights[1:])
        y += layer_weights[0]
        layer_output = layer_activation_fn(y)

        if verbose:
            print("x", x)
            print("bias", layer_weights[0])
            print("W", layer_weights[1:])
            print("y", y)
            print("z", layer_output)

        x = layer_output
        
        if verbose:
            print("...")

    return layer_output


def weights_shape(weights):
    return [weights_array.shape for weights_array in weights]


def flatten_weights(weights):
    """Convert weight parameters to a 1 dimension array (more convenient for optimization algorithms)"""
    nested_list = [weights_2d_array.flatten().tolist() for weights_2d_array in weights]
    flat_list = list(itertools.chain(*nested_list))
    return flat_list


def unflatten_weights(flat_list, shape_list):
    length_list = [shape[0] * shape[1] for shape in shape_list]

    nested_list = []
    start_index = 0

    for length, shape in zip(length_list, shape_list):
        nested_list.append(np.array(flat_list[start_index:start_index+length]).reshape(shape))
        start_index += length

    return nested_list

In [None]:
###############################################################################
# Objective function ##########################################################
###############################################################################

class ObjectiveFunction:

    def __init__(self, env, policy, ndim, num_episodes=1, max_time_steps=float('inf')):
        self.ndim = ndim
        self.env = env
        self.policy = policy
        self.num_episodes = num_episodes
        self.max_time_steps = max_time_steps

        self.num_evals = 0
        self.hist = []
        self.hist_policy = []

    def eval(self, policy_params, num_episodes=None, render=False):
        """Evaluate a policy"""

        self.num_evals += 1

        if num_episodes is None:
            num_episodes = self.num_episodes

        average_total_rewards = 0

        for i_episode in range(num_episodes):

            total_rewards = 0.
            state = self.env.reset()

            for t in range(self.max_time_steps):
                if render:
                    self.env.render_wrapper.render()

                action = self.policy(state, policy_params)
                state, reward, done, info = self.env.step(action)
                total_rewards += reward
                
                if done:
                    break

            average_total_rewards += float(total_rewards) / num_episodes

            if render:
                print("Test Episode {0}: Total Reward = {1}".format(i_episode, total_rewards))

        # Logs

        if LOG_SCORES:
            self.hist.append({"eval": self.num_evals,
                              "episode": i_episode,
                              "total_rewards": total_rewards})

            if self.num_evals % LOG_RECORD_INTERVAL == 0:
                with open("fobj_hist.json", "w") as fd:
                    json.dump(self.hist, fd)

        if LOG_POLICIES:
            self.hist_policy.append(policy_params.tolist())

            if self.num_evals % LOG_RECORD_INTERVAL == 0:
                with open("fobj_hist_policy.json", "w") as fd:
                    json.dump(self.hist_policy, fd)

        if VERBOSE:
            print("Avg total Reward = {:0.3f}, Theta = {}".format(average_total_rewards, policy_params))

        return average_total_rewards   # Optimizers do minimization by default...

    def __call__(self, policy_params):
        return self.eval(policy_params)

## Exercise 3: solve the LunarLander problem (continuous version) with CEM

**Task 1:** read https://gym.openai.com/envs/LunarLanderContinuous-v2/ and https://github.com/openai/gym/wiki/Leaderboard#lunarlander-v2 to discover the LunarLanderContinuous environment.

**Notice:** A reminder of Gym main concepts is available at https://gym.openai.com/docs/.

Print some information about the environment:

In [None]:
env = gym.make('LunarLanderContinuous-v2')
print("State space dimension is:", env.observation_space.shape[0])
print("State upper bounds:", env.observation_space.high)
print("State lower bounds:", env.observation_space.low)
print("Actions upper bounds:", env.action_space.high)
print("Actions lower bounds:", env.action_space.low)
env.close()

**Task 2:** Run the following cells and check different basic policies (for instance constant actions or randomly drawn actions) to discover the MountainCar environment.
Although this environment has easy dynamics that can be computed analytically, we will solve this problem with Policy Gradient based Reinforcement learning.

Test the LunarLander environment with a constant policy

In [None]:
env = gym.make('LunarLanderContinuous-v2')
RenderWrapper.register(env, force_gif=True)

observation = env.reset()
done = False

for t in range(150):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    ### BEGIN SOLUTION ###

    action = np.array([1., 1.])
    observation, reward, done, info = env.step(action)

    ### END SOLUTION ###


print()
env.close()

env.render_wrapper.make_gif("ex2left")

In [None]:
env = gym.make('LunarLanderContinuous-v2')
RenderWrapper.register(env, force_gif=True)

observation = env.reset()
done = False

for t in range(150):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    ### BEGIN SOLUTION ###

    action = np.array([-1., -1.])
    observation, reward, done, info = env.step(action)

    ### END SOLUTION ###

print()
env.close()

env.render_wrapper.make_gif("ex2right")

Test the LunarLander environment with a random policy

In [None]:
env = gym.make('LunarLanderContinuous-v2')
RenderWrapper.register(env, force_gif=True)

for episode_index in range(3):
    observation = env.reset()
    done = False

    for t in range(100):
        env.render_wrapper.render()

        if not done:
            print(observation)
        else:
            print("x", end="")
        
        ### BEGIN SOLUTION ###

        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)

        ### END SOLUTION ###

    print()
    env.close()

env.render_wrapper.make_gif("ex1random")

**Task 2:** Train the agent

In [None]:
MAX_EPISODE_DURATION = 1500
NUM_EPISODES_PER_EVAL = 1

VERBOSE = False
LOG_SCORES = True
LOG_POLICIES = False
LOG_RECORD_INTERVAL = 1000

GYM_ENVIRONMENT = "MountainCarContinuous-v0"
SUCCESS_SCORE = 90

###############################################################################
# Main ########################################################################
###############################################################################

env = gym.make(GYM_ENVIRONMENT)
RenderWrapper.register(env, force_gif=True)

observation_space_dim = env.observation_space.shape[0]

# Init parameters to random
theta_init = np.random.randn(observation_space_dim)
init_mean_array = np.zeros(65)
init_var_array = np.ones(65) * 1000.

# Make a neural network with 1 hidden layer of 16 units
weights = (np.zeros([env.observation_space.shape[0] + 1, 16]),
           np.zeros([17, env.action_space.shape[0]]))

# Set the neural network activation functions (one function per layer)
activation_functions = (relu, tanh)

flat_weights_list = flatten_weights(weights)
num_params = len(flat_weights_list)
print("Number of parameters (neural network weights) to optimize:", num_params)
w_shape = weights_shape(weights)
print("Number of parameters per layer:", w_shape)

nn_policy = NeuralNetworkPolicy(activation_functions, w_shape)

objective_function = ObjectiveFunction(env=env,
                                       policy=nn_policy,
                                       ndim=num_params,  # number of dimensions of the parameter (weights) space
                                       num_episodes=NUM_EPISODES_PER_EVAL,
                                       max_time_steps=MAX_EPISODE_DURATION)

# Optimization ############################################################

theta = cem_uncorrelated(objective_function=objective_function, mean_array=init_mean_array, var_array=init_var_array,
                         n_iterations=50, sample_size=50, elite_frac=0.2,
                         print_every=1,
                         hist_dict=hist_dict)

print("Solved after {} evaluations".format(objective_function.num_evals))
print("Optimized weights: ", theta)

# Test final policy
objective_function.eval(theta, num_episodes=3, render=True)

env.close()

In [None]:
env.close()

In [None]:
df = pd.DataFrame.from_dict(hist_dict, orient='index')
df.iloc[:,0].plot(title="Average reward", figsize=(30, 5));
plt.xlabel("Training Steps")
plt.ylabel("Reward")

In [None]:
df.iloc[:,1:66].plot(title="Theta w.r.t training steps", figsize=(30, 25))
plt.xlabel("Training Steps");

In [None]:
df.iloc[:,67:].plot(logy=True, title="Variance w.r.t training steps", figsize=(30, 25))
plt.xlabel("Training Steps");

In [None]:
theta

**Task 3:** Test final policy

In [None]:
#np.random.seed(1234)
env = gym.make(GYM_ENVIRONMENT)
RenderWrapper.register(env, force_gif=True)

# Test final policy
objective_function.eval(theta, num_episodes=3, render=True)

env.close()

## Exercise 4: implement SAES and solve the LunarLander problem (continuous version) with it

In [None]:
### TODO...

## Bonus exercise 1: solve the MountainCar environment (continuous version)

**Task 1:** read  https://gym.openai.com/envs/MountainCarContinuous-v0/ and https://github.com/openai/gym/wiki/MountainCarContinuous-v0 to discover the MountainCar environment.

**Notice:** A reminder of Gym main concepts is available at https://gym.openai.com/docs/.

Print some information about the environment:

In [None]:
env = gym.make('MountainCarContinuous-v0')
print("State space dimension is:", env.observation_space.shape[0])
print("State upper bounds:", env.observation_space.high)
print("State lower bounds:", env.observation_space.low)
print("Actions upper bounds:", env.action_space.high)
print("Actions lower bounds:", env.action_space.low)
env.close()

**Task 2:** Run the following cells and check different basic policies (for instance constant actions or randomly drawn actions) to discover the MountainCar environment.
Although this environment has easy dynamics that can be computed analytically, we will solve this problem with Policy Gradient based Reinforcement learning.

Test the MountainCar environment with a constant policy

In [None]:
env = gym.make('MountainCarContinuous-v0')
RenderWrapper.register(env, force_gif=True)

observation = env.reset()
done = False

for t in range(200):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    ### BEGIN SOLUTION ###

    action = np.array([1.])
    observation, reward, done, info = env.step(action)

    ### END SOLUTION ###


print()
env.close()

env.render_wrapper.make_gif("ex2left")

In [None]:
env = gym.make('MountainCarContinuous-v0')
RenderWrapper.register(env, force_gif=True)

observation = env.reset()
done = False

for t in range(200):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    ### BEGIN SOLUTION ###

    action = np.array([-1.])
    observation, reward, done, info = env.step(action)

    ### END SOLUTION ###

print()
env.close()

env.render_wrapper.make_gif("ex2right")

Test the MountainCar environment with a random policy

In [None]:
env = gym.make('MountainCarContinuous-v0')
RenderWrapper.register(env, force_gif=True)

for episode_index in range(3):
    observation = env.reset()
    done = False

    for t in range(100):
        env.render_wrapper.render()

        if not done:
            print(observation)
        else:
            print("x", end="")
        
        ### BEGIN SOLUTION ###

        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)

        ### END SOLUTION ###

    print()
    env.close()

env.render_wrapper.make_gif("ex1random")

**Task 3:** Solve the MounainCar problem with CEM

In [None]:
MAX_EPISODE_DURATION = 1500
NUM_EPISODES_PER_EVAL = 1

VERBOSE = False
LOG_SCORES = True
LOG_POLICIES = False
LOG_RECORD_INTERVAL = 1000

GYM_ENVIRONMENT = "MountainCarContinuous-v0"
SUCCESS_SCORE = 90

###############################################################################
# Main ########################################################################
###############################################################################

env = gym.make(GYM_ENVIRONMENT)
RenderWrapper.register(env, force_gif=True)

observation_space_dim = env.observation_space.shape[0]

# Init parameters to random
theta_init = np.random.randn(observation_space_dim)
init_mean_array = np.zeros(65)
init_var_array = np.ones(65) * 1000.

# Make a neural network with 1 hidden layer of 16 units
weights = (np.zeros([env.observation_space.shape[0] + 1, 16]),
           np.zeros([17, env.action_space.shape[0]]))

# Set the neural network activation functions (one function per layer)
activation_functions = (relu, tanh)

flat_weights_list = flatten_weights(weights)
num_params = len(flat_weights_list)
print("Number of parameters (neural network weights) to optimize:", num_params)
w_shape = weights_shape(weights)
print("Number of parameters per layer:", w_shape)

nn_policy = NeuralNetworkPolicy(activation_functions, w_shape)

objective_function = ObjectiveFunction(env=env,
                                       policy=nn_policy,
                                       ndim=num_params,  # number of dimensions of the parameter (weights) space
                                       num_episodes=NUM_EPISODES_PER_EVAL,
                                       max_time_steps=MAX_EPISODE_DURATION)

# Optimization ############################################################

theta = cem_uncorrelated(objective_function=objective_function, mean_array=init_mean_array, var_array=init_var_array,
                         n_iterations=50, sample_size=50, elite_frac=0.2,
                         print_every=1,
                         hist_dict=hist_dict)

print("Solved after {} evaluations".format(objective_function.num_evals))
print("Optimized weights: ", theta)

# Test final policy
objective_function.eval(theta, num_episodes=3, render=True)

env.close()

In [None]:
env.close()

In [None]:
df = pd.DataFrame.from_dict(hist_dict, orient='index')
df.iloc[:,0].plot(title="Average reward", figsize=(30, 5));
plt.xlabel("Training Steps")
plt.ylabel("Reward")

In [None]:
df.iloc[:,1:66].plot(title="Theta w.r.t training steps", figsize=(30, 25))
plt.xlabel("Training Steps");

In [None]:
df.iloc[:,67:].plot(logy=True, title="Variance w.r.t training steps", figsize=(30, 25))
plt.xlabel("Training Steps");

In [None]:
theta

Test final policy

In [None]:
#np.random.seed(1234)
env = gym.make(GYM_ENVIRONMENT)
RenderWrapper.register(env, force_gif=True)

# Test final policy
objective_function.eval(theta, num_episodes=3, render=True)

env.close()

## Bonus exercise 2: test the CMAES algorithm