## Local Setup

If you prefer to work locally, see the following instructions for setting up Python in a virtual environment.
You can then ignore the instructions in "Colab Setup".

If you haven't yet, create a [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) environment using:
```
conda create --name rl_exercises
conda activate rl_exercises
```

The dependencies can be installed with pip:
```
pip install matplotlib numpy tqdm scipy
```

Even if you are running the Jupyter notebook locally, please run the code cells in **Colab Setup**, because they define some global variables required later.

## Colab Setup

Google Colab provides you with a temporary environment for python programming.
While this conveniently works on any platform and internally handles dependency issues and such, it also requires you to set up the environment from scratch every time.
The "Colab Setup" section below will be part of **every** exercise and contains utility that is needed before getting started.

There is a timeout of about ~12 hours with Colab while it is active (and less if you close your browser window).
Any changes you make to the Jupyter notebook itself should be saved to your Google Drive.
We also save all recordings and logs in it by default so that you won't lose your work in the event of an instance timeout.
However, you will need to re-mount your Google Drive and re-install packages with every new instance.

In [None]:
"""Your work will be stored in a folder called `rl_ws24` by default to prevent Colab 
instance timeouts from deleting your edits.
We do this by mounting your google drive on the virtual machine created in this colab 
session. For this, you will likely need to sign in to your Google account and allow
access to your Google Drive files.
"""

from pathlib import Path
try:
    from google.colab import drive
    drive.mount("/content/gdrive")
    COLAB = True
except ImportError:
    COLAB = False

# Create paths in your google drive
if COLAB:
    DATA_ROOT = Path("/content/gdrive/My Drive/rl_ws24")
    DATA_ROOT.mkdir(parents=True, exist_ok=True)

    DATA_ROOT_STR = str(DATA_ROOT)
    %cd "$DATA_ROOT"
else:
    DATA_ROOT = Path.cwd() / "rl_ws24"

# Install python packages
if COLAB:
    %pip install matplotlib numpy tqdm scipy

We start by importing all the necessary python modules and defining some helper
functions which you do not need to change. Still, make sure you are aware of
what they do.

In [None]:
import time
import abc
import os
from typing import *

import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
from matplotlib.patches import Rectangle
from scipy.stats import multivariate_normal

# Set random seed and output paths
SEED = 314159
OUTPUT_FOLDER = DATA_ROOT / "exercise_4" / time.strftime("%Y-%m-%d_%H-%M")
OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)

# this function will automatically save your figure into your google drive folder (if correctly mounted!)
def save_figure(save_name: str) -> None:
    assert save_name is not None, "Need to provide a filename to save to"
    plt.savefig(os.path.join(OUTPUT_FOLDER, save_name + ".png"))


def plot_metrics(metrics: Dict[str, List[float]]):
    """
    Plots various metrics recorded during training
    :param metrics: The metrics to plot
    :return:
    """
    if len(metrics) > 0:
        plt.clf()
        plt.figure(figsize=(16, 9))
        for position, (key, value) in enumerate(metrics.items()):
            plt.subplot(len(metrics), 1, position + 1)
            plt.plot(range(len(value)), np.array(value))
            plt.ylabel(key.title())
            if key == "mean_reward":
                plt.yscale("symlog")
        plt.xlabel("Recorded Steps")
        plt.tight_layout()
        save_figure(f"training_metrics")
        plt.clf()
        plt.close()


# Exercise 4  **Stochastic Search (15 Pts)**

This exercise is about Stochastic Search methods for blackbox function optimization. In contrast to many Deep Reinforcement Learning algorithms, Stochastic Search Methods do not rely on any assumptions like the Markov property. As such, they are a highly flexible class of algorithms that can be very powerful in different scenarios. In this homework, we will implement a Canonical Evolutionary Strategy (CES) and the Cross Entropy Method (CEM) as two examples of Stochastic Search Methods. Further, we will look at the math behind MOdel-based Relative Entropy Stochastic Search (MORE), showing how Lagrangian optimization can be used to efficiently optimize a Gaussian search distribution with full covariance.

All methods discussed here rely on a Gaussian search distribution. On a high level, they all iteratively execute the following steps to maximize the reward under this distribution.
1. draw samples from the search distribution
2. evaluate the samples on the target function
3. (sort the samples according to the target function, where the best comes first)
4. update the parameters of the search distribution with the evaluations

Let's start by setting up the Gaussian Distribution class that we will use for all algorithms and the task that we will optimize for.


## Gaussian

The next code cell defines a Gaussian class containing the utility functions that are needed for the considered algorithms.

In [None]:
class Gaussian:
    """
    A multivariate Gaussian with a full covariance matrix
    """

    def __init__(self, mean: np.array, covariance: np.array):
        if len(mean.shape) < 2:
            mean = np.atleast_2d(mean).reshape([-1, 1])
        self.task_dimension = mean.shape[0]
        self.mean = mean
        self.covariance = covariance

        self.log_det = None  # log determinant
        self.chol_cov = None  # cholesky of the covariant

        self.update_params(mean, covariance)

        # precompute constant value
        self._log_2_pi_k = self.task_dimension * (np.log(2 * np.pi))

    def update_params(self, mean: np.array, covariance: np.array) -> None:
        """
        Updates the parameters of the Gaussian
        :param mean: The new mean. Shape: [task_dimension, 1]
        :param covariance: The new covariance. Shape: [task_dimension, task_dimension]
        :return:
        """
        if len(mean.shape) < 2:
            mean = np.atleast_2d(mean).reshape([-1, 1])
        self.mean = mean
        self.covariance = covariance

        self.chol_cov = np.linalg.cholesky(self.covariance)
        self.log_det = 2 * np.sum(np.log(np.diag(self.chol_cov)))

    def sample(self, n_samples: int) -> np.array:
        """
        Draw n_samples samples from the Gaussian
        :param n_samples: The number of samples to draw
        :return:
        """
        z = np.random.normal(size=(n_samples, self.task_dimension)).T
        x = self.mean + self.chol_cov @ z
        return x.T

    @property
    def entropy(self) -> float:
        """
        Compute (scalar) entropy of the multivariate Gaussian in closed form
        :return:
        """
        return 0.5 * (self.task_dimension + self._log_2_pi_k + self.log_det)

## Point Reacher
Our task for this exercise is a planar point reaching task. A robot arm starting at position $(0,0)^T$ with $n$ links of unit length is tasked to reach a point at positions $(0.7\cdot n, 0)^T$.
The action space consist of a continuous angle for each of the $n$ joints.
To promote smooth solutions, there is an additional penalty term on the squared angles.

You do **not** need to adapt the code for this task.

In [None]:
class PointReacher:

    def __init__(self, num_links: int, likelihood_std: float, smoothness_prior_std: Union[np.array, List[float]]):
        """
        Initialization of a simple point reacher task, where the goal is to reach a point (0, num_links*0,7) using
        a robot arm with num_links joints of length 1. Note that this task does *not* use time-series data, but
        instead evaluates a single angle configuration.
        :param num_links: Number of links of the robot
        :param likelihood_std: The "main" reward (not regarding the smoothness prior) is the closeness to the point/line.
            This reward is represented as a Gaussian with likelihood likelihood_std
        :param smoothness_prior_std: Standard deviation of a zero-mean Gaussian acting in angle space.
            Adds a smoothness prior for each joint, with smaller values leading to smoother solutions.
        """
        self._num_links = num_links
        self.target = [0.7 * num_links, 0]
        self._smoothness_likelihood = multivariate_normal(np.zeros(num_links),
                                                          np.array(smoothness_prior_std) * np.eye(num_links))
        self._target_likelihood = multivariate_normal(self.target, likelihood_std * np.eye(2))

    def reward(self, samples: np.array) -> np.array:
        """
        Calculates the reward for the given angles. Good angle configurations are those that
        * reach the target/have a high target (log) likelihood
        * are smooth/have a high smoothness (log) likelihood
        :param samples: An array of shape (..., num_angles) to evaluate
        :return: An array of shape (...), where each entry corresponds to the reward of the
          corresponding sample. Higher values are better.
        """
        samples = self.angle_normalize(samples)
        end_effector_position = self.forward_kinematic(joint_angles=samples)[..., -1, :]
        target_likelihood = self._target_likelihood.logpdf(end_effector_position[..., 0:])
        smoothness_likelihood = self._smoothness_likelihood.logpdf(samples)
        return np.squeeze(target_likelihood + smoothness_likelihood)

    @staticmethod
    def forward_kinematic(joint_angles: Union[List[np.array], np.array]) -> Union[List[np.array], np.array]:
        """
        Calculates the forward kinematic of the robot by interpreting each input value as an angle

        :param joint_angles: The angles of the joints. Can be of arbitrary shape, as long as the last dimension is over
            the relative angles of the robot. I.e., the shape is (..., angles)
        :return: The positions as an array of shape (..., #angles, 2),
            where the last dimension is the x and y position of each angle.
        """
        angles = np.cumsum(joint_angles, axis=-1)
        pos = np.zeros([*angles.shape[:-1], angles.shape[-1] + 1, 2])
        for i in range(angles.shape[-1]):
            pos[..., i + 1, 0] = pos[..., i, 0] + np.cos(angles[..., i])
            pos[..., i + 1, 1] = pos[..., i, 1] + np.sin(angles[..., i])
        return pos

    @staticmethod
    def angle_normalize(angles: np.array) -> np.array:
        """
        Normalizes the angles to be in [-pi, pi]
        :param angles: Unnormalized input angles
        :return: Normalized angles
        """
        return ((angles + np.pi) % (2 * np.pi)) - np.pi

    def render(self, search_distribution: Gaussian, iteration: Union[int, str] = 0):
        """
        Visualize the robot arm by creating an in-place matplotlib figure
        :param iteration: The iteration number to use for the figure title
        """
        dimension = search_distribution.task_dimension
        plt.gca().add_patch(Rectangle(xy=(-dimension * 0.02, dimension * -0.06),
                                      width=dimension * 0.02, height=dimension * 0.12,
                                      facecolor="grey", alpha=1, zorder=0))
        plt.xlabel(r"$x$")
        plt.ylabel(r"$y$")
        axes = plt.gca()
        axes.set_xlim([-0.6 * self._num_links, 1.1 * self._num_links])
        axes.set_ylim([-0.7 * self._num_links, 0.7 * self._num_links])
        axes.set_aspect(aspect="equal")

        # plot mean with a very high opacity
        mean = search_distribution.mean.squeeze()
        mean_angles = self.forward_kinematic(joint_angles=[mean])[0]
        mean_reward = self.reward(samples=mean)
        plt.plot(mean_angles[:, 0], mean_angles[:, 1], 'go-', markerfacecolor="grey", alpha=0.75,
                 label=f"Mean Reward: {mean_reward:.4e}")

        # plot a high number of samples with low opacity to get a sense of the distribution
        samples = search_distribution.sample(100)
        angles = self.forward_kinematic(joint_angles=samples)
        for position, angle_configuration in enumerate(angles):
            plt.plot(angle_configuration[:, 0], angle_configuration[:, 1], 'go-', markerfacecolor="grey", alpha=0.05,
                     label="Samples" if position == 0 else None)

        plt.scatter(self.target[0], self.target[1], c="r", marker="x", s=100)  # plot target point
        plt.legend(loc="upper left")
        plt.title(f"Point Reacher samples and reward at iteration {iteration}")

### Abstract Stochastic Search Class

We define an abstract class of the stochastic search algorithms next. The algorithms will inherit from this class and will overwrite algorithm specific parts. You do **not** need to change code in this class, however, you should still make sure that you understand the different functions.


In [None]:
class AbstractStochasticSearchMethod(abc.ABC):
    def __init__(self, task_dimension: int, samples_per_iteration: int, elite_percentage: float = 0.2):
        """
        :param task_dimension: The dimension of the task space
        :param samples_per_iteration: The number of samples to draw per iteration
        :param elite_percentage: The percentage of samples to keep as elite samples
        """
        smoothness_prior = [1] + [0.04] * (task_dimension - 1)
        self._reacher = PointReacher(num_links=task_dimension, likelihood_std=1.0e-4,
                                     smoothness_prior_std=smoothness_prior)
        self._search_distribution = Gaussian(mean=np.zeros(task_dimension),
                                             covariance=np.eye(task_dimension))

        self._samples_per_iteration = samples_per_iteration
        self._elite_percentage = elite_percentage
        self._num_elite_samples = int(self._elite_percentage * self._samples_per_iteration)

    def run(self, num_iterations: int, render: bool = True):
        full_metrics = {"mean_reward": [],
                        "entropy": []
                        }
        for iteration in tqdm(range(num_iterations), desc="Running Stochastic Search."):
            # logging utility
            if render and (iteration == 0 or 2 ** round(np.log2(iteration)) == iteration):
                # plot for each power of 2.
                # This gives a lot of plots early on, when the search distribution still changes
                # quickly, and eventually slows down near convergence
                self.reacher.render(search_distribution=self.search_distribution, iteration=iteration)
                save_figure(save_name=f"method={self.__class__.__name__}_iter={iteration:04d}")
                plot_metrics(full_metrics)
                
            if iteration % 100 == 0:
                print(f"Mean Reward: {self.mean_reward:.4e}, Entropy: {self.entropy:.4e}")
            full_metrics["mean_reward"].append(self.mean_reward)
            full_metrics["entropy"].append(self.entropy)

            self.step()  # perform one iteration of the search method. This also updates the search distribution

        if render:
            self.reacher.render(search_distribution=self.search_distribution, iteration="final")
            save_figure(save_name=f"method={self.__class__.__name__}_iter=final")
            plot_metrics(full_metrics)

    def step(self):
        """
        Perform one iteration of the cross-entropy method. This includes
        1. Sampling from the search distribution
        2. Evaluating the samples
        3. Updating the search distribution
        """
        samples = self.sample(self._samples_per_iteration)
        rewards = self.reacher.reward(samples)

        new_mean, new_covariance = self.update_distribution(samples, rewards)
        self._search_distribution.update_params(new_mean, new_covariance)

    def sample(self, n_samples: int) -> np.array:
        """
        Sample from the search distribution.
        :param n_samples: The number of samples to draw
        :return: The samples as an array of shape (n_samples, task_dimension)
        """
        return self._search_distribution.sample(n_samples=n_samples)

    def update_distribution(self, samples: np.array, rewards: np.array) -> (np.array, np.array):
        """
        Update the search distribution based on the samples and using the cross-entropy method
        :param samples: Samples from the search distribution
        :param rewards: Rewards of the samples
        :return: The new mean and covariance of the search distribution
        """
        raise NotImplementedError

    @property
    def search_distribution(self) -> Gaussian:
        return self._search_distribution

    @property
    def reacher(self) -> PointReacher:
        return self._reacher

    @property
    def mean_reward(self) -> float:
        """
        Compute the mean reward of the current search distribution
        :return: The mean reward
        """
        mean_reward = self.reacher.reward(self.search_distribution.mean.T)
        return mean_reward

    @property
    def entropy(self) -> float:
        return self._search_distribution.entropy

## **TASK 1: Canonical Evolutionary Strategy (CES)** (3 Points)

In this task, we will implement the Canonical Evolutionary Strategy (CES). Remember that the algorithms will implement some of the functions defined in the abstract class above and some additional, algorithm specific functions defined in the source code below.
The CES follows the classic stochastic search steps defined above. The update of the parameters of the search distribution are done via sample based estimations. This means that the elite samples consisting of the best **M** samples according to the target return are used to estimate the **mean** of our search distribution. Note that CES is a first order method, where the variance is a fixed value and is not optimized.

### Task 1.1: The Weight Vector (1 Points)
CES makes use of the standard strategy in evolutionary strategies methods in which the drawn samples are ranked. Additionally, when updating the parameters of the search distribution, every sample of the best **M** samples is weighted with a weight $w_i$ which is determined depending on the rank.
In this task we will implement this weight vector as shown in **Slide 14** of the **Stochastich Search** slide set.
Implement the function `_get_weight_vector()` in the code below. You can access the necessary parameters from the base `AbstractStochasticSearchMethod` class.

### Task 1.2: Updating the Parameters in CES (2 Points)
Given the weight vector, we can now update the **mean vector** of search distribution.
Implement the function `update_distribution(samples: np.array, rewards: np.array)` in the following code cell. Make sure that you correctly estimate the mean, which is a weighted recombination of the samples from the best **M**samples. The update rule can be found on **slide 14** in the slide set **Stochastic Search**. Remember that CES is a first order method and hence, the varinace is fixed ond does not need to be updated.


In [None]:
class CanonicalES(AbstractStochasticSearchMethod):
    def __init__(self, task_dimension: int, samples_per_iteration: int, elite_percentage: float = 0.01,
                 variance: float = 0.00001):
        super().__init__(task_dimension, samples_per_iteration, elite_percentage)
        """
        :param task_dimension: The dimension of the task space
        :param samples_per_iteration: The number of samples to draw per iteration
        :param elite_percentage: The percentage of samples to keep as elite samples
        :param variance: Fixed variance for the first order method
        """
        self._weight_vector = self._get_weight_vector()  # precompute weight vector for efficiency
        self._variance = variance

    def _get_weight_vector(self) -> np.array:
        """
        Create the weight vector for the canonical ES. This is a vector of length samples_per_iteration
        that assigns a weight to each sample. The weights are chosen such that the elite samples have
        a higher weight than the non-elite samples according to the slides in the lecture.

        :return:
        """
        ## TODO ##
        # your code here
        return ...

    def update_distribution(self, samples: np.array, rewards: np.array) -> (np.array, np.array):
        """
        Update the search distribution based on the samples and using the cross-entropy method
        :param samples: Samples from the search distribution
        :param rewards: Rewards of the samples
        :return: The new mean and covariance of the search distribution
        """
        ## TODO ##
        # your code here
        # hint: You will need to use the weight vector here to weight the samples
        return ..., ...


## **TASK 2: Cross Entropy Method (CEM)** (2 Points)

In this task, we will implement the Cross Entropy Method. Remember that the algorithms will implement some of the functions defined in the abstract class above.
The Cross Entropy Method follows the classic stochastic search steps defined above. The update of the parameters of the search distribution are done via sample based estimations. This means that the elite samples consisting of the best **M** samples according to the target return are used to estimate the **mean** and **covariance** of our search distribution. Note that CEM is a second order method and hence, it will estimate the **full** covariance of the search distribution.


### Task 2.1: Updating the Parameters in CEM (2 Points)
Implement the function `update_distribution(samples: np.array, rewards: np.array)` in the following code cell. Make sure that you estimate the **mean and covariance** based on the equations provided on **slide 21** of slide set **Stochastic Search**. Also do not forget that CEM is a second order method and therefore, you need to update the full covariance of the search distribution.

Hint: You can make use of the functions `np.mean(x)`, `np.cov(x)`. Do not forget to include the Polyak-Averaging for updating the mean and covariance.


In [None]:

class CrossEntropyMethod(AbstractStochasticSearchMethod):
    def __init__(self, task_dimension: int, samples_per_iteration: int, elite_percentage: float = 0.2):
        super().__init__(task_dimension, samples_per_iteration, elite_percentage)
        self._alpha = 0.5

    def update_distribution(self, samples: np.array, rewards: np.array) -> (np.array, np.array):
        """
        Update the search distribution based on the samples and using the cross-entropy method
        :param samples: Samples from the search distribution
        :param rewards: Rewards of the samples
        :return: The new mean and covariance of the search distribution
        """
        ## TODO ##
        # your code here
        new_covariance = ...

        # add a small constant for numerical stability
        stable_new_covariance = new_covariance + np.eye(self._search_distribution.task_dimension) * 1e-6
        return ..., stable_new_covariance


The following cell defines the parameters of the environment (number of robot links), and hyperparameters of the different algorithms.
You can choose which algorithm to run via the method argument: Either 'ces' (Canonical Evolutionaory Strategey), or 'cem' (Cross Entropy Method).

In [None]:
class Args:

    # @markdown Boilerplate for properly accessing the args
    def __getitem__(self, key):
        return getattr(self, key)

    def __setitem__(self, key, val):
        setattr(self, key, val)
    method = 'ces' # @param {type: "string"}
    num_links = 15  # @param {type: "integer"}
    num_iterations = 10000  # @param {type: "integer"}
    samples_per_iteration = 128  # @param {type: "integer"}
    ces_variance = 0.0001  # @param {type: "number"}

The next cell will execute the CES and the CEM algortihm.
Please submit the **final constellation** of the reacher as well as the **training metrics** for **both** methods with the pre-configured hyperparameters together with your solutions notebook.


In [None]:
def main(args: Args):
    np.random.seed(0)
    num_links = args.num_links
    num_iterations = args.num_iterations
    method = args.method

    if method == "cem":
        method = CrossEntropyMethod(task_dimension=num_links, samples_per_iteration=args.samples_per_iteration)
    elif method == "ces":
        method = CanonicalES(task_dimension=num_links, samples_per_iteration=args.samples_per_iteration,
                             variance=args.ces_variance)
    else:
        raise ValueError(f"Unknown method {method}")

    # run the search method
    method.run(num_iterations=num_iterations)

args = Args()
main(args=args)

## MORE
### **Task 3: Deriving the MORE equations** (10 Points)
Model-based Relative Entropy Stochastic Search (MORE) is a gradient free policy search algorithm for blackbox function optimization.
MORE makes use of a Gaussian search distribution with a full covariance matrix, and iteratively updates this distribution by
* drawing samples from it
* evaluating the samples on the target function
* fitting a quadratic surrogate model on the samples and their targets
* using this model to update the search distribution in closed form under entropy and KL constraints

Comparing this to the general steps outlined above, MORE uses a quadratic surrogate model for a more efficient update of the search distribution.
The algorithm is comparatively involved and contains a lot of complex mathematical expressions. We will therefore not implement it, but instead have a look into its theory.

More concretely, we are going to derive the primal solution of the optimization problem underlying MORE, given as
\begin{align}
    \underset{\boldsymbol{\omega}}{\textrm{argmax}} \int p_{\boldsymbol{\omega}}(\boldsymbol{\theta})g(\boldsymbol{\theta}) d\boldsymbol{\theta} \quad \textrm{s.t.} \quad \textrm{KL}(p_{\boldsymbol{\omega}}(\boldsymbol{\theta}) || p_{\textrm{old}}(\boldsymbol{\theta})) \leq \epsilon, \quad \textrm{H}(p_{\textrm{old}}(\boldsymbol{\theta})) - \textrm{H}(p_{\boldsymbol{\omega}}(\boldsymbol{\theta})) \leq \gamma, \quad \int p_{\boldsymbol{\omega}}(\boldsymbol{\theta}) d \boldsymbol{\theta} = 1.
\end{align}

As a first simplification we can set $\beta = \textrm{H}(p_{\textrm{old}}(\boldsymbol{\theta})) - \gamma$ and rewrite this objective as
\begin{align}
    \underset{\boldsymbol{\omega}}{\textrm{argmax}} \int p_{\boldsymbol{\omega}}(\boldsymbol{\theta})g(\boldsymbol{\theta}) d\boldsymbol{\theta} \quad \textrm{s.t.} \quad \textrm{KL}(p_{\boldsymbol{\omega}}(\boldsymbol{\theta}) || p_{\textrm{old}}(\boldsymbol{\theta})) \leq \epsilon, \quad \textrm{H}(p_{\boldsymbol{\omega}}(\boldsymbol{\theta})) \geq \beta, \quad \int p_{\boldsymbol{\omega}}(\boldsymbol{\theta}) d \boldsymbol{\theta}=1.
\end{align}

Denoting the Lagrangian multipliers for the KL and Entropy constraint by $\eta$ and $\kappa$ respectivly, **show that $p_{\boldsymbol{\omega}^*}(\boldsymbol{\theta})$ is the optimal solution to this optimization problem given by**
\begin{align}
    p_{\boldsymbol{\omega}^*}(\boldsymbol{\theta}) \propto p_{\textrm{old}}(\boldsymbol{\theta})^{\frac{\eta}{\eta + \kappa}} \exp\left( \dfrac{g(\boldsymbol{\theta})}{\eta + \kappa} \right).
\end{align}
Note that the optimal solution depends on the duals $\eta$ and $\kappa$.
