# Lunar Lander Tutorial

In this tutorial, we'll walk through how to use the `ribs` implementaiton of MAP-Elites to solve OpenAI Gym's Lunar Lander problem. Specifically, the environment we'll be using is `LunarLander-v2`.

## Overview

OpenAI Gym is a common toolkit used to test and evaluate reinforcement learning algorithms. It provides various environments/problems for algorithms to solve. In our case, we'll be trying to get a lunar lander to land successfully on the moon within a certain target area. To find out more, visit [OpenAI's page](https://gym.openai.com/envs/LunarLander-v2/) on this environment. Using MAP-Elites, we'll discover and visualize a diverse range of solutions to this problem.

If you're unfamiliar with the MAP-Elites algorithm, take a look at [this paper](https://arxiv.org/abs/1504.04909) which introduces the algorithm that we use in this notebook. It should be noted that while the algorithm in the paper minimizes performance, the `ribs` implementation maximizes performance instead. Additionally, instead of generating random solutions in the initial stage of MAP-Elites, our implementation samples from a Gaussian distribution.

## Setup

Here we'll retrieve all of our dependencies. `dask` is a library that allows parallelization.

In [None]:
%pip install ribs

In [2]:
import time

import gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dask.distributed import Client, LocalCluster

from ribs.archives import GridArchive
from ribs.optimizers import Optimizer
from ribs.emitters import GaussianEmitter

## Lunar Lander Simulation

First, let's write a `simulate()` function to run a prospective solution in the `LunarLander-v2` environment. We'll get to how we generate these prospective solutions later.

`simulate()` takes in a prospective solution (i.e. policy) and uses this policy to make the Lunar Lander take actions in the environment. After the simulation is completed, `simulate()` returns several things. It returns the sum of the rewards of all actions taken in the environment (i.e. `total_reward`), the final state of the environment (i.e. `obs`), and the number of timesteps it took for the simulation to run to completion (i.e. `timesteps`).

In [3]:
def simulate(
    env_name: str,
    model,
    seed: int = None,
    render: bool = False,
    delay: int = 10,
):
    """Runs the model in the env and returns the cumulative reward.
    Add the `seed` argument to initialize the environment from the given seed
    (this makes the environment the same between runs).
    The model is just a linear model from input to output with softmax, so it
    is represented by a single (action_dim, obs_dim) matrix.
    Add an integer delay to wait `delay` ms between timesteps.
    """
    total_reward = 0.0
    env = gym.make(env_name)

    # Seeding the environment before each reset ensures that our simulations are
    # deterministic. We cannot vary the environment between the runs because
    # that would confuse CMA-ES. See
    # https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L115
    # for the implementation of seed() for LunarLander.
    if seed is not None:
        env.seed(seed)
    obs = env.reset()

    timesteps = 0

    done = False
    while not done:
        # If render is set to True, then a video will appear showing the Lunar Lander
        # taking actions in the environment.
        if render:
            env.render()
            if delay is not None:
                time.sleep(delay / 1000)

        # Deterministic. Here is the action. Multiply observation by policy. Model is the policy and obs is state
        action = np.argmax(model @ obs)  
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        timesteps += 1

    env.close()

    return total_reward, obs[0], timesteps # add simulate after everything else

## MAP-Elites with `ribs`

`ribs` makes it easy to run the MAP-Elites algorithm to solve reinforcement learning problems. Let's run through  some basics before we apply `ribs` to solve the Lunar Lander problem.

### GridArchive

`GridArchive` is a container class used to house the solutions generated by MAP-Elites. It is our map of elites. When constructing a `GridArchive`, you can specify its dimensions, the range of valid values in our behaviour space, and certain configuration settings. These configuration settings include a seed for getting random solutions in the archive to mutate, which is essential to MAP-Elites, and a batch size. This batch size is not important for `GridArchive` but it is important for `Optimizer`, which we'll discuss soon.

In `train_model()`, you see we create an `archive = GridArchive((16, 16), [(0, 1000), (-1., 1.)], config=config)`. Let's break this down.

- `(16, 16)` specifies that we are creating a 2D 16x16 container for solutions. 16x16 was chosen arbitrarily.

- `[(0, 1000), (-1., 1.)]` specifies upper and lower bounds for each dimension of the behavior space. In the case of Lunar Lander, we want to consider timesteps and x-position of the Lunar Lander in the environment. According to OpenAI Gym documention, each simulation can take at least 0 timesteps and at most 1000 timesteps, so we specify `(0, 1000)`. Looking at `LunarLander-v2`'s source code, we find that the minimum x-position value for the lander is -1.0 and the maximum value is 1.0, so we specify `(-1., 1.)`.
- `config` is a dictionary that specifies certain configuration settings. As stated previously, the only value that `GridArchive` uses is the seed. `config` will also later be passed into `Optimizer`, which we'll discuss soon.

`GridArchive` has a method `as_pandas()` that returns the `GridArchive` as a `pandas` data frame.

### GaussianEmitter
`GaussianEmitter` is the class that generates solutions to either store in `GridArchive` or discard. As the name implies, it uses a Gaussian distribution to generate/mutate solutions. 

In `train_model()`, we create `emitter = GaussianEmitter(np.zeros(action_dim * obs_dim), sigma, archive)`. Let's break this down by looking at `GaussianEmitter`'s constructor: `GaussianEmitter(x0, sigma0, archive, config=None)`

- `x0` is the center of the Gaussian distribution to generate solutions from when the archive specified by `archive` is empty of solutions. 
- `sigma0` is the standard deviation of the Gaussian distribution. Here, we simply pass in the sigma value passed into `train_model()`.
- `archive` specifies the archive to store solutions in. In this case, we pass in the `GridArchive` we created earlier.
- `config` allows you to pass in configuration settings, including specifications for batch sizes. Here, we don't pass anything in because `GaussianEmitter`'s default batch size is 64, which works for us.

`GaussianEmitter` has two functions. `ask()` generates a batch of new solutions, either a completely new solution sampled from a Gaussian distribution or a solution generated by mutating (i.e. adding Gaussian noise) an existing solution. `tell()` takes in a batch of solutions, along with their performance values and behavior characteristics, and adds them to the archive specified by `archive` by calling `archive.add()`. `archive.add()` will decide whether or not to store each new solution by comparing each new solution's objective value with their corresponding existing solution's objective value. If for a given new solution there is no corresponding existing solution, then the new solution is automatically stored. `Optimizer`, which we discuss next, takes care of calling `ask()` and `tell()` for you, so you don't need to worry about these details.

### Optimizer

`Optimizer` is a class that uses emitters to generate batches of solutions. Then, after the feature/behavior description of each of these solutions have been found, `Optimizer` tells its emitters to store the solutions inside an archive.

In `train_model()`, you see that we create `opt = Optimizer(archive, [emitter])`. This is pretty self-explanatory. `archive` is the archive we want our solutions to be stored in. In this case, we created earlier in `train_model()`. We also want to pass in a list of `Emitter`'s we want the `Optimizer` to ask to generate solutions. In our case, we pass in `[emitter]`, which is a list of `Emitter`s just containing the emitter we created earlier in `train_model()`. 

`Optimizer` has two methods `ask()` and `tell()`. `ask()` asks the `Optimizer`'s list of `Emitter`s to generate a batch of solutions. `tell()` tells the `Optimizer`'s list of `Emitter`s to store a batch of solutions in the archive specified upon the `Optimizer`'s creation. However, in order for the `Emitter`s to know where and if to store each solution in the archive, we have to pass in a couple of additional arguments into `tell()`. 

Specifically, `tell()` has the following arguments: `def tell(objective_values, behavior_values)`

- `objective_values` is an array containing the objective function evaluations for each solution generated by the `Emitter`s we passed into this `Optimizer`. In `train_model()`, you can see we pass in `objs` for this argument. `objs` is a list of reward values derived by running the `simulate()` function described above on each individual solution.
- `behavior_values` is a matrix of feature descriptions for each solution. Each row of `behavior_values` describes features of one solution, and these features are used as coordinates to store each solution into an archive. In `train_model()`, we pass in `bcs` as our `behavior_values` argument. The significance of `bcs` was discussed in an earlier section. Check above if you've forgotten.



###  `train_model()`

We've just gone over the core components of ribs that we'll be using to solve `LunarLander-v2`: `GridArchive`, `Gaussian_Emitter`, and `Optimizer`. Now, take a look at `train_model()` for how exactly we solve the problem.

In [4]:
def train_model(
    client: Client,
    seed: int,
    sigma: float,
    model_filename: str,
    plot_filename: str,
    iterations: int,
    env_name: str = "LunarLander-v2",
):
    """Trains a model with MAP-Elites and saves it."""
    # OpenAI Gym environment properties.
    env = gym.make(env_name)
    action_dim = env.action_space.n
    obs_dim = env.observation_space.shape[0]
    
    archive = GridArchive((16, 16), [(0, 1000), (-1., 1.)],
                          config={
                              "seed": seed,
                          })
    emitter = GaussianEmitter(np.zeros(action_dim * obs_dim),
                              sigma,
                              archive,
                              config={"batch_size": 64})
    opt = Optimizer(archive, [emitter])

    for _ in range(0, iterations - 1):
        
        # Generating a batch of solutions
        opt.ask()

        objs = list()
        bcs = list()
        
        # Here, we're running each of the solutions (i.e. policies) we generated above through the
        # simulate() function. simulate() will return the objective value, timesteps to run to completion,
        # and x-position of the lunar lander for each solution we pass in. 
        futures = client.map(
            lambda sol: simulate(env_name, np.reshape(sol, (action_dim, obs_dim)), seed), opt._solutions)

        results = client.gather(futures)
    
        # Here we're just constructing a list of objective function evaluations (i.e. objs) and behavior
        # descriptions (i.e. bcs) for each solution. These values were returned by our calls to simulation()
        # above.
        for reward, x_pos, timesteps in results:
            objs.append(reward)
            bcs.append((timesteps, x_pos))
        
        # We have our Optimizer opt tell our Emitters the objective function evaluations and behavior
        # descriptions of each solution, so that our Emitter emitter and GridArchive archive can decide 
        # where and if to store each solution in our GridArchive archive.
        opt.tell(objs, bcs)

    df = archive.as_pandas()
    
    # Saving our archive to a file.
    df.to_pickle(model_filename)

    df = archive.as_pandas()
    df = df.pivot('index_0', 'index_1', 'objective')
    
    # Creating a heatmap of all of our generated solutions.
    sns.heatmap(df)
    plt.savefig(plot_filename)

### `run_evaluation()`
This function loads a saved model and runs one simulation of Lunar Lander on it.

In [5]:
def run_evaluation(model_filename, env_name, seed):
    """Runs a single simulation and displays the results."""
    model = np.load(model_filename)
    print("=== Model ===")
    print(model)
    cost = simulate(env_name, model, seed, True, 10)
    print("Reward:", -cost)

In `map_elites()`, we set up `dask` to parallelize computation and call `train_model()`, which saves an archive of solutions as a pickel file and outputs a heatmap of all solutions in the archive.

In [6]:
def map_elites(
    seed: int = 42,
    local_workers: int = 8,
    sigma: float = 10.0,
    plot_filename: str = "lunar_lander_plot.png",
    model_filename: str = "lunar_lander_model.pkl",
    run_eval: bool = False,
):
    """Uses Map-Elites to train an agent in an environment with discrete actions.
    Args:
        env: OpenAI Gym environment name. The environment should have a discrete
            action space.
        seed: Random seed for environments.
        sigma: Initial standard deviation for CMA-ES.
        local_workers: Number of workers to use when running locally.
        plot_filename: Location to store plot image.
        model_filename: Location for .npy model file (either for storing or
            reading).
        run_eval: Pass this to run an evaluation in the environment in `env`
            with the model in `model_filename`.
    """
    # Evaluations do not need Dask.
    if run_eval:
        run_evaluation(model_filename, "LunarLander-v2", seed)
        return
    
    # Set up a local cluster with Dask.
    cluster = LocalCluster(n_workers=local_workers,
                           threads_per_worker=1,
                           processes=True)
    client = Client(cluster)  # pylint: disable=unused-variable
    print("Cluster config:")
    print(client.ncores())

    train_model(client, seed, sigma, model_filename, plot_filename, 10)

In [13]:
!{sys.executable} -m pip install box2d-py

# !pip install box2d-py

You should consider upgrading via the '/Users/yaya/anaconda/bin/python -m pip install --upgrade pip' command.[0m


In [16]:
!conda env list

# conda environments:
#
base                     /Users/yaya/anaconda
caispp                   /Users/yaya/anaconda/envs/caispp
py35                     /Users/yaya/anaconda/envs/py35
pyribs_lunar_lander   *  /Users/yaya/anaconda/envs/pyribs_lunar_lander
ribs                     /Users/yaya/anaconda/envs/ribs



In [17]:
!pip freeze requirements

alabaster==0.7.12
appdirs==1.4.4
astroid==2.4.2
atomicwrites==1.4.0
attrs==20.2.0
Babel==2.8.0
beautifulsoup4==4.9.3
bleach==3.2.1
box2d-py==2.3.8
bump2version==0.5.11
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
cloudpickle==1.6.0
cma==3.0.3
contextvars==2.4
coverage==5.3
cycler==0.10.0
dask==2.30.0
dask-jobqueue==0.7.1
distlib==0.3.1
distributed==2.30.0
docutils==0.16
filelock==3.0.12
fire==0.3.1
future==0.18.2
gym==0.17.3
HeapDict==1.0.1
idna==2.10
imagesize==1.2.0
immutables==0.14
importlib-metadata==0.23
importlib-resources==3.0.0
isort==5.6.4
Jinja2==2.11.2
kiwisolver==1.3.1
lazy-object-proxy==1.4.3
livereload==2.6.3
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
more-itertools==8.6.0
msgpack==1.0.0
numpy==1.19.4
packaging==20.4
pandas==1.1.4
Pillow==8.0.1
pkginfo==1.6.1
pluggy==0.13.1
psutil==5.7.3
py==1.9.0
py-cpuinfo==7.0.0
pydata-sphinx-theme==0.4.1
pyglet==1.5.0
Pygments==2.7.2
pylint==2.6.0
pyparsing==2.4.7
pytest==4.6.5
pytest-benchmark==3.2.3
pytest-cov==2.10.1
pytho

In [14]:
map_elites()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 56946 instead
  http_address["port"], self.http_server.port


Cluster config:
{'tcp://127.0.0.1:56957': 1, 'tcp://127.0.0.1:56958': 1, 'tcp://127.0.0.1:56959': 1, 'tcp://127.0.0.1:56960': 1, 'tcp://127.0.0.1:56961': 1, 'tcp://127.0.0.1:56962': 1, 'tcp://127.0.0.1:56963': 1, 'tcp://127.0.0.1:56964': 1}


AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'