[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/icaros-usc/pyribs/blob/master/examples/tutorials/lunar_lander.ipynb)


# Lunar Lander Tutorial

In this tutorial, we'll walk through how to use the `ribs` implementaiton of MAP-Elites to solve OpenAI Gym's Lunar Lander problem. Specifically, the environment we'll be using is `LunarLander-v2`.

## Overview

OpenAI Gym is a common toolkit used to test and evaluate reinforcement learning algorithms. It provides various environments/problems for algorithms to solve. In our case, we'll be trying to get a lunar lander to land successfully on the moon within a certain target area. To find out more, visit [OpenAI's page](https://gym.openai.com/envs/LunarLander-v2/) on this environment. Using MAP-Elites, we'll discover and visualize a diverse range of solutions to this problem.

If you're unfamiliar with the MAP-Elites algorithm, take a look at [this paper](https://arxiv.org/abs/1504.04909) which introduces the algorithm that we use in this notebook. It should be noted that while the algorithm in the paper minimizes performance, the `ribs` implementation maximizes performance instead. Additionally, instead of generating random solutions in the initial stage of MAP-Elites, our implementation samples from a Gaussian distribution.

## Setup

Here we'll retrieve all of our dependencies. `dask` is a library that allows parallelization.

In [1]:
import sys
!{sys.executable} -m pip install ribs[examples]



In [6]:
import time

import gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dask.distributed import Client, LocalCluster

from ribs.archives import GridArchive
from ribs.optimizers import Optimizer
from ribs.emitters import GaussianEmitter

## Lunar Lander Simulation

First, let's write a `simulate()` function to run a prospective solution in the `LunarLander-v2` environment. We'll get to how we generate these prospective solutions later.

`simulate()` takes in a prospective solution (i.e. policy) and uses this policy to make the Lunar Lander take actions in the environment. After the simulation is completed, `simulate()` returns several things. It returns the sum of the rewards of all actions taken in the environment (i.e. `total_reward`), the final state of the environment (i.e. `obs`), and the number of timesteps it took for the simulation to run to completion (i.e. `timesteps`).

In [None]:
def simulate(
    env_name: str,
    model,
    seed: int = None,
    render: bool = False,
    delay: int = 10,
):
    """Runs the model in the env and returns the cumulative reward.
    Add the `seed` argument to initialize the environment from the given seed
    (this makes the environment the same between runs).
    The model is just a linear model from input to output with softmax, so it
    is represented by a single (action_dim, obs_dim) matrix.
    Add an integer delay to wait `delay` ms between timesteps.
    """
    total_reward = 0.0
    env = gym.make(env_name)

    # Seeding the environment before each reset ensures that our simulations are
    # deterministic. We cannot vary the environment between the runs because
    # that would confuse CMA-ES. See
    # https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L115
    # for the implementation of seed() for LunarLander.
    if seed is not None:
        env.seed(seed)
    obs = env.reset()

    timesteps = 0

    done = False
    while not done:
        # If render is set to True, then a video will appear showing the Lunar Lander
        # taking actions in the environment.
        if render:
            env.render()
        if delay is not None:
            time.sleep(delay / 1000)

        # Deterministic. Here is the action. Multiply observation by policy. Model is the policy and obs is state
        action = np.argmax(model @ obs)  
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        timesteps += 1

    env.close()

    return total_reward, obs, timesteps

## MAP-Elites with `ribs`

`ribs` makes it easy to run the MAP-Elites algorithm to solve reinforcement learning problems. Let's run through  some basics before we apply `ribs` to solve the Lunar Lander problem.

### GridArchive

`GridArchive` is a container class used to house the solutions generated by MAP-Elites. It is our map of elites. When constructing a `GridArchive`, you can specify its dimensions, the range of valid values in our behaviour space, and certain configuration settings. These configuration settings include a seed for getting random solutions in the archive to mutate, which is essential to MAP-Elites, and a batch size. This batch size is not important for `GridArchive` but it is important for `Optimizer`, which we'll discuss soon.

In `train_model()`, you see we create an `archive = GridArchive((16, 16), [(0, 1000), (-1., 1.)], config=config)`. Let's break this down.

- `(16, 16)` specifies that we are creating a 2D 16x16 container for solutions. 16x16 was chosen arbitrarily.

- `[(0, 1000), (-1., 1.)]` specifies upper and lower bounds for each dimension of the behavior space. In the case of Lunar Lander, we want to consider timesteps and x-position of the Lunar Lander in the environment. According to OpenAI Gym documention, each simulation can take at least 0 timesteps and at most 1000 timesteps, so we specify `(0, 1000)`. Looking at `LunarLander-v2`'s source code, we find that the minimum x-position value for the lander is -1.0 and the maximum value is 1.0, so we specify `(-1., 1.)`.
- `config` is a dictionary that specifies certain configuration settings. As stated previously, the only value that `GridArchive` uses is the seed. `config` will also later be passed into `Optimizer`, which we'll discuss soon.

### Emitter
`Emitter` is the class that generates solutions to either store in `GridArchive` or discard. 

### Optimizer

`Optimizer` is the class which generates elites to store in `GridArchive`. 

In [2]:
def train_model(
    client: Client,
    seed: int,
    sigma: float,
    model_filename: str,
    plot_filename: str,
    iterations: int,
    env_name: str = "LunarLander-v2",
):
    """Trains a model with MAP-Elites and saves it."""
    # Environment properties.
    env = gym.make(env_name)
    action_dim = env.action_space.n
    obs_dim = env.observation_space.shape[0]

    config = {
        "seed": seed,
        "batch_size": 64,
    }


    archive = GridArchive((16, 16), [(0, 1000), (-1., 1.)], config=config)
    emitter = GaussianEmitter(np.zeros(action_dim * obs_dim), sigma, archive)
    opt = Optimizer(archive, [emitter], config=config)

    for _ in range(0, iterations - 1):

        sols = opt.ask()

        objs = list()
        bcs = list()

        futures = client.map(lambda sol: simulate(env_name, np.reshape(sol, (action_dim, obs_dim)), seed), sols)

        results = client.gather(futures)

        for reward, state, timesteps in results:
            objs.append(reward)
            bcs.append((timesteps, state[0]))

        opt.tell(sols, objs, bcs)


    df = archive.as_pandas()

    df.to_csv(model_filename)

    df = archive.as_pandas()
    df = df.pivot('index-0', 'index-1', 'objective')
    sns.heatmap(df)
    plt.savefig(plot_filename)

In [3]:
def run_evaluation(model_filename, env_name, seed):
    """Runs a single simulation and displays the results."""
    model = np.load(model_filename)
    print("=== Model ===")
    print(model)
    cost = simulate(env_name, model, seed, True, 10)
    print("Reward:", -cost)

In [4]:
def map_elites(
    seed: int = 42,
    local_workers: int = 8,
    sigma: float = 10.0,
    plot_filename: str = "lunar_lander_plot.png",
    model_filename: str = "lunar_lander_model.csv",
    run_eval: bool = False,
):
    """Uses Map-Elites to train an agent in an environment with discrete actions.
    Args:
        env: OpenAI Gym environment name. The environment should have a discrete
            action space.
        seed: Random seed for environments.
        sigma: Initial standard deviation for CMA-ES.
        local_workers: Number of workers to use when running locally.
        slurm: Set to True if running on Slurm.
        slurm_workers: Number of workers to start when running on Slurm.
        slurm_cpus_per_worker: Number of CPUs to use on each Slurm worker.
        plot_filename: Location to store plot image.
        model_filename: Location for .npy model file (either for storing or
            reading).
        run_eval: Pass this to run an evaluation in the environment in `env`
            with the model in `model_filename`.
    """
    # Evaluations do not need Dask.
    if run_eval:
        run_evaluation(model_filename, "LunarLander-v2", seed)
        return

        # Initialize on a local machine. See the docs here:
        # https://docs.dask.org/en/latest/setup/single-distributed.html for more
        # info on LocalCluster. Keep in mind that for LocalCluster, the
        # n_workers is the number of processes. Our LunarLander evaluations do
        # not release the GIL (I think), so using threads instead of processes
        # (which we would do by setting n_workers=1 and
        # threads_per_worker=workers) would be very slow, as it would be
        # single-threaded. See here for a bit more info about processes in
        # threads in Dask:
        # https://distributed.dask.org/en/latest/worker.html#thread-pool
        # The link above is for multiple machines (each machine is called a
        # worker, and each workers has processes and threads), but the idea
        # still holds.
    cluster = LocalCluster(n_workers=local_workers,
                           threads_per_worker=1,
                           processes=True)
    client = Client(cluster)  # pylint: disable=unused-variable
    print("Cluster config:")
    print(client.ncores())

    train_model(client, seed, sigma, model_filename, plot_filename, 10)