# Multi Agent Reinforcement Learning

In the previous section, we discussed how to build a model for centalized control.
   * When trained properly, this approach can work very well.
   
   * However, we saw an exponential increase in the of the joint action space as the number of trains in the environment increased.
     As a result, this approach will be too severely limited to work at scale with a large number of trains.
     
   * Instead, we reformulate the objective as a multi-agent reinforcement learning task.
     * This is a more natural representation of the problem, in which each train operates as a decentralized agent.
     * That is to say, from the perspective of one agent, all other agents are part of the environment.

In [None]:
from environments.env_utils import env_from_env_config
from environments.observations import TreeObsForRailEnv
from environments.preprocessor import TreeObsPreprocessor
from models.dense_model import DenseModel
from ray import tune
from ray.rllib.models import ModelCatalog
from ray.tune.registry import register_env

We're going to be using RLlib, a scalable reinforcement learning library built on top of Ray by the Berkeley AI Research Group


With relatively few lines of additional code, Ray enables deep learning and reinforcement learning practitioners to turn their prototype algorithms into distributed production scale solutions trained on clusters.  

RLlib is a highly scalable library of reinforcement learning algorithms which natively supports popular deep learning frameworks such as TensorFlow and PyTorch.  

You can find out more about Ray here: https://bair.berkeley.edu/blog/2018/01/09/ray/  
and more on RLlib here: https://ray.readthedocs.io/en/latest/rllib.html  


Reinforcement learning algorithms are hungry for training examples, making training in a multi-GPU setting very important for the speed of convergence, especially at scale. We will see how RLlib's support of CUDA devices leverages Nvidia GPUs to offer a significant reduction in time taken to train.  

More information on the importance of GPUs for use with RLlib can be found here:  
https://ray.readthedocs.io/en/latest/using-ray-with-gpus.html

But for now, let's train! 

Run the cell below to set up the first experiment.

In [None]:
# Set up a dense model with "u" hidden_layer units per layer
u = 128
model_params = {
    "embedding": {"hidden_layers": [u, u], "activation_fn": "relu"},
    "actor": {"hidden_layers": [u], "activation_fn": "relu"},
    "critic": {"hidden_layers": [u], "activation_fn": "relu"},
}
custom_model = "dense_model"

# Set up the environment
env_config = {
    "obs_config": {"max_depth": 2},
    "rail_generator": "complex_rail_generator",
    "rail_config": {"nr_start_goal": 12, "nr_extra": 0, "min_dist": 8, "seed": 10},
    "width": 8,
    "height": 8,
    "number_of_agents": 5,
    "schedule_generator": "complex_schedule_generator",
    "schedule_config": {},
    "frozen": False,
    "remove_agents_at_target": True,
    "wait_for_all_done": False
}
env = env_from_env_config(env_config)
action_space = env.action_space
observation_space = env.observation_space

# Define 1 policy per agent
num_policies = env_config["number_of_agents"]
policies = {f"policy_{i}": (None, observation_space, action_space, {})
            for i in range(num_policies)}

# Register custom setup with RLlib
register_env("train_env", env_from_env_config)
ModelCatalog.register_custom_model(custom_model, DenseModel)
ModelCatalog.register_custom_preprocessor("tree_obs_preprocessor", TreeObsPreprocessor)

# Full experiment config
config = {
    # Run parameters
    "num_cpus_per_worker": 1,
    "num_cpus_for_driver": 1,
    
    # Environment parameters
    "env": "train_env",
    "env_config": env_config,
    "log_level": "ERROR",
    
    # Training parameters
    "horizon": 60,
    "num_sgd_iter": 15,
    "lr": 1e-4,
    
    # Policy parameters
    "vf_loss_coeff": 1e-6,    
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": lambda agent_id: "policy_0",
    },
    
    # Model parameters
    "model": {
        "custom_preprocessor": "tree_obs_preprocessor",
        "custom_model": custom_model,
        "custom_options": {
            "tree_depth": env_config["obs_config"]["max_depth"],
            "observation_radius": 0,
            **model_params,
        },
    },
}

# Training with CPU vs. GPU

* We are going to compare speed of training with and without the GPUs.  
* At the scale of environment we are running today, we should see a difference within a few minutes of training.
* At larger scales, such as a full sized rail network, the difference in speed is much more pronounced.

Let's get started by loading TensorBoard while we discuss the next steps.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=~/ray_results/

* Run the cell below to launch a short training run, using only CPUs.
* Then change the number of GPUs to 4 and re-run the cell.
* Check the difference in TensorBoard. (Hint: change the smoothing to help visualize results.)

In [None]:
# Run parameters
config["num_workers"] = 7
config["num_gpus"] = 0  # TODO: change this to 1

n_GPUS = config["num_gpus"]

tune.run(
    "PPO",
    name=f"PPO_multi_agent-MODEL={custom_model}_{u}-GPUS={n_GPUS}",
    stop={"episode_reward_mean": -18,
         "training_iteration": 50},
    config=config,
    checkpoint_freq=1,
    checkpoint_at_end=True,
    loggers=tune.logger.DEFAULT_LOGGERS,
    ray_auto_init=True,
)