# Multi Agent Reinforcement Learning with Centralized Critic

* A major assumption of reinforcement learning is that the states of the environment obey the Markov property.
* In the multi-agent setting, this is not strictly true.
* From the perspective of one agent, all the other agents are considered part of the environment.
* However, the other agents are learning to improve their policies over time, meaning the environment is no longer stationary from the perspective of any given agent.  
* In adversarial settings, this can lead to catastophic collapses of performance, as the agent fails to respond correctly to its opponent.
* In collaborative settings such as ours, we can run into issues when assigning credit (or blame) for events like collisions or gridlock.

<img src="https://storage.cloud.google.com/gtc-2020/images/gridlock.png">
[https://bair.berkeley.edu/blog/2018/12/12/rllib/]

* One method of alleviating this problem is to share the observations and actions of each agent during training.
* In the previous sections, we implemented the PPO algorithm in a decentralized multi-agent setting.
* In this section, we will run PPO, but this time with a *centralized critic*.
* We call this new algorithm CCPPO.

In [None]:
from ray import tune
from ray.rllib.models import ModelCatalog
from ray.tune.registry import register_env
from environments.env_utils import env_from_env_config
from environments.preprocessor import TreeObsPreprocessor
from models.centralized_critic_model import CentralizedCriticModel
from models.centralized_critic_policy import CCTrainer

In [None]:
# Set up a dense model with "u" hidden_layer units per layer
u = 10
model_params = {
    "embedding": {"hidden_layers": [u, u], "activation_fn": "relu"},
    "actor": {"hidden_layers": [u], "activation_fn": "relu"},
    "critic": {"hidden_layers": [u], "activation_fn": "relu"},
    "central_vf_size": u,
}
custom_model = "cc_dense"

# Set up the environment
env_config = {
    "obs_config": {"max_depth": 2},
    "rail_generator": "complex_rail_generator",
    "rail_config": {"nr_start_goal": 12,"nr_extra": 0, "min_dist": 8, "seed": 10},
    "width": 8,
    "height": 8,
    "number_of_agents": 3,
    "schedule_generator": "complex_schedule_generator",
    "schedule_config": {},
    "frozen": False,
    "remove_agents_at_target": True,
    "wait_for_all_done": True
}
tmp_env = env_from_env_config(env_config)
action_space = tmp_env.action_space
observation_space = tmp_env.observation_space

# Define 1 policy per agent
num_policies = env_config["number_of_agents"]
policies = {f"policy_{i}": (None, observation_space, action_space, {"agent_id": i})
            for i in range(num_policies)}
policy_ids = list(policies.keys())

# Register custom setup with RLlib
register_env("train_env", env_from_env_config)
ModelCatalog.register_custom_model(custom_model, CentralizedCriticModel)
ModelCatalog.register_custom_preprocessor("tree_obs_preprocessor", TreeObsPreprocessor)

# Full experiment config
config = {
    # Run parameters
    "num_cpus_per_worker": 1,
    "num_cpus_for_driver": 1,
    "num_workers": 7,
    "num_gpus": 0,  # TODO: change this to 1
    
    # Environment parameters
    "env": "train_env",
    "env_config": env_config,
    "log_level": "ERROR",
    
    # Training parameters
    "horizon": 40,
    "num_sgd_iter": 15,
    "lr": 1e-4,
    "batch_mode": "complete_episodes",
    
    # Policy parameters
    "vf_loss_coeff": 1e-6,    
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": lambda agent_id: policy_ids[agent_id]
    },
    
    # Model parameters
    "model": {
        "custom_preprocessor": "tree_obs_preprocessor",
        "custom_model": custom_model,
        "custom_options": {
            "tree_depth": env_config["obs_config"]["max_depth"],
            "observation_radius": 0,
            "max_num_agents": env_config["number_of_agents"],
            **model_params,
        },
    },
}

Start TensorBoard
* You should see the training results from the previous notebook.
* CCPPO will typically learn cooperative strategies faster than decentralized PPO.
* Results can vary at small scales, but we will see results from slightly larger runs in the following notebook.
* While your models train, we will move to the final section to visualize some rollouts.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=~/ray_results/

In [None]:
tune.run(
    CCTrainer,
    name=f"CCPPO-MODEL_{custom_model}_{u}",
    stop={"training_iteration": 80},
    config=config,
    checkpoint_freq=1,
    checkpoint_at_end=True,
    loggers=tune.logger.DEFAULT_LOGGERS,
    ray_auto_init=True,
)