# Tutorial - Evaluations - Level 1 best-response and self-play and cross-play

## Install the toolbox (and Ray, Tune, RLLib, PyTorch, etc.)

We need python 3.6 to use the LOLA algorithm from the toolbox.

In [None]:
import sys
print("python --version", sys.version_info)
if sys.version_info[0] != 3 or (sys.version_info[1] != 6 and sys.version_info[1] != 7) :
    raise Exception("Must be using Python 3.6 or 3.7")

If you are running on Google Colab (which you should), uncomment the cell below to install the necessary dependencies.


In [None]:
# print("Setting up colab environment")

# !pip uninstall -y pyarrow
# !pip install bs4
# !git clone https://github.com/longtermrisk/marltoolbox.git
# !pip install -e marltoolbox/.[lola]
# !pip uninstall -y dataclasses

# # # A hack to force the runtime to restart, needed to include the above dependencies.
# print("Done installing! Restarting via forced crash (this is not an issue).")
# import os
# os._exit(0)

**After you run the cell above, comment all its lines.**

## Plan  

3. Using the experimentation tools in the toolbox  
  a. Evaluate the self-play and cross-play performances  
  b. Evaluate the exploitability of an algorithm using Level 1 Best-Response (L1BR)  
  
(Section 1 and 2 are in the tutorial: Basics - How to use the toolbox)  


## Requirements

You have done the first tutorial (Basics - How to use the toolbox).

 # 3. Using the experimentation tools in the toolbox

 ## a. Evaluate the self-play and cross-play performances 

In [None]:
import os
import copy

import numpy as np 

import ray
from ray import tune
from ray.rllib.evaluation.sample_batch_builder import MultiAgentSampleBatchBuilder
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.agents.ppo import PPOTrainer, PPOTorchPolicy
from ray.rllib.agents.ppo.ppo_torch_policy import setup_mixins

from marltoolbox.utils import log, miscellaneous
from marltoolbox.envs.matrix_sequential_social_dilemma import IteratedBoS
from marltoolbox.utils import self_and_cross_perf, restore
from marltoolbox.utils.plot import PlotConfig
from marltoolbox.envs.utils.wrappers import add_RewardUncertaintyEnvClassWrapper
from marltoolbox.utils.miscellaneous import check_learning_achieved

We need to train some agents with different random seeds to then compute their self-play and cross-play performances after deployment.

We are going to train PPO agents on the BachOrStravinsky(BoS) game using the `RLLib` API.


In [None]:
bos_env_payoffs = IteratedBoS({}).PAYOUT_MATRIX
for a_1, action_player_1 in enumerate(["Bach","Stravinsky"]):
    for a_2, action_player_2 in enumerate(["Bach","Stravinsky"]):
        print(f"Payoffs for action pair ({action_player_1},{action_player_2}): " 
              f"({bos_env_payoffs[a_1][a_2][0]},{bos_env_payoffs[a_1][a_2][1]})")

 Here is the `RLLib` configuration for such training, we will not detail it:

In [None]:
def get_trainer_config(hp):
    train_n_replicates = hp["train_n_replicates"]
    seeds = miscellaneous.get_random_seeds(train_n_replicates)
    exp_name, _ = log.log_in_current_day_dir("PPO_BoS")

    # This modification to the policy will allow us to load each policies from different checkpoints 
    # This will be used during evaluation.
    def merged_after_init(*args, **kwargs):
      setup_mixins(*args, **kwargs)
      restore.after_init_load_policy_checkpoint(*args, **kwargs)
    MyPPOPolicy = PPOTorchPolicy.with_updates(after_init=merged_after_init)

    stop_config = {
        "episodes_total": hp["episodes_total"],
    }

    env_config = {
        "players_ids": ["player_row", "player_col"],
        "max_steps": hp["steps_per_epi"],  # Length of an episode
    }

    trainer_config = {
        # We add some variance on the reward returned by the environment
        "env": add_RewardUncertaintyEnvClassWrapper(
                  IteratedBoS,
                  reward_uncertainty_std=0.1),
        "env_config": env_config,

        "multiagent": {
            "policies": {
                env_config["players_ids"][0]: (MyPPOPolicy,
                                               IteratedBoS.OBSERVATION_SPACE,
                                               IteratedBoS.ACTION_SPACE,
                                               {}),
                env_config["players_ids"][1]: (MyPPOPolicy,
                                               IteratedBoS.OBSERVATION_SPACE,
                                               IteratedBoS.ACTION_SPACE,
                                               {}),
            },
            "policy_mapping_fn": lambda agent_id: agent_id,
        },

        #### PPO config ####
        # Size of batches collected from each worker.
        "rollout_fragment_length": hp["steps_per_epi"], 
        # Number of timesteps collected for each SGD round. This defines the size
        # of each SGD epoch.
        "train_batch_size": hp["steps_per_epi"]*3, 
        # Total SGD batch size across all devices for SGD. This defines the
        # minibatch size within each epoch.
        "sgd_minibatch_size": hp["steps_per_epi"],
        # Number of SGD iterations in each outer loop (i.e., number of epochs to
        # execute per train batch).
        "num_sgd_iter": 3,
        "model": {
            # Number of hidden layers for fully connected net
            "fcnet_hiddens": [4, 2],
            # Nonlinearity for fully connected net (tanh, relu)
            "fcnet_activation": "relu",
        },


        "lr": hp["base_lr"],
        "lr_schedule": [(0, hp["base_lr"]),
                (int(hp["steps_per_epi"] * hp["episodes_total"]), hp["base_lr"] / 1e9)],
    
        "seed": tune.grid_search(seeds),
        "callbacks": log.get_logging_callbacks_class(),
        "framework": "torch",
        "num_workers":0,
    }

    return trainer_config, env_config, stop_config

Let's train 8 pairs of PPO agents: 

In [None]:
hyperparameters = {
    "steps_per_epi": 20,
    "train_n_replicates": 8,
    "episodes_total": 200,
    "exp_name": "PPO_BoS",
    "base_lr": 5e-1,
}

trainer_config, _, stop_config = get_trainer_config(hyperparameters)
ray.shutdown()
ray.init(num_cpus=os.cpu_count(), num_gpus=0, local_mode=False)
tune_analysis = tune.run(PPOTrainer, config=trainer_config, stop=stop_config,
                    checkpoint_freq=0, checkpoint_at_end=True, name=hyperparameters["exp_name"],
                    metric="episode_reward_mean", mode="max")
ray.shutdown()

We now have 16 PPO agents trained with 8 differents random seeds, which perform well on BoS (check that values the "reward" columns are close to 100).  
We will be able totat load these agents using the checkpoints we saved at the end of the training.

In [None]:
print("location of the best checkpoint",tune_analysis.best_checkpoint)
tune_analysis_per_exp = {"": tune_analysis}

We will use the `SelfAndCrossPlayEvaluator` from the toolbox, to evaluate the self-play and cross-play performances.

In [None]:
def evaluate_self_and_cross_perf(tune_analysis_per_exp, hp):
    eval_config, env_config, stop_config, hp_eval = generate_eval_config(hp)

    evaluator = self_and_cross_perf.SelfAndCrossPlayEvaluator(exp_name=hp_eval["exp_name"])
    analysis_metrics_per_mode = evaluator.perform_evaluation_or_load_data(
        evaluation_config=eval_config, 
        stop_config=stop_config,
        policies_to_load_from_checkpoint=copy.deepcopy(env_config["players_ids"]),
        tune_analysis_per_exp=tune_analysis_per_exp,
        TrainerClass=PPOTrainer,
        n_cross_play_per_checkpoint=2)

    # Specify how to plot
    plot_config = PlotConfig(xlim=hp_eval["x_limits"], ylim=hp_eval["y_limits"],
                             markersize=5, alpha=1.0, jitter=hp_eval["jitter"],
                             xlabel="player 1 payoffs", ylabel="player 2 payoffs",
                             title="self-play and cross-play performances: BoS",
                             x_scale_multiplier=hp_eval["scale_multipliers"][0],
                             y_scale_multiplier=hp_eval["scale_multipliers"][1])
    
    evaluator.plot_results(analysis_metrics_per_mode, plot_config=plot_config,
                           x_axis_metric=f"policy_reward_mean/{env_config['players_ids'][0]}",
                           y_axis_metric=f"policy_reward_mean/{env_config['players_ids'][1]}")

def generate_eval_config(hp):
    
    hp_eval = copy.deepcopy(hp)
    hp_eval["steps_per_epi"]= 20
    hp_eval["episodes_total"]= 1
    hp_eval["scale_multipliers"] = (1/hp_eval["steps_per_epi"], 1/hp_eval["steps_per_epi"])
    hp_eval["base_lr"]= 0.0
    hp_eval["jitter"]= 0.05
    hp_eval["x_limits"]= (-0.5,3.5)
    hp_eval["y_limits"]= (-0.5,3.5)

    eval_config, env_config, stop_config = get_trainer_config(hp_eval)

    eval_config["explore"] = False
    eval_config["seed"] = miscellaneous.get_random_seeds(1)[0]
    eval_config["train_batch_size"] = hp_eval["steps_per_epi"]

    return eval_config, env_config, stop_config, hp_eval

ray.shutdown()
evaluate_self_and_cross_perf(tune_analysis_per_exp, hyperparameters)
ray.shutdown()

We can see the self-play and cross-play performances in the plot. You should see some failures in cross-play (close to (0,0)).  
Theses failures are explained by the fact that the PPO agents only learned to coordinate on playing either Bach or Stravinsky. They have not learned to adapt to a change of behavior in the other player.

##  b. Evaluate the exploitability of an algorithm using Level 1 Best-Response (L1BR)


In [None]:
import os
import copy

import torch

import ray
from ray import tune
from ray.rllib.agents.pg import PGTrainer, PGTorchPolicy
from ray.rllib.utils.typing import TrainerConfigDict

from marltoolbox.envs.matrix_sequential_social_dilemma import IteratedPrisonersDilemma
from marltoolbox.utils import log, miscellaneous, exploration, lvl1_best_response, policy
from marltoolbox.algos import population
from marltoolbox.algos.lola.train_exact_tune_class_API import LOLAExact
from marltoolbox.utils.miscellaneous import check_learning_achieved

We are going to see if `LOLAExact` is exploitable after deployement in the `IteratedPrisonersDilemma` environment. We will train two populations of agents. First, the level 0 agents will use the `LOLA-Exact` policy and will be trained in self-play. Then we will freeze their weights like if they were deployed in production. And we will train level 1 PolicyGradient (PG) agents against this population of level 0 agents.  

Here are the payoffs in the `IteratedPrisonersDilemma`:

In [None]:
ipd_env_payoffs = IteratedPrisonersDilemma({}).PAYOUT_MATRIX
for a_1, action_player_1 in enumerate(["Coop","Defect"]):
    for a_2, action_player_2 in enumerate(["Coop","Defect"]):
        print(f"payoffs for action pair ({action_player_1},{action_player_2}): " 
              f"({ipd_env_payoffs[a_1][a_2][0]},{ipd_env_payoffs[a_1][a_2][1]})")

To train the level 0 `LOLAExact` agents. We use the `Tune` class API because the current implementation of `LOLAExact` doesn't follow the `RLLib` API. 

In [None]:
def train_lvl0_agents(lvl0_hparameters):

    tune_config = get_tune_config(lvl0_hparameters)
    stop_config = get_stop_config(lvl0_hparameters)
    ray.shutdown()
    ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
    tune_analysis_lvl0 = tune.run(LOLAExact, name="Lvl0_LOLAExact", config=tune_config,
                                  checkpoint_at_end=True, stop=stop_config, 
                                  metric=lvl0_hparameters["metric"], mode="max")
    ray.shutdown()
    return tune_analysis_lvl0

def get_tune_config(hp: dict) -> dict:
    tune_config = copy.deepcopy(hp)
    return tune_config

def get_env_config(hp):
    env_config = {
        "players_ids": ["player_row", "player_col"],
        "max_steps": hp["trace_length"],
        "get_additional_info": True,
    }
    return env_config

def get_stop_config(hp):
    stop_config = {
        "episodes_total": hp["num_episodes"]
    }
    return stop_config

We train 8 level 0 `LOLAExact` agents. This is going to take around 15 minutes. You can read in advance the next steps during this time:

In [None]:
train_n_replicates = 8

lvl0_hparameters = {
    "train_n_replicates": train_n_replicates,
    "env": "IPD",
    "num_episodes": 50,
    "trace_length": 200,
    "simple_net": True,
    "corrections": True,
    "pseudo": False,
    "num_hidden": 32,
    "reg": 0.0,
    "lr": 1.,
    "lr_correction": 1.0,
    "gamma": 0.96,
    "metric": "ret1",

    # We use tune hyperparameter search API to train several agents in parralel
    "seed": tune.grid_search(miscellaneous.get_random_seeds(train_n_replicates)),
}

tune_analysis_lvl0 = train_lvl0_agents(lvl0_hparameters)

check_learning_achieved(tune_results=tune_analysis_lvl0, metric="ret1", 
                        min_=-1.5, trial_idx=0)

`LOLAExact` should learn to cooperate in `IteratedPrisonersDilemma`, the rewards of player 1 and 2 should be close to -1 ("ret1" and "ret2").  
Yet `LOLAExact` regularly fails to cooperate. We are thus going to filter the failures:

In [None]:
filtered_tune_analysis_lvl0 = miscellaneous.filter_tune_results(
    tune_analysis_lvl0,
    metric=f"ret1",
    metric_threshold=-1.4,
    metric_mode="last", 
    threshold_mode="above")

We now have several pairs of `LOLAExact` agents trained in self-play. We are playing in `IteratedPrisonersDilemma` and thus we may fear that an opponent could exploit our agents after they have been deployed.   

We are going to look at that precisely. We will train level 1 PolicyGradient agents that will learn while the `LOLAExact` agents are frozen (not learning any more). The PolicyGradient agents will learn by playing against a population of `LOLAExact` agents. This is used to simulate the fact that when training the exploiter, we may not know which `LOLAExact` agent will be in practice deployed and thus we want to produce an agent that would exploit any `LOLAExact` agent.

In [None]:
def train_lvl1_agents(hp_lvl1, tune_analysis_lvl0):

    rllib_config_lvl1, trainable_class, env_config = get_rllib_config(hp_lvl1)
    stop_config = get_stop_config(hp_lvl1)
    
    # We use an helper to extract all the checkpoints saved in the tune_analysis_lvl0
    checkpoints_lvl0 = miscellaneous.extract_checkpoints(tune_analysis_lvl0)
    
    # We modify the rllib_config to use population of level 0 agents
    rllib_config_lvl1 = modify_config_for_lvl1_training(hp_lvl1, env_config, rllib_config_lvl1, checkpoints_lvl0)

    ray.shutdown()
    ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
    tune_analysis_lvl1 = ray.tune.run(PGTrainer, config=rllib_config_lvl1,
                                      stop=stop_config,
                                      checkpoint_at_end=True,
                                      metric="episode_reward_mean", mode="max",
                                      name="Lvl1_PG")
    ray.shutdown()
    return tune_analysis_lvl1


def get_rllib_config(hp_eval):

    env_config = get_env_config(hp_eval)
    
    tune_config = get_tune_config(hp_eval)
    tune_config["TuneTrainerClass"] = LOLAExact

    rllib_config_lvl1 = {
        "env": IteratedPrisonersDilemma,
        "env_config": env_config,
        "multiagent": {
            "policies": {
                env_config["players_ids"][0]: (
                    PGTorchPolicy,
                    IteratedPrisonersDilemma.OBSERVATION_SPACE,
                    IteratedPrisonersDilemma.ACTION_SPACE,
                    {}),

                env_config["players_ids"][1]: (
                    # We use a class to convert a Tune class into a frozen RLLib Policy
                    policy.get_tune_policy_class(PGTorchPolicy),
                    IteratedPrisonersDilemma.OBSERVATION_SPACE,
                    IteratedPrisonersDilemma.ACTION_SPACE,
                    # The tune_config contains the informations needed by the Tune class
                    {"tune_config": copy.deepcopy(tune_config)}),
            },
            "policy_mapping_fn": lambda agent_id: agent_id,
        },
        "seed": hp_eval["seed"],
        "min_iter_time_s": hp_eval["min_iter_time_s"],
    }

    policies_to_load = copy.deepcopy(env_config["players_ids"])
    trainable_class = LOLAExact
    

    return rllib_config_lvl1, trainable_class, env_config


def modify_config_for_lvl1_training(hp_lvl1, env_config, rllib_config_lvl1, lvl0_checkpoints):
    
    # The level 0 agents will be player 2 and the level 1 agents will be player 1
    lvl0_policy_idx = 1
    lvl1_policy_idx = 0
    lvl0_policy_id = env_config["players_ids"][lvl0_policy_idx]
    lvl1_policy_id = env_config["players_ids"][lvl1_policy_idx]


    # We add a callack needed by the PopulationOfIdenticalAlgo policy
    rllib_config_lvl1["callbacks"] = miscellaneous.merge_callbacks(
        population.PopulationOfIdenticalAlgoCallBacks,
        log.get_logging_callbacks_class(log_env_step=True, log_from_policy=True))
    

    # Finally, we use an helper to replace player_2's policy (LOLA-Exact) by a PopulationOfIdenticalAlgo policy 
    #   that use nested LOLA-Exact policies
    # Before each episode, this PopulationOfIdenticalAlgo will switch between the LOLAExact agents available
    l1br_configuration_helper = lvl1_best_response.L1BRConfigurationHelper(rllib_config_lvl1, lvl0_policy_id, lvl1_policy_id)
    l1br_configuration_helper.define_exp(
        use_n_lvl0_agents_in_each_population=hp_lvl1["n_seeds_lvl0"] // hp_lvl1["n_seeds_lvl1"],
        train_n_lvl1_agents=hp_lvl1["n_seeds_lvl1"],
        lvl0_checkpoints=lvl0_checkpoints)
    rllib_config_lvl1 = l1br_configuration_helper.prepare_config_for_lvl1_training()
    

    return rllib_config_lvl1

We train 2 level 1 PolicyGradient agents:

In [None]:
lvl1_hparameters = copy.deepcopy(lvl0_hparameters)
lvl1_hparameters.update({
    "n_seeds_lvl0": len(filtered_tune_analysis_lvl0.trials),
    "n_seeds_lvl1": 2,
    "min_iter_time_s": 0.0,
    "batch_size": 1, # To work with RLLib
    "num_episodes": 1000,
    "trace_length": 10,
    "seed": None, # The seeds will be added by the L1BRConfigurationHelper
    })

tune_analysis_lvl1 = train_lvl1_agents(lvl1_hparameters, filtered_tune_analysis_lvl0)

check_learning_achieved(tune_results=tune_analysis_lvl1, 
                        max_=-25, trial_idx=0)

print("All metrics:", list(tune_analysis_lvl1.results_df.columns))

In [None]:
print("Averaged state during the last episode for each seeds:")
print("Playing CC:", tune_analysis_lvl1.results_df["custom_metrics.CC/player_col_mean"].tolist())
print("Playing CD:", tune_analysis_lvl1.results_df["custom_metrics.CD/player_col_mean"].tolist())
print("Playing DC:", tune_analysis_lvl1.results_df["custom_metrics.DC/player_col_mean"].tolist())
print("Playing DD:", tune_analysis_lvl1.results_df["custom_metrics.DD/player_col_mean"].tolist())

Each indexes in these lists refers to one seed used during the level 1 training.  
You should observe that the DC action pair is the most played. This means that most of the times, the level 1 agents defects (player 1 plays D) while the level 0 agents continues to cooperate (player 2 plays C).  

In [None]:
print("Level 1 agents, player 1, mean rewards:", tune_analysis_lvl1.results_df["policy_reward_mean.player_row"].tolist())
print("Level 0 agents, player 2, mean rewards:", tune_analysis_lvl1.results_df["policy_reward_mean.player_col"].tolist())

`LOLAExact` is here exploited by level 1 PolicyGradient agents. This is confirmed by the rewards accumulated during the last episode (10 steps).

##  c. Use TensorBoard to visualize the trainings


You can uncomment and use TensorBoard to view trial performances.

In [None]:
# %load_ext tensorboard

In [None]:
# %tensorboard --logdir /root/ray_results/ # On Google Colab
# %tensorboard --logdir ~/ray_results/ # On your machine

# You can filter the graphs with ".*mean.*|episode_reward_mean|ret1|ret2"