# Tutorial: How to use the `marltoolbox`

**Overview of the toolbox**

---



**Goal**: Facilitate and speed up the research on bargaining in MARL. 

**Components:** We rely on two main components: 
- The [`Ray/Tune/RLLib` framework](https://docs.ray.io/en/master/rllib.html):
  which we use as a research framework (and which is agnostic to the deep learning framework used). 
- A toolbox: this repository with specific contents related to bargaining in 
  MARL.

**Concrete value of using `Tune`/`RLLib` + `marltoolbox`**:  
- **with 1h of practice:**   
    track your experiments, 
    log easily in TensorBoard, run hyperparameter search, 
    use the provided environments and run the provided algorithms, 
    mostly agnostic to the deep learning framework, 
    create custom algorithms using the `Tune` API 
- **with 10h of practice:**  
    use some of the components of `RLLib` 
    (like using a PPO agent in your custom algorithms), use checkpoints, 
    use the experimentation tools provided here, create new environments, 
    create simple custom algorithm with the `RLLib` API
- **with more than 10h of practice:**  
    build custom distributed algorithms,
    use all of the components of `RLLib`, 
    use the fully customizable training pipeline of `RLLib`,
    create complex custom algorithm with the `RLLib` API  
  
**Code**: https://github.com/longtermrisk/marltoolbox/tree/master/marltoolbox

## Install the toolbox (and Ray, Tune, RLLib, PyTorch, etc.)

If you are running on Google Colab (which you should), uncomment the cell below to install the necessary dependencies.

In [None]:
# print("Setting up colab environment")

# !pip uninstall -y pyarrow
# !pip install bs4
# !git clone https://github.com/longtermrisk/marltoolbox.git
# !pip install -e marltoolbox/.
# !pip uninstall -y dataclasses

# # Needed for TensorBoard
# !pip install tensorflow

# # # A hack to force the runtime to restart, needed to include the above dependencies.
# print("Done installing! Restarting via forced crash (this is not an issue).")
# import os
# os._exit(0)

**After you run the cell above, comment all its lines.**

## Plan

1. Running experiments using the `Tune` class API  
  a. Using the `IteratedPrisonersDilemma` environment from `marltoolbox` and components from `RLLib`   
  b. Using `Tune`'s hyperparameter search functionnality 

2. Running experiments using the `RLLib` API  
  a. Using the `IteratedPrisonersDilemma` environment and the `LTFT` algorithm from `marltoolbox`    
  b. Using some `RLLib` functionnalities    
  c. Use TensorBoard to visualize the trainings



## Requirements

Be sure to have read, at least, the following quick introductions to `Tune` and `RLLib`:  
- Quick introduction to 
[`Tune`'s key concepts](https://docs.ray.io/en/master/tune/key-concepts.html) (< 5 min).  
- Quick introduction 
[`RLlib` in 60 seconds](https://docs.ray.io/en/master/rllib.html#rllib-in-60-seconds) (< 5 min).  
- The README of the `Ray`/`Tune`/`RLLib` project:
[`Ray` github](https://github.com/ray-project/ray) (<5 min)

# 1. Running experiments using the `Tune` class API  

In [None]:
import os

import numpy as np 

import ray
from ray import tune
from ray.rllib.agents.pg import PGTorchPolicy, DEFAULT_CONFIG
from ray.rllib.evaluation.sample_batch_builder import MultiAgentSampleBatchBuilder
from ray.rllib.agents.callbacks import DefaultCallbacks

from marltoolbox.envs.matrix_sequential_social_dilemma import IteratedPrisonersDilemma
from marltoolbox.utils.miscellaneous import check_learning_achieved

 ## a. Using the `IteratedPrisonersDilemma` environment from `marltoolbox` and components from `RLLib`  

We use the `Tune` class API, which requires a Trainable class with at minimum a setup and a step method, like that:


In [None]:
class TrainableWIP(tune.Trainable):
    def setup(self, config):
        # config (dict): A dict of hyperparameters
        pass 

    def step(self):  # This is called iteratively to train the agents.
        pass
        return {"fake_score": self.fake_score}

We use the `IteratedPrisonersDilemma` environment (IPD) from the toolbox. This is a two player game.  
Let's look at its payoffs (rewards) given the joint actions of the players:

In [None]:
ipd_env_payoffs = IteratedPrisonersDilemma({}).PAYOUT_MATRIX
for a_1, action_player_1 in enumerate(["Coop","Defect"]):
    for a_2, action_player_2 in enumerate(["Coop","Defect"]):
        print(f"Payoffs for action pair ({action_player_1},{action_player_2}): " 
              f"({ipd_env_payoffs[a_1][a_2][0]},{ipd_env_payoffs[a_1][a_2][1]})")

Create the environment in the Trainable class:

In [None]:
class TrainableWIP(tune.Trainable):
    def setup(self, config):
        self.env = self._init_env(config)

##### NEW ######
    def _init_env(self, config):
        return IteratedPrisonersDilemma(config["env_config"])
#####

    def step(self): 
        pass
        return {"fake_score": self.fake_score}

##### NEW ######
# This dict will be sent to the setup method by Tune when we will run the training.
tune_config = {
    "env_config": {
        "max_steps": 10,  # Length of an episode
    }
}
#####

We create two simple PolicyGradient(PG) players using the `PGTorchPolicy` policy class from `RLLib`.  
And we create a `MultiAgentSampleBatchBuilder` (from `RLLib`) to aggregate our data into batches. 

In [None]:
class TrainableWIP(tune.Trainable):
    def setup(self, config):
        self.env = self._init_env(config)
        self.players = self._init_players(config)
        self.multi_agent_batch_builder = self._init_batch_builder()
        
    def _init_env(self, config):
        return IteratedPrisonersDilemma(config["env_config"])

##### NEW ######
    def _init_players(self, config):
        # We will use the default config provided for the PG policy by RLLib, 
        #   with a few modfications.
        my_pg_config = DEFAULT_CONFIG
        my_pg_config["gamma"] = 0.5
        my_pg_config["train_batch_size"] = config["env_config"]["max_steps"]

        players = {}
        for player_id in self.env.players_ids:
            players[player_id] = PGTorchPolicy(self.env.OBSERVATION_SPACE, 
                                              self.env.ACTION_SPACE,
                                              my_pg_config)
        return players
          
    def _init_batch_builder(self):
        return MultiAgentSampleBatchBuilder(
            policy_map={player_id: player for player_id, player in self.players.items()},
            clip_rewards=False,
            callbacks=DefaultCallbacks()
        )
#####

    def step(self):  # This is called iteratively.
        pass
        return {"fake_score": self.fake_score}

We play one episode per call to `TrainableWIP.step`.  
And then we report the total reward (of both players) averaged per environment step.  
This information will be saved as a metric and displayed in TensorBoard as well.

In [None]:
class TrainableWIP(tune.Trainable):
    def setup(self, config):
        self.env = self._init_env(config)
        self.players = self._init_players(config)
        self.multi_agent_batch_builder = self._init_batch_builder()
        self.total_welfare = None

    def _init_env(self, config):
        return IteratedPrisonersDilemma(config["env_config"])

    def _init_players(self, config):
        my_pg_config = DEFAULT_CONFIG
        my_pg_config["gamma"] = 0.5
        my_pg_config["train_batch_size"] = config["env_config"]["max_steps"]

        players = {}
        for player_id in self.env.players_ids:
            players[player_id] = PGTorchPolicy(self.env.OBSERVATION_SPACE, 
                                              self.env.ACTION_SPACE,
                                              my_pg_config)
        return players
          
    def _init_batch_builder(self):
        return MultiAgentSampleBatchBuilder(
            policy_map={player_id: player for player_id, player in self.players.items()},
            clip_rewards=False,
            callbacks=DefaultCallbacks()
        )

##### NEW ######
    def step(self):
        self.to_report = {}
        self._play_one_episode() 
        self.to_report["mean_welfare"] = self.total_welfare / self.config["env_config"]["max_steps"]
        return self.to_report

    def _play_one_episode(self):
        obs_before_act = self.env.reset()
        done = {"__all__": False}
        self.total_welfare = 0.0
        while not done["__all__"]:
            obs_after_act, actions, rewards, done = self._play_one_step(obs_before_act)
            self._add_step_in_batch_builder_buffer(obs_before_act, actions, rewards, done)
            obs_before_act = obs_after_act
          
    def _play_one_step(self, obs_before_act):
        actions = {player_id: player_policy.compute_actions([obs_before_act[player_id]])[0][0] 
                                  for player_id, player_policy in self.players.items()}

        obs_after_act, rewards, done, info = self.env.step(actions)
        self.to_report.update(info)
        
        return obs_after_act, actions, rewards, done

    def _add_step_in_batch_builder_buffer(self, obs_before_act, actions, rewards, done):
        for player_id in self.players.keys():
            self.total_welfare += rewards[player_id]

            step_player_values = {
                "eps_id": self.training_iteration,
                "obs": obs_before_act[player_id],
                "actions": actions[player_id],
                "rewards": rewards[player_id],
                "dones": done[player_id],
            }
            # The policy_id and agent_id used in the RLLib API are the same in our case (equal to player_id)
            self.multi_agent_batch_builder.add_values(agent_id=player_id, policy_id=player_id, **step_player_values) 
#####

Finally, after each epsisode we train the policies of both our agents:

In [None]:
class Trainable(tune.Trainable):

    def setup(self, config):
        self.env = self._init_env(config)
        self.players = self._init_players(config)
        self.multi_agent_batch_builder = self._init_batch_builder()
        
    def _init_env(self, config):
        return IteratedPrisonersDilemma(config["env_config"])

    def _init_players(self, config):
        my_pg_config = DEFAULT_CONFIG
        my_pg_config["gamma"] = 0.5
        my_pg_config["train_batch_size"] = config["env_config"]["max_steps"]

        players = {}
        for player_id in self.env.players_ids:
            players[player_id] = PGTorchPolicy(self.env.OBSERVATION_SPACE, 
                                              self.env.ACTION_SPACE,
                                              my_pg_config)
        return players
          
    def _init_batch_builder(self):
        return MultiAgentSampleBatchBuilder(
            policy_map={player_id: player for player_id, player in self.players.items()},
            clip_rewards=False,
            callbacks=DefaultCallbacks()
        )

    def step(self):
        self.to_report = {}

        self._play_one_episode()
        self._optimize_weights()

        self.to_report["mean_welfare"] = self.total_welfare / self.config["env_config"]["max_steps"]
        self.to_report["training_iteration"] = self.training_iteration # This is an attribute from tune.Trainable 
        return self.to_report 

    def _play_one_episode(self):
        obs_before_act = self.env.reset()
        done = {"__all__": False}
        self.total_welfare = 0.0
        while not done["__all__"]:
            obs_after_act, actions, rewards, done = self._play_one_step(obs_before_act)
            self._add_step_in_batch_builder_buffer(obs_before_act, actions, rewards, done)
            obs_before_act = obs_after_act
      
    def _play_one_step(self, obs_before_act):
        actions = {player_id: player_policy.compute_actions([obs_before_act[player_id]])[0][0] 
            for player_id, player_policy in self.players.items()}

        obs_after_act, rewards, done, info = self.env.step(actions)
        self.to_report.update(info)

        return obs_after_act, actions, rewards, done
        
    def _add_step_in_batch_builder_buffer(self, obs_before_act, actions, rewards, done):
        for player_id in self.players.keys():
            self.total_welfare += rewards[player_id]

            step_player_values = {
                "eps_id": self.training_iteration,
                "obs": obs_before_act[player_id],
                "actions": actions[player_id],
                "rewards": rewards[player_id],
                "dones": done[player_id],
            }
            self.multi_agent_batch_builder.add_values(agent_id=player_id, policy_id=player_id, **step_player_values) 

##### NEW ######
    def _optimize_weights(self):
        
        multiagent_batch = self.multi_agent_batch_builder.build_and_reset()
        for player_id, player in self.players.items():
            multiagent_batch = self._center_reward(multiagent_batch, player_id)
            stats = player.learn_on_batch(multiagent_batch.policy_batches[player_id])

    def _center_reward(self, multiagent_batch, player_id):
        multiagent_batch.policy_batches[player_id]["rewards"] = (multiagent_batch.policy_batches[player_id]["rewards"] - 
                                                        multiagent_batch.policy_batches[player_id]["rewards"].mean())
        return multiagent_batch
#####

We can now run this experiment with `Tune`:

In [None]:
tune_config = {
    "env_config": {
        "max_steps": 10, # Length of an episode
    }
}

# stop the training after N Trainable.step (here this is equivalent to N episodes and N updates)
stop_config = {"training_iteration": 200} 

# Restart Ray defensively in case the ray connection is lost.
ray.shutdown() 
ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
# Run the experiment
tune_analysis = tune.run(
    Trainable,
    stop=stop_config,
    config=tune_config,
    name="PG_IPD",
    )

ray.shutdown()

check_learning_achieved(tune_results=tune_analysis, metric="mean_welfare", max_=-3.8)

You should get a mean_welfare close to -4, which means that both players are defecting and they both get a reward of -2 per step.

##  b. Using `Tune`'s hyperparameter search functionnality


We are going to do a simple hyperparameter grid search using `Tune`:

In [None]:
tune_config = {
    "env_config": {
        "max_steps": 10,
    }
}

##### NEW ######
# Usually hyperparameter searches are done in the tune_config dictionary 
# but here varying "training_iteration" is interesting.
stop_config = {"training_iteration": tune.grid_search([2, 4, 8, 16, 32, 64, 128])} 
#####

ray.shutdown() 
ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
tune_analysis = tune.run(
    Trainable,
    stop=stop_config,
    config=tune_config,
    name="PG_IPD",
    )
ray.shutdown()

check_learning_achieved(tune_results=tune_analysis, metric="mean_welfare", max_=-3.5, trial_idx=6)
check_learning_achieved(tune_results=tune_analysis, metric="mean_welfare", max_=-2.5, trial_idx=3)

As expected, the more we train, the worst we are!   
This is expected since we are playing on the `IteratedPrisonersDilemma` environment with selfish agents.  
Both agents are slowly learning to defect.  

All hyperparameter search spaces available in `Tune` are listed at [`Tune`'s search-spaces](https://docs.ray.io/en/master/tune/key-concepts.html#search-spaces).

# 2. Running experiments using the `RLLib` API  


In [None]:
import os 
import torch
import copy

import ray
from ray import tune
from ray.rllib.policy.policy import Policy
from ray.rllib.utils import merge_dicts
from ray.rllib.utils.schedules import PiecewiseSchedule
from ray.rllib.utils.typing import TrainerConfigDict

from marltoolbox.envs.matrix_sequential_social_dilemma import IteratedPrisonersDilemma
from marltoolbox.algos.ltft import LTFTCallbacks, LTFTTrainer, prepare_default_config
from marltoolbox.utils import log, miscellaneous, exploration
from marltoolbox.envs.utils.wrappers import add_RewardUncertaintyEnvClassWrapper
from marltoolbox.utils.miscellaneous import check_learning_achieved

##  a. Using the `IteratedPrisonersDilemma` environment and the `LTFT` algorithm from `marltoolbox`  

We are going to train two `LTFT` agents in the `IteratedPrisonersDilemma` environment (both from the toolbox). 

Instead of creating our own Trainable class like we did with the `Tune` API, when using the `RLLib` API we provide an `RLLib` Trainable class and customize it extensively. 
Inside a configuration dictionnary, we provide everything like the environment, the exploration policy, etc..., and we can customize the agent's policies.  

Let's configure that!

Configure the policies:

In [None]:
def get_rllib_config_WIP(hyperparameters: dict)-> dict:

    rllib_config = {}
    rllib_config.update(get_policies_config(hyperparameters))   
    # ...
    # ...
    # ...
    # ...


    return rllib_config

##### NEW ######
def get_policies_config(hp):

    # We will need to use the LTFTTrainer to manage the dataflow
    # and we use the DEFAULT_CONFIG from this Trainer
    policies_config = prepare_default_config(
        lr=hp["base_lr"],
        lr_spl=hp["base_lr"] * hp["spl_lr_mul"],
        n_epi=hp["n_epi"],
        n_steps_per_epi=hp["n_steps_per_epi"])
    
    policies_config.update({
        # Inside the "multiagent" key of the RLLib config dict, we define all the policies that are going to be used
        "multiagent": {
            "policies": {
                "player_row": (
                    # The default policy is LTFTTorchPolicy (as defined in our Trainable class: LTFTTrainer) 
                    None,
                    IteratedPrisonersDilemma.OBSERVATION_SPACE,
                    IteratedPrisonersDilemma.ACTION_SPACE,
                    # We can provide an additionnal configuration dict to this policy. 
                    #   It will be merged with a copy of the rllib_config that we are currenlty creating.
                    {}),
                "player_col": (
                    None,
                    IteratedPrisonersDilemma.OBSERVATION_SPACE,
                    IteratedPrisonersDilemma.ACTION_SPACE,
                    {}),
            },
            # We need to define how the agent_id used in the environment (dict keys) will be associated 
            #  to the policy_id of the policies above (dict keys). Here they are simply identical.
            "policy_mapping_fn": lambda agent_id: agent_id,
        },

        # We add some callbacks needed by the LTFT policy and ask for additionnal logs.
        "callbacks": miscellaneous.merge_callbacks(LTFTCallbacks,
                                                   log.get_logging_callbacks_class(
                                                       log_env_step=True,
                                                       log_from_policy=True)),
    })
    return policies_config


#####

Configure the environment:

In [None]:
def get_rllib_config_WIP(hyperparameters: dict)-> dict:

    rllib_config = {}
    rllib_config.update(get_policies_config(hyperparameters))   
    rllib_config.update(get_env_config(hyperparameters))
    # ...
    # ...
    # ...
    
    return rllib_config

def get_env_config(hp):
    env_config = {
        # We provide the environment class
        "env": get_env_class(),
        # And the dictionnary that will be sent to initialize the environment
        "env_config": {
            "players_ids": ["player_row", "player_col"],
            "max_steps": hp["n_steps_per_epi"],  # Length of an episode
        },
    }
    return env_config

def get_env_class():
    # We add a wrapper around the environment to add some variance to the rewards returned
    MyUncertainIPD = add_RewardUncertaintyEnvClassWrapper(
        IteratedPrisonersDilemma,
        reward_uncertainty_std=0.1)
    return MyUncertainIPD


Configure the default `DQN` policy:

In [None]:
def get_rllib_config_WIP(hyperparameters: dict)-> dict:

    rllib_config = {}
    rllib_config.update(get_policies_config(hyperparameters))   
    rllib_config.update(get_env_config(hyperparameters))
    rllib_config.update(get_default_DQN_config(hyperparameters))
    # ...
    # ...
    
    return rllib_config

##### NEW ######
def get_default_DQN_config(hp):
    # The LTFT policy uses three DQN policies and a supervised learning policy.
    # We provide here some additional configuration for the DQN policies 
    # (this additional configuration will also be sent to the supervised learning policy 
    # which will ignore it)

    default_DQN_config = {
        # === DQN Models ===
        # Minimum env steps to optimize for per train call. This value does
        # not affect learning, only the length of iterations.
        "timesteps_per_iteration": hp["n_steps_per_epi"],
        # Update the target network every `target_network_update_freq` steps.
        "target_network_update_freq": hp["n_steps_per_epi"],
        # === Replay buffer ===
        # Size of the replay buffer. Note that if async_updates is set, then
        # each worker will have a replay buffer of this size.
        "buffer_size": int(hp["n_steps_per_epi"] * hp["n_epi"]),
        # Whether to use dueling dqn
        "dueling": False,
        # Dense-layer setup for each the advantage branch and the value branch
        # in a dueling architecture.
        "hiddens": [32],
        # Whether to use double dqn
        "double_q": False,
        # If True prioritized replay buffer will be used.
        "prioritized_replay": False,
        "model": {
            # Number of hidden layers for fully connected net
            "fcnet_hiddens": [32, 2],
            # Nonlinearity for fully connected net (tanh, relu)
            "fcnet_activation": "relu",
        },
    }
    return default_DQN_config
#####

Configure the exploration policy:

In [None]:
def get_rllib_config_WIP(hyperparameters: dict)-> dict:

    rllib_config = {}
    rllib_config.update(get_policies_config(hyperparameters))   
    rllib_config.update(get_env_config(hyperparameters))
    rllib_config.update(get_default_DQN_config(hyperparameters))
    rllib_config.update(get_exploration_config(hyperparameters))
    # ...
    
    return rllib_config

##### NEW ######
def get_exploration_config(hp):
    exploration_config = {
        # === Exploration Settings ===
        # Set to False for no exploration behavior (e.g., for evaluation).
        "explore": True,
        # Provide a dict specifying the Exploration object's config.
        "exploration_config": {
            # The Exploration class to use. In the simplest case, this is the name
            # (str) of any class present in the `rllib.utils.exploration` package.
            # You can also provide the python class directly or the full location
            # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
            # EpsilonGreedy").
            "type": exploration.SoftQSchedule,
            # Add constructor kwargs here (if any).
            "temperature_schedule": PiecewiseSchedule(
                endpoints=[
                    (0, 1.0), (int(hp["n_steps_per_epi"] * hp["n_epi"] * 0.75), 0.1)],
                outside_value=0.1,
                framework="torch")
        },

    }

    return exploration_config
#####

Configure the optimization and others:

In [None]:
def get_rllib_config(hyperparameters: dict)-> dict:

    rllib_config = {}
    rllib_config.update(get_policies_config(hyperparameters))   
    rllib_config.update(get_env_config(hyperparameters))
    rllib_config.update(get_default_DQN_config(hyperparameters))
    rllib_config.update(get_exploration_config(hyperparameters))
    rllib_config.update(get_optimization_and_general_config(hyperparameters))

    return rllib_config

##### NEW ######
def get_optimization_and_general_config(hp: dict):

    optim_and_general_config = {
        
        # === Optimization ===
        # Learning rate for adam optimizer
        "lr": hp["base_lr"],
        # Learning rate schedule
        "lr_schedule": [(0, hp["base_lr"]),
                        (int(hp["n_steps_per_epi"] * hp["n_epi"]), hp["base_lr"] / 1e9)],
        # How many steps of the model to sample before learning starts.
        "learning_starts": int(hp["n_steps_per_epi"] * hp["bs_epi_mul"]),
        # Update the replay buffer with this many samples at once. Note that
        # this setting applies per-worker if num_workers > 1.
        "rollout_fragment_length": hp["n_steps_per_epi"],
        # Size of a batch sampled from replay buffer for training. Note that
        # if async_updates is set, then each worker returns gradients for a
        # batch of this size.
        "train_batch_size": int(hp["n_steps_per_epi"] * hp["bs_epi_mul"]),
        "gamma": 0.5,

        # === General config ===
        "framework": "torch",
        "batch_mode": "complete_episodes",
        # LTFT supports only 1 worker only otherwise it would be mixing several opponents trajectories
        "num_workers": 0,
        # LTFT supports only 1 env per worker only otherwise several episodes would be played at the same time
        "num_envs_per_worker": 1,
        "seed": tune.grid_search(hp["seeds"]),

    }

    return optim_and_general_config
#####

Start the training:

In [None]:
def get_stop_config(hp):
    stop_config = {
        "episodes_total": hp["n_epi"], 
    }
    return stop_config

ltft_hparameters = {
    "n_epi": 400,
    "n_steps_per_epi": 20,
    "bs_epi_mul": 4,
    "base_lr": 0.04,
    "spl_lr_mul": 10.0,
    "seeds": miscellaneous.get_random_seeds(1),
    "debug": False,
}


rllib_config = get_rllib_config(ltft_hparameters)
stop_config = get_stop_config(ltft_hparameters)
ray.shutdown()
ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
tune_analysis_self_play = ray.tune.run(LTFTTrainer, config=rllib_config,
                        checkpoint_freq=0, stop=stop_config, 
                        checkpoint_at_end=False, name="LTFT_exp")
ray.shutdown()

check_learning_achieved(tune_results=tune_analysis_self_play, 
                        min_=-42, trial_idx=0)

`LTFT` agents should learn to cooperate and we should reach a "reward" close to -40. This is the sum for both players's rewards during an entire episode.  
Averaged by 20 steps, this gives use -2 per step which is the best possible welfare in the `IteratedPrisonersDilemma` environment!

##  b. Using some `RLLib` functionnalities  

We can easily change the model and dataflow used by `RLLib` policies by changing the configuration dict. Let try to reduce the training time.  

We are going to make the following changes:  
- using a smaller network
- using Dueling Double DQN (D3QN) instead of DQN.
- training for less episodes
- training with less steps per episodes


In [None]:
def get_default_DQN_config(hp):
    default_DQN_config = {
        "timesteps_per_iteration": hp["n_steps_per_epi"],
        "target_network_update_freq": hp["n_steps_per_epi"],
        "prioritized_replay": False,

        ##### MODIFIED ######
        "buffer_size": int(hp["n_steps_per_epi"] * hp["n_epi"]),
        "dueling": True, # instead of False
        "hiddens": [4], # instead of 32
        "double_q": True, # instead of False
        "model": {
            "fcnet_hiddens": [4, 2], # instead of [32, 2]
            "fcnet_activation": "relu",
        },
        #####
    }
    return default_DQN_config

In [None]:
def get_stop_config(hp):
    stop_config = {
        "episodes_total": hp["n_epi"],
    }
    return stop_config

ltft_hparameters = {
    ##### MODIFIED ######
    "n_epi": 200,  # instead of 400
    "n_steps_per_epi": 10, # instead of 20
    #####
    "bs_epi_mul": 4,
    "base_lr": 0.04,
    "spl_lr_mul": 10.0,
    "seeds": miscellaneous.get_random_seeds(1),
    "debug": False,
}


rllib_config = get_rllib_config(ltft_hparameters)
stop_config = get_stop_config(ltft_hparameters)
ray.shutdown()
ray.init(num_cpus=os.cpu_count(), num_gpus=0) 
tune_analysis_self_play = ray.tune.run(LTFTTrainer, config=rllib_config,
                        checkpoint_freq=0, stop=stop_config, 
                        checkpoint_at_end=False, name="LTFT_exp")
ray.shutdown()

check_learning_achieved(tune_results=tune_analysis_self_play,
                        min_=-22, trial_idx=0)

Our training time is now around 15 seconds long while previously it was around 35 seconds (these values depend on which machine Google Colab is running).   

And we still achieve cooperation since the welfare (total reward) per step is still -2.

If you want, cou can try to determine which modification was the most important.

##  c. Use TensorBoard to visualize the trainings


You can uncomment and use TensorBoard to view trial performances.

In [None]:
# %load_ext tensorboard

In [None]:
# %tensorboard --logdir /root/ray_results/ # On Google Colab
# %tensorboard --logdir ~/ray_results/ # On your machine

# You can filter the graphs with "reward|mean_welfare|defection_metric|entropy|CC"