# Using RLlib for more multi-agent learning control

As discussed in `5-improving-dqn-architecture.ipynb` we thought of three aspects that might be the root of the agent's not learning to play the game pleasingly:
- Training two DQN agents simultaneously is known to be though, especially when starting from a random initialisation
- The network used was a simple MLP
- The training is not done over enough iterations

In the notebooks `5-improving-dqn-architecture.ipynb` and `6-dqn-using-a-cnn.ipynb`, two alternative networks besides MLP were used.
Whilst these give somewhat satisfactory results when trained for long enough and incentivising moves by giving a reward for making a move, it is still far from perfect.
The iterations were also boosted to a couple of hours on a CUDA GPU, which didn't improve things all that much.

Thus, what is most likely to be an issue is the fact that we are training two agents simultaneously.
This makes it hard to get a good performing agent.
An alternative to this is training an agent for a couple of epochs whilst freezing the other and alternating this between the agents.
This makes the problem to learn "stationary" in a certain way and is known to make learning easier.
What is also done, often in very complex games, is starting from a somewhat smart agent instead of a random one.

This notebook will use [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html), which is better documented for use in multi-agent environments and PettingZoo like environments in particular.
They also note that zero-sum environments are harder to learn in multi-agent settings.
That is why we introduce a reward for making moves and a high reward for playing a tie game.
We hope to create agents that are capable of reaching a tie board or extending losses maximally in this manner.

We will use portions of the [Ray documentation and examples in this notebook](https://docs.ray.io/en/latest/rllib/rllib-examples.html).
This includes following files on public GitHub repositories:
- `multi_agent_independent_learning.py` from the [Ray GitHub repository](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py).
- `multi_agent_parameter_sharing.py` from the [Ray GitHub repository](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_parameter_sharing.py).
- `rllib_pistonball.py` from the [Petting Zoo GitHub repository](https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/rllib_pistonball.py).

Alongside these documents and files, a tutorial by[ J K Terry on using RLlib in Petting Zoo environments](https://towardsdatascience.com/using-pettingzoo-with-rllib-for-multi-agent-deep-reinforcement-learning-5ff47c677abd) was also used.

# IMPORTANT: BUGGY NOTEBOOK
This notebook doesn't work due to issues related to the one reported [here](https://github.com/ray-project/ray/issues/22976).
This along with the fact that working with custom Petting Zoo like environment throws random errors, left us to beleive that the Ray RL Lib is sadly not the way to go.

Indeed, our model has values that are `None` which throws the following error:

> Can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Since editing the source code of Ray RL lib is asking for troubles we leave our exploration of this library as is.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training Connect Four agents with Ray RLlib
  - Trying out Ray RL lib

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [2]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib
# Ray RLlib for RL algorithms instead of Tianshou
import ray; print(f"Ray version (1.12.1 recommended): {ray.__version__}")
import ray.rllib

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym;
importlib.invalidate_caches();
importlib.reload(cfgym);

Ray version (1.12.1 recommended): 1.12.1


  from .autonotebook import tqdm as notebook_tqdm


Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116
Gym version (0.21.0 recommended): 0.21.0
pygame 2.1.2 (SDL 2.0.18, Python 3.8.10)
Hello from the pygame community. https://www.pygame.org/contribute.html


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

<hr><hr>

## Training Connect Four agents with Ray RLlib

As discussed, this notebook will use Ray RLlib to train two agents for Connect four.

### Trying out Ray RL lib

We try out the Ray RL lib and do this on the Petting Zoo provided Connect Four game.
Whilst the training works, the saved files cause an issue for loading and thus for replaying.
Becuase this is a straight copy from the documentation with only the environment changed, we see no reason why it should not work and discard further experiments with this library.

In [3]:
import os
from copy import deepcopy

import ray
from gym.spaces import Box
from ray import tune
from ray.rllib.agents.dqn.dqn_torch_model import DQNTorchModel
from ray.rllib.agents.registry import get_trainer_class
from ray.rllib.env import PettingZooEnv
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.torch_utils import FLOAT_MAX
from ray.tune.registry import register_env

from pettingzoo.classic import connect_four_v3

torch, nn = try_import_torch()


class TorchMaskedActions(DQNTorchModel):
    """PyTorch version of above ParametricActionsModel."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name, **kw):
        DQNTorchModel.__init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kw
        )

        obs_len = obs_space.shape[0] - action_space.n

        orig_obs_space = Box(
            shape=(obs_len,), low=obs_space.low[:obs_len], high=obs_space.high[:obs_len]
        )
        self.action_embed_model = TorchFC(
            orig_obs_space,
            action_space,
            action_space.n,
            model_config,
            name + "_action_embed",
        )

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]

        # Compute the predicted action embedding
        action_logits, _ = self.action_embed_model(
            {"obs": input_dict["obs"]["observation"]}
        )
        # turns probit action mask into logit action mask
        inf_mask = torch.clamp(torch.log(action_mask), -1e10, FLOAT_MAX)

        return action_logits + inf_mask, state

    def value_function(self):
        return self.action_embed_model.value_function()





In [4]:
alg_name = "DQN"
ModelCatalog.register_custom_model("pa_model", TorchMaskedActions)
# function that outputs the environment you wish to register.

my_env = cfgym.env()

def env_creator():
    env = connect_four_v3.env()
    return env

num_cpus = 1

config = deepcopy(get_trainer_class(alg_name)._default_config)

register_env("leduc_holdem", lambda config: PettingZooEnv(env_creator()))

test_env = PettingZooEnv(env_creator())
obs_space = test_env.observation_space
print(obs_space)
act_space = test_env.action_space

config["multiagent"] = {
    "policies": {
        "player_0": (None, obs_space, act_space, {}),
        "player_1": (None, obs_space, act_space, {}),
    },
    "policy_mapping_fn": lambda agent_id: agent_id,
}

config["num_gpus"] = int(os.environ.get("RLLIB_NUM_GPUS", "0"))
config["log_level"] = "DEBUG"
config["num_workers"] = 1
config["rollout_fragment_length"] = 30
config["train_batch_size"] = 200
config["horizon"] = 200
config["no_done_at_end"] = False
config["framework"] = "torch"
config["model"] = {
    "custom_model": "pa_model",
}
config["n_step"] = 1

config["exploration_config"] = {
    # The Exploration class to use.
    "type": "EpsilonGreedy",
    # Config for the Exploration class' constructor:
    "initial_epsilon": 0.1,
    "final_epsilon": 0.0,
    "epsilon_timesteps": 100000,  # Timesteps over which to anneal epsilon.
}
config["hiddens"] = []
config["dueling"] = False
config["env"] = "leduc_holdem"

ray.init(num_cpus=num_cpus + 1)

tune.run(
    alg_name,
    name="DQN",
    stop={"timesteps_total": 5000},
    checkpoint_freq=10,
    config=config,
)

Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]], [[[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]], (6, 7, 2), int8))


2022-05-26 15:04:34,292	INFO trial_runner.py:803 -- starting DQN_leduc_holdem_63720_00000
2022-05-26 15:04:34,360	ERROR syncer.py:119 -- Log sync requires rsync to be installed.
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:39,569	INFO simple_q.py:161 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:44,566	INFO worker_set.py:154 -- Inferred observation/action spaces from remote worker (local worker has no env): {'player_1': (Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[[0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]
[2m[36m(DQNTrainer pid=21264)[0m   [0 0]]
[2m[36m(DQNTrainer pid=21264)[0m 
[2

Trial name,status,loc
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:44,715	INFO replay_buffer.py:47 -- Estimated max memory usage for replay buffer is 0.03825 GB (50000.0 batches of size 1, 765 bytes each), available system memory is 17.129304064 GB
[2m[36m(RolloutWorker pid=22524)[0m 2022-05-26 15:04:44,644	INFO rollout_worker.py:809 -- Generating sample batch of size 30
[2m[36m(RolloutWorker pid=22524)[0m 2022-05-26 15:04:44,645	INFO sampler.py:672 -- Raw obs from env: { 0: { 'player_0': { 'action_mask': np.ndarray((7,), dtype=int8, min=1.0, max=1.0, mean=1.0),
[2m[36m(RolloutWorker pid=22524)[0m                      'observation': np.ndarray((6, 7, 2), dtype=int8, min=0.0, max=0.0, mean=0.0)}}}
[2m[36m(RolloutWorker pid=22524)[0m 2022-05-26 15:04:44,645	INFO sampler.py:673 -- Info return from env: {0: {}}
[2m[36m(RolloutWorker pid=22524)[0m 2022-05-26 15:04:44,645	INFO sampler.py:908 -- Preprocessed obs: np.ndarray((91,), dtype=float32, min=0.0, max=1.0, mean=0.077)
[2m[36m(Rollou

[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 1019
  custom_metrics: {}
  date: 2022-05-26_15-04-46
  done: false
  episode_len_mean: 17.87719298245614
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.08771929824561403
  episode_reward_min: -1.0
  episodes_this_iter: 57
  episodes_total: 57
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,422	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,433	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264,1,1.82784,1020,-0.0877193,0,-1,17.8772


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,630	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,638	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,806	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,814	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,974	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:46,982	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:47,144	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:47,151	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:47,303	DEBUG train_

Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 1949
  custom_metrics: {}
  date: 2022-05-26_15-04-51
  done: false
  episode_len_mean: 8.41
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: 0.0
  episode_reward_min: 0.0
  episodes_this_iter: 3
  episodes_total: 167
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 1530
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.001257112598977983
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,538	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,546	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264,32,5.30447,1950,0,0,0,8.41


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,724	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,733	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,888	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:51,896	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:52,051	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:52,059	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:52,212	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:52,221	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:52,374	DEBUG train_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264,61,8.73618,2820,0,0,0,8.58


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:56,686	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:56,693	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 2849
  custom_metrics: {}
  date: 2022-05-26_15-04-56
  done: false
  episode_len_mean: 8.56
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: 0.0
  episode_reward_min: 0.0
  episodes_this_iter: 4
  episodes_total: 272
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 2550
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0013593443436548114
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -1000

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:56,886	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:56,894	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,052	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,060	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,228	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,237	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,395	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,404	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:04:57,561	DEBUG train_

[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,599	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,610	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,777	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,785	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,987	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:00,995	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,166	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,174	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,328	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,338	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,521	DEBUG train_

Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 3719
  custom_metrics: {}
  date: 2022-05-26_15-05-01
  done: false
  episode_len_mean: 10.19
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.02
  episode_reward_min: -1.0
  episodes_this_iter: 3
  episodes_total: 354
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 3570
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0016049898695200682
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,717	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,724	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264,91,12.3256,3720,-0.02,0,-1,10.19


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,911	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:01,919	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,075	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,084	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,237	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,245	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,397	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,404	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:02,558	DEBUG train_

[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,417	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,426	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,585	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,593	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,749	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,760	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,936	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:05,945	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,106	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,115	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,280	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,288	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,439	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,447	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,602	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,611	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,798	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:06,807	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,RUNNING,127.0.0.1:21264,119,15.6723,4560,-0.08,0,-1,13.17


Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 4589
  custom_metrics: {}
  date: 2022-05-26_15-05-06
  done: false
  episode_len_mean: 13.19
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.08
  episode_reward_min: -1.0
  episodes_this_iter: 3
  episodes_total: 420
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 4590
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.001981080509722233
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -1

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,148	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,158	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,345	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,355	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,519	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,527	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,679	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,688	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,843	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:07,851	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,003	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,011	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,167	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,175	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


[2m[36m(RolloutWorker pid=22524)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,325	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,333	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,484	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,492	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,645	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,654	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,818	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,826	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:08,977	DEBUG train_

Result for DQN_leduc_holdem_63720_00000:
  agent_timesteps_total: 5009
  custom_metrics: {}
  date: 2022-05-26_15-05-09
  done: true
  episode_len_mean: 13.17
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.08
  episode_reward_min: -1.0
  episodes_this_iter: 3
  episodes_total: 452
  experiment_id: 76ae7192618d4f36bd5bc77cc03bfffe
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 4590
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0020761718042194843
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -1

[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:09,299	DEBUG train_ops.py:336 -- == sgd epochs for player_1 ==
[2m[36m(DQNTrainer pid=21264)[0m 2022-05-26 15:05:09,307	DEBUG train_ops.py:336 -- == sgd epochs for player_0 ==


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_63720_00000,TERMINATED,127.0.0.1:21264,134,17.4973,5010,-0.08,0,-1,13.17


2022-05-26 15:05:10,212	INFO tune.py:701 -- Total run time: 36.20 seconds (35.26 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x25e18aef580>

In [8]:
import argparse
import os
from copy import deepcopy
from pathlib import Path

import numpy as np
import pickle
import PIL
import ray
from ray.rllib.agents.dqn import DQNTrainer
from ray.rllib.agents.registry import get_trainer_class
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from ray.rllib.models import ModelCatalog
from ray.tune.registry import register_env

from pettingzoo.classic import connect_four_v3

os.environ["SDL_VIDEODRIVER"] = "dummy"



checkpoint_path = os.path.expanduser("C:/Users/Lennert/ray_results/DQN/DQN_leduc_holdem_63720_00000_0_2022-05-26_15-04-34/checkpoint_000130/checkpoint-130")
params_path = Path(checkpoint_path).parent.parent / "params.pkl"


alg_name = "DQN"
ModelCatalog.register_custom_model("pa_model", TorchMaskedActions)
# function that outputs the environment you wish to register.


def env_creator():
    env = connect_four_v3.env()
    return env


num_cpus = 1

config = deepcopy(get_trainer_class(alg_name)._default_config)

register_env("leduc_holdem", lambda config: PettingZooEnv(env_creator()))

env = env_creator()
# obs_space = env.observation_space
# print(obs_space)
# act_space = test_env.action_space

with open(params_path, "rb") as f:
    config = pickle.load(f)
    # num_workers not needed since we are not training
    del config["num_workers"]
    del config["num_gpus"]

#ray.init(num_cpus=8, num_gpus=0)
DQNAgent = DQNTrainer(env="leduc_holdem", config=config)
DQNAgent.restore(checkpoint_path)

reward_sums = {a: 0 for a in env.possible_agents}
i = 0
env.reset()

for agent in env.agent_iter():
    observation, reward, done, info = env.last()
    obs = observation["observation"]
    reward_sums[agent] += reward
    if done:
        action = None
    else:
        print(DQNAgent.get_policy(agent))
        policy = DQNAgent.get_policy(agent)
        batch_obs = {
            "obs": {
                "observation": np.expand_dims(observation["observation"], 0),
                "action_mask": np.expand_dims(observation["action_mask"], 0),
            }
        }
        batched_action, state_out, info = policy.compute_actions_from_input_dict(
            batch_obs
        )
        single_action = batched_action[0]
        action = single_action

    env.step(action)
    i += 1
    env.render()

print("rewards:")
print(reward_sums)

2022-05-26 15:08:00,349	INFO simple_q.py:161 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
2022-05-26 15:08:00,355	DEBUG rollout_worker.py:1704 -- Creating policy for player_0
2022-05-26 15:08:00,356	DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8)
2022-05-26 15:08:00,358	DEBUG preprocessors.py:269 -- Creating sub-preprocessor for Box([[[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]], [[[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.