# Using RLlib for more multi-agent learning control

As discussed in `5-improving-dqn-architecture.ipynb` we thought of three aspects that might be the root of the agent's not learning to play the game pleasingly:
- Training two DQN agents simultaneously is known to be though, especially when starting from a random initialisation
- The network used was a simple MLP
- The training is not done over enough iterations

In the notebooks `5-improving-dqn-architecture.ipynb` and `6-dqn-using-a-cnn.ipynb`, two alternative networks besides MLP were used.
Whilst these give somewhat satisfactory results when trained for long enough and incentivising moves by giving a reward for making a move, it is still far from perfect.
The iterations were also boosted to a couple of hours on a CUDA GPU, which didn't improve things all that much.

Thus, what is most likely to be an issue is the fact that we are training two agents simultaneously.
This makes it hard to get a good performing agent.
An alternative to this is training an agent for a couple of epochs whilst freezing the other and alternating this between the agents.
This makes the problem to learn "stationary" in a certain way and is known to make learning easier.
What is also done, often in very complex games, is starting from a somewhat smart agent instead of a random one.

This notebook will use [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html), which is better documented for use in multi-agent environments and PettingZoo like environments in particular.
They also note that zero-sum environments are harder to learn in multi-agent settings.
That is why we introduce a reward for making moves and a high reward for playing a tie game.
We hope to create agents that are capable of reaching a tie board or extending losses maximally in this manner.

We will use portions of the [Ray documentation and examples in this notebook](https://docs.ray.io/en/latest/rllib/rllib-examples.html).
This includes following files on public GitHub repositories:
- `multi_agent_independent_learning.py` from the [Ray GitHub repository](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py).
- `multi_agent_parameter_sharing.py` from the [Ray GitHub repository](https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_parameter_sharing.py).
- `rllib_pistonball.py` from the [Petting Zoo GitHub repository](https://github.com/Farama-Foundation/PettingZoo/blob/master/tutorials/rllib_pistonball.py).

Alongside these documents and files, a tutorial by[ J K Terry on using RLlib in Petting Zoo environments](https://towardsdatascience.com/using-pettingzoo-with-rllib-for-multi-agent-deep-reinforcement-learning-5ff47c677abd) was also used.

# IMPORTANT: BUGGY NOTEBOOK
This notebook doesn't work due to issues related to the one reported [here](https://github.com/ray-project/ray/issues/22976).
This along with the fact that working with custom Petting Zoo like environment throws random errors, left us to beleive that the Ray RL Lib is sadly not the way to go.

Indeed, our model has values that are `None` which throws the following error:

> Can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Since editing the source code of Ray RL lib is asking for troubles we leave our exploration of this library as is.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training Connect Four agents with Ray RLlib
  - Trying out Ray RL lib

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib
# Ray RLlib for RL algorithms instead of Tianshou
import ray; print(f"Ray version (1.12.1 recommended): {ray.__version__}")
import ray.rllib

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym;
importlib.invalidate_caches();
importlib.reload(cfgym);

Ray version (1.12.1 recommended): 1.12.1
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116
Gym version (0.21.0 recommended): 0.21.0


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

<hr><hr>

## Training Connect Four agents with Ray RLlib

As discussed, this notebook will use Ray RLlib to train two agents for Connect four.

### Trying out Ray RL lib

We try out the Ray RL lib and do this on the Petting Zoo provided Connect Four game.
Whilst the training works, the saved files cause an issue for loading and thus for replaying.
Becuase this is a straight copy from the documentation with only the environment changed, we see no reason why it should not work and discard further experiments with this library.

In [4]:
import os
from copy import deepcopy

import ray
from gym.spaces import Box
from ray import tune
from ray.rllib.agents.dqn.dqn_torch_model import DQNTorchModel
from ray.rllib.agents.registry import get_trainer_class
from ray.rllib.env import PettingZooEnv
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.torch_utils import FLOAT_MAX
from ray.tune.registry import register_env

from pettingzoo.classic import connect_four_v3

torch, nn = try_import_torch()


class TorchMaskedActions(DQNTorchModel):
    """PyTorch version of above ParametricActionsModel."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name, **kw):
        DQNTorchModel.__init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kw
        )

        obs_len = obs_space.shape[0] - action_space.n

        orig_obs_space = Box(
            shape=(obs_len,), low=obs_space.low[:obs_len], high=obs_space.high[:obs_len]
        )
        self.action_embed_model = TorchFC(
            orig_obs_space,
            action_space,
            action_space.n,
            model_config,
            name + "_action_embed",
        )

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]

        # Compute the predicted action embedding
        action_logits, _ = self.action_embed_model(
            {"obs": input_dict["obs"]["observation"]}
        )
        # turns probit action mask into logit action mask
        inf_mask = torch.clamp(torch.log(action_mask), -1e10, FLOAT_MAX)

        return action_logits + inf_mask, state

    def value_function(self):
        return self.action_embed_model.value_function()





In [5]:
alg_name = "DQN"
ModelCatalog.register_custom_model("pa_model", TorchMaskedActions)
# function that outputs the environment you wish to register.

my_env = cfgym.env()

def env_creator():
    env = connect_four_v3.env()
    return env

num_cpus = 1

config = deepcopy(get_trainer_class(alg_name)._default_config)

register_env("leduc_holdem", lambda config: PettingZooEnv(env_creator()))

test_env = PettingZooEnv(env_creator())
obs_space = test_env.observation_space
print(obs_space)
act_space = test_env.action_space

config["multiagent"] = {
    "policies": {
        "player_0": (None, obs_space, act_space, {}),
        "player_1": (None, obs_space, act_space, {}),
    },
    "policy_mapping_fn": lambda agent_id: agent_id,
}

config["num_gpus"] = int(os.environ.get("RLLIB_NUM_GPUS", "0"))
config["log_level"] = "INFO"
config["num_workers"] = 1
config["rollout_fragment_length"] = 30
config["train_batch_size"] = 200
config["horizon"] = 200
config["no_done_at_end"] = False
config["framework"] = "torch"
config["model"] = {
    "custom_model": "pa_model",
}
config["n_step"] = 1

config["exploration_config"] = {
    # The Exploration class to use.
    "type": "EpsilonGreedy",
    # Config for the Exploration class' constructor:
    "initial_epsilon": 0.1,
    "final_epsilon": 0.0,
    "epsilon_timesteps": 100000,  # Timesteps over which to anneal epsilon.
}
config["hiddens"] = []
config["dueling"] = False
config["env"] = "leduc_holdem"

ray.init(num_cpus=num_cpus + 1)

tune.run(
    alg_name,
    name="DQN",
    stop={"timesteps_total": 5000},
    checkpoint_freq=10,
    config=config,
)

Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]

 [[0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]
  [0 0]]], [[[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]

 [[1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]
  [1 1]]], (6, 7, 2), int8))


2022-06-02 01:43:13,188	INFO trial_runner.py:803 -- starting DQN_leduc_holdem_99c2b_00000
2022-06-02 01:43:13,247	ERROR syncer.py:119 -- Log sync requires rsync to be installed.
[2m[36m(DQNTrainer pid=14440)[0m 2022-06-02 01:43:18,248	INFO simple_q.py:161 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
[2m[36m(RolloutWorker pid=2128)[0m 2022-06-02 01:43:23,709	INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
[2m[36m(RolloutWorker pid=2128)[0m 2022-06-02 01:43:23,714	INFO catalog.py:474 -- Wrapping <class '__main__.TorchMaskedActions'> as <class 'ray.rllib.agents.dqn.dqn_torch_model.DQNTorchModel'>
[2m[36m(RolloutWorker pid=2128)[0m 2022-06-02 01:43:23,718	INFO torch_policy.py:183 -- TorchPolicy (worker=1) running on CPU.
[2m[36m(RolloutWorker pid=2128)[0m 2022-06-02 01:43:23

Trial name,status,loc
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440


[2m[36m(DQNTrainer pid=14440)[0m 2022-06-02 01:43:23,760	INFO worker_set.py:154 -- Inferred observation/action spaces from remote worker (local worker has no env): {'player_0': (Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[[0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]]
[2m[36m(DQNTrainer pid=14440)[0m 
[2m[36m(DQNTrainer pid=14440)[0m  [[0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]
[2m[36m(DQNTrainer pid=14440)[0m   [0 0]]
[2m[36m(DQNTrainer pid=14440)[0m 
[2m[36m(DQNTrainer pid=14440)[0m  [[0 0]
[2m[36m(DQNTrainer pid=14440)[0m

[2m[36m(DQNTrainer pid=14440)[0m 2022-06-02 01:43:23,934	INFO replay_buffer.py:47 -- Estimated max memory usage for replay buffer is 0.03825 GB (50000.0 batches of size 1, 765 bytes each), available system memory is 17.129304064 GB
[2m[36m(RolloutWorker pid=2128)[0m 2022-06-02 01:43:23,870	INFO simple_list_collector.py:904 -- Trajectory fragment after postprocess_trajectory():
[2m[36m(RolloutWorker pid=2128)[0m 
[2m[36m(RolloutWorker pid=2128)[0m { 'player_0': { 'actions': np.ndarray((7,), dtype=int64, min=0.0, max=6.0, mean=4.0),
[2m[36m(RolloutWorker pid=2128)[0m                 'agent_index': np.ndarray((7,), dtype=int32, min=0.0, max=0.0, mean=0.0),
[2m[36m(RolloutWorker pid=2128)[0m                 'dones': np.ndarray((7,), dtype=bool, min=0.0, max=1.0, mean=0.143),
[2m[36m(RolloutWorker pid=2128)[0m                 'eps_id': np.ndarray((7,), dtype=int32, min=1279442517.0, max=1279442517.0, mean=1279442517.0),
[2m[36m(RolloutWorker pid=2128)[0m             

[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440,1,1.94017,1020,-0.160714,0,-1,18.1964


Result for DQN_leduc_holdem_99c2b_00000:
  agent_timesteps_total: 1889
  custom_metrics: {}
  date: 2022-06-02_01-43-30
  done: false
  episode_len_mean: 11.03
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.03
  episode_reward_min: -1.0
  episodes_this_iter: 2
  episodes_total: 143
  experiment_id: 20d2d93e401548a1937e1ffd4fd1c320
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 1530
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0015086865751072764
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440,30,5.50739,1890,-0.03,0,-1,11.03


[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
Result for DQN_leduc_holdem_99c2b_00000:
  agent_timesteps_total: 2759
  custom_metrics: {}
  date: 2022-06-02_01-43-36
  done: false
  episode_len_mean: 10.0
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.01
  episode_reward_min: -1.0
  episodes_this_iter: 3
  episodes_total: 232
  experiment_id: 20d2d93e401548a1937e1ffd4fd1c320
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 2550
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0017396074254065752
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440,59,9.03132,2760,-0.01,0,-1,10


[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
Result for DQN_leduc_holdem_99c2b_00000:
  agent_timesteps_total: 3659
  custom_metrics: {}
  date: 2022-06-02_01-43-41
  done: false
  episode_len_mean: 12.02
  episode_media: {}
  episode_reward_max: 0.0
  episode_

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440,89,12.5589,3660,-0.08,0,-1,12.02


[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,RUNNING,127.0.0.1:14440,119,15.9737,4560,-0.09,0,-1,12.46


Result for DQN_leduc_holdem_99c2b_00000:
  agent_timesteps_total: 4589
  custom_metrics: {}
  date: 2022-06-02_01-43-46
  done: false
  episode_len_mean: 12.46
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.09
  episode_reward_min: -1.0
  episodes_this_iter: 3
  episodes_total: 376
  experiment_id: 20d2d93e401548a1937e1ffd4fd1c320
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 4590
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.002683277241885662
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_steps_trained: 200.0
        td_error:
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -10000000000.0
        - -1

[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
[2m[36m(RolloutWorker pid=2128)[0m obs['action_mask'] contains a mask of all legal moves that can be chosen.
Result for DQN_leduc_holdem_99c2b_00000:
  agent_timesteps_total: 5009
  custom_metrics: {}
  date: 2022-06-02_01-43-48
  done: true
  episode_len_mean: 13.03
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -0.07
  episode_reward_min: -1.0
  episodes_this_iter: 2
  episodes_total: 407
  experiment_id: 20d2d93e401548a1937e1ffd4fd1c320
  hostname: GAMING-LENNERT
  info:
    last_target_update_ts: 4590
    learner:
      player_0:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.0001
          grad_gnorm: 0.0029454275500029325
          max_q: -10000000000.0
          mean_q: -9999997952.0
          min_q: -10000000000.0
        mean_td_error: -9999997952.0
        model: {}
        num_agent_st

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
DQN_leduc_holdem_99c2b_00000,TERMINATED,127.0.0.1:14440,134,17.7881,5010,-0.07,0,-1,13.03


2022-06-02 01:43:49,565	INFO tune.py:701 -- Total run time: 36.66 seconds (35.87 seconds for the tuning loop).
[2m[36m(pid=)[0m 2022-06-02 01:43:49,592	INFO context.py:67 -- Exec'ing worker with command: "C:\ProgramData\Anaconda3\envs\rl-project\python.exe" C:\ProgramData\Anaconda3\envs\rl-project\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=62406 --object-store-name=tcp://127.0.0.1:63755 --raylet-name=tcp://127.0.0.1:63206 --redis-address=None --storage=None --temp-dir=C:\Users\Lennert\AppData\Local\Temp\ray --metrics-agent-port=63722 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64969 --redis-password=5241590000000000 --startup-token=3 --runtime-env-hash=947844633


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x2d68ef473a0>

In [6]:
import argparse
import os
from copy import deepcopy
from pathlib import Path

import numpy as np
import pickle
import PIL
import ray
from ray.rllib.agents.dqn import DQNTrainer
from ray.rllib.agents.registry import get_trainer_class
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from ray.rllib.models import ModelCatalog
from ray.tune.registry import register_env

from pettingzoo.classic import connect_four_v3

os.environ["SDL_VIDEODRIVER"] = "dummy"



checkpoint_path = os.path.expanduser("C:/Users/Lennert/ray_results/DQN/DQN_leduc_holdem_63720_00000_0_2022-05-26_15-04-34/checkpoint_000130/checkpoint-130")
params_path = Path(checkpoint_path).parent.parent / "params.pkl"


alg_name = "DQN"
ModelCatalog.register_custom_model("pa_model", TorchMaskedActions)
# function that outputs the environment you wish to register.


def env_creator():
    env = connect_four_v3.env()
    return env


num_cpus = 1

config = deepcopy(get_trainer_class(alg_name)._default_config)

register_env("leduc_holdem", lambda config: PettingZooEnv(env_creator()))

env = env_creator()
# obs_space = env.observation_space
# print(obs_space)
# act_space = test_env.action_space

with open(params_path, "rb") as f:
    config = pickle.load(f)
    # num_workers not needed since we are not training
    del config["num_workers"]
    del config["num_gpus"]

#ray.init(num_cpus=8, num_gpus=0)
DQNAgent = DQNTrainer(env="leduc_holdem", config=config)
DQNAgent.restore(checkpoint_path)

reward_sums = {a: 0 for a in env.possible_agents}
i = 0
env.reset()

for agent in env.agent_iter():
    observation, reward, done, info = env.last()
    obs = observation["observation"]
    reward_sums[agent] += reward
    if done:
        action = None
    else:
        print(DQNAgent.get_policy(agent))
        policy = DQNAgent.get_policy(agent)
        batch_obs = {
            "obs": {
                "observation": np.expand_dims(observation["observation"], 0),
                "action_mask": np.expand_dims(observation["action_mask"], 0),
            }
        }
        batched_action, state_out, info = policy.compute_actions_from_input_dict(
            batch_obs
        )
        single_action = batched_action[0]
        action = single_action

    env.step(action)
    i += 1
    env.render()

print("rewards:")
print(reward_sums)

2022-06-02 01:43:49,712	INFO simple_q.py:161 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
2022-06-02 01:43:49,750	DEBUG rollout_worker.py:1704 -- Creating policy for player_0
[2m[36m(pid=)[0m 2022-06-02 01:43:49,624	INFO context.py:67 -- Exec'ing worker with command: "C:\ProgramData\Anaconda3\envs\rl-project\python.exe" C:\ProgramData\Anaconda3\envs\rl-project\lib\site-packages\ray\workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=62406 --object-store-name=tcp://127.0.0.1:63755 --raylet-name=tcp://127.0.0.1:63206 --redis-address=None --storage=None --temp-dir=C:\Users\Lennert\AppData\Local\Temp\ray --metrics-agent-port=63722 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64969 --redis-password=5241590000000000 --startup-token=2 --runtime-env-hash=947844633
2022-06-02 01:43:49,751	DEBUG preprocessors.py

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.