# Connect four with random agents from Tianshou

In this notebook the created custom gym environment of connect four is played using two random agents.
We use the powerfull [Tianshou library](https://github.com/thu-ml/tianshou) for this.
This notebook thus shows how the multi-agent environment can be configured using Tianshou for managing the agents.


## Table of Contents

- Contact information
- Checking requirements
  - Correct anaconda environment
  - Correct module access
  - Correct CUDA access
- Training two random agents on connect four Gym
  - Create the Gym environment
  - Create a Multi Agent Policy Manager
  - Create a data collector
  - Collect data for a given amount of episodes
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following codeblock will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.11.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
importlib.invalidate_caches()
importlib.reload(cfgym)

# Time for allowing "freezes" in execution
import time;

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.11.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following codeblock tests if this was done succesfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"Amount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"Current CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True
Amount of connected devices supporting CUDA: 1
Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training two random agents on connect four Gym

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Create the Gym environment

Whilst our first connect four implementation (V1) was playable in a multi-agent manner through a manual game loop, as is done for random agents in the experimental notebook `2-testing-custom-gym-environment.ipynb`, it is hard to use libraries for this environment.
That is because the Gym environment was originally made with single agent games in mind and there is no real standard on how to write multi-agent environments.
Thus, Tianshou, which offers some multi-agent support, didn't work well with this version of the Gym environment.
To tackle this, we created a V2, which is a rework of V1 to follow the standards of a *Petting Zoo* environment.
[Petting zoo](https://www.pettingzoo.ml/) is a library that offers many Gym environments extended to be multi-agent which uses wrapper classes and base classes so that each multi-agent environment follows the same guidelines.
The environment is now a subclass of `AECEnv` rather then a `gym.Env`, which follows a similar approach but requires far more attributes and more complex observation and action spaces so that each agent has their own, even if they are all equal.

We note that we only found out about this library after running into troubles making our V1 work with libraries that support Gym training.
We could have searched for online implementations of e.g. Deep Q-Network (DQN) that allows to train using the gym gaming loop used in `2-testing-custom-gym-environment.ipynb`. 
Nonetheless, this would require many manual work which is not really ideal for the goal of this paper, which focuses on reducing work for game developers.
Thus, the decission was made to create this V2 of the environment based on Petting Zoo environments.
As a final note, we noticed there was a Connect Four implementation of Petting Zoo.
This implementation differs from ours in a number of ways.
Their observation space consists of multiple variants of the board, namely one with only the oponents coin and the agent's coins and it uses action masks to not allow placing coins in full columns.
We use a singular observation space of the complete board, which is the location of both agent's coins and the free spaces, corresponding to what a human would see.
Adding to this, we also don't programmatically disallow placing a coin in a full column but rather punish the user for trying that action and leave the board unchanged and let the agent play again.
There are more logic differences and the visual game is also completely different.

In [5]:
####################################################
# SETTING UP THE GYM ENVIRONMENT
####################################################

# Create an instance of the environment to be used
# V2 is used as this contains edits for Tianshou
# We use the PettingZooEnv wrapper for multiagent support
env = ts.env.PettingZooEnv(cfgym.env())

# Get information about the environment
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


<hr>

### Create a Multi Agent Policy Manager

Since we have a multi-agent environment we need a multi-agent policy manager which manages a policy for each agent.
For starters, this policy is a simple randompolicy.
This is very much based on the [tic-tac-toe example from the Tianshou docs](https://tianshou.readthedocs.io/en/master/tutorials/tictactoe.html).

In [6]:
####################################################
# SETTING UP THE MULTI AGENT POLICY MANAGER
####################################################

# Create the multi agent policy manager
multi_agent_policy_manager = ts.policy.MultiAgentPolicyManager([ts.policy.RandomPolicy(), ts.policy.RandomPolicy()], env)

<hr>

### Create a data collector

Having a policy, which is a multi agent policy, we can setup a Tianshou data collector to collect data using the provided policy.

In [7]:
####################################################
# SETTING UP THE DATA COLLECTOR
####################################################

# Need to vectorize the environment for the collector
vectorized_env = ts.env.DummyVectorEnv([lambda: env])

# use collectors to collect episodes of trajectories
collector = ts.data.Collector(multi_agent_policy_manager, vectorized_env)

<hr>

### Collect data for a given amount of episodes

Having set up the policy and data collector, we can start gathering results from playing the policy.
Since this is a random policy, the agents don't learn and there is not much more we can do.
Thus, we visualize this policy game playing for 3 episodes.

In [8]:
####################################################
# COLLECTING DATA
####################################################

# Collect results over 3 episodes (complete games)
# If the render option is set, a step is made every
#   specified amount of seconds.
results = collector.collect(n_episode=3, render=.15)

# Close the environment aftering collecting the results
# This closes the pygame window after completion
env.close()

# Show the obtained results
pprint(results)

# Remove unused variables
del results

{'idxs': array([0, 0, 0]),
 'len': 17.666666666666668,
 'len_std': 4.189935029992178,
 'lens': array([19, 22, 12]),
 'n/ep': 3,
 'n/st': 53,
 'rew': 0.0,
 'rew_std': 10.0,
 'rews': array([[ 10., -10.],
       [-10.,  10.],
       [-10.,  10.]])}


<hr><hr>

## Discussion

We see that our V2 gym environment, based on an `AECenv` object (Petting Zoo multi agent environment), works with the Tianshou library for multi-agent data collection using a random policy.
In the next notebooks a DQN will be trained.
We also see correct rewards are obtained, remember that the negative reward for placing a coin in a full column which can add up since we play with random agents.

In [9]:
####################################################
# CLEAN VARIABLES
####################################################

del collector
del env
del multi_agent_policy_manager
del vectorized_env