# Human vs bot game

In this notebook, we give a tutorial on how to run a pygame between two differing agents, where there is an option to be the agent in a game yourself.
Thus, this notebook guides you in using a bot as the opponent in the connect four game.

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Loading policies
  - MLP based DQN
  - CNN based DQN
  - CNN based Rainbow
  - MiniMax policy
- Loading in torch dictionaries
- Setup the game

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.11.0 recommended): {torch.__version__}")

# Our custom connect four gym environment
import sys
sys.path.append('../')
import human_vs_bot_connect4.human_vs_bot_connect_four as game
import minimax_agent.minimax_agent as minimaxbot
importlib.invalidate_caches();
importlib.reload(game);
importlib.reload(minimaxbot);

# More data types
import typing
import numpy as np


Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.11.0 recommended): 1.12.0.dev20220520+cu116


<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [4]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    # Show current cuda device
    print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

    # Show cuda device name
    print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Loading policies

We need to specify the policies the trained agent's weight are for.
These are taken from previous notebooks.

<hr>

### MLP based DQN

In [5]:
####################################################
# DQN POLICY FROM PAPER NOTEBOOK 5
####################################################

class CustomDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        self.model = torch.nn.Sequential(
            torch.nn.Linear(np.prod(state_shape), 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state

def cf_custom_dqn_policy(state_shape: tuple,
                         action_shape: tuple,
                         learning_rate: float =  0.0001,
                         gamma: float = 0.9, # Smaller gamma favours "faster" win
                         n_step: int = 1, # Number of steps to look ahead
                         target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CustomDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### CNN based DQN

In [6]:
####################################################
# DQN POLICY FROM PAPER NOTEBOOK 7
####################################################

class CNNBasedDQN(torch.nn.Module):
    """
    Custom DQN using a model based on CNN
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',):
        # Parent call
        super().__init__()
        
        # Save device (e.g. cuda)
        self.device = device
        
        # Number of input channels
        input_channels_cnn = 1
        output_channels_cnn = 32
        flatten_size = (state_shape[0] - 3) * (state_shape[1] - 3) * output_channels_cnn
        output_size= np.prod(action_shape)
        
        self.model = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels= input_channels_cnn, out_channels= output_channels_cnn, kernel_size= 4, stride= 1), torch.nn.ReLU(inplace=True),
            torch.nn.Flatten(0,-1),
            torch.nn.Unflatten(0, (1, flatten_size)),
            torch.nn.Linear(flatten_size, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, 128), torch.nn.ReLU(inplace=True),
            torch.nn.Linear(128, output_size),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
        
        logits = self.model(obs)
        return logits, state

    
def cf_cnn_dqn_policy(state_shape: tuple,
                      action_shape: tuple,
                      optim: typing.Optional[torch.optim.Optimizer] = None,
                      learning_rate: float =  0.0001,
                      gamma: float = 0.9, # Smaller gamma favours "faster" win
                      n_step: int = 4, # Number of steps to look ahead
                      frozen: bool = False,
                      target_update_freq: int = 320):
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Network to be used for DQN
    net = CNNBasedDQN(state_shape, action_shape, device= device).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
        
    # Our agent DQN policy
    return ts.policy.DQNPolicy(model= net,
                               optim= optim,
                               discount_factor= gamma,
                               estimation_step= n_step,
                               target_update_freq= target_update_freq)

<hr>

### CNN based Rainbow

<hr>

### CNN based DQN

In [7]:
####################################################
# RAINBOW POLICY FROM PAPER NOTEBOOK 9
####################################################

class CNNForRainbow(torch.nn.Module):
    """
    Custom CNN to be used as baseclass for the Rainbow algorithm.
    Extracts "feautures" for the Rainbow algorithm by doing a 4x4 cnn kernel pass and providing 16 filters.
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu'):
        
        # Torch init
        super().__init__()
        
        # Store device to be used
        self.device = device
        
        # The input layer is singular -> we have 1 board vector
        input_channels_cnn = 1
        
        # We output 64/16 filters per kernel 
        output_channels_cnn = 64 # Increased from 16
        
        # We store the output dimension of the CNN "feature" layer
        self.output_dim = (state_shape[0] - 3) * (state_shape[1] - 3) * output_channels_cnn
        
        self.net = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels= input_channels_cnn, out_channels= output_channels_cnn, kernel_size= 4, stride= 1), torch.nn.ReLU(inplace=True),
            torch.nn.Flatten(),
        )

    def forward(self,
                obs: typing.Union[np.ndarray, torch.Tensor],
                state: typing.Optional[typing.Any] = None,
                info: typing.Dict[str, typing.Any] = {}):
        # Make a torch instance (from regular vector of board)
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
            
        # Tianshou bugs the batch output, reshape to work properly with our torch version
        if (len(np.shape(obs)) != 4):
            obs = obs[:, None, :, :]
        
        # Return what is needed (network output & state)
        return self.net(obs), state

class Rainbow(CNNForRainbow):
    """
    Implementation of the Rainbow algorithm making using of the CNNForRainbow baseclass.
    Default parameters adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
    """

    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',
                 num_atoms: int = 51,
                 is_noisy: bool = True,
                 noisy_std: float = 0.1,
                 is_dueling: bool = True):
        
        # Init CNN feature extraction parent class
        super().__init__(state_shape= state_shape, device= device)
        
        # the amount of actions we have is just the action shape
        self.action_num = np.prod(action_shape)
        
        # Store class specific info
        self.num_atoms = num_atoms
        self._is_dueling = is_dueling

        # Our linear layer depends on wether or not we want to use a noisy environment
        # Noisy implementation based on https://arxiv.org/abs/1706.10295
        def linear(x, y):
            if is_noisy:
                return ts.utils.net.discrete.NoisyLinear(x, y, noisy_std)
            else:
                return torch.nn.Linear(x, y)
            
        # Specify Q and V based on wether or not agent is dueling
        # Setting agent on dueling mode should help generalisation according to rainbow paper
        # NOTE: this uses the output dim from the feature extraction CNN
        self.Q = torch.nn.Sequential(
            linear(self.output_dim, 512), torch.nn.ReLU(inplace=True),
            linear(512, self.action_num * self.num_atoms))
        
        if self._is_dueling:
            self.V = torch.nn.Sequential(
                linear(self.output_dim, 512), torch.nn.ReLU(inplace=True),
                linear(512, self.num_atoms))
            
        # New output dim for this rainbow network
        self.output_dim = self.action_num * self.num_atoms
        

    def forward(self,
                obs: typing.Union[np.ndarray, torch.Tensor],
                state: typing.Optional[typing.Any] = None,
                info: typing.Dict[str, typing.Any] = {}):
        
        # Use our parent CNN based network to get "features"
        obs, state = super().forward(obs)
        
        # Get our Rainbow specific values
        q = self.Q(obs)
        q = q.view(-1, self.action_num, self.num_atoms)
        
        if self._is_dueling:
            v = self.V(obs)
            v = v.view(-1, 1, self.num_atoms)
            logits = q - q.mean(dim=1, keepdim=True) + v
        else:
            logits = q
        
        # We need to go from our logits to an accepted dimension of probability outputs
        probs = logits.softmax(dim=2)
        
        return probs, state
    
    
def rainbow_policy(state_shape: tuple,
                   action_shape: tuple,
                   optim: typing.Optional[torch.optim.Optimizer] = None,
                   learning_rate: float =  0.0000625,
                   gamma: float = 0.9,
                   n_step: int = 3, # Number of steps to look ahead
                   num_atoms: int = 51,
                   is_noisy: bool = True,
                   noisy_std: float = 0.1,
                   is_dueling: bool = True,
                   target_update_freq: int = 500):
    """
    Implementation of the Rainbow policy.
    Default parameters adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
    """
    
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Rainbow network to be used by policy
    net = Rainbow(state_shape= state_shape,
                  action_shape= action_shape,
                  device= device,
                  num_atoms= num_atoms,
                  is_noisy= is_noisy,
                  noisy_std= noisy_std,
                  is_dueling= is_dueling).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # Our agents Rainbow policy
    return ts.policy.RainbowPolicy(model= net,
                                   optim= optim,
                                   discount_factor= gamma,
                                   num_atoms= num_atoms,
                                   estimation_step= n_step,
                                   target_update_freq= target_update_freq).to(device)



<hr>

### MiniMax policy

In [8]:
####################################################
# CUSTOM MINIMAX TIANSHOU POLICY
####################################################

class TianshouMiniMaxConnectFourPolicy(ts.policy.BasePolicy):
    """
    Tianshou compatible MiniMax policy for connect four.
    """

    def __init__(self,
                 coin: int,
                 oponent_coin: int,
                 minimax_depth: int,
                 column_count: int = 7,
                 row_count: int = 6,
                 **kwargs: typing.Any):
        # Init base policy
        super().__init__(**kwargs)
        
        # Configure minimax bot
        self.bot = minimaxbot.MiniMaxConnectFourBot(coin= coin,
                                                    oponent_coin= oponent_coin,
                                                    column_count= column_count,
                                                    row_count= row_count,
                                                    minimax_depth= minimax_depth)

    def forward(self,
                batch: ts.data.Batch,
                state: typing.Optional[typing.Union[dict, ts.data.Batch, np.ndarray]] = None,
                **kwargs: typing.Any):
        """
        Compute minimax action over the given batch data.
        """
        boards = batch["obs"]
        
        # Can be nested in Tianshou
        while isinstance(boards, ts.data.Batch):
            boards = boards["obs"]
        
        preds = [None] * len(boards)        
        
        for i in range(len(boards)):
            preds[i] = self.bot.predict(board= boards[i])
            
        
        return ts.data.Batch(act=preds, state=state)
    
    def learn(self, batch, **kwargs):
        # No learning needed
        return {}
    
    def set_eps(self, eps):
        # Not needed
        return

    


<hr><hr>

## Loading in torch dictionaries



In [9]:
####################################################
# FUNCTION FOR LOADING IN TORCH DICTIONARIIES
####################################################

def load_torch_dict(filename):
    """
    Loads in torch dictionary using correct cuda settings for current device
    """   
    if torch.cuda.is_available():
        return torch.load(filename)
    else:
        return torch.load(filename, map_location=torch.device('cpu'))

<hr><hr>

## Setup the game



In [19]:
####################################################
# SETUP THE GAME
####################################################

if (True):
    # Player 1 is a pytorch bot
    player1 = rainbow_policy(state_shape= (6, 7),
                            action_shape= (7,),
                            learning_rate=  0.0000625,
                            gamma= 0.9,
                            n_step= 3, # Number of steps to look ahead
                            num_atoms= 51,
                            is_noisy= True,
                            noisy_std= 0.1,
                            is_dueling= True)
    player1.load_state_dict(load_torch_dict("../paper_notebooks/./saved_variables/paper_notebooks/11/1-250epoch_5loop/looping-iteration-4/best_policy_agent1.pth"))
    player1.set_eps(0)

if (False):
    # Player 1 is a minimax bot
    player1 = TianshouMiniMaxConnectFourPolicy(coin= 1,
                                               oponent_coin= 2,
                                               column_count= game.GRID_COLUMN_COUNT,
                                               row_count= game.GRID_ROW_COUNT,
                                               minimax_depth= 5)
    
if (False):
    # We are player 1
    player1 = "me"

# ------------------------------------------------------------------------------------------------------

# Specify either an instance of an object that can predict a move or "me" for player 2
if (False):
    # Player 2 is a pytorch bot
    player2 = rainbow_policy(state_shape= (6, 7),
                            action_shape= (7,),
                            learning_rate=  0.0000625,
                            gamma= 0.9,
                            n_step= 3, # Number of steps to look ahead
                            num_atoms= 51,
                            is_noisy= True,
                            noisy_std= 0.1,
                            is_dueling= True)
    player2.load_state_dict(load_torch_dict("../paper_notebooks/./saved_variables/paper_notebooks/10/1-500epoch_20loop/looping-iteration-9/best_policy_agent2.pth"))
    player2.set_eps(0)

if (False):
    # Player 2 is a minimax bot
    player2 = minimaxbot.MiniMaxConnectFourBot(coin= 2,
                                               oponent_coin= 1,
                                               column_count= game.GRID_COLUMN_COUNT,
                                               row_count= game.GRID_ROW_COUNT,
                                               minimax_depth= 5)  
if (True):
    # We are player 2
    player2 = "me"

# ------------------------------------------------------------------------------------------------------
    
# Play the game
game.play_game(player1= player1,
               player2= player2)

In [None]:
####################################################
# REMOVE UNUSED VARIABLES
####################################################

del player1
del player2