# CNN based Rainbow vs minimax

In the previous notebook, `10-rainbow-fixed-opponent.ipynb`, we performed an experiment where 2 rainbow agents fight against each otehr, freezing both agents alteratingly to simulate a league based system.
This time around we train our rainbow agent against an increasingly smarter minimax agent. 

<hr><hr>

## Table of Contents

- Contact information
- Checking requirements
  - Correct Anaconda environment
  - Correct module access
  - Correct CUDA access
- Training rainbow agent against frozen rainbow agent
  - Building the environment
  - Implementing the MiniMax policy
  - Implementing the Rainbow policy
  - Building agents
  - Function for letting agents learn
  - Function for watching learned agent
  - Doing the experiment
- Discussion

<hr><hr>

## Contact information

| Name             | Student ID | VUB mail                                                  | Personal mail                                               |
| ---------------- | ---------- | --------------------------------------------------------- | ----------------------------------------------------------- |
| Lennert Bontinck | 0568702    | [lennert.bontinck@vub.be](mailto:lennert.bontinck@vub.be) | [info@lennertbontinck.com](mailto:info@lennertbontinck.com) |



<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `rl-project` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the RL course project and homeworks](https://github.com/pikawika/vub-rl).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'rl-project'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: rl-project
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules and show if the versions match those that are recommended.

In [3]:
####################################################
# LOADING MODULES
####################################################

# Allow reloading of libraries
import importlib

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Argparser
import argparse

# More data types
import typing
import numpy as np

# Pygame
import pygame; print(f"Pygame version (2.1.2 recommended): {pygame.__version__}")

# Gym environment
import gym; print(f"Gym version (0.21.0 recommended): {gym.__version__}")

# Tianshou for RL algorithms
import tianshou as ts; print(f"Tianshou version (0.4.8 recommended): {ts.__version__}")

# Torch is a popular DL framework
import torch; print(f"Torch version (1.12.0 recommended): {torch.__version__}")

# PPrint is a pretty print for variables
from pprint import pprint

# Our custom connect four gym environment
import sys
sys.path.append('../')
import gym_connect4_pygame.envs.ConnectFourPygameEnvV2 as cfgym
import minimax_agent.minimax_agent as minimaxbot
importlib.invalidate_caches()
importlib.reload(cfgym)
importlib.reload(minimaxbot);

# Time for allowing "freezes" in execution
import time;

# Allow for copying objects in a non reference manner
import copy

# Used for updating notebook display
from IPython.display import clear_output

Matplotlib version (3.5.1 recommended): 3.5.1
Pygame version (2.1.2 recommended): 2.1.2
Gym version (0.21.0 recommended): 0.21.0
Tianshou version (0.4.8 recommended): 0.4.8
Torch version (1.12.0 recommended): 1.12.0.dev20220520+cu116


In [4]:
####################################################
# FUNCTION FOR LOADING IN TORCH DICTIONARIIES
####################################################

def load_torch_dict(filename):
    """
    Loads in torch dictionary using correct cuda settings for current device
    """   
    if torch.cuda.is_available():
        return torch.load(filename)
    else:
        return torch.load(filename, map_location=torch.device('cpu'))

<hr>

### Correct CUDA access

The installation instructions specify how to install PyTorch with CUDA 11.6.
The following code block tests if this was done successfully.

In [5]:
####################################################
# CUDA VALIDATION
####################################################

# Check cuda available
print(f"CUDA is available: {torch.cuda.is_available()}")

# Show cuda devices
print(f"\nAmount of connected devices supporting CUDA: {torch.cuda.device_count()}")

# Show current cuda device
print(f"\nCurrent CUDA device: {torch.cuda.current_device()}")

# Show cuda device name
print(f"Cuda device 0 name: {torch.cuda.get_device_name(0)}")

CUDA is available: True

Amount of connected devices supporting CUDA: 1

Current CUDA device: 0
Cuda device 0 name: NVIDIA GeForce GTX 970


<hr><hr>

## Training rainbow agent against frozen rainbow agent

Our connect four gym setup requires two agents, one for each player.
To reduce complexity, agents will always play as the same player, e.g. always as player 1.
It is important to note that connect four is a *solved game*.
According to [The Washington Post](https://www.washingtonpost.com/news/wonk/wp/2015/05/08/how-to-win-any-popular-game-according-to-data-scientists/):

> Connect Four is what mathematicians call a "solved game," meaning you can play it perfectly every time, no matter what your opponent does. You will need to get the first move, but as long as you do so, you can always win within 41 moves.

<hr>

### Building the environment

This code is identical to the notebook `9-rainbow.ipynb`, a reward for blocking moves is given.

In [6]:
####################################################
# CONNECT FOUR V2 ENVIRONMENT
####################################################

def get_env():
    """
    Returns the connect four gym environment V2 altered for Tianshou and Petting Zoo compatibility.
    Already wrapped with a ts.env.PettingZooEnv wrapper.
    """
    return ts.env.PettingZooEnv(cfgym.env(reward_move= 0, # Set to 1 for reward to make moves (incentivise longer games)
                                          reward_blocking= 1, # Set to 1 for reward to make blocking moves (incentivise defensive games)
                                          reward_invalid= -3,
                                          reward_draw= 3,
                                          reward_win= 5,
                                          reward_loss= -5,
                                          allow_invalid_move= False))
    
    
# Test the environment
env = get_env()
print(f"Observation space: {env.observation_space}")
print(f"\nAction space: {env.action_space}")

# Reset the environment to start from a clean state, returns the initial observation
observation = env.reset()

print("\n Initial player id:")
print(observation["agent_id"])

print("\n Initial observation:")
print(observation["obs"])

print("\n Initial mask:")
print(observation["mask"])

# Clean unused variables
del observation
del env

Observation space: Dict(action_mask:Box([0 0 0 0 0 0 0], [1 1 1 1 1 1 1], (7,), int8), observation:Box([[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]], [[2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2]], (6, 7), int8))

Action space: Discrete(7)

 Initial player id:
player_1

 Initial observation:
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]

 Initial mask:
[True, True, True, True, True, True, True]


In [7]:
####################################################
# BLOCKING MOVE CHECK
####################################################

# Check if a reward is received for playing a blocking move

env = get_env()
env.reset()
env.step(action= 0)
print(env.rewards)
env.step(action= 1)
print(env.rewards)
env.step(action= 0)
print(env.rewards)
env.step(action= 1)
print(env.rewards)
env.step(action= 0)
print(env.rewards)
env.step(action= 1)
print(env.rewards)
env.step(action= 1)
print(f"Blocking move made by player 1: {env.rewards}")

[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
[0, 0]
Blocking move made by player 1: [1, 0]


<hr>

### Implementing the MiniMax policy

We provide the minimax algorithm as a Tianshou policy so that it can be used from within Tianshou.

In [8]:
####################################################
# CUSTOM MINIMAX TIANSHOU POLICY
####################################################

class TianshouMiniMaxConnectFourPolicy(ts.policy.BasePolicy):
    """
    Tianshou compatible MiniMax policy for connect four.
    """

    def __init__(self,
                 coin: int,
                 oponent_coin: int,
                 minimax_depth: int,
                 column_count: int = 7,
                 row_count: int = 6,
                 **kwargs: typing.Any):
        # Init base policy
        super().__init__(**kwargs)
        
        # Configure minimax bot
        self.bot = minimaxbot.MiniMaxConnectFourBot(coin= coin,
                                                    oponent_coin= oponent_coin,
                                                    column_count= column_count,
                                                    row_count= row_count,
                                                    minimax_depth= minimax_depth)

    def forward(self,
                batch: ts.data.Batch,
                state: typing.Optional[typing.Union[dict, ts.data.Batch, np.ndarray]] = None,
                **kwargs: typing.Any):
        """
        Compute minimax action over the given batch data.
        """
        boards = batch["obs"]
        
        # Can be nested in Tianshou
        while isinstance(boards, ts.data.Batch):
            boards = boards["obs"]
        
        preds = [None] * len(boards)        
        
        for i in range(len(boards)):
            preds[i] = self.bot.predict(board= boards[i])
            
        
        return ts.data.Batch(act=preds, state=state)
    
    def learn(self, batch, **kwargs):
        # No learning needed
        return {}
    
    def set_eps(self, eps):
        # Not needed
        return

    


<hr>

### Implementing the Rainbow policy

This code is identical to the notebook `9-rainbow.ipynb`, the defaults are changed so that they reflect the best found parameters.

In [9]:
####################################################
# DQN ARCHITECTURE
####################################################

class CNNForRainbow(torch.nn.Module):
    """
    Custom CNN to be used as baseclass for the Rainbow algorithm.
    Extracts "feautures" for the Rainbow algorithm by doing a 4x4 cnn kernel pass and providing 64 filters for each mask.
    """
    def __init__(self,
                 state_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu'):
        
        # Torch init
        super().__init__()
        
        # Store device to be used
        self.device = device
        
        # The input layer is singular -> we have 1 board vector
        input_channels_cnn = 1
        
        # We output 64 filters per kernel 
        output_channels_cnn = 64 # Updated from previous 16
        
        # We store the output dimension of the CNN "feature" layer
        self.output_dim = (state_shape[0] - 3) * (state_shape[1] - 3) * output_channels_cnn
        
        self.net = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels= input_channels_cnn, out_channels= output_channels_cnn, kernel_size= 4, stride= 1), torch.nn.ReLU(inplace=True),
            torch.nn.Flatten(),
        )

    def forward(self,
                obs: typing.Union[np.ndarray, torch.Tensor],
                state: typing.Optional[typing.Any] = None,
                info: typing.Dict[str, typing.Any] = {}):
        # Make a torch instance (from regular vector of board)
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float, device=self.device)
            
        # Tianshou bugs the batch output, reshape to work properly with our torch version
        if (len(np.shape(obs)) != 4):
            obs = obs[:, None, :, :]
        
        # Return what is needed (network output & state)
        return self.net(obs), state


In [10]:
####################################################
# RAINBOW ARCHITECTURE
####################################################

class Rainbow(CNNForRainbow):
    """
    Implementation of the Rainbow algorithm making using of the CNNForRainbow baseclass.
    Default parameters adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
    """

    def __init__(self,
                 state_shape: typing.Sequence[int],
                 action_shape: typing.Sequence[int],
                 device: typing.Union[str, int, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu',
                 num_atoms: int = 51,
                 is_noisy: bool = True,
                 noisy_std: float = 0.1,
                 is_dueling: bool = True):
        
        # Init CNN feature extraction parent class
        super().__init__(state_shape= state_shape, device= device)
        
        # the amount of actions we have is just the action shape
        self.action_num = np.prod(action_shape)
        
        # Store class specific info
        self.num_atoms = num_atoms
        self._is_dueling = is_dueling

        # Our linear layer depends on wether or not we want to use a noisy environment
        # Noisy implementation based on https://arxiv.org/abs/1706.10295
        def linear(x, y):
            if is_noisy:
                return ts.utils.net.discrete.NoisyLinear(x, y, noisy_std)
            else:
                return torch.nn.Linear(x, y)
            
        # Specify Q and V based on wether or not agent is dueling
        # Setting agent on dueling mode should help generalisation according to rainbow paper
        # NOTE: this uses the output dim from the feature extraction CNN
        self.Q = torch.nn.Sequential(
            linear(self.output_dim, 512), torch.nn.ReLU(inplace=True),
            linear(512, self.action_num * self.num_atoms))
        
        if self._is_dueling:
            self.V = torch.nn.Sequential(
                linear(self.output_dim, 512), torch.nn.ReLU(inplace=True),
                linear(512, self.num_atoms))
            
        # New output dim for this rainbow network
        self.output_dim = self.action_num * self.num_atoms
        

    def forward(self,
                obs: typing.Union[np.ndarray, torch.Tensor],
                state: typing.Optional[typing.Any] = None,
                info: typing.Dict[str, typing.Any] = {}):
        
        # Use our parent CNN based network to get "features"
        obs, state = super().forward(obs)
        
        # Get our Rainbow specific values
        q = self.Q(obs)
        q = q.view(-1, self.action_num, self.num_atoms)
        
        if self._is_dueling:
            v = self.V(obs)
            v = v.view(-1, 1, self.num_atoms)
            logits = q - q.mean(dim=1, keepdim=True) + v
        else:
            logits = q
        
        # We need to go from our logits to an accepted dimension of probability outputs
        probs = logits.softmax(dim=2)
        
        return probs, state

In [11]:
####################################################
# RAINBOW POLICY
####################################################

def rainbow_policy(state_shape: tuple,
                   action_shape: tuple,
                   optim: typing.Optional[torch.optim.Optimizer] = None,
                   learning_rate: float =  0.0001, # Increased from 0000625
                   gamma: float = 0.8, # Decreased from 0.9
                   n_step: int = 3,
                   num_atoms: int = 51,
                   is_noisy: bool = True,
                   noisy_std: float = 0.1,
                   is_dueling: bool = True,
                   frozen: bool = False, # Added to freeze an agent
                   target_update_freq: int = 500):
    """
    Implementation of the Rainbow policy.
    Default parameters adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
    """
    
    # Use cuda device if possible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Rainbow network to be used by policy
    net = Rainbow(state_shape= state_shape,
                  action_shape= action_shape,
                  device= device,
                  num_atoms= num_atoms,
                  is_noisy= is_noisy,
                  noisy_std= noisy_std,
                  is_dueling= is_dueling).to(device)
    
    # Default optimizer is an adam optimizer with the argparser learning rate
    if optim is None:
        optim = torch.optim.Adam(net.parameters(), lr= learning_rate)
        
    # If we are frozen, we use an optimizer that has learning rate 0
    if frozen:
        optim = torch.optim.SGD(net.parameters(), lr= 0)
        
    # Our agents Rainbow policy
    return ts.policy.RainbowPolicy(model= net,
                                   optim= optim,
                                   discount_factor= gamma,
                                   num_atoms= num_atoms,
                                   estimation_step= n_step,
                                   target_update_freq= target_update_freq).to(device)
    
    

<hr>

### Building agents

This code is identical to the notebook `9-rainbow.ipynb`, with the added option of "freezing" an agent which corresponds to giving it an optimizer with learning rate 0.

In [12]:
####################################################
# AGENT CREATION
####################################################

def get_agent_manager(agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                      agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                      agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                      agent_player2_frozen: bool = False,
                      optim: typing.Optional[torch.optim.Optimizer] = None):
    """
    Gets a multi agent policy manager, optimizer and player ids for the connect four V2 gym environment.
    Per default this returns 
        - Multi agent manager for 2 agents using Rainbow
        - Adam optimizer
        - ['player_1', 'player_2'] from the connect four environment
    """
    
    # Get the environment to play in (Connect four gym V2)
    env = get_env()
    
    # Get the observation space from the environment, depending on typo of space (ternary operator)
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    
    # Set the arguments
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure agent player 1 to be a Rainbow if no policy is passed.
    if agent_player1 is None:
        # Our agent1 uses a Rainbow policy
        agent_player1 = rainbow_policy(state_shape= state_shape,
                                       action_shape= action_shape,
                                       optim= optim,
                                       frozen= agent_player1_frozen)
    
    # Configure agent player 2 to be a Rainbow if no policy is passed.
    if agent_player2 is None:
        # Our agent1 uses a Rainbow policy
        agent_player2 = rainbow_policy(state_shape= state_shape,
                                       action_shape= action_shape,
                                       optim= optim,
                                       frozen= agent_player2_frozen)

    # Default order of the agents
    agents = [agent_player1, agent_player2]
        
    # Create the multi agent policy
    policy = ts.policy.MultiAgentPolicyManager(agents, env)
    
    # Return our policy, optimizer and the available agents in the environment
    # Per default: 
    #   - Multi agent manager for 2 agents using Rainbow
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    
    return policy, optim, env.agents

<hr>

### Function for letting agents learn

This code is identical to the notebook `9-rainbow.ipynb`, but a stopping condition is added and the defaults are updated to the newly found best, the reward metric is also updated to relfect the score of the non frozen agent.
The testing strategy is also updated to be on one environment using 10 trials.
We also decay the epsilon faster and don't use epsilon decay on the frozen agent.

In [13]:
####################################################
# AGENT TRAINING
####################################################

def train_agent(filename: str,
                agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
                agent_player1_frozen: bool = False, # Freeze a player -> don't let it learn further
                agent_player2_frozen: bool = False,
                single_agent_score_as_reward: bool= False, # Uses non frozen agent's score as reward
                optim: typing.Optional[torch.optim.Optimizer] = None,
                training_env_num: int = 10,
                testing_env_num: int = 10,
                episode_per_test: int = 10,
                stopping_threshold: float = 7,
                buffer_size: int = 10000, # Default 100000
                batch_size: int = 64, # Default 32
                epochs: int = 500, # Default 50
                step_per_epoch: int = 10000,
                step_per_collect: int = 10, # Should be multiple of the test/training envs
                update_per_step: float = 0.1,
                testing_eps: float = 0.005,
                training_eps_init: float = 1,
                training_eps_final: float = 0.2): # Default 0.05
    """
    Trains two agents in the connect four V2 environment and saves their best model and logs.
    Returns:
        - result from offpolicy_trainer
        - final version of agent 1
        - final version of agent 2
    Defaults adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
    """

    # ======== notebook specific =========
    notebook_version = '11' # Used for foldering logs and models

    # ======== environment setup =========
    train_envs = ts.env.DummyVectorEnv([get_env for _ in range(training_env_num)])
    test_envs = ts.env.DummyVectorEnv([get_env for _ in range(testing_env_num)])
    
    # set the seed for reproducibility
    np.random.seed(1998)
    torch.manual_seed(1998)
    train_envs.seed(1998)
    test_envs.seed(1998)

    # ======== agent setup =========
    # Gets our agents from the previously made function
    # Per default: 
    #   - Multi agent manager for 2 agents using Rainbow
    #   - Adam optimizer
    #   - ['player_1', 'player_2'] from the connect four environment
    policy, optim, agents = get_agent_manager(agent_player1=agent_player1,
                                              agent_player2=agent_player2,
                                              agent_player1_frozen= agent_player1_frozen,
                                              agent_player2_frozen= agent_player2_frozen,
                                              optim=optim)

    # ======== collector setup =========
    # Make a collector for the training environments
    buffer= ts.data.VectorReplayBuffer(total_size= buffer_size,
                                       buffer_num=len(train_envs))
    
    train_collector = ts.data.Collector(policy= policy,
                                        env= train_envs,
                                        buffer= buffer,
                                        exploration_noise= True)
    
    # Make a collector for the testing environments
    test_collector = ts.data.Collector(policy= policy,
                                       env= test_envs,
                                       exploration_noise= True)
    
    # ======== ensure folders exist =========
    if not os.path.exists(os.path.join('./logs', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./logs', 'paper_notebooks', notebook_version, filename))
    if not os.path.exists(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename)):
        os.makedirs(os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename))

    # ======== tensorboard logging setup =========
    # Allows to save the training progress to tensorboard compatable logs
    log_path = os.path.join('./logs', 'paper_notebooks', notebook_version, filename)
    writer = torch.utils.tensorboard.SummaryWriter(log_path)
    logger = ts.utils.TensorboardLogger(writer)

    # ======== callback functions used during training =========
    # We want to save our best policy
    def save_best_fn(policy):
        """
        Callback to save the best model
        """
        # Save best agent 1
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent1.pth')
        torch.save(policy.policies[agents[0]].state_dict(), model_save_path)
        
        # Save best agent 2
        model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'best_policy_agent2.pth')
        torch.save(policy.policies[agents[1]].state_dict(), model_save_path)
        
        # Save agent2

    def stop_fn(average_rews):
        """
        Callback to stop training when we've reached the desired reward.
        Reward is the test average return value of the reward_metric function.
        """
        if single_agent_score_as_reward:
            # Get singular episode mean reward
            episode_reward= average_rews / episode_per_test
            stop= episode_reward >= stopping_threshold
            print(f"testing for stop: {episode_reward} >= {stopping_threshold} -> {stop}")
            # Agent is seen as "trained enough"
            return stop
        else:
            return False # Not implemented

    def train_fn(epoch, env_step):
        """
        Callback before training, sets the training epsilon in a decaying manner.
        Adopted from: https://github.com/thu-ml/tianshou/blob/master/examples/atari/atari_rainbow.py
        """        
        # Nature DQN setting to have a "linear decaying epsilon" for the first 50 thousand iterations
        if env_step <= 50000:
            training_eps = training_eps_init - env_step / 1000000 * (training_eps_init - training_eps_final)
        else:
            training_eps = training_eps_final
            
            
        # Set epsilon
        policy.policies[agents[0]].set_eps(training_eps)
        policy.policies[agents[1]].set_eps(training_eps)
        
        # If frozen we don't have a large epsilon
        if agent_player1_frozen:
            policy.policies[agents[0]].set_eps(training_eps_final)
        if agent_player2_frozen:
            policy.policies[agents[1]].set_eps(training_eps_final)

    def test_fn(epoch, env_step):
        """
        Callback beore testing, sets the testing epsilon.
        """        
        # Before testing we want to configure the epsilon for the agents
        # In general more greedy than the train case but not
        #   to avoid getting stuck on invalid moves
        policy.policies[agents[0]].set_eps(testing_eps)
        policy.policies[agents[1]].set_eps(testing_eps)

    def reward_metric(rews):
        """
        Callback for reward collection.
        Currently the reward is the sum of both agents.
        """        
        if agent_player2_frozen and single_agent_score_as_reward:
            # agent 2 frozen, optimizing for agent 1
            return rews[:, 0]
        
        if agent_player1_frozen and single_agent_score_as_reward:
            # agent 1 frozen, optimizing for agent 2
            return rews[:, 1]
        
        # Per default we are interested in optimizing both agents
        return rews[:, 0] + rews[:, 1]

    # ======== Training =========
    # off policy training
    result = ts.trainer.offpolicy_trainer(policy= policy,
                                          train_collector= train_collector,
                                          test_collector= test_collector,
                                          max_epoch= epochs,
                                          step_per_epoch= step_per_epoch,
                                          step_per_collect= step_per_collect,
                                          episode_per_test= episode_per_test,
                                          batch_size= batch_size,
                                          train_fn= train_fn,
                                          test_fn= test_fn,
                                          stop_fn= stop_fn,
                                          save_best_fn= save_best_fn,
                                          update_per_step= update_per_step,
                                          logger= logger,
                                          test_in_train= False,
                                          reward_metric= reward_metric)
    
    # Save final agent 1
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent1.pth')
    torch.save(policy.policies[agents[0]].state_dict(), model_save_path)

    # Save final agent 2
    model_save_path = os.path.join('./saved_variables', 'paper_notebooks', notebook_version, filename, 'final_policy_agent2.pth')
    torch.save(policy.policies[agents[1]].state_dict(), model_save_path)

    return result, policy.policies[agents[0]], policy.policies[agents[1]]


<hr>

### Function for watching learned agent

Identical to the previous notebook.

In [14]:
####################################################
# WATCHING THE LEARNED POLICY IN ACTION
####################################################

def watch(numer_of_games: int = 3,
          agent_player1: typing.Optional[ts.policy.BasePolicy] = None,
          agent_player2: typing.Optional[ts.policy.BasePolicy] = None,
          test_epsilon: float = 0.005, # For the watching we act completely greedy but low random for not getting stuck on invalid move
          render_speed: float = 0.15, # Amount of seconds to update frame/ do a step
          ) -> None:
    
    # Get the connect four V2 environment (must be a list)
    env= ts.env.DummyVectorEnv([get_env])
    
    # Get the agents from the trained agents
    policy, optim, agents = get_agent_manager(agent_player1= agent_player1,
                                              agent_player2= agent_player2)
    
    # Evaluate the policy
    policy.eval()
    
    # Set the testing policy epsilon for our agents
    policy.policies[agents[0]].set_eps(test_epsilon)
    policy.policies[agents[1]].set_eps(test_epsilon)
    
    # Collect the test data
    collector = ts.data.Collector(policy= policy,
                                  env= env,
                                  exploration_noise= True)
    
    # Render games in human mode to see how it plays
    result = collector.collect(n_episode= numer_of_games, render= render_speed)
    
    # Close the environment aftering collecting the results
    # This closes the pygame window after completion
    env.close()
    
    # Get the rewards and length from the test trials
    rewards, length = result["rews"], result["lens"]
    
    # Print the final reward for the first agent
    print(f"Average steps of game:  {length.mean()}")
    print(f"Final mean reward agent 1: {rewards[:, 0].mean()}, std: {rewards[:, 0].std()}")
    print(f"Final mean reward agent 2: {rewards[:, 1].mean()}, std: {rewards[:, 1].std()}")

<hr>

### Doing the experiment

To test if we can better train our agents when playing against the minimax agent, we play against it in increasing minimax depth as to simulate increasing difficulty.

1. We play as player 1 so that we could potentially learn the complete connect four game. We changed the epsilon values as well as to not loose to much info either. The stopping criteria is 10.

| **MiniMax depth** | **Test score**         | **Epoch** |
|-------------------|------------------------|-----------|
| 1                 | 108.200000 ± 2.400000  | 182       |
| 2                 | 107.000000 ± 0.000000  | 45        |
| 3                 | 113.000000 ± 0.000000  | 11        |
| 4                 | 104.800000 ± 36.600000 | 6         |
| 5                 | 109                    | 6         |


In [15]:
####################################################
# EXPERIMENT: TRAINING AGENTS
####################################################

# Select agent for minimax
agent1_is_minimax = False

# Specify starter for rainbow
rainbow_starting_params = "./saved_variables/paper_notebooks/9/rainbow_vs_rainbow_blocking_reward_complex_cnn/best_policy_agent1.pth"

# Experiment settings
epochs = 250
loops = 5
stopping_threshold = 10
training_eps_init = 0.4
training_eps_final = 0.05

# Filename prefix
filename_prefix = "1-250epoch_5loop/looping-iteration-"

for loop_idx in range(loops):
    # Depth is loop index +1 
    depth = loop_idx + 1
    
    # Filename
    filename = filename_prefix + str(loop_idx)
    
    # Use provided starting params in first loop, the one from previous iteration in next
    if loop_idx > 0:
        if agent1_is_minimax:
            rainbow_starting_params = "./saved_variables/paper_notebooks/11/" + filename_prefix + str(loop_idx - 1) + "/best_policy_agent2.pth"
        else:
            rainbow_starting_params = "./saved_variables/paper_notebooks/11/" + filename_prefix + str(loop_idx - 1) + "/best_policy_agent1.pth"
    
    
    # Show info
    print()
    training_agent = "2" if agent1_is_minimax else "1"
    print(f"Started training agent player {training_agent} against minimax with depth {depth}")
    
    # Get the environment settings
    env = get_env()
    observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
    state_shape = observation_space.shape or observation_space.n
    action_shape = env.action_space.shape or env.action_space.n
    
    # Configure rainbow agent
    rainbow_agent = rainbow_policy(state_shape= state_shape,
                                   action_shape= action_shape)
    
    if rainbow_starting_params:
        rainbow_agent.load_state_dict(load_torch_dict(rainbow_starting_params))
        
    # Configure minimax agent
    minimax_agent = TianshouMiniMaxConnectFourPolicy(coin= 1 if agent1_is_minimax else 2,
                                                     oponent_coin= 2 if agent1_is_minimax else 1,
                                                     minimax_depth= depth)
        
        
    # Train the agent
    off_policy_traininer_results, final_agent_player1, final_agent_player2 = train_agent(epochs= epochs,
                                                                                         agent_player1= minimax_agent if agent1_is_minimax else rainbow_agent,
                                                                                         agent_player1_frozen= True if agent1_is_minimax else False,
                                                                                         agent_player2= rainbow_agent if agent1_is_minimax else minimax_agent,
                                                                                         agent_player2_frozen= False if agent1_is_minimax else True,
                                                                                         filename= filename,
                                                                                         training_eps_init = training_eps_init,
                                                                                         training_eps_final = training_eps_final,
                                                                                         stopping_threshold= stopping_threshold,
                                                                                         single_agent_score_as_reward = True)
            
            


Started training agent player 1 against minimax with depth 1


Epoch #1: 10001it [00:34, 287.83it/s, env_step=10000, len=14, n/ep=0, n/st=10, player_1/loss=1.533, rew=3.00]          


Epoch #1: test_reward: 11.000000 ± 0.000000, best_reward: 11.000000 ± 0.000000 in #1
testing for stop: 1.1 >= 10 -> False


Epoch #2: 10001it [00:34, 291.84it/s, env_step=20000, len=22, n/ep=0, n/st=10, player_1/loss=1.495, rew=27.00]         


Epoch #2: test_reward: 11.000000 ± 0.000000, best_reward: 11.000000 ± 0.000000 in #1
testing for stop: 1.1 >= 10 -> False


Epoch #3: 10001it [00:34, 293.67it/s, env_step=30000, len=10, n/ep=1, n/st=10, player_1/loss=1.423, rew=-5.00]         


Epoch #3: test_reward: 55.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #4: 10001it [00:34, 292.16it/s, env_step=40000, len=20, n/ep=1, n/st=10, player_1/loss=1.366, rew=17.00]         


Epoch #4: test_reward: 29.100000 ± 5.700000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #5: 10001it [00:31, 321.79it/s, env_step=50000, len=10, n/ep=1, n/st=10, player_1/loss=1.330, rew=-5.00]         


Epoch #5: test_reward: 5.800000 ± 2.400000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #6: 10001it [00:33, 300.82it/s, env_step=60000, len=32, n/ep=1, n/st=10, player_1/loss=1.113, rew=19.00]         


Epoch #6: test_reward: 12.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #7: 10001it [00:34, 291.91it/s, env_step=70000, len=18, n/ep=1, n/st=10, player_1/loss=1.134, rew=3.00]          


Epoch #7: test_reward: 5.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #8: 10001it [00:34, 292.29it/s, env_step=80000, len=18, n/ep=1, n/st=10, player_1/loss=1.092, rew=3.00]          


Epoch #8: test_reward: 10.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #9: 10001it [00:33, 294.18it/s, env_step=90000, len=18, n/ep=0, n/st=10, player_1/loss=1.067, rew=3.00]          


Epoch #9: test_reward: 4.600000 ± 4.800000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #10: 10001it [00:34, 290.05it/s, env_step=100000, len=16, n/ep=1, n/st=10, player_1/loss=1.084, rew=-5.00]       


Epoch #10: test_reward: 10.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #11: 10001it [00:34, 289.53it/s, env_step=110000, len=13, n/ep=0, n/st=10, player_1/loss=1.084, rew=12.00]       


Epoch #11: test_reward: 10.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #12: 10001it [00:34, 289.61it/s, env_step=120000, len=18, n/ep=2, n/st=10, player_1/loss=1.066, rew=3.00]        


Epoch #12: test_reward: 2.400000 ± 1.800000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #13: 10001it [00:34, 289.47it/s, env_step=130000, len=18, n/ep=0, n/st=10, player_1/loss=1.000, rew=1.00]        


Epoch #13: test_reward: 3.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #14: 10001it [00:34, 293.01it/s, env_step=140000, len=22, n/ep=1, n/st=10, player_1/loss=0.994, rew=7.00]        


Epoch #14: test_reward: 10.000000 ± 0.000000, best_reward: 55.000000 ± 0.000000 in #3
testing for stop: 5.5 >= 10 -> False


Epoch #15: 10001it [00:33, 294.20it/s, env_step=150000, len=23, n/ep=0, n/st=10, player_1/loss=0.972, rew=23.00]       


Epoch #15: test_reward: 92.200000 ± 32.400000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #16: 10001it [00:33, 297.19it/s, env_step=160000, len=17, n/ep=1, n/st=10, player_1/loss=0.912, rew=10.00]       


Epoch #16: test_reward: 3.000000 ± 0.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #17: 10001it [00:33, 295.75it/s, env_step=170000, len=19, n/ep=2, n/st=10, player_1/loss=0.946, rew=2.00]        


Epoch #17: test_reward: 3.000000 ± 0.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #18: 10001it [00:33, 295.96it/s, env_step=180000, len=30, n/ep=2, n/st=10, player_1/loss=0.917, rew=68.00]       


Epoch #18: test_reward: 3.000000 ± 0.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #19: 10001it [00:33, 295.37it/s, env_step=190000, len=15, n/ep=0, n/st=10, player_1/loss=0.972, rew=7.50]        


Epoch #19: test_reward: 3.000000 ± 0.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #20: 10001it [00:33, 294.73it/s, env_step=200000, len=18, n/ep=1, n/st=10, player_1/loss=0.983, rew=9.00]        


Epoch #20: test_reward: 4.100000 ± 3.300000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #21: 10001it [00:34, 293.39it/s, env_step=210000, len=19, n/ep=1, n/st=10, player_1/loss=0.971, rew=14.00]       


Epoch #21: test_reward: 4.000000 ± 3.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #22: 10001it [00:33, 294.84it/s, env_step=220000, len=14, n/ep=1, n/st=10, player_1/loss=0.950, rew=-5.00]       


Epoch #22: test_reward: 4.000000 ± 3.000000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #23: 10001it [00:33, 297.57it/s, env_step=230000, len=28, n/ep=1, n/st=10, player_1/loss=0.882, rew=25.00]       


Epoch #23: test_reward: 5.200000 ± 6.600000, best_reward: 92.200000 ± 32.400000 in #15
testing for stop: 9.22 >= 10 -> False


Epoch #24: 10001it [00:33, 299.03it/s, env_step=240000, len=35, n/ep=2, n/st=10, player_1/loss=0.986, rew=52.00]       


Epoch #24: test_reward: 97.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #25: 10001it [00:33, 300.12it/s, env_step=250000, len=18, n/ep=0, n/st=10, player_1/loss=0.910, rew=3.00]        


Epoch #25: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #26: 10001it [00:33, 298.01it/s, env_step=260000, len=18, n/ep=0, n/st=10, player_1/loss=0.913, rew=3.00]        


Epoch #26: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #27: 10001it [00:33, 295.32it/s, env_step=270000, len=18, n/ep=0, n/st=10, player_1/loss=0.944, rew=3.00]        


Epoch #27: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #28: 10001it [00:34, 293.57it/s, env_step=280000, len=12, n/ep=1, n/st=10, player_1/loss=0.951, rew=3.00]        


Epoch #28: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #29: 10001it [00:33, 301.87it/s, env_step=290000, len=19, n/ep=1, n/st=10, player_1/loss=0.837, rew=25.00]       


Epoch #29: test_reward: 63.600000 ± 18.374983, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #30: 10001it [00:33, 294.83it/s, env_step=300000, len=18, n/ep=1, n/st=10, player_1/loss=0.895, rew=3.00]        


Epoch #30: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #31: 10001it [00:33, 295.83it/s, env_step=310000, len=22, n/ep=1, n/st=10, player_1/loss=0.974, rew=17.00]       


Epoch #31: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #32: 10001it [00:33, 298.64it/s, env_step=320000, len=34, n/ep=0, n/st=10, player_1/loss=0.893, rew=47.00]       


Epoch #32: test_reward: 14.000000 ± 33.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #33: 10001it [00:33, 297.02it/s, env_step=330000, len=17, n/ep=0, n/st=10, player_1/loss=0.886, rew=5.00]        


Epoch #33: test_reward: 11.500000 ± 25.500000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #34: 10001it [00:33, 298.30it/s, env_step=340000, len=18, n/ep=0, n/st=10, player_1/loss=0.830, rew=11.00]       


Epoch #34: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #35: 10001it [00:33, 298.88it/s, env_step=350000, len=18, n/ep=1, n/st=10, player_1/loss=0.804, rew=3.00]        


Epoch #35: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #36: 10001it [00:33, 295.59it/s, env_step=360000, len=18, n/ep=1, n/st=10, player_1/loss=0.870, rew=3.00]        


Epoch #36: test_reward: 7.800000 ± 14.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #37: 10001it [00:33, 295.39it/s, env_step=370000, len=18, n/ep=1, n/st=10, player_1/loss=0.870, rew=3.00]        


Epoch #37: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #38: 10001it [00:33, 296.29it/s, env_step=380000, len=18, n/ep=0, n/st=10, player_1/loss=0.870, rew=3.00]        


Epoch #38: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #39: 10001it [00:33, 300.14it/s, env_step=390000, len=18, n/ep=1, n/st=10, player_1/loss=1.019, rew=3.00]        


Epoch #39: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #40: 10001it [00:34, 291.34it/s, env_step=400000, len=19, n/ep=1, n/st=10, player_1/loss=0.925, rew=5.00]        


Epoch #40: test_reward: 2.200000 ± 2.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #41: 10001it [00:33, 294.66it/s, env_step=410000, len=29, n/ep=0, n/st=10, player_1/loss=0.824, rew=43.00]       


Epoch #41: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #42: 10001it [00:33, 298.53it/s, env_step=420000, len=16, n/ep=0, n/st=10, player_1/loss=0.818, rew=13.00]       


Epoch #42: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #43: 10001it [00:32, 303.19it/s, env_step=430000, len=20, n/ep=0, n/st=10, player_1/loss=0.682, rew=13.00]       


Epoch #43: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #44: 10001it [00:33, 299.35it/s, env_step=440000, len=32, n/ep=1, n/st=10, player_1/loss=0.780, rew=43.00]       


Epoch #44: test_reward: 29.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #45: 10001it [00:33, 297.35it/s, env_step=450000, len=28, n/ep=0, n/st=10, player_1/loss=0.805, rew=29.00]       


Epoch #45: test_reward: 34.400000 ± 16.200000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #46: 10001it [00:33, 300.73it/s, env_step=460000, len=18, n/ep=0, n/st=10, player_1/loss=0.767, rew=3.00]        


Epoch #46: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #47: 10001it [00:32, 304.86it/s, env_step=470000, len=28, n/ep=0, n/st=10, player_1/loss=0.672, rew=27.00]       


Epoch #47: test_reward: 68.800000 ± 29.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #48: 10001it [00:33, 302.43it/s, env_step=480000, len=18, n/ep=0, n/st=10, player_1/loss=0.780, rew=3.00]        


Epoch #48: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #49: 10001it [00:32, 309.25it/s, env_step=490000, len=26, n/ep=0, n/st=10, player_1/loss=0.603, rew=19.00]       


Epoch #49: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #50: 10001it [00:32, 305.53it/s, env_step=500000, len=40, n/ep=0, n/st=10, player_1/loss=0.714, rew=83.00]       


Epoch #50: test_reward: 45.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #51: 10001it [00:32, 308.88it/s, env_step=510000, len=40, n/ep=0, n/st=10, player_1/loss=0.622, rew=83.00]       


Epoch #51: test_reward: 68.700000 ± 28.642800, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #52: 10001it [00:32, 304.04it/s, env_step=520000, len=23, n/ep=2, n/st=10, player_1/loss=0.787, rew=24.00]       


Epoch #52: test_reward: 70.200000 ± 25.771302, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #53: 10001it [00:33, 303.04it/s, env_step=530000, len=24, n/ep=2, n/st=10, player_1/loss=0.786, rew=24.50]       


Epoch #53: test_reward: 5.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #54: 10001it [00:33, 302.64it/s, env_step=540000, len=16, n/ep=0, n/st=10, player_1/loss=0.818, rew=4.00]        


Epoch #54: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #55: 10001it [00:32, 305.18it/s, env_step=550000, len=34, n/ep=0, n/st=10, player_1/loss=0.673, rew=53.00]       


Epoch #55: test_reward: 83.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #56: 10001it [00:32, 307.37it/s, env_step=560000, len=32, n/ep=0, n/st=10, player_1/loss=0.645, rew=43.00]       


Epoch #56: test_reward: 77.700000 ± 15.900000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #57: 10001it [00:32, 307.58it/s, env_step=570000, len=40, n/ep=0, n/st=10, player_1/loss=0.621, rew=83.00]       


Epoch #57: test_reward: 77.800000 ± 15.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #58: 10001it [00:32, 307.65it/s, env_step=580000, len=31, n/ep=2, n/st=10, player_1/loss=0.626, rew=53.00]       


Epoch #58: test_reward: 4.400000 ± 3.583295, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #59: 10001it [00:33, 302.96it/s, env_step=590000, len=18, n/ep=0, n/st=10, player_1/loss=0.714, rew=3.00]        


Epoch #59: test_reward: 3.500000 ± 1.500000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #60: 10001it [00:33, 295.68it/s, env_step=600000, len=15, n/ep=0, n/st=10, player_1/loss=0.854, rew=15.00]       


Epoch #60: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #61: 10001it [00:33, 294.72it/s, env_step=610000, len=18, n/ep=0, n/st=10, player_1/loss=0.886, rew=3.00]        


Epoch #61: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #62: 10001it [00:33, 297.17it/s, env_step=620000, len=18, n/ep=1, n/st=10, player_1/loss=0.917, rew=3.00]        


Epoch #62: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #63: 10001it [00:30, 328.60it/s, env_step=630000, len=26, n/ep=2, n/st=10, player_1/loss=0.828, rew=26.00]       


Epoch #63: test_reward: 74.800000 ± 24.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #64: 10001it [00:30, 332.49it/s, env_step=640000, len=40, n/ep=1, n/st=10, player_1/loss=0.831, rew=83.00]       


Epoch #64: test_reward: 83.000000 ± 6.260990, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #65: 10001it [00:29, 344.31it/s, env_step=650000, len=29, n/ep=0, n/st=10, player_1/loss=0.581, rew=46.33]       


Epoch #65: test_reward: 83.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #66: 10001it [00:31, 315.89it/s, env_step=660000, len=36, n/ep=0, n/st=10, player_1/loss=0.599, rew=61.00]       


Epoch #66: test_reward: 84.400000 ± 4.200000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #67: 10001it [00:33, 298.72it/s, env_step=670000, len=16, n/ep=0, n/st=10, player_1/loss=0.694, rew=13.00]       


Epoch #67: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #68: 10001it [00:33, 298.34it/s, env_step=680000, len=20, n/ep=0, n/st=10, player_1/loss=0.769, rew=13.00]       


Epoch #68: test_reward: 31.800000 ± 9.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #69: 10001it [00:33, 298.17it/s, env_step=690000, len=18, n/ep=0, n/st=10, player_1/loss=0.752, rew=3.00]        


Epoch #69: test_reward: 33.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #70: 10001it [00:33, 296.87it/s, env_step=700000, len=20, n/ep=0, n/st=10, player_1/loss=0.787, rew=11.00]       


Epoch #70: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #71: 10001it [00:34, 294.06it/s, env_step=710000, len=18, n/ep=0, n/st=10, player_1/loss=0.888, rew=3.00]        


Epoch #71: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #72: 10001it [00:33, 297.66it/s, env_step=720000, len=29, n/ep=0, n/st=10, player_1/loss=0.760, rew=35.00]       


Epoch #72: test_reward: 31.800000 ± 9.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #73: 10001it [00:33, 298.96it/s, env_step=730000, len=27, n/ep=0, n/st=10, player_1/loss=0.857, rew=31.00]       


Epoch #73: test_reward: 29.200000 ± 5.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #74: 10001it [00:33, 299.47it/s, env_step=740000, len=27, n/ep=0, n/st=10, player_1/loss=0.856, rew=32.33]       


Epoch #74: test_reward: 10.200000 ± 21.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #75: 10001it [00:33, 295.73it/s, env_step=750000, len=18, n/ep=0, n/st=10, player_1/loss=0.981, rew=3.00]        


Epoch #75: test_reward: 3.200000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #76: 10001it [00:34, 293.29it/s, env_step=760000, len=36, n/ep=1, n/st=10, player_1/loss=0.973, rew=69.00]       


Epoch #76: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #77: 10001it [00:34, 292.68it/s, env_step=770000, len=18, n/ep=0, n/st=10, player_1/loss=0.975, rew=3.00]        


Epoch #77: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #78: 10001it [00:33, 294.74it/s, env_step=780000, len=18, n/ep=0, n/st=10, player_1/loss=0.915, rew=3.00]        


Epoch #78: test_reward: 31.800000 ± 9.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #79: 10001it [00:33, 295.91it/s, env_step=790000, len=27, n/ep=1, n/st=10, player_1/loss=0.812, rew=31.00]       


Epoch #79: test_reward: 10.600000 ± 20.274121, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #80: 10001it [00:33, 295.02it/s, env_step=800000, len=18, n/ep=0, n/st=10, player_1/loss=0.815, rew=3.00]        


Epoch #80: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #81: 10001it [00:33, 298.21it/s, env_step=810000, len=29, n/ep=0, n/st=10, player_1/loss=0.772, rew=35.00]       


Epoch #81: test_reward: 34.200000 ± 2.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #82: 10001it [00:33, 297.12it/s, env_step=820000, len=29, n/ep=1, n/st=10, player_1/loss=0.812, rew=35.00]       


Epoch #82: test_reward: 3.200000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #83: 10001it [00:33, 298.07it/s, env_step=830000, len=16, n/ep=1, n/st=10, player_1/loss=0.849, rew=13.00]       


Epoch #83: test_reward: 31.000000 ± 6.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #84: 10001it [00:33, 297.04it/s, env_step=840000, len=30, n/ep=0, n/st=10, player_1/loss=0.816, rew=17.00]       


Epoch #84: test_reward: 9.200000 ± 5.173007, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #85: 10001it [00:33, 298.47it/s, env_step=850000, len=30, n/ep=1, n/st=10, player_1/loss=0.932, rew=15.00]       


Epoch #85: test_reward: 9.700000 ± 0.900000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #86: 10001it [00:33, 297.41it/s, env_step=860000, len=17, n/ep=0, n/st=10, player_1/loss=0.805, rew=10.00]       


Epoch #86: test_reward: 10.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #87: 10001it [00:33, 299.93it/s, env_step=870000, len=17, n/ep=1, n/st=10, player_1/loss=0.744, rew=10.00]       


Epoch #87: test_reward: 31.200000 ± 11.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #88: 10001it [00:33, 300.77it/s, env_step=880000, len=27, n/ep=1, n/st=10, player_1/loss=0.704, rew=31.00]       


Epoch #88: test_reward: 35.300000 ± 0.900000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #89: 10001it [00:33, 298.84it/s, env_step=890000, len=17, n/ep=0, n/st=10, player_1/loss=0.816, rew=10.00]       


Epoch #89: test_reward: 10.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #90: 10001it [00:33, 299.73it/s, env_step=900000, len=30, n/ep=0, n/st=10, player_1/loss=0.782, rew=33.00]       


Epoch #90: test_reward: 32.000000 ± 3.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #91: 10001it [00:33, 303.00it/s, env_step=910000, len=36, n/ep=0, n/st=10, player_1/loss=0.835, rew=35.00]       


Epoch #91: test_reward: 79.200000 ± 11.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #92: 10001it [00:33, 299.96it/s, env_step=920000, len=34, n/ep=0, n/st=10, player_1/loss=0.944, rew=23.00]       


Epoch #92: test_reward: 60.800000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #93: 10001it [00:33, 297.97it/s, env_step=930000, len=19, n/ep=2, n/st=10, player_1/loss=0.985, rew=3.00]        


Epoch #93: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #94: 10001it [00:34, 292.81it/s, env_step=940000, len=16, n/ep=0, n/st=10, player_1/loss=0.928, rew=13.00]       


Epoch #94: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #95: 10001it [00:34, 293.31it/s, env_step=950000, len=16, n/ep=1, n/st=10, player_1/loss=0.866, rew=13.00]       


Epoch #95: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #96: 10001it [00:34, 293.36it/s, env_step=960000, len=24, n/ep=0, n/st=10, player_1/loss=0.804, rew=38.00]       


Epoch #96: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #97: 10001it [00:33, 299.65it/s, env_step=970000, len=16, n/ep=1, n/st=10, player_1/loss=0.900, rew=13.00]       


Epoch #97: test_reward: 33.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #98: 10001it [00:32, 307.10it/s, env_step=980000, len=28, n/ep=1, n/st=10, player_1/loss=0.911, rew=19.00]       


Epoch #98: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #99: 10001it [00:32, 308.92it/s, env_step=990000, len=37, n/ep=0, n/st=10, player_1/loss=0.770, rew=60.00]       


Epoch #99: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #100: 10001it [00:33, 298.33it/s, env_step=1000000, len=36, n/ep=2, n/st=10, player_1/loss=0.772, rew=64.00]     


Epoch #100: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #101: 10001it [00:33, 296.94it/s, env_step=1010000, len=18, n/ep=0, n/st=10, player_1/loss=0.694, rew=17.00]     


Epoch #101: test_reward: 74.200000 ± 20.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #102: 10001it [00:33, 297.70it/s, env_step=1020000, len=25, n/ep=2, n/st=10, player_1/loss=0.691, rew=49.00]     


Epoch #102: test_reward: 73.800000 ± 21.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #103: 10001it [00:33, 296.50it/s, env_step=1030000, len=18, n/ep=0, n/st=10, player_1/loss=0.821, rew=3.00]      


Epoch #103: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #104: 10001it [00:33, 296.67it/s, env_step=1040000, len=21, n/ep=0, n/st=10, player_1/loss=0.870, rew=16.00]     


Epoch #104: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #105: 10001it [00:33, 296.46it/s, env_step=1050000, len=19, n/ep=0, n/st=10, player_1/loss=0.829, rew=10.00]     


Epoch #105: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #106: 10001it [00:33, 297.30it/s, env_step=1060000, len=18, n/ep=0, n/st=10, player_1/loss=0.868, rew=3.00]      


Epoch #106: test_reward: 2.200000 ± 2.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #107: 10001it [00:33, 297.07it/s, env_step=1070000, len=14, n/ep=0, n/st=10, player_1/loss=0.842, rew=-5.00]     


Epoch #107: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #108: 10001it [00:33, 297.58it/s, env_step=1080000, len=18, n/ep=1, n/st=10, player_1/loss=0.876, rew=3.00]      


Epoch #108: test_reward: 8.200000 ± 15.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #109: 10001it [00:33, 297.00it/s, env_step=1090000, len=34, n/ep=1, n/st=10, player_1/loss=0.850, rew=37.00]     


Epoch #109: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #110: 10001it [00:33, 298.65it/s, env_step=1100000, len=21, n/ep=0, n/st=10, player_1/loss=0.849, rew=35.00]     


Epoch #110: test_reward: 83.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #111: 10001it [00:32, 304.22it/s, env_step=1110000, len=29, n/ep=0, n/st=10, player_1/loss=0.728, rew=35.00]     


Epoch #111: test_reward: 30.200000 ± 2.400000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #112: 10001it [00:32, 303.35it/s, env_step=1120000, len=25, n/ep=0, n/st=10, player_1/loss=0.767, rew=32.33]     


Epoch #112: test_reward: 33.800000 ± 9.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #113: 10001it [00:33, 300.53it/s, env_step=1130000, len=18, n/ep=0, n/st=10, player_1/loss=0.825, rew=3.00]      


Epoch #113: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #114: 10001it [00:33, 303.00it/s, env_step=1140000, len=29, n/ep=1, n/st=10, player_1/loss=0.727, rew=35.00]     


Epoch #114: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #115: 10001it [00:32, 303.99it/s, env_step=1150000, len=14, n/ep=0, n/st=10, player_1/loss=0.702, rew=-5.00]     


Epoch #115: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #116: 10001it [00:32, 303.92it/s, env_step=1160000, len=18, n/ep=1, n/st=10, player_1/loss=0.754, rew=1.00]      


Epoch #116: test_reward: 34.400000 ± 1.800000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #117: 10001it [00:32, 304.19it/s, env_step=1170000, len=23, n/ep=0, n/st=10, player_1/loss=0.692, rew=18.00]     


Epoch #117: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #118: 10001it [00:32, 303.46it/s, env_step=1180000, len=29, n/ep=0, n/st=10, player_1/loss=0.711, rew=35.00]     


Epoch #118: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #119: 10001it [00:32, 303.42it/s, env_step=1190000, len=20, n/ep=0, n/st=10, player_1/loss=0.706, rew=15.00]     


Epoch #119: test_reward: 27.800000 ± 14.427751, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #120: 10001it [00:32, 303.51it/s, env_step=1200000, len=29, n/ep=0, n/st=10, player_1/loss=0.713, rew=35.00]     


Epoch #120: test_reward: 32.000000 ± 9.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #121: 10001it [00:32, 303.90it/s, env_step=1210000, len=29, n/ep=0, n/st=10, player_1/loss=0.720, rew=35.00]     


Epoch #121: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #122: 10001it [00:32, 303.83it/s, env_step=1220000, len=23, n/ep=0, n/st=10, player_1/loss=0.713, rew=19.67]     


Epoch #122: test_reward: 33.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #123: 10001it [00:33, 302.42it/s, env_step=1230000, len=14, n/ep=1, n/st=10, player_1/loss=0.708, rew=-5.00]     


Epoch #123: test_reward: 37.000000 ± 11.349009, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #124: 10001it [00:33, 302.56it/s, env_step=1240000, len=29, n/ep=1, n/st=10, player_1/loss=0.718, rew=35.00]     


Epoch #124: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #125: 10001it [00:32, 303.20it/s, env_step=1250000, len=29, n/ep=0, n/st=10, player_1/loss=0.707, rew=35.00]     


Epoch #125: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #126: 10001it [00:32, 303.67it/s, env_step=1260000, len=18, n/ep=0, n/st=10, player_1/loss=0.778, rew=-5.00]     


Epoch #126: test_reward: 33.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #127: 10001it [00:32, 303.86it/s, env_step=1270000, len=29, n/ep=1, n/st=10, player_1/loss=0.742, rew=35.00]     


Epoch #127: test_reward: 30.700000 ± 6.900000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #128: 10001it [00:32, 303.65it/s, env_step=1280000, len=29, n/ep=1, n/st=10, player_1/loss=0.771, rew=35.00]     


Epoch #128: test_reward: 31.000000 ± 10.545141, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #129: 10001it [00:33, 302.90it/s, env_step=1290000, len=15, n/ep=2, n/st=10, player_1/loss=0.796, rew=4.00]      


Epoch #129: test_reward: 31.800000 ± 9.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #130: 10001it [00:33, 302.38it/s, env_step=1300000, len=29, n/ep=0, n/st=10, player_1/loss=0.775, rew=35.00]     


Epoch #130: test_reward: 32.800000 ± 6.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #131: 10001it [00:33, 300.04it/s, env_step=1310000, len=17, n/ep=0, n/st=10, player_1/loss=0.870, rew=1.50]      


Epoch #131: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #132: 10001it [00:33, 298.39it/s, env_step=1320000, len=16, n/ep=0, n/st=10, player_1/loss=0.859, rew=13.00]     


Epoch #132: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #133: 10001it [00:33, 299.56it/s, env_step=1330000, len=16, n/ep=0, n/st=10, player_1/loss=0.910, rew=13.00]     


Epoch #133: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #134: 10001it [00:33, 298.16it/s, env_step=1340000, len=18, n/ep=1, n/st=10, player_1/loss=0.875, rew=3.00]      


Epoch #134: test_reward: 2.800000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #135: 10001it [00:33, 297.03it/s, env_step=1350000, len=18, n/ep=1, n/st=10, player_1/loss=0.862, rew=3.00]      


Epoch #135: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #136: 10001it [00:33, 296.85it/s, env_step=1360000, len=18, n/ep=0, n/st=10, player_1/loss=0.839, rew=3.00]      


Epoch #136: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #137: 10001it [00:33, 297.18it/s, env_step=1370000, len=18, n/ep=0, n/st=10, player_1/loss=0.833, rew=9.00]      


Epoch #137: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #138: 10001it [00:33, 298.30it/s, env_step=1380000, len=18, n/ep=0, n/st=10, player_1/loss=0.809, rew=3.00]      


Epoch #138: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #139: 10001it [00:33, 298.76it/s, env_step=1390000, len=18, n/ep=0, n/st=10, player_1/loss=0.831, rew=3.00]      


Epoch #139: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #140: 10001it [00:33, 297.54it/s, env_step=1400000, len=29, n/ep=0, n/st=10, player_1/loss=0.823, rew=33.00]     


Epoch #140: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #141: 10001it [00:33, 296.83it/s, env_step=1410000, len=18, n/ep=0, n/st=10, player_1/loss=0.829, rew=3.00]      


Epoch #141: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #142: 10001it [00:33, 298.35it/s, env_step=1420000, len=29, n/ep=0, n/st=10, player_1/loss=0.870, rew=35.00]     


Epoch #142: test_reward: 34.800000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #143: 10001it [00:32, 304.05it/s, env_step=1430000, len=26, n/ep=1, n/st=10, player_1/loss=0.731, rew=53.00]     


Epoch #143: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #144: 10001it [00:32, 304.10it/s, env_step=1440000, len=29, n/ep=0, n/st=10, player_1/loss=0.742, rew=35.00]     


Epoch #144: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #145: 10001it [00:32, 303.43it/s, env_step=1450000, len=10, n/ep=0, n/st=10, player_1/loss=0.721, rew=-5.00]     


Epoch #145: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #146: 10001it [00:32, 303.86it/s, env_step=1460000, len=29, n/ep=0, n/st=10, player_1/loss=0.794, rew=35.00]     


Epoch #146: test_reward: 33.400000 ± 4.800000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #147: 10001it [00:33, 303.00it/s, env_step=1470000, len=24, n/ep=2, n/st=10, player_1/loss=0.736, rew=17.00]     


Epoch #147: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #148: 10001it [00:32, 303.37it/s, env_step=1480000, len=29, n/ep=0, n/st=10, player_1/loss=0.702, rew=35.00]     


Epoch #148: test_reward: 31.600000 ± 10.200000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #149: 10001it [00:32, 304.03it/s, env_step=1490000, len=13, n/ep=0, n/st=10, player_1/loss=0.669, rew=5.00]      


Epoch #149: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #150: 10001it [00:32, 303.77it/s, env_step=1500000, len=29, n/ep=0, n/st=10, player_1/loss=0.669, rew=35.00]     


Epoch #150: test_reward: 32.900000 ± 6.300000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #151: 10001it [00:33, 299.24it/s, env_step=1510000, len=29, n/ep=1, n/st=10, player_1/loss=0.821, rew=35.00]     


Epoch #151: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #152: 10001it [00:33, 297.70it/s, env_step=1520000, len=18, n/ep=0, n/st=10, player_1/loss=0.869, rew=3.00]      


Epoch #152: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #153: 10001it [00:33, 297.01it/s, env_step=1530000, len=18, n/ep=0, n/st=10, player_1/loss=0.896, rew=2.00]      


Epoch #153: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #154: 10001it [00:33, 297.16it/s, env_step=1540000, len=18, n/ep=0, n/st=10, player_1/loss=0.836, rew=3.00]      


Epoch #154: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #155: 10001it [00:33, 299.98it/s, env_step=1550000, len=20, n/ep=0, n/st=10, player_1/loss=0.771, rew=15.00]     


Epoch #155: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #156: 10001it [00:33, 298.76it/s, env_step=1560000, len=20, n/ep=2, n/st=10, player_1/loss=0.793, rew=13.00]     


Epoch #156: test_reward: 13.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #157: 10001it [00:33, 298.29it/s, env_step=1570000, len=29, n/ep=1, n/st=10, player_1/loss=0.780, rew=35.00]     


Epoch #157: test_reward: 39.700000 ± 14.100000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #158: 10001it [00:33, 300.04it/s, env_step=1580000, len=18, n/ep=0, n/st=10, player_1/loss=0.737, rew=3.00]      


Epoch #158: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #159: 10001it [00:33, 298.93it/s, env_step=1590000, len=14, n/ep=1, n/st=10, player_1/loss=0.771, rew=-5.00]     


Epoch #159: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #160: 10001it [00:34, 293.67it/s, env_step=1600000, len=18, n/ep=1, n/st=10, player_1/loss=0.766, rew=3.00]      


Epoch #160: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #161: 10001it [00:34, 293.65it/s, env_step=1610000, len=19, n/ep=1, n/st=10, player_1/loss=0.805, rew=5.00]      


Epoch #161: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #162: 10001it [00:34, 293.27it/s, env_step=1620000, len=20, n/ep=1, n/st=10, player_1/loss=0.755, rew=11.00]     


Epoch #162: test_reward: 3.300000 ± 4.290688, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #163: 10001it [00:34, 293.73it/s, env_step=1630000, len=18, n/ep=0, n/st=10, player_1/loss=0.832, rew=3.00]      


Epoch #163: test_reward: 2.800000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #164: 10001it [00:34, 293.67it/s, env_step=1640000, len=18, n/ep=0, n/st=10, player_1/loss=0.858, rew=3.00]      


Epoch #164: test_reward: 3.200000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #165: 10001it [00:33, 294.27it/s, env_step=1650000, len=18, n/ep=1, n/st=10, player_1/loss=0.923, rew=3.00]      


Epoch #165: test_reward: 8.600000 ± 16.800000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #166: 10001it [00:34, 293.87it/s, env_step=1660000, len=18, n/ep=1, n/st=10, player_1/loss=0.891, rew=3.00]      


Epoch #166: test_reward: 3.200000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #167: 10001it [00:34, 293.79it/s, env_step=1670000, len=18, n/ep=1, n/st=10, player_1/loss=0.814, rew=3.00]      


Epoch #167: test_reward: 3.200000 ± 0.600000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #168: 10001it [00:34, 293.61it/s, env_step=1680000, len=34, n/ep=1, n/st=10, player_1/loss=0.829, rew=45.00]     


Epoch #168: test_reward: 16.000000 ± 39.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #169: 10001it [00:33, 299.55it/s, env_step=1690000, len=27, n/ep=0, n/st=10, player_1/loss=0.792, rew=31.00]     


Epoch #169: test_reward: 35.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #170: 10001it [00:33, 295.61it/s, env_step=1700000, len=30, n/ep=1, n/st=10, player_1/loss=0.785, rew=33.00]     


Epoch #170: test_reward: 33.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #171: 10001it [00:33, 297.94it/s, env_step=1710000, len=18, n/ep=0, n/st=10, player_1/loss=0.754, rew=3.00]      


Epoch #171: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #172: 10001it [00:33, 294.58it/s, env_step=1720000, len=19, n/ep=1, n/st=10, player_1/loss=0.843, rew=5.00]      


Epoch #172: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #173: 10001it [00:34, 293.91it/s, env_step=1730000, len=18, n/ep=1, n/st=10, player_1/loss=0.848, rew=3.00]      


Epoch #173: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #174: 10001it [00:33, 294.36it/s, env_step=1740000, len=19, n/ep=0, n/st=10, player_1/loss=0.882, rew=5.00]      


Epoch #174: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #175: 10001it [00:34, 293.26it/s, env_step=1750000, len=18, n/ep=0, n/st=10, player_1/loss=0.861, rew=3.00]      


Epoch #175: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #176: 10001it [00:33, 294.20it/s, env_step=1760000, len=34, n/ep=0, n/st=10, player_1/loss=0.863, rew=53.00]     


Epoch #176: test_reward: 4.000000 ± 3.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #177: 10001it [00:33, 294.33it/s, env_step=1770000, len=18, n/ep=1, n/st=10, player_1/loss=0.851, rew=3.00]      


Epoch #177: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #178: 10001it [00:34, 294.00it/s, env_step=1780000, len=20, n/ep=1, n/st=10, player_1/loss=0.826, rew=15.00]     


Epoch #178: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #179: 10001it [00:34, 293.73it/s, env_step=1790000, len=18, n/ep=0, n/st=10, player_1/loss=0.818, rew=1.00]      


Epoch #179: test_reward: 3.000000 ± 0.000000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #180: 10001it [00:33, 294.21it/s, env_step=1800000, len=18, n/ep=2, n/st=10, player_1/loss=0.828, rew=3.00]      


Epoch #180: test_reward: 7.600000 ± 15.900943, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #181: 10001it [00:33, 298.88it/s, env_step=1810000, len=42, n/ep=0, n/st=10, player_1/loss=0.864, rew=89.00]     


Epoch #181: test_reward: 83.100000 ± 23.700000, best_reward: 97.000000 ± 0.000000 in #24
testing for stop: 9.7 >= 10 -> False


Epoch #182: 10001it [00:32, 308.45it/s, env_step=1820000, len=19, n/ep=0, n/st=10, player_1/loss=0.975, rew=5.00]      


Epoch #182: test_reward: 108.200000 ± 2.400000, best_reward: 108.200000 ± 2.400000 in #182
testing for stop: 10.82 >= 10 -> True

Started training agent player 1 against minimax with depth 2


Epoch #1: 10001it [00:58, 171.67it/s, env_step=10000, len=28, n/ep=1, n/st=10, player_1/loss=1.450, rew=-5.00]         


Epoch #1: test_reward: 35.000000 ± 0.000000, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #2: 10001it [00:57, 173.60it/s, env_step=20000, len=10, n/ep=1, n/st=10, player_1/loss=1.494, rew=-5.00]         


Epoch #2: test_reward: 17.000000 ± 0.000000, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #3: 10001it [00:57, 174.07it/s, env_step=30000, len=20, n/ep=0, n/st=10, player_1/loss=1.453, rew=15.00]         


Epoch #3: test_reward: 5.000000 ± 0.000000, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #4: 10001it [00:57, 174.93it/s, env_step=40000, len=22, n/ep=0, n/st=10, player_1/loss=1.361, rew=-5.00]         


Epoch #4: test_reward: 35.000000 ± 0.000000, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #5: 10001it [00:57, 174.80it/s, env_step=50000, len=24, n/ep=0, n/st=10, player_1/loss=1.393, rew=23.00]         


Epoch #5: test_reward: 33.000000 ± 0.000000, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #6: 10001it [00:53, 185.91it/s, env_step=60000, len=34, n/ep=0, n/st=10, player_1/loss=1.384, rew=13.00]         


Epoch #6: test_reward: 24.200000 ± 5.878775, best_reward: 49.200000 ± 15.759442 in #0
testing for stop: 4.92 >= 10 -> False


Epoch #7: 10001it [00:52, 189.47it/s, env_step=70000, len=26, n/ep=0, n/st=10, player_1/loss=1.134, rew=13.00]         


Epoch #7: test_reward: 89.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #8: 10001it [00:51, 193.11it/s, env_step=80000, len=42, n/ep=1, n/st=10, player_1/loss=1.023, rew=85.00]         


Epoch #8: test_reward: 79.600000 ± 28.200000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #9: 10001it [00:52, 191.42it/s, env_step=90000, len=30, n/ep=0, n/st=10, player_1/loss=0.930, rew=23.00]         


Epoch #9: test_reward: 82.200000 ± 20.400000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #10: 10001it [00:52, 189.54it/s, env_step=100000, len=28, n/ep=0, n/st=10, player_1/loss=0.960, rew=17.00]       


Epoch #10: test_reward: 49.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #11: 10001it [00:52, 189.67it/s, env_step=110000, len=36, n/ep=0, n/st=10, player_1/loss=0.928, rew=29.00]       


Epoch #11: test_reward: 69.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #12: 10001it [00:53, 186.10it/s, env_step=120000, len=42, n/ep=1, n/st=10, player_1/loss=1.209, rew=81.00]       


Epoch #12: test_reward: 33.800000 ± 3.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #13: 10001it [00:53, 185.95it/s, env_step=130000, len=42, n/ep=0, n/st=10, player_1/loss=0.970, rew=81.00]       


Epoch #13: test_reward: 23.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #14: 10001it [00:54, 183.88it/s, env_step=140000, len=34, n/ep=0, n/st=10, player_1/loss=1.050, rew=23.00]       


Epoch #14: test_reward: 89.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #15: 10001it [00:51, 193.69it/s, env_step=150000, len=35, n/ep=2, n/st=10, player_1/loss=1.080, rew=27.00]       


Epoch #15: test_reward: 17.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #16: 10001it [00:51, 192.53it/s, env_step=160000, len=34, n/ep=0, n/st=10, player_1/loss=1.204, rew=23.00]       


Epoch #16: test_reward: 26.400000 ± 1.800000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #17: 10001it [00:54, 184.40it/s, env_step=170000, len=34, n/ep=0, n/st=10, player_1/loss=1.109, rew=23.00]       


Epoch #17: test_reward: 47.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #18: 10001it [00:53, 187.61it/s, env_step=180000, len=34, n/ep=0, n/st=10, player_1/loss=0.996, rew=35.67]       


Epoch #18: test_reward: 20.200000 ± 8.400000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #19: 10001it [00:54, 183.00it/s, env_step=190000, len=32, n/ep=0, n/st=10, player_1/loss=1.052, rew=35.00]       


Epoch #19: test_reward: 23.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #20: 10001it [00:53, 186.76it/s, env_step=200000, len=38, n/ep=2, n/st=10, player_1/loss=1.214, rew=69.00]       


Epoch #20: test_reward: 60.400000 ± 7.800000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #21: 10001it [00:53, 186.15it/s, env_step=210000, len=36, n/ep=2, n/st=10, player_1/loss=1.021, rew=63.00]       


Epoch #21: test_reward: 62.000000 ± 3.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #22: 10001it [00:54, 182.46it/s, env_step=220000, len=38, n/ep=1, n/st=10, player_1/loss=0.890, rew=65.00]       


Epoch #22: test_reward: 63.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #23: 10001it [00:53, 185.79it/s, env_step=230000, len=40, n/ep=1, n/st=10, player_1/loss=0.846, rew=49.00]       


Epoch #23: test_reward: 63.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #24: 10001it [00:52, 190.09it/s, env_step=240000, len=32, n/ep=0, n/st=10, player_1/loss=0.906, rew=21.00]       


Epoch #24: test_reward: 17.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #25: 10001it [00:54, 183.65it/s, env_step=250000, len=36, n/ep=0, n/st=10, player_1/loss=0.876, rew=63.00]       


Epoch #25: test_reward: 61.600000 ± 3.583295, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #26: 10001it [00:55, 181.08it/s, env_step=260000, len=24, n/ep=1, n/st=10, player_1/loss=0.794, rew=13.00]       


Epoch #26: test_reward: 35.800000 ± 3.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #27: 10001it [00:54, 182.35it/s, env_step=270000, len=23, n/ep=0, n/st=10, player_1/loss=0.984, rew=12.00]       


Epoch #27: test_reward: 23.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #28: 10001it [00:55, 181.17it/s, env_step=280000, len=8, n/ep=1, n/st=10, player_1/loss=0.974, rew=-5.00]        


Epoch #28: test_reward: 33.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #29: 10001it [00:54, 182.48it/s, env_step=290000, len=26, n/ep=0, n/st=10, player_1/loss=1.034, rew=23.00]       


Epoch #29: test_reward: 67.600000 ± 22.200000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #30: 10001it [00:54, 183.03it/s, env_step=300000, len=32, n/ep=0, n/st=10, player_1/loss=1.069, rew=35.00]       


Epoch #30: test_reward: 33.200000 ± 5.400000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #31: 10001it [00:53, 188.33it/s, env_step=310000, len=26, n/ep=0, n/st=10, player_1/loss=0.827, rew=22.00]       


Epoch #31: test_reward: 34.800000 ± 0.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #32: 10001it [00:52, 190.25it/s, env_step=320000, len=16, n/ep=0, n/st=10, player_1/loss=1.398, rew=-3.00]       


Epoch #32: test_reward: 1.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #33: 10001it [00:51, 192.78it/s, env_step=330000, len=16, n/ep=0, n/st=10, player_1/loss=1.405, rew=-5.00]       


Epoch #33: test_reward: 7.800000 ± 3.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #34: 10001it [00:54, 184.60it/s, env_step=340000, len=20, n/ep=1, n/st=10, player_1/loss=1.119, rew=17.00]       


Epoch #34: test_reward: 35.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #35: 10001it [00:54, 183.06it/s, env_step=350000, len=34, n/ep=0, n/st=10, player_1/loss=1.380, rew=31.00]       


Epoch #35: test_reward: 14.800000 ± 6.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #36: 10001it [00:56, 176.83it/s, env_step=360000, len=24, n/ep=1, n/st=10, player_1/loss=1.260, rew=25.00]       


Epoch #36: test_reward: 25.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #37: 10001it [00:54, 183.95it/s, env_step=370000, len=32, n/ep=0, n/st=10, player_1/loss=1.148, rew=35.00]       


Epoch #37: test_reward: 40.600000 ± 15.615377, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #38: 10001it [00:53, 188.19it/s, env_step=380000, len=32, n/ep=0, n/st=10, player_1/loss=1.071, rew=21.00]       


Epoch #38: test_reward: 22.200000 ± 3.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #39: 10001it [00:53, 186.22it/s, env_step=390000, len=42, n/ep=0, n/st=10, player_1/loss=0.986, rew=83.00]       


Epoch #39: test_reward: 83.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #40: 10001it [00:54, 183.82it/s, env_step=400000, len=30, n/ep=0, n/st=10, player_1/loss=1.099, rew=5.00]        


Epoch #40: test_reward: 3.200000 ± 0.600000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #41: 10001it [00:54, 184.66it/s, env_step=410000, len=16, n/ep=0, n/st=10, player_1/loss=1.265, rew=3.00]        


Epoch #41: test_reward: 3.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #42: 10001it [00:52, 188.94it/s, env_step=420000, len=22, n/ep=0, n/st=10, player_1/loss=1.139, rew=15.00]       


Epoch #42: test_reward: 35.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #43: 10001it [00:54, 184.12it/s, env_step=430000, len=42, n/ep=0, n/st=10, player_1/loss=0.842, rew=57.00]       


Epoch #43: test_reward: 39.800000 ± 14.400000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #44: 10001it [00:54, 184.82it/s, env_step=440000, len=16, n/ep=0, n/st=10, player_1/loss=0.935, rew=-5.00]       


Epoch #44: test_reward: 85.000000 ± 0.000000, best_reward: 89.000000 ± 0.000000 in #7
testing for stop: 8.9 >= 10 -> False


Epoch #45: 10001it [00:52, 192.14it/s, env_step=450000, len=24, n/ep=0, n/st=10, player_1/loss=0.956, rew=7.00]        


Epoch #45: test_reward: 107.000000 ± 0.000000, best_reward: 107.000000 ± 0.000000 in #45
testing for stop: 10.7 >= 10 -> True

Started training agent player 1 against minimax with depth 3


Epoch #1: 10001it [03:31, 47.33it/s, env_step=10000, len=24, n/ep=3, n/st=10, player_1/loss=1.533, rew=29.67]          


Epoch #1: test_reward: 19.000000 ± 0.000000, best_reward: 54.400000 ± 4.200000 in #0
testing for stop: 5.4399999999999995 >= 10 -> False


Epoch #2: 10001it [03:32, 47.13it/s, env_step=20000, len=30, n/ep=0, n/st=10, player_1/loss=1.488, rew=17.00]          


Epoch #2: test_reward: 17.000000 ± 0.000000, best_reward: 54.400000 ± 4.200000 in #0
testing for stop: 5.4399999999999995 >= 10 -> False


Epoch #3: 10001it [03:26, 48.47it/s, env_step=30000, len=12, n/ep=0, n/st=10, player_1/loss=1.451, rew=-5.00]          


Epoch #3: test_reward: 29.000000 ± 0.000000, best_reward: 54.400000 ± 4.200000 in #0
testing for stop: 5.4399999999999995 >= 10 -> False


Epoch #4: 10001it [03:28, 47.98it/s, env_step=40000, len=16, n/ep=3, n/st=10, player_1/loss=1.470, rew=5.67]           


Epoch #4: test_reward: 13.700000 ± 5.100000, best_reward: 54.400000 ± 4.200000 in #0
testing for stop: 5.4399999999999995 >= 10 -> False


Epoch #5: 10001it [03:26, 48.50it/s, env_step=50000, len=30, n/ep=1, n/st=10, player_1/loss=1.427, rew=7.00]           


Epoch #5: test_reward: 17.900000 ± 17.700000, best_reward: 54.400000 ± 4.200000 in #0
testing for stop: 5.4399999999999995 >= 10 -> False


Epoch #6: 10001it [03:08, 53.09it/s, env_step=60000, len=38, n/ep=0, n/st=10, player_1/loss=1.197, rew=77.00]          


Epoch #6: test_reward: 71.800000 ± 15.600000, best_reward: 71.800000 ± 15.600000 in #6
testing for stop: 7.18 >= 10 -> False


Epoch #7: 10001it [03:03, 54.59it/s, env_step=70000, len=12, n/ep=0, n/st=10, player_1/loss=1.029, rew=3.00]           


Epoch #7: test_reward: 71.000000 ± 0.000000, best_reward: 71.800000 ± 15.600000 in #6
testing for stop: 7.18 >= 10 -> False


Epoch #8: 10001it [02:54, 57.40it/s, env_step=80000, len=42, n/ep=0, n/st=10, player_1/loss=0.734, rew=105.00]         


Epoch #8: test_reward: 57.000000 ± 18.000000, best_reward: 71.800000 ± 15.600000 in #6
testing for stop: 7.18 >= 10 -> False


Epoch #9: 10001it [02:56, 56.60it/s, env_step=90000, len=34, n/ep=0, n/st=10, player_1/loss=0.757, rew=55.00]          


Epoch #9: test_reward: 79.200000 ± 17.400000, best_reward: 79.200000 ± 17.400000 in #9
testing for stop: 7.92 >= 10 -> False


Epoch #10: 10001it [03:02, 54.77it/s, env_step=100000, len=40, n/ep=0, n/st=10, player_1/loss=0.709, rew=99.00]        


Epoch #10: test_reward: 87.700000 ± 36.160890, best_reward: 87.700000 ± 36.160890 in #10
testing for stop: 8.77 >= 10 -> False


Epoch #11: 10001it [02:52, 57.98it/s, env_step=110000, len=42, n/ep=1, n/st=10, player_1/loss=0.644, rew=113.00]       


Epoch #11: test_reward: 113.000000 ± 0.000000, best_reward: 113.000000 ± 0.000000 in #11
testing for stop: 11.3 >= 10 -> True

Started training agent player 1 against minimax with depth 4


Epoch #1: 10001it [09:01, 18.46it/s, env_step=10000, len=14, n/ep=1, n/st=10, player_1/loss=1.405, rew=-5.00]          


Epoch #1: test_reward: 51.000000 ± 0.000000, best_reward: 67.200000 ± 9.897474 in #0
testing for stop: 6.720000000000001 >= 10 -> False


Epoch #2: 10001it [08:50, 18.84it/s, env_step=20000, len=28, n/ep=2, n/st=10, player_1/loss=1.335, rew=10.00]          


Epoch #2: test_reward: 39.800000 ± 2.400000, best_reward: 67.200000 ± 9.897474 in #0
testing for stop: 6.720000000000001 >= 10 -> False


Epoch #3: 10001it [08:55, 18.66it/s, env_step=30000, len=17, n/ep=0, n/st=10, player_1/loss=1.316, rew=3.00]           


Epoch #3: test_reward: 39.000000 ± 0.000000, best_reward: 67.200000 ± 9.897474 in #0
testing for stop: 6.720000000000001 >= 10 -> False


Epoch #4: 10001it [08:52, 18.79it/s, env_step=40000, len=18, n/ep=1, n/st=10, player_1/loss=1.380, rew=11.00]          


Epoch #4: test_reward: 45.400000 ± 13.290598, best_reward: 67.200000 ± 9.897474 in #0
testing for stop: 6.720000000000001 >= 10 -> False


Epoch #5: 10001it [08:51, 18.83it/s, env_step=50000, len=20, n/ep=2, n/st=10, player_1/loss=1.376, rew=2.00]           


Epoch #5: test_reward: 23.000000 ± 0.000000, best_reward: 67.200000 ± 9.897474 in #0
testing for stop: 6.720000000000001 >= 10 -> False


Epoch #6: 10001it [06:39, 25.04it/s, env_step=60000, len=38, n/ep=0, n/st=10, player_1/loss=0.971, rew=59.00]          


Epoch #6: test_reward: 104.800000 ± 36.600000, best_reward: 104.800000 ± 36.600000 in #6
testing for stop: 10.48 >= 10 -> True

Started training agent player 1 against minimax with depth 5


Epoch #1: 10001it [36:12,  4.60it/s, env_step=10000, len=30, n/ep=0, n/st=10, player_1/loss=1.419, rew=11.00]          


Epoch #1: test_reward: 7.000000 ± 0.000000, best_reward: 53.000000 ± 9.838699 in #0
testing for stop: 5.3 >= 10 -> False


Epoch #2: 10001it [35:23,  4.71it/s, env_step=20000, len=20, n/ep=0, n/st=10, player_1/loss=1.495, rew=13.67]          


Epoch #2: test_reward: 11.000000 ± 0.000000, best_reward: 53.000000 ± 9.838699 in #0
testing for stop: 5.3 >= 10 -> False


Epoch #3: 10001it [36:49,  4.53it/s, env_step=30000, len=17, n/ep=0, n/st=10, player_1/loss=1.439, rew=12.00]          


Epoch #3: test_reward: 11.000000 ± 0.000000, best_reward: 53.000000 ± 9.838699 in #0
testing for stop: 5.3 >= 10 -> False


Epoch #4: 10001it [36:22,  4.58it/s, env_step=40000, len=30, n/ep=0, n/st=10, player_1/loss=1.463, rew=17.00]          


Epoch #4: test_reward: 41.200000 ± 4.935585, best_reward: 53.000000 ± 9.838699 in #0
testing for stop: 5.3 >= 10 -> False


Epoch #5: 10001it [36:23,  4.58it/s, env_step=50000, len=21, n/ep=0, n/st=10, player_1/loss=1.428, rew=14.00]          


Epoch #5: test_reward: 9.000000 ± 0.000000, best_reward: 53.000000 ± 9.838699 in #0
testing for stop: 5.3 >= 10 -> False


Epoch #6: 10001it [27:59,  5.96it/s, env_step=60000, len=28, n/ep=0, n/st=10, player_1/loss=1.123, rew=17.00]          


Epoch #6: test_reward: 109.000000 ± 0.000000, best_reward: 109.000000 ± 0.000000 in #6
testing for stop: 10.9 >= 10 -> True


In [16]:
####################################################
# EXPERIMENT: VIEWING THE BEST LEARNED POLICY
####################################################

# settings
depth = 1
agent1_is_minimax = False

# Get the environment settings
env = get_env()
observation_space = env.observation_space['observation'] if isinstance(env.observation_space, gym.spaces.Dict) else env.observation_space
state_shape = observation_space.shape or observation_space.n
action_shape = env.action_space.shape or env.action_space.n



# Configure rainbow agent
rainbow_agent = rainbow_policy(state_shape= state_shape,
                               action_shape= action_shape)
rainbow_agent.load_state_dict(load_torch_dict("./saved_variables/paper_notebooks/11/1-250epoch_5loop/looping-iteration-0/best_policy_agent1.pth"))
rainbow_agent.set_eps(0)
      
# Configure minimax agent
minimax_agent = TianshouMiniMaxConnectFourPolicy(coin= 1 if agent1_is_minimax else 2,
                                                oponent_coin= 2 if agent1_is_minimax else 1,
                                                minimax_depth= depth)




# Watch the best agent at work
watch(numer_of_games= 3,
      render_speed= 0.3,
      agent_player1= minimax_agent if agent1_is_minimax else rainbow_agent,
      agent_player2= rainbow_agent if agent1_is_minimax else minimax_agent)



Average steps of game:  42.0
Final mean reward agent 1: 109.0, std: 0.0
Final mean reward agent 2: 3.0, std: 0.0


<hr><hr>

## Discussion

By playing against a considerably smart agent the rainbow algorithm starts to struggle since it fails to win.
The trained behaviour is mostly defensive.
Our minimax agent gets harder to beat as the iteration increases since the iteration number corresponds with the depth of the minimax agent.
Since we re-optimize the rainbow agent on this, we should have an incrementally better rainbow agent given enough training.
This would fulfil our goal of having a variable difficulty bot.
This approach is not ideal, as the minimax algorithm is fixed so the model can train to the exact agents behaviour.


In [18]:
####################################################
# CLEAN VARIABLES
####################################################

del action_shape
del agent1_is_minimax
del depth
del env
del epochs
del filename
del filename_prefix
del final_agent_player1
del final_agent_player2
del loop_idx
del loops
del minimax_agent
del observation_space
del off_policy_traininer_results
del rainbow_agent
del rainbow_starting_params
del state_shape
del stopping_threshold
del training_agent
