# Pyrat Deep Q-Learning Processing

## Setup Environment

Required libraries for PyRat Q-Learning

In [1]:
import json
import numpy as np
import time
import random
import pickle
from tqdm import tqdm
from AIs import manh, numpy_rl_reload
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim

### The game.py file describes the simulation environment, including the generation of reward and the observation that is fed to the agent.
import game

### The rl.py file describes the reinforcement learning procedure, including Q-learning, Experience replay, and a pytorch model to learn the Q-function.
### SGD is used to approximate the Q-function.
import rl

Libraries for training the Convolutional Neural Network

In [2]:
# Import libraries

import torch.nn.functional as F
import inspect

# Personal libraries

from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

This is a very unknown but cool library that can help you build a neural network.

It helps you **calculate the shape of the tensor outputs** of network operations.

[Tensorshape Library Documentation](https://pypi.org/project/torchshape/0.0.8/#description)

In [3]:
import subprocess # For importing missing libraries real-time
try:
    from torchshape import tensorshape
except:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'torchshape'])
    from torchshape import tensorshape

Define our **device** as the first visible CUDA device if we have CUDA available:

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device '+ str(device))

Using device cuda


## PyRat Game Specifications

Details of the game:

1️⃣ The **opponent** plays using a deterministic strategy: a **greedy algorithm** always targetting the closest piece of cheese as next target. - The distance to pieces of cheese is calculated using Manhattan Distance (= L1 distance). The code is in AIs/manh.py.

2️⃣ The maze does **not** include **walls** (option -d 0)

3️⃣ The maze does **not** include **mud** (option -md 0)

4️⃣The **dimension** of the maze is **21 x 15** (default parameter)

5️⃣The number of **pieces of cheese** is **40** (option -p 40)

6️⃣ The **maze** is non **symmetric** (option --nonsymmetric)

You can therefore run a 1000 game test simulations using the following command:

<pre>python pyrat.py -d 0 -md 0 -p 40 -x 21 -y 15 --rat AIs/manh.py --python AIs/YOURAIHERE --nonsymmetric --nodrawing --tests 1000 --synchronous</pre>

Furthermore, you can run a visual game simulation with a command following the next structure:

<pre>python pyrat.py -p 40 -x 21 -y 15 -d 0 -md 0 --rat AIs/manh.py --python AIs/YOURAIHERE --nonsymmetric</pre>

## Train a model to approximate the Q-function

Definitions :
- An iteration of training is called an **Epoch**. It correspond to a full play of a PyRat game. 
- An **experience** is a set of  vectors < s, a, r, s’ > describing the consequence of being in state s, doing action a, receiving reward r, and ending up in state s'.
- Look at the file rl.py to see how the **experience replay buffer** is implemented. 
- A **batch** is a set of experiences we use for training during one epoch. We draw batches from the experience replay buffer.

### Create the simulated game environment

Set the parameters for the Pyrat game simulated environment.

In [5]:
width = 21  # Size of the playing field
height = 15  # Size of the playing field
cheeses = 40  # Number of cheeses in the game
opponent = manh  # AI used for the opponent

Create the Pyrat simulated environment.

In [6]:
env = game.PyRat(width=width, height=height, opponent=opponent, cheeses=cheeses)

Show the **shape of an observation**.

In [7]:
test_observation = torch.FloatTensor(env.observe())
test_observation.shape

torch.Size([1, 29, 41, 1])

We have to be careful with this default shape since convolutional layers expect as inputs tensors in the form of:

<pre>(batch size, number of channels, height, width)</pre>

The environment throws the tensor in the shape, which is **WRONG**:

<pre>(batch size, height, width, number of channels)</pre>

We can transform the tensor in the following way:

In [8]:
test_observation = test_observation.permute(0, 3, 1, 2)
test_observation.shape

torch.Size([1, 1, 29, 41])

### Create the experience replay buffer

Set the parameters for the experience replay buffer.

In [9]:
max_memory = 1000  # Maximum number of experiences we are storing
discount_factor=.97 # Discount factor for future rewards

Create the experience replay buffer.

In [10]:
exp_replay = rl.ExperienceReplay(max_memory=max_memory, discount=discount_factor)
exp_replay.discount

0.97

### Q-Function Approximation Model Topologies

#### Model 1

**Simple regressor** to predict the Q-values. Base Topology used and tested in course laboratory.

In [11]:
class MultiRegressor1FC(nn.Module):
    def __init__(self, x_example, number_of_channels=1, number_of_regressors=4):
        super(MultiRegressor1FC, self).__init__()
        in_features = x_example.reshape(-1).shape[0]
        self.nb_channels = number_of_channels
        self.linear = nn.Linear(in_features, number_of_regressors)
    
    def forward(self, x):
        x = x.reshape(x.shape[0], -1)
        return self.linear(x)

    def load(self):
        if self.nb_channels == 1:
            self.load_state_dict(torch.load('save_rl/weights_ANN1FC_1channel.pt'))
        else:
            self.load_state_dict(torch.load('save_rl/weights_ANN1FC_2channel.pt'))

    def save(self):
        if self.nb_channels == 1:
            torch.save(self.state_dict(), 'save_rl/weights_ANN1FC_1channel.pt')
        else:
            torch.save(self.state_dict(), 'save_rl/weights_ANN1FC_2channel.pt')

#### Model 2

**1 hidden layer regressor network** to predict the Q-values.

In [12]:
class MultiRegressor2FC(nn.Module):
    def __init__(self, x_example, number_of_channels=1, number_of_regressors=4):
        super(MultiRegressor2FC, self).__init__()
        in_features = x_example.reshape(-1).shape[0]
        self.nb_channels = number_of_channels
        self.fc1 = nn.Linear(in_features, 16)
        self.selu = nn.SELU()
        self.linear = nn.Linear(16, number_of_regressors)
    
    def forward(self, x):
        x = x.reshape(x.shape[0], -1)
        x = self.fc1(x)
        x = self.selu(x)
        return self.linear(x)

    def load(self):
        if self.nb_channels == 1:
            self.load_state_dict(torch.load('save_rl/weights_ANN2FC_1channel.pt'))
        else:
            self.load_state_dict(torch.load('save_rl/weights_ANN2FC_2channel.pt'))

    def save(self):
        if self.nb_channels == 1:
            torch.save(self.state_dict(), 'save_rl/weights_ANN2FC_1channel.pt')
        else:
            torch.save(self.state_dict(), 'save_rl/weights_ANN2FC_2channel.pt')

#### Model 3

A **CNN multi-regressor** integrating **1 fully connected layer**. Expects **1 or 2 channels** as input.

In [13]:
class MultiRegressorCNN1FC(nn.Module):
    def __init__(self, number_of_channels=1):
        super().__init__()
        self.nb_channels = number_of_channels
        self.conv1 = nn.Conv2d(number_of_channels, 16, kernel_size=3) # output_shape = (1, 16, 27, 39)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 16, 13, 19)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3) # output_shape = (1, 32, 11, 17)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 32, 5, 8)
        self.fc = nn.Linear(32 * 5 * 8, 4) 
        
    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)
        x = x.reshape(x.shape[0],-1) # output_shape = (1280)
        x = self.fc(x)
        return x
    
    def load(self):
        if self.nb_channels == 1:
            self.load_state_dict(torch.load('save_rl/weights_CNN1FC_1channel.pt'))
        else:
            self.load_state_dict(torch.load('save_rl/weights_CNN1FC_2channel.pt'))

    def save(self):
        if self.nb_channels == 1:
            torch.save(self.state_dict(), 'save_rl/weights_CNN1FC_1channel.pt')
        else:
            torch.save(self.state_dict(), 'save_rl/weights_CNN1FC_2channel.pt')

Sanity check to confirm the dimensions of the tensors after each convolutional layer operations.

In [14]:
# Input shape which is the size of the canvas

x_shape = (1, 1, 29, 41)

# Default not passed parameters for Conv2d
# stride=(1,1), padding=(0,0), dilation=(1,1), groups=1

# First convolution operation
op = nn.Conv2d(1, 16, kernel_size=3)
x_shape = tensorshape(op, x_shape)
print(f'Shape after first Conv2d: {x_shape}')

# First maxpool operation
op = nn.MaxPool2d(kernel_size=2)
x_shape = tensorshape(op, x_shape)
print(f'Shape after first MaxPool2d: {x_shape}')

# Second convolution operation
op = nn.Conv2d(16, 32, kernel_size=3)
x_shape = tensorshape(op, x_shape)
print(f'Shape after second Conv2d: {x_shape}')

# Second maxpool operation
op = nn.MaxPool2d(kernel_size=2)
x_shape = tensorshape(op, x_shape)
print(f'Shape after second MaxPool2d: {x_shape}')

Shape after first Conv2d: (1, 16, 27, 39)
Shape after first MaxPool2d: (1, 16, 13, 19)
Shape after second Conv2d: (1, 32, 11, 17)
Shape after second MaxPool2d: (1, 32, 5, 8)


Check that this model fits well the data with a small sanity check.

In [15]:
# Create a test instance of the model
test_model = MultiRegressorCNN1FC(env.observe().shape[3])

# Get a sample observation of the game environment
test_input_tensor = torch.FloatTensor(env.observe())
print(f'Shape of the raw input tensor: {test_input_tensor.shape}')

# Get an output given the sample input and an untrained model just to validate teh correct output size
test_output_tensor = test_model(test_input_tensor)
print(f'Shape of the output tensor: {test_output_tensor.shape}')

Shape of the raw input tensor: torch.Size([1, 29, 41, 1])
Shape of the output tensor: torch.Size([1, 4])


#### Model 4

A **CNN multi-regressor** integrating **2 fully connected layers**. Expects **1 or 2 channels** as input.

In [16]:
class MultiRegressorCNN2FC(nn.Module):
    def __init__(self, number_of_channels=1):
        super().__init__()
        self.nb_channels = number_of_channels
        self.conv1 = nn.Conv2d(number_of_channels, 16, kernel_size=3) # output_shape = (1, 16, 27, 39)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 16, 13, 19)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3) # output_shape = (1, 32, 11, 17)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 32, 5, 8)
        self.fc1 = nn.Linear(32 * 5 * 8, 16)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(16, 4)
        self.dropout = nn.Dropout(p=0.3)
        
    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)
        #x = self.dropout(x)
        
        x = x.reshape(x.shape[0],-1) # output_shape = (1280)
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x
    
    def load(self):
        if self.nb_channels == 1:
            self.load_state_dict(torch.load('save_rl/weights_CNN2FC_1channel.pt'))
        else:
            self.load_state_dict(torch.load('save_rl/weights_CNN2FC_2channel.pt'))

    def save(self):
        if self.nb_channels == 1:
            torch.save(self.state_dict(), 'save_rl/weights_CNN2FC_1channel.pt')
        else:
            torch.save(self.state_dict(), 'save_rl/weights_CNN2FC_2channel.pt')

Check that this model fits well the data with a small sanity check.

In [17]:
# Create a test instance of the model
test_model = MultiRegressorCNN2FC(env.observe().shape[3])

# Get a sample observation of the game environment
test_input_tensor = torch.FloatTensor(env.observe())
print(f'Shape of the raw input tensor: {test_input_tensor.shape}')

# Get an output given the sample input and an untrained model just to validate teh correct output size
test_output_tensor = test_model(test_input_tensor)
print(f'Shape of the output tensor: {test_output_tensor.shape}')

Shape of the raw input tensor: torch.Size([1, 29, 41, 1])
Shape of the output tensor: torch.Size([1, 4])


#### Model 5

A **CNN multi-regressor** integrating **3 fully connected layers**. Expects **1 or 2 channels** as input.

In [18]:
class MultiRegressorCNN3FC(nn.Module):
    def __init__(self, number_of_channels=1):
        super().__init__()
        self.nb_channels = number_of_channels
        self.conv1 = nn.Conv2d(number_of_channels, 16, kernel_size=3) # output_shape = (1, 16, 27, 39)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 16, 13, 19)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3) # output_shape = (1, 32, 11, 17)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=2) # output_shape = (1, 32, 5, 8)
        self.fc1 = nn.Linear(32 * 5 * 8, 16)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(16, 16)
        self.relu4 = nn.ReLU()
        self.fc3 = nn.Linear(16, 4)
        self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)
        #x = self.dropout(x)
        
        x = x.reshape(x.shape[0],-1) # output_shape = (1280)
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.relu4(x)
        x = self.dropout(x)
        x = self.fc3(x)
        return x
    
    def load(self):
        if self.nb_channels == 1:
            self.load_state_dict(torch.load('save_rl/weights_CNN3FC_1channel.pt'))
        else:
            self.load_state_dict(torch.load('save_rl/weights_CNN3FC_2channel.pt'))

    def save(self):
        if self.nb_channels == 1:
            torch.save(self.state_dict(), 'save_rl/weights_CNN3FC_1channel.pt')
        else:
            torch.save(self.state_dict(), 'save_rl/weights_CNN3FC_2channel.pt')

Check that this model fits well the data with a small sanity check.

In [19]:
# Create a test instance of the model
test_model = MultiRegressorCNN3FC(env.observe().shape[3])

# Get a sample observation of the game environment
test_input_tensor = torch.FloatTensor(env.observe())
print(f'Shape of the raw input tensor: {test_input_tensor.shape}')

# Get an output given the sample input and an untrained model just to validate teh correct output size
test_output_tensor = test_model(test_input_tensor)
print(f'Shape of the output tensor: {test_output_tensor.shape}')

Shape of the raw input tensor: torch.Size([1, 29, 41, 1])
Shape of the output tensor: torch.Size([1, 4])


### Initialize Q-Function Approximation Model

Let's **initialize** the neural network of your choice!

Un-comment / comment to choose your network!

In [20]:
# Define model parameters
nb_channels = env.observe().shape[3]
print(f'Number of channels: {nb_channels}')

# Instantiate chosen model and move it to device

# Simple regressor with 1 fully connected layer
#model = MultiRegressor1FC(env.observe()[0], number_of_channels=nb_channels)

# Simple regressor with 2 fully connected layers
model = MultiRegressor2FC(env.observe()[0], number_of_channels=nb_channels)

# CNN regressor with 1 fully connected layer
#model = MultiRegressorCNN1FC(number_of_channels=nb_channels)

# CNN regressor with 2 fully connected layers
#model = MultiRegressorCNN2FC(number_of_channels=nb_channels)

# CNN regressor with 3 fully connected layers
#model = MultiRegressorCNN3FC(number_of_channels=nb_channels)

#model.to(device=device)

model

Number of channels: 1


MultiRegressor2FC(
  (fc1): Linear(in_features=1189, out_features=16, bias=True)
  (selu): SELU()
  (linear): Linear(in_features=16, out_features=4, bias=True)
)

### Loss Function and Optimizer

Define a **loss function** and **optimizer**.

In [21]:
# Define the loss function as cross-entropy
criterion = nn.MSELoss()

# Set stochastic gradient descent as the optimizer
#optimizer = torch.optim.SGD(model.parameters(),lr = 0.01)

# Set Adam as the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

### Train the model

Training parameters.

In [22]:
number_of_batches = 8  # Number of batches per epoch
batch_size = 32  # Number of experiences we use for training per batch

Reset global maximum reward counter per 100 epoch.

In [23]:
max_reward = 0

In [24]:
max_reward

0

Training routine definition.

In [25]:
def play(model, epochs, criterion, optimizer=None, train=True):

    win_cnt = 0
    lose_cnt = 0
    draw_cnt = 0
    win_hist = []
    cheeses = []
    
    # Addition to track rewards
    reward_cnt = 0
    rewards = []
    
    steps = 0.
    last_W = 0
    last_D = 0
    last_L = 0
    
    global max_cheese
    global max_reward
    
    for e in tqdm(range(epochs)):
        env.reset()
        game_over = False

        # Get the current state of the environment
        state = env.observe()
        
        # Play a full game until game is over
        while not game_over:
            # Do not forget to transform the input of model into torch tensor
            state = torch.FloatTensor(state)

            # Predict the Q value for the current state
            q_values = model(state)

            # Pick the next action that maximizes the Q value
            action = torch.argmax(q_values)
            
            # Apply action, get rewards and new state
            previous_state = state.detach().clone()
            state, reward, game_over= env.act(action)
            
            # Statistics            
            reward_cnt += reward
            
            if game_over:
                steps += env.round
                if env.score > env.enemy_score:
                    win_cnt += 1
                elif env.score == env.enemy_score:
                    draw_cnt += 1
                else:
                    lose_cnt += 1
                cheese = env.score

            # Create an experience array using previous state, the performed action, the obtained reward and the new state. The vector has to be in this order.
            # Store in the experience replay buffer an experience and end game.
            # Do not forget to transform the previous state and the new state into torch tensor.
            # Create an experience array
            experience = [torch.FloatTensor(previous_state), action, reward, torch.FloatTensor(state)]

            # Store the experience in the experience replay buffer
            exp_replay.remember(experience, game_over)
            
        win_hist.append(win_cnt)  # Statistics
        cheeses.append(cheese)  # Statistics
        
        # Save the total reward of this episode and reset the episode reward counter
        rewards.append(reward_cnt)        
        reward_cnt = 0

        if train:

            # Train using experience replay. For each batch, get a set of experiences (state, action, new state) that were stored in the buffer. 
            # Use this batch to train the model.
            
            running_loss = 0
            for b in range(number_of_batches):
                # Get the batch data
                states, Q = exp_replay.get_batch(model, batch_size=batch_size)
                
                # Fit the training data in mounted device
                #states = states.to(device=device)                
                #Q = Q.to(device=device)                
               
                # Compute the loss
                loss = rl.train_on_batch(model, states, Q, criterion, optimizer)
                
                # statistics
                running_loss += loss
            #print('[%d] loss: %.3f' % (e + 1, running_loss / number_of_batches))
            running_loss = 0.0

            '''if e > 100 :  # Check to save
                cheese_np = np.array(cheeses)
                if cheese_np[-100:].sum() > max_cheese:
                    max_cheese = cheese_np[-100:].sum()
                    print(f"New maximum cheese: {max_cheese}.\nSaving model...")
                    model.save()'''
                    
            if e > 100 :  # Check to save
                rewards_np = np.array(rewards)
                if rewards_np[-100:].sum() > max_reward:
                    max_reward = rewards_np[-100:].sum()
                    print(f"New maximum rewards: {max_reward}.\nSaving model...")
                    model.save()   
        
        if (e+1) % 100 == 0:  # Statistics every 100 epochs
            cheese_np = np.array(cheeses)
            rewards_np = np.array(rewards)
            string = "Epoch {:03d}/{:03d} | Last 100 Reward {} | Last 100 Cheese {}| W/D/L {}/{}/{} | 100 W/D/L {}/{}/{} | 100 Steps {}".format(
                        e,epochs, rewards_np[-100:].sum(), 
                        cheese_np[-100:].sum(), win_cnt, draw_cnt, lose_cnt, 
                        win_cnt-last_W, draw_cnt-last_D, lose_cnt-last_L, steps/100)
            print(string)
        
            steps = 0.
            last_W = win_cnt
            last_D = draw_cnt
            last_L = lose_cnt  

### Train the Q-learner with Experience replay

If load, then the last saved result is loaded and training is continued. Otherwise, training is performed from scratch starting with random parameters.

In [26]:
load = False

if load:
    model.load()

Train the model.

In [30]:
epoch = 10000  # Total number of epochs that will be done

print("Training")
play(model, epoch, criterion, optimizer, True)
print("Training done")

#model.save()

Training


  1%|▊                                                                             | 102/10000 [00:07<13:16, 12.43it/s]

Epoch 099/10000 | Last 100 Reward 1983.0 | Last 100 Cheese 1926.5| W/D/L 48/13/39 | 100 W/D/L 48/13/39 | 100 Steps 72.29


  2%|█▌                                                                            | 202/10000 [00:15<12:25, 13.15it/s]

Epoch 199/10000 | Last 100 Reward 1991.0 | Last 100 Cheese 1939.0| W/D/L 102/29/69 | 100 W/D/L 54/16/30 | 100 Steps 71.47


  3%|██▎                                                                           | 302/10000 [00:23<12:51, 12.58it/s]

Epoch 299/10000 | Last 100 Reward 1955.0 | Last 100 Cheese 1904.5| W/D/L 154/41/105 | 100 W/D/L 52/12/36 | 100 Steps 68.9


  4%|███▏                                                                          | 402/10000 [00:30<12:49, 12.47it/s]

Epoch 399/10000 | Last 100 Reward 2039.0 | Last 100 Cheese 1958.5| W/D/L 216/49/135 | 100 W/D/L 62/8/30 | 100 Steps 71.17


  5%|███▉                                                                          | 502/10000 [00:38<13:00, 12.18it/s]

Epoch 499/10000 | Last 100 Reward 1958.0 | Last 100 Cheese 1921.5| W/D/L 271/57/172 | 100 W/D/L 55/8/37 | 100 Steps 71.14


  6%|████▋                                                                         | 602/10000 [00:46<12:32, 12.49it/s]

Epoch 599/10000 | Last 100 Reward 2027.0 | Last 100 Cheese 1970.5| W/D/L 330/70/200 | 100 W/D/L 59/13/28 | 100 Steps 71.53


  7%|█████▍                                                                        | 702/10000 [00:54<13:03, 11.87it/s]

Epoch 699/10000 | Last 100 Reward 2009.0 | Last 100 Cheese 1940.5| W/D/L 393/78/229 | 100 W/D/L 63/8/29 | 100 Steps 69.2


  8%|██████▎                                                                       | 802/10000 [01:02<12:36, 12.16it/s]

Epoch 799/10000 | Last 100 Reward 2043.0 | Last 100 Cheese 1967.0| W/D/L 454/92/254 | 100 W/D/L 61/14/25 | 100 Steps 70.89


  9%|███████                                                                       | 902/10000 [01:10<11:42, 12.95it/s]

Epoch 899/10000 | Last 100 Reward 2044.0 | Last 100 Cheese 1980.5| W/D/L 521/101/278 | 100 W/D/L 67/9/24 | 100 Steps 69.84


 10%|███████▋                                                                     | 1002/10000 [01:18<12:11, 12.31it/s]

Epoch 999/10000 | Last 100 Reward 2024.0 | Last 100 Cheese 1963.5| W/D/L 580/110/310 | 100 W/D/L 59/9/32 | 100 Steps 71.05


 11%|████████▍                                                                    | 1102/10000 [01:26<11:39, 12.73it/s]

Epoch 1099/10000 | Last 100 Reward 2011.0 | Last 100 Cheese 1961.0| W/D/L 637/123/340 | 100 W/D/L 57/13/30 | 100 Steps 70.45


 12%|█████████▎                                                                   | 1202/10000 [01:34<11:30, 12.73it/s]

Epoch 1199/10000 | Last 100 Reward 1991.0 | Last 100 Cheese 1924.5| W/D/L 690/139/371 | 100 W/D/L 53/16/31 | 100 Steps 71.71


 13%|██████████                                                                   | 1302/10000 [01:42<11:14, 12.89it/s]

Epoch 1299/10000 | Last 100 Reward 2049.0 | Last 100 Cheese 1957.0| W/D/L 756/144/400 | 100 W/D/L 66/5/29 | 100 Steps 70.24


 13%|██████████▎                                                                  | 1342/10000 [01:45<11:38, 12.40it/s]

New maximum rewards: 2100.0.
Saving model...


 13%|██████████▎                                                                  | 1346/10000 [01:45<11:47, 12.24it/s]

New maximum rewards: 2102.0.
Saving model...


 14%|██████████▍                                                                  | 1352/10000 [01:46<11:59, 12.01it/s]

New maximum rewards: 2106.0.
Saving model...


 14%|██████████▊                                                                  | 1402/10000 [01:50<11:20, 12.64it/s]

Epoch 1399/10000 | Last 100 Reward 2051.0 | Last 100 Cheese 1981.0| W/D/L 815/161/424 | 100 W/D/L 59/17/24 | 100 Steps 72.97


 15%|███████████▌                                                                 | 1502/10000 [01:58<12:07, 11.69it/s]

Epoch 1499/10000 | Last 100 Reward 2054.0 | Last 100 Cheese 1991.5| W/D/L 878/175/447 | 100 W/D/L 63/14/23 | 100 Steps 73.02


 16%|████████████▎                                                                | 1602/10000 [02:07<11:52, 11.79it/s]

Epoch 1599/10000 | Last 100 Reward 1951.0 | Last 100 Cheese 1917.0| W/D/L 925/190/485 | 100 W/D/L 47/15/38 | 100 Steps 72.02


 17%|█████████████                                                                | 1700/10000 [02:15<10:54, 12.68it/s]

Epoch 1699/10000 | Last 100 Reward 1978.0 | Last 100 Cheese 1931.5| W/D/L 987/200/513 | 100 W/D/L 62/10/28 | 100 Steps 70.13


 18%|█████████████▉                                                               | 1802/10000 [02:23<10:33, 12.93it/s]

Epoch 1799/10000 | Last 100 Reward 1990.0 | Last 100 Cheese 1944.0| W/D/L 1049/213/538 | 100 W/D/L 62/13/25 | 100 Steps 69.06


 19%|██████████████▋                                                              | 1902/10000 [02:31<11:22, 11.86it/s]

Epoch 1899/10000 | Last 100 Reward 1999.0 | Last 100 Cheese 1933.5| W/D/L 1103/222/575 | 100 W/D/L 54/9/37 | 100 Steps 72.38


 20%|███████████████▍                                                             | 2002/10000 [02:39<10:38, 12.52it/s]

Epoch 1999/10000 | Last 100 Reward 1983.0 | Last 100 Cheese 1927.0| W/D/L 1157/233/610 | 100 W/D/L 54/11/35 | 100 Steps 69.66


 21%|████████████████▏                                                            | 2102/10000 [02:47<10:19, 12.76it/s]

Epoch 2099/10000 | Last 100 Reward 2006.0 | Last 100 Cheese 1948.0| W/D/L 1218/239/643 | 100 W/D/L 61/6/33 | 100 Steps 68.29


 22%|████████████████▉                                                            | 2202/10000 [02:55<10:15, 12.68it/s]

Epoch 2199/10000 | Last 100 Reward 2051.0 | Last 100 Cheese 1988.0| W/D/L 1288/247/665 | 100 W/D/L 70/8/22 | 100 Steps 68.57


 23%|█████████████████▋                                                           | 2302/10000 [03:03<10:06, 12.70it/s]

Epoch 2299/10000 | Last 100 Reward 2027.0 | Last 100 Cheese 1961.5| W/D/L 1348/258/694 | 100 W/D/L 60/11/29 | 100 Steps 69.08


 24%|██████████████████▍                                                          | 2402/10000 [03:11<10:04, 12.57it/s]

Epoch 2399/10000 | Last 100 Reward 1948.0 | Last 100 Cheese 1908.5| W/D/L 1396/273/731 | 100 W/D/L 48/15/37 | 100 Steps 71.69


 25%|███████████████████▎                                                         | 2502/10000 [03:19<10:08, 12.33it/s]

Epoch 2499/10000 | Last 100 Reward 2070.0 | Last 100 Cheese 1972.0| W/D/L 1457/287/756 | 100 W/D/L 61/14/25 | 100 Steps 72.46


 26%|████████████████████                                                         | 2602/10000 [03:27<09:28, 13.02it/s]

Epoch 2599/10000 | Last 100 Reward 2008.0 | Last 100 Cheese 1958.5| W/D/L 1513/300/787 | 100 W/D/L 56/13/31 | 100 Steps 72.15


 27%|████████████████████▊                                                        | 2702/10000 [03:35<09:55, 12.25it/s]

Epoch 2699/10000 | Last 100 Reward 2025.0 | Last 100 Cheese 1961.0| W/D/L 1567/312/821 | 100 W/D/L 54/12/34 | 100 Steps 73.2


 28%|█████████████████████▌                                                       | 2802/10000 [03:43<09:35, 12.51it/s]

Epoch 2799/10000 | Last 100 Reward 1995.0 | Last 100 Cheese 1947.5| W/D/L 1620/325/855 | 100 W/D/L 53/13/34 | 100 Steps 70.62


 29%|██████████████████████▎                                                      | 2902/10000 [03:51<10:05, 11.72it/s]

Epoch 2899/10000 | Last 100 Reward 2043.0 | Last 100 Cheese 1972.5| W/D/L 1682/338/880 | 100 W/D/L 62/13/25 | 100 Steps 69.46


 30%|███████████████████████                                                      | 3002/10000 [03:59<09:09, 12.72it/s]

Epoch 2999/10000 | Last 100 Reward 2033.0 | Last 100 Cheese 1968.5| W/D/L 1741/350/909 | 100 W/D/L 59/12/29 | 100 Steps 71.86


 31%|███████████████████████▉                                                     | 3102/10000 [04:07<09:06, 12.62it/s]

Epoch 3099/10000 | Last 100 Reward 1934.0 | Last 100 Cheese 1889.5| W/D/L 1792/360/948 | 100 W/D/L 51/10/39 | 100 Steps 69.56


 32%|████████████████████████▋                                                    | 3202/10000 [04:14<08:57, 12.64it/s]

Epoch 3199/10000 | Last 100 Reward 2017.0 | Last 100 Cheese 1941.0| W/D/L 1853/372/975 | 100 W/D/L 61/12/27 | 100 Steps 70.34


 33%|█████████████████████████▍                                                   | 3302/10000 [04:22<08:30, 13.11it/s]

Epoch 3299/10000 | Last 100 Reward 2052.0 | Last 100 Cheese 1983.0| W/D/L 1916/385/999 | 100 W/D/L 63/13/24 | 100 Steps 70.23


 34%|██████████████████████████▏                                                  | 3402/10000 [04:31<08:41, 12.66it/s]

Epoch 3399/10000 | Last 100 Reward 2022.0 | Last 100 Cheese 1960.0| W/D/L 1975/396/1029 | 100 W/D/L 59/11/30 | 100 Steps 71.03


 35%|██████████████████████████▉                                                  | 3502/10000 [04:38<08:24, 12.87it/s]

Epoch 3499/10000 | Last 100 Reward 2019.0 | Last 100 Cheese 1948.5| W/D/L 2031/410/1059 | 100 W/D/L 56/14/30 | 100 Steps 70.96


 36%|███████████████████████████▋                                                 | 3602/10000 [04:46<08:04, 13.19it/s]

Epoch 3599/10000 | Last 100 Reward 2060.0 | Last 100 Cheese 1972.0| W/D/L 2096/423/1081 | 100 W/D/L 65/13/22 | 100 Steps 69.62


 37%|████████████████████████████▌                                                | 3702/10000 [04:54<08:23, 12.52it/s]

Epoch 3699/10000 | Last 100 Reward 1993.0 | Last 100 Cheese 1943.0| W/D/L 2154/432/1114 | 100 W/D/L 58/9/33 | 100 Steps 69.76


 38%|█████████████████████████████▎                                               | 3802/10000 [05:03<07:54, 13.07it/s]

Epoch 3799/10000 | Last 100 Reward 2024.0 | Last 100 Cheese 1955.5| W/D/L 2215/440/1145 | 100 W/D/L 61/8/31 | 100 Steps 69.77


 39%|██████████████████████████████                                               | 3902/10000 [05:10<07:50, 12.96it/s]

Epoch 3899/10000 | Last 100 Reward 1994.0 | Last 100 Cheese 1942.5| W/D/L 2270/452/1178 | 100 W/D/L 55/12/33 | 100 Steps 72.18


 40%|██████████████████████████████▊                                              | 4002/10000 [05:18<07:42, 12.97it/s]

Epoch 3999/10000 | Last 100 Reward 2010.0 | Last 100 Cheese 1941.5| W/D/L 2329/460/1211 | 100 W/D/L 59/8/33 | 100 Steps 70.8


 41%|███████████████████████████████▌                                             | 4102/10000 [05:26<07:33, 13.00it/s]

Epoch 4099/10000 | Last 100 Reward 2023.0 | Last 100 Cheese 1940.0| W/D/L 2383/476/1241 | 100 W/D/L 54/16/30 | 100 Steps 71.57


 42%|████████████████████████████████▎                                            | 4202/10000 [05:34<07:29, 12.89it/s]

Epoch 4199/10000 | Last 100 Reward 2006.0 | Last 100 Cheese 1958.5| W/D/L 2442/488/1270 | 100 W/D/L 59/12/29 | 100 Steps 72.67


 43%|█████████████████████████████████▏                                           | 4302/10000 [05:42<07:54, 12.00it/s]

Epoch 4299/10000 | Last 100 Reward 1896.0 | Last 100 Cheese 1868.5| W/D/L 2489/497/1314 | 100 W/D/L 47/9/44 | 100 Steps 69.6


 44%|█████████████████████████████████▉                                           | 4402/10000 [05:50<07:23, 12.62it/s]

Epoch 4399/10000 | Last 100 Reward 2019.0 | Last 100 Cheese 1951.0| W/D/L 2549/506/1345 | 100 W/D/L 60/9/31 | 100 Steps 72.75


 45%|██████████████████████████████████▋                                          | 4502/10000 [05:58<07:07, 12.87it/s]

Epoch 4499/10000 | Last 100 Reward 2031.0 | Last 100 Cheese 1953.0| W/D/L 2605/524/1371 | 100 W/D/L 56/18/26 | 100 Steps 71.93


 46%|███████████████████████████████████▍                                         | 4600/10000 [06:06<07:36, 11.82it/s]

Epoch 4599/10000 | Last 100 Reward 2052.0 | Last 100 Cheese 1977.5| W/D/L 2665/542/1393 | 100 W/D/L 60/18/22 | 100 Steps 71.21


 47%|████████████████████████████████████▏                                        | 4702/10000 [06:14<06:56, 12.71it/s]

Epoch 4699/10000 | Last 100 Reward 2022.0 | Last 100 Cheese 1930.0| W/D/L 2726/551/1423 | 100 W/D/L 61/9/30 | 100 Steps 68.76


 48%|████████████████████████████████████▉                                        | 4802/10000 [06:22<06:55, 12.51it/s]

Epoch 4799/10000 | Last 100 Reward 2041.0 | Last 100 Cheese 1962.5| W/D/L 2786/561/1453 | 100 W/D/L 60/10/30 | 100 Steps 70.56


 49%|█████████████████████████████████████▋                                       | 4902/10000 [06:30<07:41, 11.04it/s]

Epoch 4899/10000 | Last 100 Reward 2072.0 | Last 100 Cheese 1975.5| W/D/L 2849/576/1475 | 100 W/D/L 63/15/22 | 100 Steps 70.07


 50%|██████████████████████████████████████▌                                      | 5002/10000 [06:38<06:34, 12.68it/s]

Epoch 4999/10000 | Last 100 Reward 1985.0 | Last 100 Cheese 1944.0| W/D/L 2905/586/1509 | 100 W/D/L 56/10/34 | 100 Steps 70.83


 51%|███████████████████████████████████████▎                                     | 5102/10000 [06:46<06:35, 12.37it/s]

Epoch 5099/10000 | Last 100 Reward 2017.0 | Last 100 Cheese 1951.0| W/D/L 2958/602/1540 | 100 W/D/L 53/16/31 | 100 Steps 71.24


 52%|████████████████████████████████████████                                     | 5202/10000 [06:54<06:15, 12.76it/s]

Epoch 5199/10000 | Last 100 Reward 2013.0 | Last 100 Cheese 1949.0| W/D/L 3016/615/1569 | 100 W/D/L 58/13/29 | 100 Steps 68.64


 53%|████████████████████████████████████████▊                                    | 5302/10000 [07:02<05:54, 13.24it/s]

Epoch 5299/10000 | Last 100 Reward 2058.0 | Last 100 Cheese 1969.5| W/D/L 3082/625/1593 | 100 W/D/L 66/10/24 | 100 Steps 70.59


 54%|█████████████████████████████████████████▌                                   | 5402/10000 [07:10<05:53, 13.00it/s]

Epoch 5399/10000 | Last 100 Reward 1962.0 | Last 100 Cheese 1905.5| W/D/L 3137/634/1629 | 100 W/D/L 55/9/36 | 100 Steps 69.22


 55%|██████████████████████████████████████████▎                                  | 5502/10000 [07:18<06:05, 12.31it/s]

Epoch 5499/10000 | Last 100 Reward 1907.0 | Last 100 Cheese 1872.0| W/D/L 3186/646/1668 | 100 W/D/L 49/12/39 | 100 Steps 69.5


 56%|███████████████████████████████████████████▏                                 | 5602/10000 [07:26<05:46, 12.71it/s]

Epoch 5599/10000 | Last 100 Reward 1973.0 | Last 100 Cheese 1911.0| W/D/L 3243/659/1698 | 100 W/D/L 57/13/30 | 100 Steps 68.79


 57%|███████████████████████████████████████████▉                                 | 5702/10000 [07:34<05:28, 13.07it/s]

Epoch 5699/10000 | Last 100 Reward 2017.0 | Last 100 Cheese 1959.5| W/D/L 3296/674/1730 | 100 W/D/L 53/15/32 | 100 Steps 71.11


 58%|████████████████████████████████████████████▋                                | 5802/10000 [07:42<05:46, 12.10it/s]

Epoch 5799/10000 | Last 100 Reward 1995.0 | Last 100 Cheese 1932.5| W/D/L 3352/686/1762 | 100 W/D/L 56/12/32 | 100 Steps 69.6


 59%|█████████████████████████████████████████████▍                               | 5902/10000 [07:50<05:16, 12.96it/s]

Epoch 5899/10000 | Last 100 Reward 1967.0 | Last 100 Cheese 1906.5| W/D/L 3408/694/1798 | 100 W/D/L 56/8/36 | 100 Steps 68.42


 60%|██████████████████████████████████████████████▏                              | 6002/10000 [07:59<05:11, 12.84it/s]

Epoch 5999/10000 | Last 100 Reward 2029.0 | Last 100 Cheese 1959.0| W/D/L 3467/712/1821 | 100 W/D/L 59/18/23 | 100 Steps 69.01


 61%|██████████████████████████████████████████████▉                              | 6102/10000 [08:07<05:16, 12.30it/s]

Epoch 6099/10000 | Last 100 Reward 1997.0 | Last 100 Cheese 1935.0| W/D/L 3522/723/1855 | 100 W/D/L 55/11/34 | 100 Steps 72.17


 62%|███████████████████████████████████████████████▊                             | 6202/10000 [08:14<05:06, 12.40it/s]

Epoch 6199/10000 | Last 100 Reward 2028.0 | Last 100 Cheese 1950.5| W/D/L 3581/736/1883 | 100 W/D/L 59/13/28 | 100 Steps 69.74


 63%|████████████████████████████████████████████████▌                            | 6302/10000 [08:22<04:48, 12.80it/s]

Epoch 6299/10000 | Last 100 Reward 2083.0 | Last 100 Cheese 1996.0| W/D/L 3643/753/1904 | 100 W/D/L 62/17/21 | 100 Steps 71.55


 64%|█████████████████████████████████████████████████▎                           | 6402/10000 [08:31<05:00, 11.98it/s]

Epoch 6399/10000 | Last 100 Reward 2011.0 | Last 100 Cheese 1937.5| W/D/L 3695/768/1937 | 100 W/D/L 52/15/33 | 100 Steps 70.33


 65%|██████████████████████████████████████████████████                           | 6502/10000 [08:39<04:34, 12.75it/s]

Epoch 6499/10000 | Last 100 Reward 2007.0 | Last 100 Cheese 1940.0| W/D/L 3754/781/1965 | 100 W/D/L 59/13/28 | 100 Steps 69.46


 66%|██████████████████████████████████████████████████▊                          | 6600/10000 [08:47<04:39, 12.17it/s]

Epoch 6599/10000 | Last 100 Reward 1938.0 | Last 100 Cheese 1904.5| W/D/L 3804/795/2001 | 100 W/D/L 50/14/36 | 100 Steps 69.03


 67%|███████████████████████████████████████████████████▌                         | 6702/10000 [08:55<04:24, 12.45it/s]

Epoch 6699/10000 | Last 100 Reward 1985.0 | Last 100 Cheese 1924.5| W/D/L 3854/811/2035 | 100 W/D/L 50/16/34 | 100 Steps 72.84


 68%|████████████████████████████████████████████████████▍                        | 6802/10000 [09:03<04:17, 12.43it/s]

Epoch 6799/10000 | Last 100 Reward 2021.0 | Last 100 Cheese 1951.0| W/D/L 3908/828/2064 | 100 W/D/L 54/17/29 | 100 Steps 70.93


 69%|█████████████████████████████████████████████████████▏                       | 6902/10000 [09:11<03:54, 13.23it/s]

Epoch 6899/10000 | Last 100 Reward 1964.0 | Last 100 Cheese 1910.0| W/D/L 3961/837/2102 | 100 W/D/L 53/9/38 | 100 Steps 69.53


 70%|█████████████████████████████████████████████████████▉                       | 6998/10000 [09:18<03:45, 13.30it/s]

New maximum rewards: 2109.0.
Saving model...
New maximum rewards: 2110.0.
Saving model...


 70%|█████████████████████████████████████████████████████▉                       | 7002/10000 [09:19<04:10, 11.95it/s]

New maximum rewards: 2113.0.
Saving model...
Epoch 6999/10000 | Last 100 Reward 2113.0 | Last 100 Cheese 1999.0| W/D/L 4035/845/2120 | 100 W/D/L 74/8/18 | 100 Steps 66.91
New maximum rewards: 2115.0.
Saving model...


 70%|█████████████████████████████████████████████████████▉                       | 7004/10000 [09:19<04:08, 12.04it/s]

New maximum rewards: 2125.0.
Saving model...
New maximum rewards: 2130.0.
Saving model...
New maximum rewards: 2137.0.
Saving model...


 70%|█████████████████████████████████████████████████████▉                       | 7008/10000 [09:19<04:02, 12.36it/s]

New maximum rewards: 2143.0.
Saving model...
New maximum rewards: 2144.0.
Saving model...
New maximum rewards: 2145.0.
Saving model...


 70%|█████████████████████████████████████████████████████▉                       | 7010/10000 [09:19<03:59, 12.51it/s]

New maximum rewards: 2151.0.
Saving model...
New maximum rewards: 2153.0.
Saving model...
New maximum rewards: 2166.0.
Saving model...


 71%|██████████████████████████████████████████████████████▋                      | 7102/10000 [09:27<03:40, 13.16it/s]

Epoch 7099/10000 | Last 100 Reward 2097.0 | Last 100 Cheese 1996.0| W/D/L 4101/860/2139 | 100 W/D/L 66/15/19 | 100 Steps 72.27


 72%|███████████████████████████████████████████████████████▍                     | 7202/10000 [09:35<03:59, 11.66it/s]

Epoch 7199/10000 | Last 100 Reward 2051.0 | Last 100 Cheese 1971.5| W/D/L 4167/871/2162 | 100 W/D/L 66/11/23 | 100 Steps 69.06


 73%|████████████████████████████████████████████████████████▏                    | 7302/10000 [09:43<03:35, 12.51it/s]

Epoch 7299/10000 | Last 100 Reward 2000.0 | Last 100 Cheese 1940.0| W/D/L 4226/887/2187 | 100 W/D/L 59/16/25 | 100 Steps 69.96


 74%|████████████████████████████████████████████████████████▉                    | 7402/10000 [09:52<03:32, 12.22it/s]

Epoch 7399/10000 | Last 100 Reward 2023.0 | Last 100 Cheese 1951.0| W/D/L 4288/895/2217 | 100 W/D/L 62/8/30 | 100 Steps 69.91


 75%|█████████████████████████████████████████████████████████▊                   | 7502/10000 [10:00<03:17, 12.65it/s]

Epoch 7499/10000 | Last 100 Reward 2024.0 | Last 100 Cheese 1947.0| W/D/L 4344/908/2248 | 100 W/D/L 56/13/31 | 100 Steps 73.49


 76%|██████████████████████████████████████████████████████████▌                  | 7602/10000 [10:08<03:28, 11.51it/s]

Epoch 7599/10000 | Last 100 Reward 2036.0 | Last 100 Cheese 1958.5| W/D/L 4397/925/2278 | 100 W/D/L 53/17/30 | 100 Steps 73.0


 77%|███████████████████████████████████████████████████████████▎                 | 7702/10000 [10:16<03:02, 12.62it/s]

Epoch 7699/10000 | Last 100 Reward 2022.0 | Last 100 Cheese 1964.0| W/D/L 4458/936/2306 | 100 W/D/L 61/11/28 | 100 Steps 70.62


 78%|████████████████████████████████████████████████████████████                 | 7802/10000 [10:23<02:53, 12.67it/s]

Epoch 7799/10000 | Last 100 Reward 2021.0 | Last 100 Cheese 1964.5| W/D/L 4518/949/2333 | 100 W/D/L 60/13/27 | 100 Steps 71.58


 79%|████████████████████████████████████████████████████████████▊                | 7902/10000 [10:32<02:42, 12.95it/s]

Epoch 7899/10000 | Last 100 Reward 1999.0 | Last 100 Cheese 1949.0| W/D/L 4572/964/2364 | 100 W/D/L 54/15/31 | 100 Steps 70.51


 80%|█████████████████████████████████████████████████████████████▌               | 8002/10000 [10:40<02:38, 12.62it/s]

Epoch 7999/10000 | Last 100 Reward 1965.0 | Last 100 Cheese 1914.0| W/D/L 4625/973/2402 | 100 W/D/L 53/9/38 | 100 Steps 71.01


 81%|██████████████████████████████████████████████████████████████▍              | 8102/10000 [10:48<02:31, 12.57it/s]

Epoch 8099/10000 | Last 100 Reward 1988.0 | Last 100 Cheese 1915.0| W/D/L 4673/992/2435 | 100 W/D/L 48/19/33 | 100 Steps 72.26


 82%|███████████████████████████████████████████████████████████████▏             | 8202/10000 [10:56<02:21, 12.73it/s]

Epoch 8199/10000 | Last 100 Reward 1996.0 | Last 100 Cheese 1924.5| W/D/L 4725/1006/2469 | 100 W/D/L 52/14/34 | 100 Steps 70.36


 83%|███████████████████████████████████████████████████████████████▉             | 8302/10000 [11:04<02:12, 12.79it/s]

Epoch 8299/10000 | Last 100 Reward 1980.0 | Last 100 Cheese 1948.0| W/D/L 4779/1019/2502 | 100 W/D/L 54/13/33 | 100 Steps 68.64


 84%|████████████████████████████████████████████████████████████████▋            | 8400/10000 [11:12<02:06, 12.69it/s]

Epoch 8399/10000 | Last 100 Reward 1999.0 | Last 100 Cheese 1943.0| W/D/L 4837/1029/2534 | 100 W/D/L 58/10/32 | 100 Steps 70.45


 85%|█████████████████████████████████████████████████████████████████▍           | 8502/10000 [11:20<01:59, 12.50it/s]

Epoch 8499/10000 | Last 100 Reward 2005.0 | Last 100 Cheese 1939.5| W/D/L 4895/1039/2566 | 100 W/D/L 58/10/32 | 100 Steps 69.18


 86%|██████████████████████████████████████████████████████████████████▏          | 8602/10000 [11:28<01:52, 12.44it/s]

Epoch 8599/10000 | Last 100 Reward 1999.0 | Last 100 Cheese 1938.0| W/D/L 4953/1051/2596 | 100 W/D/L 58/12/30 | 100 Steps 69.41


 87%|███████████████████████████████████████████████████████████████████          | 8702/10000 [11:36<01:42, 12.72it/s]

Epoch 8699/10000 | Last 100 Reward 1969.0 | Last 100 Cheese 1920.0| W/D/L 5003/1062/2635 | 100 W/D/L 50/11/39 | 100 Steps 71.4


 88%|███████████████████████████████████████████████████████████████████▊         | 8802/10000 [11:44<01:38, 12.21it/s]

Epoch 8799/10000 | Last 100 Reward 2074.0 | Last 100 Cheese 1991.5| W/D/L 5071/1070/2659 | 100 W/D/L 68/8/24 | 100 Steps 70.42


 89%|████████████████████████████████████████████████████████████████████▌        | 8902/10000 [11:52<01:32, 11.86it/s]

Epoch 8899/10000 | Last 100 Reward 2022.0 | Last 100 Cheese 1959.5| W/D/L 5125/1085/2690 | 100 W/D/L 54/15/31 | 100 Steps 71.29


 90%|█████████████████████████████████████████████████████████████████████▎       | 9002/10000 [12:00<01:19, 12.55it/s]

Epoch 8999/10000 | Last 100 Reward 2019.0 | Last 100 Cheese 1959.5| W/D/L 5181/1097/2722 | 100 W/D/L 56/12/32 | 100 Steps 71.36


 91%|██████████████████████████████████████████████████████████████████████       | 9102/10000 [12:08<01:15, 11.92it/s]

Epoch 9099/10000 | Last 100 Reward 1967.0 | Last 100 Cheese 1922.5| W/D/L 5237/1103/2760 | 100 W/D/L 56/6/38 | 100 Steps 70.62


 92%|██████████████████████████████████████████████████████████████████████▊      | 9202/10000 [12:16<01:04, 12.46it/s]

Epoch 9199/10000 | Last 100 Reward 1977.0 | Last 100 Cheese 1935.5| W/D/L 5290/1119/2791 | 100 W/D/L 53/16/31 | 100 Steps 68.81


 93%|███████████████████████████████████████████████████████████████████████▋     | 9302/10000 [12:24<00:57, 12.05it/s]

Epoch 9299/10000 | Last 100 Reward 1988.0 | Last 100 Cheese 1948.5| W/D/L 5339/1134/2827 | 100 W/D/L 49/15/36 | 100 Steps 70.95


 94%|████████████████████████████████████████████████████████████████████████▍    | 9402/10000 [12:32<00:47, 12.50it/s]

Epoch 9399/10000 | Last 100 Reward 1980.0 | Last 100 Cheese 1933.0| W/D/L 5395/1145/2860 | 100 W/D/L 56/11/33 | 100 Steps 71.02


 95%|█████████████████████████████████████████████████████████████████████████▏   | 9502/10000 [12:40<00:38, 12.77it/s]

Epoch 9499/10000 | Last 100 Reward 1968.0 | Last 100 Cheese 1911.5| W/D/L 5450/1154/2896 | 100 W/D/L 55/9/36 | 100 Steps 69.74


 96%|█████████████████████████████████████████████████████████████████████████▉   | 9602/10000 [12:48<00:31, 12.74it/s]

Epoch 9599/10000 | Last 100 Reward 2041.0 | Last 100 Cheese 1961.5| W/D/L 5506/1168/2926 | 100 W/D/L 56/14/30 | 100 Steps 69.31


 97%|██████████████████████████████████████████████████████████████████████████▋  | 9702/10000 [12:56<00:23, 12.88it/s]

Epoch 9699/10000 | Last 100 Reward 2004.0 | Last 100 Cheese 1926.0| W/D/L 5561/1178/2961 | 100 W/D/L 55/10/35 | 100 Steps 69.66


 98%|███████████████████████████████████████████████████████████████████████████▍ | 9802/10000 [13:04<00:15, 12.38it/s]

Epoch 9799/10000 | Last 100 Reward 2046.0 | Last 100 Cheese 1959.0| W/D/L 5619/1192/2989 | 100 W/D/L 58/14/28 | 100 Steps 71.69


 99%|████████████████████████████████████████████████████████████████████████████▏| 9902/10000 [13:12<00:07, 13.00it/s]

Epoch 9899/10000 | Last 100 Reward 1996.0 | Last 100 Cheese 1945.0| W/D/L 5674/1202/3024 | 100 W/D/L 55/10/35 | 100 Steps 70.79


100%|████████████████████████████████████████████████████████████████████████████| 10000/10000 [13:20<00:00, 12.49it/s]

Epoch 9999/10000 | Last 100 Reward 2037.0 | Last 100 Cheese 1966.5| W/D/L 5740/1214/3046 | 100 W/D/L 66/12/22 | 100 Steps 68.31
Training done





### Evaluate the Q-learner model

Load the best performant weight parameters.

In [31]:
# Evaluate previous model
load = True

if load:
    model.load()

Evaluate the model.

In [33]:
epoch = 1000  # Total number of epochs that will be done

print("Testing")
play(model, epoch, criterion, optimizer, False)
print("Testing done")

Testing


 11%|████████▉                                                                      | 113/1000 [00:01<00:11, 74.67it/s]

Epoch 099/1000 | Last 100 Reward 2049.0 | Last 100 Cheese 1979.5| W/D/L 61/14/25 | 100 W/D/L 61/14/25 | 100 Steps 71.78


 21%|████████████████▊                                                              | 213/1000 [00:02<00:10, 77.84it/s]

Epoch 199/1000 | Last 100 Reward 2065.0 | Last 100 Cheese 1990.0| W/D/L 124/28/48 | 100 W/D/L 63/14/23 | 100 Steps 71.17


 31%|████████████████████████▏                                                      | 306/1000 [00:04<00:09, 74.62it/s]

Epoch 299/1000 | Last 100 Reward 2029.0 | Last 100 Cheese 1957.5| W/D/L 185/40/75 | 100 W/D/L 61/12/27 | 100 Steps 69.78


 41%|████████████████████████████████▋                                              | 413/1000 [00:05<00:07, 77.05it/s]

Epoch 399/1000 | Last 100 Reward 2026.0 | Last 100 Cheese 1950.0| W/D/L 245/50/105 | 100 W/D/L 60/10/30 | 100 Steps 69.3


 51%|████████████████████████████████████████▍                                      | 512/1000 [00:06<00:06, 77.93it/s]

Epoch 499/1000 | Last 100 Reward 2046.0 | Last 100 Cheese 1958.5| W/D/L 305/62/133 | 100 W/D/L 60/12/28 | 100 Steps 72.4


 61%|████████████████████████████████████████████████▌                              | 614/1000 [00:08<00:04, 78.48it/s]

Epoch 599/1000 | Last 100 Reward 2089.0 | Last 100 Cheese 2001.0| W/D/L 373/71/156 | 100 W/D/L 68/9/23 | 100 Steps 68.48


 71%|███████████████████████████████████████████████████████▊                       | 707/1000 [00:09<00:03, 78.20it/s]

Epoch 699/1000 | Last 100 Reward 2063.0 | Last 100 Cheese 1984.0| W/D/L 442/81/177 | 100 W/D/L 69/10/21 | 100 Steps 68.74


 81%|████████████████████████████████████████████████████████████████▎              | 814/1000 [00:10<00:02, 75.36it/s]

Epoch 799/1000 | Last 100 Reward 2066.0 | Last 100 Cheese 1980.0| W/D/L 508/93/199 | 100 W/D/L 66/12/22 | 100 Steps 71.07


 91%|███████████████████████████████████████████████████████████████████████▋       | 907/1000 [00:11<00:01, 78.01it/s]

Epoch 899/1000 | Last 100 Reward 2038.0 | Last 100 Cheese 1964.5| W/D/L 569/102/229 | 100 W/D/L 61/9/30 | 100 Steps 69.03


100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:13<00:00, 76.79it/s]

Epoch 999/1000 | Last 100 Reward 1986.0 | Last 100 Cheese 1920.5| W/D/L 625/113/262 | 100 W/D/L 56/11/33 | 100 Steps 71.83
Testing done



