
<style>
    @keyframes lavaEffect {
        0% { color: #00FFFF; } /* Cyan */
        25% { color: #0000CD; } /* MediumBlue */
        50% { color: #1E90FF; } /* DodgerBlue */
        75% { color: #00BFFF; } /* DeepSkyBlue */
        100% { color: #00FFFF; } /* Cyan */
    }

    #lava-text h1 {
        text-align: center;
        animation: lavaEffect 5s infinite;
        text-shadow: 0 0 10px #00FFFF, 0 0 20px #0000CD, 0 0 30px #1E90FF, 0 0 40px #00BFFF, 0 0 50px #00FFFF; /* Adding text shadow for lava glow effect */
        font-weight: bold;
        font-family: inherit; /* Inherit the font-family from the parent */
        font-size: 32px; /* Adjust the font-size as needed */
    }
</style>

<div id="lava-text">
    <h1>The Strategic Heuristic Algorithm with Zero-Human Advancement for Mastering Othello</h1>
</div>


<h3 style="color: darkblue; font-weight: bold;" >Introduction :</h3>

This notebook presents the development and evaluation of a Strategic Heuristic Algorithm with Zero-Human Advancement (SHA-ZHA) for mastering the game of Othello. 

The traditional board game Othello, sometimes referred to as Reversi, was invented in 1883 by Lewis Waterman and John W. Mollett. It developed into a contemporary version with a fixed basic board configuration over time. After being invented in Japan in 1971 by Goro Hasegawa, Othello has gained worldwide recognition since its creation and has been a mainstay in competitive tournaments since 1977.

<div style="text-align:center;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Othello-Standard-Board.jpg/500px-Othello-Standard-Board.jpg" alt="Othello Board" width="400" style="border-radius: 20px;"/>
</div>


The strategic depth of this two-player game poses a significant challenge for artificial intelligence, primarily due to its vast state space. Although the exact complexity remains elusive, estimates range from 10^28 to 10^32, a magnitude surpassing 100 times the number of bacteria on Earth and 10^13 times the number of grains of sand on Earth. To appreciate this scale, envision the hypothetical scenario of generating one billion Othello positions per second, a task demanding over 10^18 years—70,000 times the age of the universe.

Inspired by [DeepMind's AlphaZero](https://arxiv.org/abs/1712.01815), which achieved superhuman performance in chess, shogi, and Go through self-play and reinforcement learning, this work employs similar principles to master Othello. SHA-ZHA makes use of parallel game simulation and [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) to allow the agent to learn and adapt through self-play without requiring any prior human understanding.

In [None]:
! pip install torchsummary tqdm tabulate matplotlib numba

<h3 style="color: darkblue; font-weight: bold;" >Package Descriptions :</h3>

- **torchsummary**: A package that provides a summary of the layers in a PyTorch model, similar to `model.summary()` in Keras.
  
- **tqdm**: A package for displaying progress bars in loops and other iterators, making it easy to track the progress of long-running operations.

- **tabulate**: A package for pretty-printing tabular data in various formats, including plain text, HTML, and LaTeX.

- **matplotlib**: A comprehensive library for creating static, animated, and interactive visualizations in Python.

- **numba**: A Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code, enabling high-performance operations.

In [16]:
import numpy as np
from torch import nn
import torch
from torchsummary import summary
from tqdm import tqdm
import os
from datetime import datetime
import time
from torch.optim.lr_scheduler import StepLR
from tabulate import tabulate
import matplotlib.pyplot as plt
import random
from torch.utils.data import  DataLoader
import torch.multiprocessing as mp
import copy
from _othello import OthelloRL,miniMax,OthelloBoard,asyncSelfPlay,parallelSelfPlay,PPO_Loss

<h3 style="color: darkblue; font-weight: bold;">Learning Rate Scheduler, Torch Multiprocessing, and `_othello` Module</h3>

<h5 style="color: darkblue; font-weight: bold;">Learning Rate Scheduler</h5>
The learning rate scheduler is a tool in PyTorch that adjusts the learning rate during training. Adjusting the learning rate is a common practice in reinforcement learning which can help improve the performance and convergence of the model. One common scheduler is `StepLR`, which decays the learning rate by a factor every few epochs. This helps in fine-tuning the learning process, preventing the model from overshooting minima and stabilizing the training process.

<h5 style="color: darkblue; font-weight: bold;">Torch Multiprocessing</h5>
Torch multiprocessing allows for parallel processing in PyTorch, enabling faster and more efficient training, especially on multi-core processors. By using `torch.multiprocessing`, you can distribute the workload across multiple CPU cores or GPU devices. This is particularly useful for tasks like data loading and model training, where computational efficiency and speed are critical.

<h5 style="color: darkblue; font-weight: bold;">`_othello` Module</h5>
The `_othello` module is a comprehensive implementation for training an agent to master the game of Othello. It includes several components:

- **Proximal Policy Optimization (PPO)**: An advanced reinforcement learning algorithm developed by OpenAI. PPO balances exploration and exploitation by clipping probability ratios. This helps in stabilizing training and improving the performance of the agent. **The Proximal Policy Optimization (PPO) loss** consists of three components:

$$
\text{PPO Loss} = \text{Policy Loss} + \text{Value Loss} + \text{Entropy Loss}
$$
<font color='green' size='2'>

1. **Policy Loss**: Measures the difference between the predicted action probabilities and the action probabilities that maximize the expected return.

2. **Value Loss**: Measures the difference between the predicted value function and the observed returns.

3. **Entropy Loss**: Measures the uncertainty or randomness in the agent's policy.

</font>
  
- **Multiprocessing**: The module uses multiprocessing for parallel training, making the implementation highly efficient. This involves running multiple instances of self-play in parallel, significantly speeding up the learning process.
  
- **Othello Game Implementation**: A robust implementation of the Othello game, allowing for various configurations and scenarios for training the agent.
  
- **Minimax Engine**: The module includes a minimax engine to measure the performance of the trained agent. The minimax algorithm is a classical AI strategy for decision-making in two-player games, providing a baseline for evaluating the agent's performance.


In [23]:
# Hyperparameters
learning_rate = 0.5e-4 # learning rate
steps = 6000 # number of steps
batch_size = 256 #batch size
test_num = 250 # games played in each test point
test_frq = 25 # test every test_frq of steps
game_per_worker = 512 # number of games played in each step by one worker
num_workers = 16 # number of workers in parallel
epochs_per_step = 1 # number of epochs per step
board_base_size = 8 # base size of the board
num_boards = 1 # number of board sizes
load_check_point = False # load last check point
current_step = 1 # current step
path = "/checkpoint/" # path to save the results
epsilon_max = 0.05 # maximum value of epsilon
epsilon_min = 0.0 # minimum value of epsilon
scale_fac = 4 # control the rate of decay of Epsilon
engine_depth = 7 #depth of minimax engine
########################################
#model architecture
internal_channels = 64 # number of internal channels
policy_net_depth = 8 # depth of the policy network
########################################
# PPO loss parameters
value_coefficient = 0.5 # value coefficient
entropy_coefficient = 0.09 # entropy coefficient
clip_param = 0.2 # clip parameter
########################################
gamma_lr = np.exp(np.log(1/8)/(steps))# reduction factor for learning rate
# np.exp(np.log(0.5)/ x ) reach to 50% after x
print(f"**discout factor is : {gamma_lr : .010f}, LR reaches 50% in {int(np.log(0.8)/np.log(gamma_lr))} steps.")
step_lr = 1 # step size for the learning rate
ref_path = "/workspace/shaza_old.pth" # path to reference model
print(f"number of games per step : {game_per_worker*num_workers}")
print(f"random seed : { torch.initial_seed()} ,numpy : {np.random.get_state()[1][0]}.")

**discout factor is :  0.9999306877, LR reaches 50% in 20000 steps.
number of games per step : 4096
randomness seed : 1034430421326500,numpy : 2724217580.


- Considering an estimated duration of 5 minutes for each Othello game, a single step utilizes 16 parallel processes for simulation. Each of these processes generates 512 games, resulting in a total of 8192 games per step. Therefore, every step is equivalent to approximately 28.4 days of continuous gameplay in human terms.
- With a total of 6000 steps in the training process, the agent will experience an equivalent of approximately 467.27 years of gameplay in human terms.

<h3 style="color: darkblue; font-weight: bold;">Agent Network Architecture</h3>

The agent consists of two deep convolutional neural networks based on ConvNeXt. The first network is the policy network, which maps different boards into a probability distribution over actions. The second network is the value network, which provides an evaluation of the board.

<h5 style="color: darkblue; font-weight: bold;">ConvNeXt</h5>
ConvNeXt is a modern architecture for convolutional neural networks inspired by the design principles of vision transformers (ViTs). It combines the strengths of convolutional layers and transformer-like features, achieving state-of-the-art performance in various image recognition tasks. ConvNeXt introduces innovations such as:

- **Depthwise Convolutions**: These reduce the number of parameters and computations, making the network more efficient.
- **Layer Normalization**: This stabilizes and accelerates training by normalizing the inputs across the feature map dimensions.
- **Residual Connections**: These help in training deeper networks by mitigating the vanishing gradient problem.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTTm384qquw_hbC0UIIhF8Jnr9gHvtNokBOCQ&s" alt="ConvNeXt" hight="250" style="display: block; margin-left: auto; margin-right: auto; border-radius: 15px;"/>

<h5 style="color: darkblue; font-weight: bold;">Convolutional Block Attention Module (CBAM)</h5>
Another key component that significantly speeds up the learning process is the Convolutional Block Attention Module (CBAM). CBAM enhances the feature representation of the neural network by focusing on important information and suppressing irrelevant details. It consists of two sequential sub-modules:

- **Channel Attention Module**: This emphasizes informative channels and suppresses less useful ones by computing channel-wise attention.
- **Spatial Attention Module**: This enhances important spatial features and suppresses irrelevant ones by computing spatial attention maps.

By applying both channel and spatial attention, CBAM improves the network's ability to focus on crucial parts of the input, leading to better performance and faster convergence.

<img src="https://miro.medium.com/v2/resize:fit:1400/0*cvZx6H1aDsSgqQ1z" alt="CBAM" width="600" style="display: block; margin-left: auto; margin-right: auto; border-radius: 15px;"/>


In [28]:
# Set the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Your current device is: {device}.")
#####################
#Double Q-learning algorithm
#Create 2 instances of OthelloRL
agents = [OthelloRL(internal_channels= internal_channels,res_depth=policy_net_depth,name = "new",return_val = True),
          OthelloRL(internal_channels= internal_channels,res_depth=policy_net_depth,name = "old",return_val = True)]
####################
if load_check_point:
  try:
    agents[0].load(os.path.join(path,f"shaza_{agents[0].name}.pth"))
    agents[1].load(os.path.join(path,f"shaza_{agents[1].name}.pth"))
  except Exception as e:
    print(e)
# load the reference model
ref = OthelloRL()
ref = ref.to(device)
ref.load(path = ref_path)
ref.name = "refernce"

#best checkpoint
best_model = copy.deepcopy(agents[0])
best_model.name = "best_model"

# set name
# Move the agents to the device
agents[0] = agents[0].to(device)
agents[1] = agents[1].to(device)

################
agents[0].train()
agents[1].eval()
ref.eval() # set to evaluation mode
# Define the loss function and optimizer
criterion = PPO_Loss(clip_param = clip_param,value_coefficient = value_coefficient,entropy_coefficient = entropy_coefficient,echo=True)
#######################################
optim = torch.optim.AdamW(agents[0].parameters(), lr=learning_rate)


if load_check_point:
    try:
        optim.load_state_dict(torch.load(os.path.join(path,f"optimizer.pth"),map_location=device)["optimizer_state_dict"])
    except Exception as e:
        print(e)
# load scheduler

scheduler = StepLR(optim,step_size= step_lr,gamma= gamma_lr)


if load_check_point:
    try:
        scheduler.load_state_dict(torch.load(os.path.join(path,f"scheduler.pth"),map_location=device)["scheduler_state_dict"])
    except Exception as e:
        print(e)
# Lists to store training metrics
win_trace = {f"Agent_vs_Random" : [], # store the win trace for each
             f"Agent_vs_Engine":[],
             f"{agents[0].name}_vs_{best_model.name}": [],
             f"Agent_vs_{ref.name}" : [],
             }
loss = {"total_loss":[],
        "policy_loss":[],
        "value_loss":[],
        "entropy_loss":[],
        } # store loss form each epoch

time_step = [] # store the time in sec for each epoch

if load_check_point:
    try :
        data = np.load(os.path.join(path,"log.npy"),allow_pickle=True).item()
        win_trace = data["win trace"]
        total_loss = data["loss"]
        time_step = data["avg time ber epoch"]
        current_step = data["step"] + 1
        print("check point has been loaded")
    except Exception as e :
        print(e)


print(f"summary of model : {agents[0].name}")
summary(agents[0], [(2,8,8), (1,8,8)])
print("\n\n\n\n\n\n")
#print(f"summary of model : {agents[1].name}")
#summary(agents[1], [(2,8,8), (1,8,8)])

Your current device is: cuda.
Model A wasn't found : [Errno 2] No such file or directory: '/checkpoint/shaza_old.pth'
summary of model : new
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1             [-1, 64, 8, 8]           1,152
         GroupNorm-2             [-1, 64, 8, 8]             128
         LeakyReLU-3             [-1, 64, 8, 8]               0
          Identity-4             [-1, 64, 8, 8]               0
            Conv2d-5            [-1, 256, 8, 8]         147,456
         GroupNorm-6            [-1, 256, 8, 8]             512
 AdaptiveAvgPool2d-7            [-1, 256, 1, 1]               0
            Conv2d-8             [-1, 32, 1, 1]           8,224
         LeakyReLU-9             [-1, 32, 1, 1]               0
           Conv2d-10            [-1, 256, 1, 1]           8,448
          Sigmoid-11            [-1, 256, 1, 1]               0
AdaptiveMaxPool2d-12      

<h5 style="color: darkblue; font-weight: bold;">Exploration Strategies</h5>
To increase exploration and break possible biases generated from selecting actions with the highest probability, I included epsilon-greedy exploration, although Proximal Policy Optimization (PPO) already has Boltzmann exploration in place.

- **Epsilon-Greedy Exploration**: In this strategy, with a small probability (epsilon), a random action is selected instead of the action with the highest probability. This helps in exploring the action space more thoroughly and avoids getting stuck in local optima.
- **Boltzmann Exploration**: PPO uses this strategy to sample actions according to a probability distribution derived from the action values. This promotes exploration by allowing for a diverse range of actions based on their probabilities.

Combining these exploration strategies ensures a balance between exploration and exploitation, leading to a more robust learning process.

In [29]:
def calcEpsilon(ep):

    return (epsilon_max - epsilon_min)*np.exp(-1*(ep*scale_fac)/steps) + epsilon_min

Now, for testing lets stop Agents from returning value

In [30]:
agents[0].return_val = False
agents[1].return_val = False
best_model.return_val = False

In [31]:
def Test(agent_A,agent_B = None, test_num = test_num,board_base_size = board_base_size,num_boards = num_boards):
        # model to evaluation mode
        agent_A.cpu().eval()
        agent_A.return_val = False
        if agent_B :
          agent_B.cpu().eval()
          agent_B.return_val = False

        score = np.array([0,0])

        for i in tqdm(range(test_num),desc = f"model {agent_A.name} vs { agent_B.name if agent_B else 'random' }"):
            # chance the size of the board from 8 to 10 to 12 and so on ...
            board_size = board_base_size + 2*(i%num_boards) # dynamic board size
            Board = OthelloBoard()
            #########################################################
            if agent_B :
                score += Board.modelVsModel(agent_A,agent_B)
            else :
                score += Board.modelVsModel(agent_A,agent_A,Epsilon_2=1.0)

        return score

def RunTests(Q):
        # testing the model agenst random and engine
        #####################################################################
        score_random = Test(Q)
        win_trace['Agent_vs_Random'].append(round((score_random[0] / (score_random[0] + score_random[1])) * 100, 4))
        # test model agenst pretrained reference model
        ######################################################################
        score_ref = Test(Q,ref, test_num = test_num)
        win_trace[f"Agent_vs_{ref.name}"].append(round((score_ref[0] / (score_ref[0] + score_ref[1])) * 100, 4))
        # test model agenst engine
        #####################################################################
        engine = miniMax(name = "E",depth = engine_depth)
        score_ref = Test(Q,engine, test_num = test_num//2)
        win_trace[f"Agent_vs_Engine"].append(round((score_ref[0] / (score_ref[0] + score_ref[1])) * 100, 4))


        results = [
            (f"{Q.name} vs random", f"{win_trace['Agent_vs_Random'][-1]}%"),
            (f"{Q.name} vs {ref.name}", f"{win_trace[f'Agent_vs_{ref.name}'][-1]}%"),
            (f"{Q.name} vs {engine.name}", f"{win_trace[f'Agent_vs_Engine'][-1]}%"),
        ]

        # Display table
        table = tabulate(results, headers=["Comparison", f"Win Rate for {Q.name}"])
        print(table)

In [32]:
if __name__ == "__main__":

    os.environ['MKL_THREADING_LAYER'] = 'GNU'
    mp.set_start_method('spawn', force=True)
    result = parallelSelfPlay(model = agents[0],device = device)
    print(len(result))

Started : self playing worker -0
Started : self playing worker -1
Started : self playing worker -2
Started : self playing worker -3
Finished : self playing worker -0
Finished : self playing worker -1
Finished : self playing worker -2
Finished : self playing worker -3
1652


<h3 style="color: darkblue; font-weight: bold;">Testing the Agent</h3>

The baseline untrained agent will undergo a series of tests to evaluate its initial performance. The tests will be conducted against three different opponents:

1. **Random Policy**: This opponent selects actions randomly without any strategic consideration. It serves as a basic benchmark to assess the untrained agent's performance against a completely unstructured strategy.
   
2. **Another Trained Agent**: This opponent is a version of the agent that has undergone training. Testing against a trained agent helps in understanding how the untrained agent fares against a more sophisticated and learned strategy.
   
3. **Minimax Algorithm**: This opponent uses the minimax algorithm with a selected depth to make decisions. The minimax algorithm is a classical AI strategy for decision-making in two-player games, which provides a more structured and competitive challenge. The selected depth determines the lookahead search depth of the algorithm, balancing between computational efficiency and strategic depth.

Tests will continue during the training phase to continuously evaluate the model.

In [33]:
RunTests(agents[0])

model new vs random:   0%|          | 0/250 [00:00<?, ?it/s]

model new vs random: 100%|██████████| 250/250 [00:55<00:00,  4.54it/s]
model new vs refernce: 100%|██████████| 250/250 [01:21<00:00,  3.07it/s]
model new vs E: 100%|██████████| 125/125 [00:58<00:00,  2.13it/s]

Comparison       Win Rate for new
---------------  ------------------
new vs random    35.6846%
new vs refernce  20.8333%
new vs E         8.0645%





<h3 style="color: darkblue; font-weight: bold;">Training the Model with PPO</h3>

The goal of the `fit_ppo` function is to train the model using the Proximal Policy Optimization (PPO) loss function based on boards and returns generated from simulations. As mentioned earlier, the PPO loss function consists of three components:

1. **Policy Loss**: This measures the difference between the predicted action probabilities and the action probabilities that maximize the expected return. The policy loss helps the agent to learn the optimal policy by adjusting the probabilities of taking certain actions.

$$
\text{Policy Loss} = -\mathbb{E}_{t} \left[ \min \left( \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t, \, \text{clip} \left( \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right]
$$

where:
- $\pi_{\theta}$ represents the current policy.
- $\pi_{\theta_{\text{old}}}$ represents the old policy before the update.
- $a_t$ is the action taken at time $t$.
- $s_t$ is the state at time $t$.
- $\hat{A}_t$ is the advantage estimate at time $t$.
- $\epsilon$ is the clipping parameter.

2. **Value Loss**: This measures the difference between the predicted value function and the observed returns. The value loss helps the agent to accurately estimate the value of different states, which is crucial for making informed decisions.

$$
\text{Value Loss} = \mathbb{E}_{t} \left[ \left( V_{\theta}(s_t) - R_t \right)^2 \right]
$$

where:
- $V_{\theta}(s_t)$ is the predicted value function for state $s_t$.
- $R_t$ is the observed return at time $t$.

3. **Entropy Loss**: This measures the uncertainty or randomness in the agent's policy. The entropy loss encourages exploration by preventing the policy from becoming too deterministic, thus promoting a more robust learning process.

$$
\text{Entropy Loss} = -\mathbb{E}_{t} \left[ \sum_{a} \pi_{\theta}(a | s_t) \log \pi_{\theta}(a | s_t) \right]
$$

where:
- $\pi_{\theta}(a | s_t)$ is the probability of taking action $a$ in state $s_t$ under the current policy.


4. **Total Loss**: This is the combined loss function used to update the model parameters. It incorporates the policy loss, value loss, and entropy loss, balanced by their respective coefficients.

$$
\text{Total Loss} = \text{Policy Loss} + c_v \cdot \text{Value Loss} - c_e \cdot \text{Entropy Loss}
$$

where:
- $c_v$ is the value loss coefficient.
- $c_e$ is the entropy loss coefficient.


In [34]:
def fit_ppo(data_loader,agents=agents,loss_dict=loss,optim=optim,device=device,scheduler=scheduler,echo=True):

      agents[0].to(device).train()  # model to training mode
      agents[1].to(device).eval()  # model to evaluation mode
      ##############################
      model_clone = copy.deepcopy(agents[0]) # later the old model
      ##############################
      agents[0].return_val = True
      agents[1].return_val = False
      ##############################
      avg_total_loss = 0.0
      avg_policy_loss = 0.0
      avg_value_loss = 0.0
      avg_entropy_loss = 0.0
      
      for ep in range(epochs_per_step):
          optim.zero_grad() # grendents accumlation
          for batch_index,((x,mask),act,ret) in enumerate(tqdm(data_loader)):
              
              X = x.to(device)
              Mask = mask.to(device)
              #####################
              policy_new,values= agents[0](X,Mask) # output from the model under training
              policy_old = agents[1](X,Mask)
              loss,policy_loss,value_loss,entropy_loss = criterion(policy_old,policy_new,
                                                                   act.to(device),values,ret.to(device)) # calculating ppo loss
              loss.backward()
              ############################
              avg_total_loss += loss.item()
              avg_policy_loss += policy_loss.item()
              avg_value_loss += value_loss.item()
              avg_entropy_loss += entropy_loss.item()  

          optim.step()

      num_batch = len(data_loader)
      total_ep = epochs_per_step*num_batch # total number of epochs
      #############################
      loss_dict['total_loss'].append(avg_total_loss/total_ep)
      loss_dict['policy_loss'].append(avg_policy_loss/total_ep)
      loss_dict['value_loss'].append(avg_value_loss/total_ep)
      loss_dict['entropy_loss'].append(avg_entropy_loss/total_ep)
      if echo :
          print(f"data size : {num_batch*batch_size} states.")
          print(f"Avg total loss : {loss_dict['total_loss'][-1]}, Avg policy loss : {loss_dict['policy_loss'][-1]}.")
          print(f"Avg value loss : {loss_dict['value_loss'][-1]}, Avg entropy loss : {loss_dict['entropy_loss'][-1]}.")

      ########################################################
      scheduler.step()
      model_clone.name = "old"
      agents = [agents[0],model_clone]
      #######################################################
      agents[0].return_val = False
      agents[1].return_val = False # only policy head works

      return agents,loss_dict,optim,scheduler

<h3 style="color: darkblue; font-weight: bold;">Simulation and Training Loop</h3>

The simulation/training loop consists of the following steps:

1. **Parallel Self-Play**: The `parallelSelfPlay` function generates multiple simulated games concurrently using GPU acceleration. It takes the current agent's weights and produces trajectories of gameplay experiences.
  
2. **Update Agent's Weights**: After generating trajectories from self-play, the agent's weights are updated using the collected data. The `fit_ppo` function is called to train the agent using the Proximal Policy Optimization (PPO) loss function. This involves optimizing the combined loss function to update both the policy and value networks, helping the agent to learn the optimal strategy for playing Othello.
  
3. **Testing**: Once the agent's weights have been updated, the agent is tested against various opponents to evaluate its performance. Testing involves playing against opponents such as a random policy, another trained agent, and the minimax algorithm with a selected depth. The test results provide insights into the agent's progress and performance improvements over time.


In [None]:
try :
    avg_time = 0.0
    avg_loss = 0.0
    # 1 => steps
    for step in range(current_step, steps + 1):
        start = time.time()  # start counting time for the loop
        agents[0].to(device).eval() # simulation in cpu
        agents[1].to(device).eval()
        ################
        agents[0].return_val = False
        agents[1].return_val = False # only policy head works
        ################
        # generating data from self playing
        print("-------------------------------------------------------------")
        print(f"Step Num : {step}/{steps}.")
        print("-------------------------------------------------------------")
        data = parallelSelfPlay(model=agents[0],device=device,workers=num_workers,
                                Epsilon = calcEpsilon(step),boltzmann=True,game_per_worker=game_per_worker)
        # training loader
        train_loader = DataLoader(data, batch_size=batch_size, shuffle=True, drop_last=True)
        # train the model
        print("")
        agents,loss,optim,scheduler = fit_ppo(data_loader=train_loader,agents=agents,loss_dict=loss,optim=optim,
                                                   device=device,scheduler=scheduler)
        agents[0].cpu().eval()
        agents[1].cpu().eval()

        avg_time += (time.time() - start) / test_frq  # counting the average time for an epoch

        print(f"time stamp : {datetime.now()}")

        if step % test_frq == 0:
              print("\ntesting...\n")
              RunTests(agents[0])
              score = Test(agents[0],best_model) # new vs old
              print(f"model {agents[0].name} vs {best_model.name}:  {(score[0] / (score[0] + score[1]) * 100):.04f} %.")
              win_trace[f"{agents[0].name}_vs_{best_model.name}"].append((score[0] / (score[0] + score[1]) * 100))
              ##########################################
              time_step.append(avg_time)
              ###########################################
              data = {
                  "win trace": win_trace,
                  "total loss": loss,
                  "step": step,
                  "time stamp": str(datetime.now()),
                  "avg time ber epoch": time_step,
              }
              if win_trace[f"{agents[0].name}_vs_{best_model.name}"][-1] > 50.0 :
                  best_model = copy.deepcopy(agents[0])
                  best_model.name = "best_model"
                  # save model
                  for agent in agents:
                      agent.save(os.path.join(path, f"shaza_{agent.name}.pth"))
                      torch.save({
                          "optimizer_state_dict": optim.state_dict(),
                      }, os.path.join(path, f"optimizer.pth"))
                      torch.save({
                          "scheduler_state_dict": scheduler.state_dict(),
                      }, os.path.join(path, f"scheduler.pth"))

                  # save log
                  np.save(os.path.join(path, f"log.npy"), data)
        
              # reset vars
              avg_time = 0.0
              avg_loss = 0.0

except Exception as e:
    log = {
        "message" : f"{e}",
        "time stamp": str(datetime.now()),
    }
    print(log)
    np.save(os.path.join(path, f"errorlog.npy"),log)

### Conclusion

Congratulations on completing this notebook journey! Happy coding, and may your algorithms always find success! 🚀🤖

Best regards,  
[Mohammed Yahya Yousif](https://www.linkedin.com/in/mohammed-yousif-6b272a241/)  
GitHub: [mohammed-tech-innovator](https://github.com/mohammed-tech-innovator)  
Website: [tech-innovator.me](https://tech-innovator.me/)
