# Chapter 10: Introduction to Deep Learning

Starting from this chapter, you'll learn a new AI paradigm: machine learning. Instead of hard-coding in the rules, machine learning takes in input-output pairs and figures out the relation between the inputs (which we call features) and outputs (the labels). One field of machine learning, deep learning, has attracted much attention in recent years. The algorithm used by AlphaGo is based on deep reinforcement learning, which is a combimation of deep learning and reinforcement learning.

Deep learning are based on artificial neural networks. A neural network consists of an input layer, some hidden layers, and an output layer. In this chapter, we'll learn to use deep neural networks to design game strategies for Tic Tac Toe. The neural network we create includes both dense layers and convolutional layers. While dense layers treat the input as a one-dimensional vector, convolutional layers can process two dimensional inputs such as images or game boards. As a result, convolutional layers can extract spatial features in images and game boards. This, in turn, has greatly improved the power of deep neural networks (DNNs). You'll learn to treat the Tic Tac Toe game board as a two-dimensional image and extract spatial features from the image and associate these features with game outcome and design intelligent game strategies.  

You'll use similated games as input data to feed into a deep neural network with a convolutional layer and some dense layers. After the model is trained, you'll use it to play games. At each step of the game, you'll look at all possible next moves. The trained model predicts the probability of winning the game with each hypothetical move. You'll pick the move with the highest probability of winning the game for the agent.

***
$\mathbf{\text{Create a subfolder for files in Chapter 10}}$<br>
***
We'll put all files in Chapter 10 in a subfolder /files/ch10. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch10", exist_ok=True)

# 1. What Are Neural Networks?

This section discusses the basic structure of a neural network.

## 1.1. Elements of A Neural Network

A neural network consists of one input layer, one output layer, and a number of hidden layers. In general, each layer in a neural network has one or more neurons. Neural networks with two or more hidden layers are usually called deep neural networks. 

There are differnet types of layers in a neural network. The most common type of layer is the dense layer, in which each neuron is fully connected to the neurons in the next layer. 

The convolutional layers, in contrast, treats the input as a two-dimensional image and extract patterns from the input data. 

## 1.2. Activation Functions
In artificial neural networks, activation functions transform inputs into outputs. As
the name suggests, the activation functions activate the neuron when the input reaches a certain threshold. Simply put, activation functions are on-off switches in artificial neural networks. These on-off switches play an important role in making artificial neural networks powerful. The activation functions allow a network to learn more complex patterns in the data. Without activation functions, neural networks can only learn linear relationships in the data.


Activation functions help us create a nonlinear relationship between the inputs and outputs. Without them, we can only approximate linear relations. No matter how many hidden layers we add to the neural network, we cannot achieve a nonlinear relationship. Without activation functions, the neural network cannot learn a nonlinear relationship: the linear transformation of a linear relationship is still linear.

ReLU is short for rectified linear unit activation function. It returns the original value if it’s positive, and 0 otherwise. It has the mathematical formula of 
$$ReLU(x)\
=\{\genfrac{}{}{0}{}{x\ if \ x>0}{0\ if \ x\leq 0}$$

It’s widely used in many neural networks, and you’ll see it in this book more often
than any other type of activation function.
In essence, the ReLU activation function activates the neuron when the value of x
reaches the threshold value of zero. When the value of x is below zero, the neuron is
switched off. This simply on-off switch is able to create a nonlinear relation between
inputs and outputs.

Another commonly used activation function is the sigmoid function. It’s widely used in many machine learning models. In particular, it’s a must-have in any binary classification problem.
The sigmoid function has the form
$$y=\frac {1} {1+e^{-x}} $$
The sigmoid function has an S-shaped curve. It has this nice property: for any value
of input x between −∞ and∞, the output value y is always between 0 and 1. Because
of this property, we use the sigmoid activation function to model the probability of
an outcome, which also falls between 0 and 1 (0 means there is no chance of the
outcome occurring, while 1 the outcome occurring with 100% certainty).

The third most-used activation function in this book is the softmax function. It’s a
must-have in any multi-category classification problem.
The softmax function has the form
$$y(x)=\frac {e^{x}} {\sum_{k=1}^{K}e^{x_k}}$$
where $x=[x_1,x_2,...,x_K]$ and $y=[y_1,y_2,...,y_K]$ are K-element lists. The i-th element of $y$ is 
$$y_i(x)=\frac {e^{x_i}} {\sum_{k=1}^{K}e^{x_k}}$$ 
The softmax function has a nice property: each element in the output vector $y$ is always between 0 and 1. Further, elements in the output vector $y$ sum up to 1. Because of this property, we use the softmax activation function to model the probability of a multiple outcome event. Therefore, the activation function in the output layer is always the softmax function when we model multi-class classification problems.  

Finally, we'll also use the tanh activation function when we train the actor-critic models later in this book. The Tanh activation function is similar to the sigmoid activation function in the sense that it also s-shaped. However, the output from the tanh activation function is in the range of -1 to 1 instead of from 0 to 1. The tanh activation function has the form
$$y=\frac {2} {1+e^{-2x}} -1$$ In multiplayer games, the game outcome for a player is a number between -1 and 1: -1 means the player has lost the game, 1 means the palyer has won teh game, and a 0 means the game has tied.

## 1.3. Loss Functions
The loss function in ML is the objective function in the mathematical optimization process. Intuitively, the loss function measures the forecasting error of the machine learning algorithm. 
By minimizing the loss function, the machine learning model finds parameter values that
lead to the best predictions. 

The most commonly used loss function is mean squared error (MSE). MSE is defined as $$MSE= \frac{1}{N} \sum_{i=1} ^{N} (Y_n-\hat{Y}_n)^2$$

where $Y_n$ is the actual value of the target variable (i.e., the label) and $\hat{Y}_n$ is the predicted value of the target variable. 
To calculate MSE, we look at the forecasting error: the difference between the model’s predictions and the actual values. We then square the forecasting error for each observation, and average it across all observations. In short, it is the average squared forecasting error in each observation.


In binary classification problems, the preferred loss function is the binary cross-entropy function, which measures the average difference between the predicted probabilities and the actual labels (1 or 0). If a model makes a perfect prediction and assigns a 100% probability to all observations labeled 1 and a 0% probability to all observations labeled 0, the binary cross-entropy loss function will have a value of 0. 

Mathematically, the binary class-entropy loss function is defined as 
$$BinaryCrossEntropy= \sum_{n=1} ^{N} -[Y_n\times log(\hat{Y}_n) + (1-Y_n)\times log(1-\hat{Y}_n)]$$
where $\hat{Y}_n$ is the estimated probability of observation n being class 1, and $Y_n$ is the actual label of observation n (which is either 0 or 1).

The preferred loss function to use in multi-category classifications is the categorical-crossentropy loss function. It measures the average difference between the predicted distribution and the actual distribution. 

Mathematically, the categorical class-entropy loss function is defined as 
$$Categorical\ Cross\ Entropy= \sum_{n=1} ^{N}\sum_{k=1} ^{K} -y_{n,k}\times log(\hat{y}_{n,k})$$
where $\hat{y}_{n,k}$ is the estimated probability of observation n being class k, and $y_{n,k}$ is the actual label of observation n belonging to category k (which is either 0 or 1).

# 2.  Deep Learning Game Strategies in Tic Tac Toe
In this chapter, you'll use a deep neural network to train intelligent game strategies for Tic Tac Toe. In particular, you’ll use the convolutional neural network to train the game strategy. By treating the game board as a two-dimensional image instead of a one-dimensional vector, you’ll greatly improve the intelligence of your game strategies.

You’ll learn how to prepare data to train the model, how to interpret the prediction from the model, how to use the prediction to play games, and how to check the efficacy of your game strategies.

## 1.1. A Summary of the Game Strategy for Tic Tac Toe
Here is a summary of what we’ll do to train the game strategy:

1.	We’ll let two computer players play a game with random moves, and record the whole game history. The game history will contain all the game board positions from the very first move to the very last move.
2.	We then associate each board position with a game outcome (a win, a tie, or a loss). We'll use the game board position as features X, and the outcome as labels y. We'll treat this as a multi-category classification problem since there are three possible outcomes associated with each board posotion: a win, a tie, or a loss.
3.	We’ll simulate 100,000 games. By using the histories of the games and the corresponding outcomes as Xs and ys, we feed the data into a Deep Neural Network. After the training is done, we have a trained model.
4.	We can now use the trained model to play a game. At each move of the game, we look at all possible next moves, and feed the hypothetical game board into the pretained model. The model will tell you the probabilities of a win, a loss, and a tie.
5.	You select the move that the model predicts with the highest chance of winning for the current player.

## 1.2. Create Training Data for Tic Tac Toe 
You’ll learn how to generate data to train the DNN. The logic is as follows: you’ll generate 100,000 games in which both players use random moves. You’ll then record the board positions of all intermediate steps and the eventual outcomes of each board position (a win, a loss, or a tie). 

First, let's simulate one game. The code in the cell below accomplishes that.

In [2]:
from utils.ttt_simple_env import ttt
import time
import random
import numpy as np
from pprint import pprint

# Initiate the game environment
env=ttt()

# Define the one_game() function
def one_game():
    history = []
    state=env.reset()   
    while True:   
        action = random.choice(env.validinputs)  
        state, reward, done, info = env.step(action)
        history.append(np.array(state).reshape(3,3))
        if done:
            break
    return history, reward

# Simulate one game and print out results
history, outcome = one_game()
pprint(history)
pprint(outcome)        

[array([[1, 0, 0],
       [0, 0, 0],
       [0, 0, 0]]),
 array([[ 1,  0,  0],
       [ 0, -1,  0],
       [ 0,  0,  0]]),
 array([[ 1,  0,  0],
       [ 0, -1,  0],
       [ 0,  0,  1]]),
 array([[ 1,  0,  0],
       [-1, -1,  0],
       [ 0,  0,  1]]),
 array([[ 1,  0,  0],
       [-1, -1,  0],
       [ 0,  1,  1]]),
 array([[ 1, -1,  0],
       [-1, -1,  0],
       [ 0,  1,  1]]),
 array([[ 1, -1,  1],
       [-1, -1,  0],
       [ 0,  1,  1]]),
 array([[ 1, -1,  1],
       [-1, -1, -1],
       [ 0,  1,  1]])]
-1


Note here we convert the game board to a 3 by 3 array so it's easy for you to see the positions of the game pieces. 

Now let's simulate 100,000 games and save the data.

In [3]:
# simulate the game 100000 times and record all games
results = []        
for x in range(100000):
    history, outcome = one_game()
    # Associate each board with the game outcome
    for board in history:
        results.append((outcome, board))    

Now let's save the data on your computer for later use

In [4]:
import pickle
# save the simulation data on your computer
with open('files/ch10/games_ttt100K.p', 'wb') as fp:
    pickle.dump(results,fp)
# read the data and print out the first 10 observations       
with open('files/ch10/games_ttt100K.p', 'rb') as fp:
    games = pickle.load(fp)
pprint(games[:10])

[(-1, array([[0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])),
 (-1, array([[-1,  1,  0],
       [ 0,  0,  0],
       [ 0,  0,  0]])),
 (-1, array([[-1,  1,  1],
       [ 0,  0,  0],
       [ 0,  0,  0]])),
 (-1, array([[-1,  1,  1],
       [ 0, -1,  0],
       [ 0,  0,  0]])),
 (-1, array([[-1,  1,  1],
       [ 0, -1,  1],
       [ 0,  0,  0]])),
 (-1, array([[-1,  1,  1],
       [-1, -1,  1],
       [ 0,  0,  0]])),
 (-1, array([[-1,  1,  1],
       [-1, -1,  1],
       [ 0,  1,  0]])),
 (-1, array([[-1,  1,  1],
       [-1, -1,  1],
       [ 0,  1, -1]])),
 (1, array([[0, 0, 1],
       [0, 0, 0],
       [0, 0, 0]])),
 (1, array([[ 0,  0,  1],
       [ 0,  0,  0],
       [ 0,  0, -1]]))]


The first eight observations are from teh first game in which player O won by occupying cells 1, 5, and 9. Therefore you see -1 as the first element of the first eight observations. The data are stored correctly. 

We have the data we need. You’ll learn how to train the model next.

# 3. Create A Convolutional Neural Network

We'll use Keras to create a deep neural network to train game strategies in Tic Tac Toe. In particular, the network will include some dense layers and a convolutional layer. Since there are three possible game outcomes (a win, a loss, and a tie), we'll treat the learning process as a multi-category classification probelm. Therefore, we'll have three neurons in the output layer. We'll use softmax as out activation function in the output layer.

## 3.1. Convolutional Layers
Convolutional layers use filters (also called kernels) to find patterns on the input data. A convolutional layer can automatically detect a large number of patterns and associate certain patterns with the target label. This is useful in both image classifications and game strategy developments in machine learning.

In particular, we'll use the Tic Tac Toe game board, something everyone knows, as our example in this chapter. Game boards have far fewer pixels than images and we can focus on certain patterns that we know are associated with game outcomes (vertical, horizontal, or diagonal lines in Tic Tac Toe and Connect Four games, for example). Therefore, we’ll use game boards to explain how CNNs work.

Let’s say that the input data is the Tic Tac Toe game board. For simplicity let’s assume the board looks like the picture below:

In [5]:
import numpy as np

inputs = np.array([[1,1,1],
                   [0,-1,-1],
                   [0,0,0]]).reshape(-1,3,3,1) 

In teh anove game board, Player X occupies cells 1, 2, and 3, while Player O occupies cells 5 and 6. We represent the board with a 3 by 3 matrix: the first row has three ones since they are occupied by Xs. We use reshape(-1,3,3,1) to reshape the matrix to a four dimensional array: the first dimension represents how many images we have; the second and third dimensions are the width and height of the image. The last dimension is the color channel. For a color picture, there are three channels (RGB, i.e., red, green, and blue), but here we put the number of channels as one for simplicity.

Below, we'll create a horizontal filter with a size of 3 by 3. The middle row has values 1, while the other two rows have 0s. 

In [6]:
# Create a horizontal filter
h_filter = np.array([[0,0,0], 
                   [1,1,1],
                   [0,0,0]]).reshape(3,3,1,1)  

A horizontal filter highlights the horizontal features in the image and blurs the rest. We’ll apply the 3 by 3 horizontal filter on the Tic Tac Toe game board as follows:

In [7]:
import tensorflow as tf

# Apply the filter on the game board
outputs=tf.nn.conv2d(inputs,h_filter,strides=1,padding="SAME")
# Convert output to numpy array and print it output
print(outputs.numpy().reshape(3,3))

[[ 2  3  2]
 [-1 -2 -2]
 [ 0  0  0]]


In the output, the values are large in the first row. The values are much lower in the other two rows. So the horizontal filter has correctly detected the horizontal pattern in the first row of the game board.  

## 3.2. Create A Model to Train the Tic Tac Toe Game Strategy
We use Keras to create teh following deep neural network to train teh game strategy.

In [8]:
from random import choice
import pickle
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Conv2D, Flatten
from tensorflow.keras.models import Sequential
import numpy as np

model = Sequential()
model.add(Conv2D(filters=128, 
kernel_size=(3,3),padding="same",activation="relu",
                 input_shape=(3,3,1)))
model.add(Flatten())
model.add(Dense(units=64, activation="relu"))
model.add(Dense(units=64, activation="relu"))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

We first use a convoluttional layer with 128 filters. The kernel size is 3 by 3. We tehn flatten the output form the convolutional layer to a vector and feed it to two hidden dense layers with 64 neurons each. The output layer has three neurons, representing there possible game outcomes: a win, a tie, or a loss. The softmax activation ensures that the proabilities add up to 100%. 

## 3.3. Train the Deep Neural Network
We'll train the deep neural network we just created in the last section. We first preprocess the data so that we can feed them into the model.

The outcome data is a variable with three possible values: -1, 0, and 1. We'll convert them into one-hot variables so that the deep neural network can process. 

In [9]:
import tensorflow as tf

labels=[0,1,-1]
one_hot=tf.keras.utils.to_categorical(labels,3)
print(one_hot)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In the example above, we have three labels: 0, 1, and -1. They represent a tie, a win for Player X, and a loss for Player X (i.e., a win for Player O).

We can use the *to_categorical()* method in TensorFlow to change them into one-hot variables. The second argument in the *to_categorical()* method, 3, indicates the depth of the one-hot variable. This means each one-hot variable will be a vector with a length of 3, with value 1 in one position and 0
in all others.

A tie, which has an initial label of 0, now becomes a one-hot label: a 3-value
vector [1, 0, 0]. The first value (i.e., index 0) is turned on as 1, and the other two values are turned off as 0. Similarly, a win for Player X, which has a label of 1 originally, now becomes a one-hot label of [0, 1, 0]. The second value (i.e., index 1) is turned on as 1, and the rest are turned off as 0. By the same logic, a loss for Player X, with an original value of -1, is now represented by
[0, 0, 1]. 

Next, we load up the simulated game data and convert them into Xs and ys so that we can feed them into the deep neural network:

In [10]:
with open('files/ch10/games_ttt100K.p','rb') as fp:
    tttgames=pickle.load(fp)

boards = []
outcomes = []
for game in tttgames:
    boards.append(game[1])
    outcomes.append(game[0])

X = np.array(boards).reshape((-1, 3, 3, 1))
# one_hot encoder, three outcomes: -1, 0, and 1
y = tf.keras.utils.to_categorical(outcomes, 3)

Finally, we train the model for 100 epochs: 

In [11]:
# Train the model for 100 epochs
model.fit(X, y, epochs=100, verbose=0)
model.save('files/ch10/trained_ttt100K.h5')

It takes several hours to train the model since we have close to a million observations. The trained model is saved on your computer. 

## 4. Use the Trained Model to Play Tic Tac Toe
Next, we’ll use the strategy to play a game. 

The player X will use the best move from the trained model. Player O will randomly select a move. 

### 4.1. Best Moves Based on the Trained Model
First, we'll define a *best_move_X()* function for player X. The function will go over each move hypothetically, and use the trained deep neural network to predict the probability of player X winning the game. The function returns the move with the highest chance of winning.

We define a best_move_X() function for the computer to find best moves. 
What the computer does is as follows:
1.	Look at the current board.
2.	Look at all possible next moves, and add each move to the current board to form a hypothetical board
3.	Use the pretained model to predict the chance of winning with the hypothetical board
4.	Choose the move that produces the highest chance of winning. 

In [12]:
def best_move_X(env):
    # if there is only one valid move, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # Set the initial value of bestoutcome        
    bestoutcome = -2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        state=state.reshape(-1, 3,3,1)
        prediction=reload.predict(state, verbose=0)
        # output is prob(X wins) - prob(O wins)
        win_lose_dif=prediction[0][1]-prediction[0][2]
        if win_lose_dif>bestoutcome:
            # Update the bestoutcome
            bestoutcome = win_lose_dif
            # Update the best move
            bestmove = move
    return bestmove

Similarly, we'll define a *best_move_O()* function for player O. The function will go over each move hypothetically, and use the trained deep neural network to predict the probability of player O winning the game. The function returns the move with the highest chance of winning for Player O.

In [13]:
def best_move_O(env):
    # Set the initial value of bestoutcome        
    bestoutcome = -2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        state=state.reshape(-1,3,3,1)
        prediction=reload.predict(state, verbose=0)
        # output is prob(O wins) - prob(X wins)
        win_lose_dif=prediction[0][2]-prediction[0][1]
        if win_lose_dif>bestoutcome:
            # Update the bestoutcome
            bestoutcome = win_lose_dif
            # Update the best move
            bestmove = move
    return bestmove

Now let's use the best move functions to choose moves for player X and play a game.

In [14]:
from utils.ttt_simple_env import ttt
import time
import random
from copy import deepcopy
import numpy as np
import tensorflow as tf

file='files/ch10/trained_ttt100K.h5'
reload=tf.keras.models.load_model(file)
# Initiate the game environment
env=ttt()
state=env.reset()   
while True:
    # Use the best_move() function to select the next move
    action = best_move_X(env)
    print(f"Player X has chosen action={action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{state.reshape(3,3)}")
    if done:
        if reward==1:
            print(f"Player X has won!") 
        else:
            print(f"It's a tie!") 
        break   
    action = random.choice(env.validinputs)
    print(f"Player O has chosen action={action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{state.reshape(3,3)}")
    if done:
        print(f"Player O has won!") 
        break    

Player X has chosen action=5
the current state is 
[[0 0 0]
 [0 1 0]
 [0 0 0]]
Player O has chosen action=2
the current state is 
[[ 0 -1  0]
 [ 0  1  0]
 [ 0  0  0]]
Player X has chosen action=1
the current state is 
[[ 1 -1  0]
 [ 0  1  0]
 [ 0  0  0]]
Player O has chosen action=4
the current state is 
[[ 1 -1  0]
 [-1  1  0]
 [ 0  0  0]]
Player X has chosen action=9
the current state is 
[[ 1 -1  0]
 [-1  1  0]
 [ 0  0  1]]
Player X has won!


The DNN model has won the game. 

Next, we’ll test how often the DNN trained game strategy wins against a player who makes random moves. 
The following script does that:

## 4.2. Against Random Players
We'll see how the deep learning game strategy fairs against a random player. We simulate 100 games. If the deep learning agent wins, we record an outcome of 1. Otherwise, we record an outcome of -1. 

In [15]:
from utils.ttt_simple_env import ttt
import time
import random
from copy import deepcopy
import numpy as np
import tensorflow as tf

# Initiate the game environment
env=ttt()
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        if env.turn=="X":
            action = best_move_X(env) 
        else:
            action = best_move_O(env)    
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the DL agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = random.choice(env.validinputs)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the DL agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break                       

Among 50 games, the deep learning agent moves. In the remaining 50 games, the random-move agent goes first. This way, no player has a first-mover's advantage. We first create an empty list *results*. Whenever the deep learning agent wins, we append a value of 1 to the list. Otherwise we add an element of -1 to the list.

Next, we count how many times the deep learning agent has won:

In [16]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the deep learning agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the deep learning agent has lost {losses} games")         
# count how many times the game ties
losses=results.count(0)
print(f"the game has tied {losses} times") 

the deep learning agent has won 92 games
the deep learning agent has lost 3 games
the game has tied 5 times


The deep learning agent wins 92 out of 100 games and loses 3 games. The remaining 5 games are tied. So the deep learning game strategy works really well!

Note that we simulated 100,000 games to train the model. You can potentially simulate even more games, say, 1 million games and train the model. The trained model will be even more powerful. However, it takes much longer to train as well. 

## Against Think-Two-Steps-Ahead AI
Next, we see how the deep learning agent fairs against the think-two-steps-ahead AI agent that we developed in Chapter 5. 

In [17]:
from utils.ch05util import AI_think2
from utils.ttt_simple_env import ttt
import time
import random
from copy import deepcopy
import numpy as np
import tensorflow as tf

# Initiate the game environment
env=ttt()
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=AI_think2(env)
        if action is None:
            action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        if env.turn=="X":
            action = best_move_X(env) 
        else:
            action = best_move_O(env)    
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the DL agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = AI_think2(env)  
        if action is None:
            action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the DL agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break

In Tic Tac Toe, Player X has a huge first-mover’s advantage. Therefore, we test 100 games and in 50 of them, we let the think-two-steps-ahead agent go first. In the other 50 games, the deep learning agent moves first. We record game outcomes in a list results. If the deep learning agent wins, we record an outcome of 1 in the list results. If the deep learning agent loses, we record an outcome of -1. If the game is tied, we record an outcome of 0.

Next, we check how many times the deep learning agent has won:

In [18]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the deep learning agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the deep learning agent has lost {losses} games")         
# count how many times the game ties
losses=results.count(0)
print(f"the game has tied {losses} times") 

the deep learning agent has won 36 games
the deep learning agent has lost 12 games
the game has tied 52 times


Whenever it’s Player X’s turn, the deep learning agent uses the best_move_X() function to select a move. Whenever it’s Player O’s turn, the deep learning agent uses the best_move_O() function to select a move. The opponent of the deep learning agent is the think-two-steps-ahead agent. Results show that the deep learning agent has won 36 games and lost 12 games out of 100 games. The remaining 52 games are tied. So the deep learning game strategy works really well and seems to be better than a think-two-steps-ahead agent. 