# Chapter 12: Deep Learning in Multi-Player Games

New Skills in This Chapter:

• Creating your own game environment

• Adding attributes and methods to a game environment

• Making moves and determining wins and losses in a game

• Training game strategies in self-made environments

• Designing deep learning game strategies in multi-player games

***
*Oh, well, this would be one of those circumstances that people unfamiliar with the
law of large numbers would call a coincidence.*<br>
***
--Sheldon Cooper, in The Big Bang Theory

***

In [1]:
import os

os.makedirs("files/ch12", exist_ok=True)

***
$\mathbf{\text{Install needed modules for Chapter 12}}$<br>
***
To covert a ps file to a png file, you need to conda install Ghostscript.  

Enter the following command in the Anaconda prompt (Windows) or a terminal (MAC/Linux) with your virtual environment activated 


`conda install -c conda-forge ghostscript==9.54.0`


***

# 12.1. Create the Tic Tac Toe Game Environment

## 12.1.2. Create A Local Module for the Tic Tac Toe Game


Download *ttt_env.py* in the folder *utils* from the book's GitHub repository. 

Open the file *ttt_env.py* to familiarize yourself with the module. To save space, we'll just outline the main structure of the module below:

In [2]:
# Define an action_space helper class
class action_space:
    ...
    
# Define an obervation_space helper class    
class observation_space:
    ...

# the ttt class
class ttt():
    # initiate the class
    def __init__(self): 
        ...
    # reset the board
    def reset(self):  
        ...            
    # place piece on board and update state
    def step(self, inp):
        ...
                  
    # Determine if a player has won the game
    def win_game(self):
        ...
    # Show the graphical board
    def render(self):
        ...
    # Close the game environment
    def close(self):
        ...        

## 12.1.3. Verify the Custom-Made Game Environment
First we'll initiate the game environment and show the game board.

In [3]:
from utils.ttt_env import ttt

env = ttt()
env.reset()                    
env.render()

We first import the *ttt* class from the local package. We then create an instance of the class and call it *env*. The *reset()* method set the game board to the initial state. The *render()* method generates a graphical game board. 

If you run the above cell, you should see a separate turtle window, with a game board as follows: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_start.png" />

If you want to close the game board window, use the *close()* method, like so:

In [4]:
env.close()

Next, we'll check the attributes of the game environment such as the observation space and the action space. 

In [5]:
# check the action space
number_actions = env.action_space.n
print("the number of possible actions are",\
      number_actions)
# sample the action space ten times
print("the following are ten sample actions")
for i in range(10):
   print(env.action_space.sample())
# check the shape of the observation space
print("the shape of the observation space is", \
      env.observation_space.shape)

the number of possible actions are 9
the following are ten sample actions
2
5
1
8
8
1
9
4
2
3
the shape of the observation space is (9,)


Results above show that there are nine possible actions that can be taken by the agent. The meanings of the actions in this game as follows
* 1: Placing a game piece in cell 1
* 2: Placing a game piece in cell 2
* ...
* 9: Placing a game piece in cell 9

The *sample()* method returns an action from the action space randomly. The state space is a vector with 9 values. Each value can be either -1, 0, or 1, with teh following meanings: 
* 0 means the cell is empty; 
* -1 means the cell is occupied by player O; 
* 1 means the cell is occupied by player X.

## 12.1.4. Play A Game in the Tic Tac Toe Environment
Next, we'll play a game in the custom-made environment, by randomly choosing an action from the action space each step.

In [6]:
import time
import random

env=ttt()
state=env.reset()   
env.render()
# Play a full game
while True:
    action = random.choice(env.validinputs)
    time.sleep(1)
    print(f"Player X has chosen action {action}")    
    state, reward, done, info = env.step(action)
    env.render()
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")    
    if done:
        if reward==1:
            print(f"Player X has won!") 
        else:
            print(f"It's a tie!") 
        break   
    action = random.choice(env.validinputs)
    time.sleep(1)
    print(f"Player O has chosen action {action}")    
    state, reward, done, info = env.step(action)
    env.render()
    print(f"the current state is \n{state.reshape(3,3)[::-1]}") 
    if done:
        print(f"Player O has won!") 
        break   
env.close()      

Player X has chosen action 8
the current state is 
[[0 1 0]
 [0 0 0]
 [0 0 0]]
Player O has chosen action 9
the current state is 
[[ 0  1 -1]
 [ 0  0  0]
 [ 0  0  0]]
Player X has chosen action 4
the current state is 
[[ 0  1 -1]
 [ 1  0  0]
 [ 0  0  0]]
Player O has chosen action 2
the current state is 
[[ 0  1 -1]
 [ 1  0  0]
 [ 0 -1  0]]
Player X has chosen action 1
the current state is 
[[ 0  1 -1]
 [ 1  0  0]
 [ 1 -1  0]]
Player O has chosen action 5
the current state is 
[[ 0  1 -1]
 [ 1 -1  0]
 [ 1 -1  0]]
Player X has chosen action 3
the current state is 
[[ 0  1 -1]
 [ 1 -1  0]
 [ 1 -1  1]]
Player O has chosen action 6
the current state is 
[[ 0  1 -1]
 [ 1 -1 -1]
 [ 1 -1  1]]
Player X has chosen action 7
the current state is 
[[ 1  1 -1]
 [ 1 -1 -1]
 [ 1 -1  1]]
Player X has won!


# 12.2. Train A Deep Learning Game Strategy

## 12.2.2. Simulate Tic Tac Toe Games

In [7]:
from pprint import pprint
import numpy as np

env=ttt()
# Define the one_game() function
def one_game():
    history=[]
    state=env.reset()   
    while True:   
        action=random.choice(env.validinputs)  
        state,reward,done,info=env.step(action)
        history.append(np.array(state))
        if done:
            break
    return history, reward

# Simulate one game and print out results
history, outcome = one_game()
pprint(history)
pprint(outcome)        

[array([0, 0, 0, 0, 1, 0, 0, 0, 0]),
 array([ 0,  0,  0,  0,  1,  0, -1,  0,  0]),
 array([ 0,  0,  0,  0,  1,  0, -1,  1,  0]),
 array([ 0,  0, -1,  0,  1,  0, -1,  1,  0]),
 array([ 0,  1, -1,  0,  1,  0, -1,  1,  0])]
1


In [8]:
# simulate 100000 games and record them
results = []        
for x in range(100000):
    history, outcome = one_game()
    # Note here we associate each board with the game outcome
    for board in history:
        results.append((outcome, board))    

In [9]:
import pickle
# save the simulation data on your computer
with open('files/ch12/games_ttt100K.p', 'wb') as fp:
    pickle.dump(results,fp)
# read the data and print out the first 10 observations       
with open('files/ch12/games_ttt100K.p', 'rb') as fp:
    games = pickle.load(fp)
pprint(games[:10])

[(1, array([0, 0, 0, 1, 0, 0, 0, 0, 0])),
 (1, array([ 0,  0,  0,  1,  0,  0,  0, -1,  0])),
 (1, array([ 0,  0,  0,  1,  0,  0,  0, -1,  1])),
 (1, array([ 0,  0,  0,  1,  0, -1,  0, -1,  1])),
 (1, array([ 0,  1,  0,  1,  0, -1,  0, -1,  1])),
 (1, array([ 0,  1, -1,  1,  0, -1,  0, -1,  1])),
 (1, array([ 1,  1, -1,  1,  0, -1,  0, -1,  1])),
 (1, array([ 1,  1, -1,  1,  0, -1, -1, -1,  1])),
 (1, array([ 1,  1, -1,  1,  1, -1, -1, -1,  1])),
 (0, array([0, 0, 0, 0, 1, 0, 0, 0, 0]))]


## 12.2.3. Train Your Tic Tac Toe Game Strategy 

In [10]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Conv2D, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Conv2D(filters=128, 
kernel_size=(3,3),padding="same",activation="relu",
                 input_shape=(3,3,1)))
model.add(Flatten())
model.add(Dense(units=64, activation="relu"))
model.add(Dense(units=64, activation="relu"))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

In [11]:
import tensorflow as tf

with open('files/ch12/games_ttt100K.p', 'rb') as fp:
    tttgames = pickle.load(fp)

boards = []
outcomes = []
for game in tttgames:
    boards.append(game[1])
    outcomes.append(game[0])

X = np.array(boards).reshape((-1, 3, 3, 1))
# one_hot encoder, three outcomes: -1, 0, and 1
y = tf.keras.utils.to_categorical(outcomes, 3)

In [12]:
# Train the model for 100 epochs
model.fit(X, y, epochs=100, verbose=0)
model.save('files/ch12/trained_ttt100K.h5')

It takes several hours to train the model since we have close to a million observations. The trained model is saved on your computer. Alternatively, you can download the trained model from the book’s GitHub repository.

# 12.3. Use the Trained Model to Play Games

## 12.3.1. Best Moves Based on the Trained Model

In [13]:
from copy import deepcopy

def best_move_X(env):
    # if there is only one valid move, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # Set the initial value of bestoutcome        
    bestoutcome=-2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        state=state.reshape(-1,3,3,1)
        prediction=model.predict(state, verbose=0)
        # output is prob(X wins) - prob(O wins)
        win_lose_dif=prediction[0][1]-prediction[0][2]
        if win_lose_dif>bestoutcome:
            # Update the bestoutcome
            bestoutcome=win_lose_dif
            # Update the best move
            bestmove=move
    return bestmove

In [14]:
def best_move_O(env):
    # Set the initial value of bestoutcome        
    bestoutcome = -2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        state=state.reshape(-1,3,3,1)
        prediction=model.predict(state, verbose=0)
        # output is prob(O wins) - prob(X wins)
        win_lose_dif=prediction[0][2]-prediction[0][1]
        if win_lose_dif>bestoutcome:
            # Update the bestoutcome
            bestoutcome = win_lose_dif
            # Update the best move
            bestmove = move
    return bestmove

## 12.3.2. Test A Game Against the Model

In [15]:
env=ttt()
state=env.reset()   
env.render()
# Play a full game manually
while True:
    # Use the best_move_X() function to select move
    action=best_move_X(env)
    print(f"Player X has chosen action {action}")    
    state,reward,done,info=env.step(action)
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")
    env.render()
    if done:
        if reward==1:
            print(f"Player X has won!") 
        else:
            print(f"It's a tie!") 
        break    
    action = random.choice(env.validinputs)
    print(f"Player O has chosen action {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")
    env.render()
    if done:
        print(f"Player O has won!") 
        break        

The best strategy looks at each possible next move, and add that move to the current board to form a hypothetical board. We feed the hypothetical board to the trained model to make predictions. 
The prediction will have three values: the probability of tying, player X winning, and player O winning. The best strategy chooses the move with the highest probability of player X winning the game. 

Here is one example of the eventual outcome:


Player X uses the best moves recommended by the trained model and wins the game by occupying cells 4, 5, and 6, as shown in this picture.
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_win_screen.png" /> 

You can also test the best strategy for player O by using the best_move_O() function, assuming Player X chooses random moves. I leave that as an excercise for you. 

## 12.3.3. Test the Efficacy of the Trained Model

In [16]:
# Initiate the game environment
env=ttt()
results=[]
for i in range(1000):
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        if env.turn=="X":
            action = best_move_X(env) 
        else:
            action = best_move_O(env)    
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the deep learning agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = random.choice(env.validinputs)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the deep learning agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break         

In [17]:
# count how many times the deep learning agent won
wins=results.count(1)
print(f"the deep learning agent has won {wins} games")
# count how many times the deep learning agent lost
losses=results.count(-1)
print(f"the deep learning agent has lost {losses} games")         
# count how many times the game ties
losses=results.count(0)
print(f"the game has tied {losses} times")          

the number of winning games is 994
the number of tying games is 6
the number of losing games is 0


# 12.4. Animate the Deep Learning Process

## 12.4.1. Probabilities of Winning for Each Hypothetical Move

In [18]:
from utils.ch12util import record_ttt

history=record_ttt()

In [19]:
p_wins_step0=history[0][1]
for key, value in p_wins_step0.items():
    print(f"If Player X chooses action {key}, \
    the probability of winning is {value:.4f}.")

If Player X chooses action 1, the probability of winning is 0.6460.
If Player X chooses action 2, the probability of winning is 0.5623.
If Player X chooses action 3, the probability of winning is 0.6474.
If Player X chooses action 4, the probability of winning is 0.5680.
If Player X chooses action 5, the probability of winning is 0.7345.
If Player X chooses action 6, the probability of winning is 0.5654.
If Player X chooses action 7, the probability of winning is 0.6453.
If Player X chooses action 8, the probability of winning is 0.5629.
If Player X chooses action 9, the probability of winning is 0.6471.


In [20]:
p_wins_step1=history[1][1]
for key, value in p_wins_step1.items():
    print(f"If Player X chooses action {key},\
    the probability of winning is {value:.4f}.")

If Player X chooses action 1, the probability of winning is 0.8206.
If Player X chooses action 2, the probability of winning is 0.6159.
If Player X chooses action 3, the probability of winning is 0.7762.
If Player X chooses action 4, the probability of winning is 0.7598.
If Player X chooses action 6, the probability of winning is 0.7974.
If Player X chooses action 7, the probability of winning is 0.7929.
If Player X chooses action 9, the probability of winning is 0.7860.


In [21]:
import pickle

# save the game history on your computer
with open('files/ch12/ttt_game_history.p','wb') as fp:
    pickle.dump(history,fp)

## 12.4.2. Animate the Whole Game

In [22]:
import imageio
from PIL import Image

frames=[]
for i in range(6):
    im=Image.open(f"files/ch12/ttt_step{i}.ps")
    frame=np.asarray(im)
    frames.append(frame) 
imageio.mimsave("files/ch12/ttt_steps.gif",frames,duration=1000) 

If you open the file ttt_steps.gif, you'll see the following: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_steps.gif"/>

The animation shows the game board at each stage of the game. 

## 12.4.3. Animate the Decision Making

In [23]:
from utils.ch12util import gen_images

gen_images()

The above script highlights the decision making process of Player X. For example, if you open the file ttt_stage4step3.png, you'll see the following picture.
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_stage4step3.png" /> It shows the probabilities of Player X winning the game with each hypothetical move. In particular, the probability is 100% if Player X chooses Cell 9. The cell is highlighted in blue, and that is also the move made by Player X as a result. 

In [24]:
from PIL import Image
import imageio

frames=[]
for stage in [0, 2, 4]:
    for step in [1,2,3]:
        file=f"files/ch12/ttt_stage{stage*2}step{step}.png"
        im=Image.open(file)
        f1=np.asarray(im)
        frames.append(f1)  
imageio.mimsave('files/ch12/ttt_DL_probs.gif',frames,duration=500)

If you open the file ttt_DL_probs.gif, you'll see the animation as follows.
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_DL_probs.gif" /> 

## 12.4.4. Animate Board Positions and the Decision Making

In [25]:
from utils.ch12util import combine_animation

frames=combine_animation()

If you open the gif file, you'll see the following animation:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/ttt_DL_steps.gif"/>

## 12.4.5. Subplots of the Decision-Making Process

In [26]:
subplot_frames=frames[2::3]

In [27]:
from matplotlib import pyplot as plt

plt.figure(figsize=(20,30),dpi=200)
for i in range(3):
    plt.subplot(3,1,i+1)
    plt.imshow(subplot_frames[i])
    plt.axis('off')
plt.subplots_adjust(bottom=0.001,right=0.999,top=0.999,
        left=0.001, hspace=-0.01,wspace=-0.22)
plt.savefig("files/ch12/subplots_ttt.png")

# 12.6 Exercises

In [28]:
# answer to question 12.2
import time
import random

env=ttt()
state=env.reset()   
env.render()
# Play a full game
while True:
    action = random.choice(env.validinputs)
    time.sleep(1)
    print(f"Player X has chosen action {action}")    
    state, reward, done, info = env.step(action)
    env.render()
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")    
    if done:
        if reward==1:
            print(f"Player X has won!") 
        else:
            print(f"It's a tie!") 
        break   
    action = int(input("Player O, enter your move:"))
    time.sleep(1)
    print(f"Player O has chosen action {action}")    
    state, reward, done, info = env.step(action)
    env.render()
    print(f"the current state is \n{state.reshape(3,3)[::-1]}") 
    if done:
        print(f"Player O has won!") 
        break   
env.close()  

In [29]:
# answer to question 12.3
env=ttt()
state=env.reset()   
env.render()
# Play a full game manually
while True:
    action=random.choice(env.validinputs)
    print(f"Player X has chosen action {action}")    
    state,reward,done,info=env.step(action)
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")
    env.render()
    if done:
        if reward==1:
            print(f"Player X has won!") 
        else:
            print(f"It's a tie!") 
        break    
    # Use the best_move_O() function to select move
    action=best_move_O(env)
    print(f"Player O has chosen action {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{state.reshape(3,3)[::-1]}")
    env.render()
    if done:
        print(f"Player O has won!") 
        break  

In [30]:
# answer to question 12.4
p_wins_step2=history[2][1]
for key, value in p_wins_step2.items():
    print(f"If Player X chooses action {key},\
    the probability of winning is {value:.4f}.")