# Learn to Score a Tic-Tac-Toe Board by Example

## Introduction 


We want to use machine learning to support intelligent agents playing Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). Here is the approach:

1. Simulate playouts to create labeled training data. The data $X$ is the board and the label $y$ indicates if the game resulted in a win or a loss.
2. Learn a model to predict the probability that we will win a given board. More specifically, we will learn here a function that predicts a score normalzed to a probability of a win for each board (i.e., $\hat{P}(y = \mathrm{win} | x) = h(x)$).
3. The model can be applied as:
   - the heuristic evaluation function for Heuristic Minimax Search.
   - a playout policy for better simulated games used in Pure Monte Carlo Search/Monte Carlo Tree Search.
   - self-play where we learn a model, use the model as the playout policy to create more a realistic training data for a better model. We keep on doing that as long as the model improves. 

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [1]:
import numpy as np
import pandas as pd
import math

%precision 3

'%.3f'

In [2]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [3]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [4]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [5]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [6]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [7]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


In [8]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()
    
show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

# Create Training Data using Playouts

We will try to learn a function $\hat{y} = h(x)$ where $x$ is a board and $y$ is the estimated utility. The data we need to learn this model can be creating by running playouts (complete games) and recording the boards and mark them as leading to a win, loss or draw. Note: I learn a model that is specific to a player by using only the boards resulting from a move of that player.  

To describe $x$ for the learning algorithm, I translate empty cells to 0, `x` to 1 and `o` to -1. For $y$ I use the utility defined for win (1), loss (-1), and tie (0).

We will start with a **randomized playout policy.**

In [9]:
def playout_policy_random(state, player = 'x'):
    return np.random.choice(get_actions(state))
    
playout_policy = playout_policy_random
playout_policy(board)

4

In [10]:
tr = {' ': 0, 'x': 1, 'o': -1} # I translate the board into numbers

def encode_state(state):
    """Represent the board as a vector of numbers."""
    return [tr[s] for s in state]

def playout_record(player = 'x'):
    """Run a playout and record the boards after the player's move."""
    state = empty_board()
    current_player = 'x'
    
    boards = []
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: return(boards, [u] * len(boards))
  
        a = playout_policy(state, current_player)
        state = result(state, current_player, a)   
  
        if current_player == player:
            boards.append(encode_state(state))

        # switch between players
        current_player = other(current_player)

playout_record()

([[0, 0, 0, 0, 0, 0, 0, 0, 1],
  [0, 0, -1, 0, 0, 0, 1, 0, 1],
  [0, -1, -1, 0, 0, 0, 1, 1, 1]],
 [1, 1, 1])

Run `N` playouts and create a pandas dataframe for `X` and a numpy array for `y`. These data structures work for `sklearn`. 

In [11]:
def create_data(N = 100, record = 'x'):
    board = []
    utility = []
    
    for i in range(N):
        b, u = playout_record(record)
        board.extend(b)
        utility.extend(u)
        
    return {'X': pd.DataFrame(board), 'y': np.array(utility)}


np.random.seed(1234)

data = create_data(2000)
X = data['X']
y = data['y']

print("X:")
display(X)

print("y:")
display(y)

X:


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,0,1,0,0,0,0,0
1,0,0,0,1,0,0,1,-1,0
2,0,0,0,1,0,-1,1,-1,1
3,-1,0,1,1,0,-1,1,-1,1
4,-1,1,1,1,-1,-1,1,-1,1
...,...,...,...,...,...,...,...,...,...
8336,-1,1,1,1,-1,-1,-1,1,1
8337,0,0,0,0,0,0,0,1,0
8338,-1,1,0,0,0,0,0,1,0
8339,-1,1,0,0,-1,1,0,1,0


y:


array([ 0,  0,  0, ..., -1, -1, -1])

## Train a Model

In [12]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score

Split the data in training and testing data.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

We learn an Artificial Neural Network (ANN) to approximate $y = f(X)$ by $\hat{y} = h(X)$. See
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

ANNs are popular for this kind of task but other classification models can also be used (e.g., decision trees).

In [14]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
                    hidden_layer_sizes = (100),
                    max_iter = 1000
                    ) 
                    
%time mlp.fit(X_train, y_train)

CPU times: user 9.48 s, sys: 0 ns, total: 9.48 s
Wall time: 9.49 s


MLPClassifier(hidden_layer_sizes=100, max_iter=1000)

Model size (number of weights)

In [15]:
print("Layer 1:", np.shape(mlp.coefs_[0]))
print("Layer 2:", np.shape(mlp.coefs_[1]))

print("Total number of weights:", np.product(np.shape(mlp.coefs_[0])) + np.product(np.shape(mlp.coefs_[1])))

Layer 1: (9, 100)
Layer 2: (100, 3)
Total number of weights: 1200


Test the model against the test data.

In [16]:
pred = mlp.predict(X_test)

print("y_test:\t", list(y_test)[0:10])
print("pred:\t",   pred[0:10])

print("Confusion matrix:\n", confusion_matrix(pred, y_test))
print("Accuracy:", accuracy_score(pred, y_test))

y_test:	 [-1, 1, -1, 1, -1, 1, 0, 1, 0, 1]
pred:	 [ 1  1  1  1  1 -1  0  1  0  1]
Confusion matrix:
 [[179  34 114]
 [ 16 101  44]
 [235 132 814]]
Accuracy: 0.6554823247453565


__Note:__ The accuracy is not great since we have many boards with only a few moves on it. Since these boards can easily lead to wins, losses or ties, they produce many errors.

Here is the number of empty cells for each board in the test set.

In [17]:
(X_test == 0).sum(axis=1)

7352    2
7422    4
7235    6
5534    2
5585    8
       ..
2732    8
1101    4
5698    2
383     4
4451    6
Length: 1669, dtype: int64

Test only on boards that have only three or less cells left to play.

In [34]:
take = list((X_test == 0).sum(axis=1)<=3)

X_test2 = X_test[take]
y_test2 = y_test[take]

In [35]:
pred2 = mlp.predict(X_test2)
print(f"Accuracy:", accuracy_score(pred2, y_test2))

Accuracy: 0.836864406779661


__More Notes:__ 

* The board is symmetric. You could use deep learning with convolution layers to create better models.
* The tic-tac-toe board is small and we used 2000 playouts. This covers a large space of the search space. If you have a more complicated game, 
    then you would need to do self-play to learn better and better playout policies.

## Some Tests

We evaluate some boards where `x` just made a move. The classifier tries to predict the most likely outcome
of the game as -1 = `o` wins, 0 = draw, and 1 = `x` wins. The classifier can also predict the probability of the three possible outcomes. We can use these probabilities as weights to calculate the expected utility in the range $[-1,1]$.

In [20]:
def print_eval_board(board):
    print("Board:")
    show_board(board)

    pred = mlp.predict(pd.DataFrame([encode_state(board)]))
    print("\nPredicted game outcome:", pred)

    probs = mlp.predict_proba(pd.DataFrame([encode_state(board)]))
    print("Predicted probability [loss, draw, win]:", probs)
    print("Expected utility:", np.sum(probs * [-1,0,1]))

In [21]:
# x will win

board = ['x', 'x', ' ',
         'o', 'x', ' ',
         'o', ' ', ' ']

print_eval_board(board)

Board:
[['x' 'x' ' ']
 ['o' 'x' ' ']
 ['o' ' ' ' ']]

Predicted game outcome: [1]
Predicted probability [loss, draw, win]: [[8.641e-02 8.183e-04 9.128e-01]]
Expected utility: 0.8263553792688264


In [22]:
# x made a mistake and will lose.

board = ['o', 'x', ' ',
         'x', 'o', 'x',
         'o', ' ', ' ']
    
print_eval_board(board)

Board:
[['o' 'x' ' ']
 ['x' 'o' 'x']
 ['o' ' ' ' ']]

Predicted game outcome: [-1]
Predicted probability [loss, draw, win]: [[0.953 0.011 0.036]]
Expected utility: -0.9174997985317431


In [23]:
# This will be a draw

board = ['x', 'o', 'x',
         ' ', 'o', ' ',
         ' ', 'x', ' ']

print_eval_board(board)

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' 'x' ' ']]

Predicted game outcome: [0]
Predicted probability [loss, draw, win]: [[0.1   0.631 0.269]]
Expected utility: 0.16903114678744696


## Using the Predictions to find the Best Move

The predict function can be used for
   - the heuristic evaluation function for Heuristic Minimax Search.
   - a better playout policy for simulated games used in Pure Monte Carlo Search/Monte Carlo Tree Search. If we use it as a playout strategy to create data for learning better ML models, then this is called _self-play_. 

I show here how to use the model as a heuristic evaluation function for all boards that the player `x` can get to with its next move. The player then chooses the move with the highest heuristic value (printed as "best action").

In [24]:
def eval_fun_ML(state, player = 'x'):
    p = mlp.predict_proba(pd.DataFrame([encode_state(state)]))
    val = np.sum(p * [-1, 0 , 1])
    return val
    

def best_move(state, player = 'x', verbose = False):  
    action = None
    value = -math.inf

    for a in get_actions(state) : 
        b = result(state.copy(), player, a)
        val = eval_fun_ML(b, player)
        if (verbose):
            print("%s chooses %d; expected utility = %+1.2f" % (player, a, val))

        if val > value:
            value = val
            action = a
        
    return action
  
    
print("# Empty board: Place in the center (or at least a corner).")
board = empty_board()
show_board(board)
print("Best action:", best_move(board, verbose = True))

print("\n# Play 7 to avoid loss.")
board = ['x', 'o', 'x',
         ' ', 'o', ' ',
         ' ', ' ', ' ']
show_board(board)
print("Best action:", best_move(board, verbose = True))

print("\n# Play 4 to win.")
board = ['o', 'x', ' ',
         ' ', ' ', 'x',
         ' ', ' ', 'o']
show_board(board)
print("Best action:", best_move(board, verbose = True))

# Empty board: Place in the center (or at least a corner).
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
x chooses 0; expected utility = +0.33
x chooses 1; expected utility = +0.19
x chooses 2; expected utility = +0.28
x chooses 3; expected utility = +0.20
x chooses 4; expected utility = +0.51
x chooses 5; expected utility = +0.22
x chooses 6; expected utility = +0.35
x chooses 7; expected utility = -0.01
x chooses 8; expected utility = +0.39
Best action: 4

# Play 7 to avoid loss.
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]
x chooses 3; expected utility = -0.11
x chooses 5; expected utility = -0.40
x chooses 6; expected utility = -0.22
x chooses 7; expected utility = +0.17
x chooses 8; expected utility = -0.25
Best action: 7

# Play 4 to win.
[['o' 'x' ' ']
 [' ' ' ' 'x']
 [' ' ' ' 'o']]
x chooses 2; expected utility = -0.93
x chooses 3; expected utility = -0.44
x chooses 4; expected utility = +0.31
x chooses 6; expected utility = +0.65
x chooses 7; expected utility = -0.23
Best act

## Notes on Self-Play

You need to learn a model for each player and use it in the playout strategy to create better data. Then learn new models and repeat.