# Learn to Score a Tic-Tac-Toe Board by Example

## Introduction 


We want to use machine learning to support intelligent agents playing Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). Here is the approach:

1. Simulate playouts to estimate create training data labeled with if they lead to a win.
2. Learn a model to predict the label for each board.
3. The model can be applied as:
   - the heuristic evaluation function for Heuristic Minimax Search.
   - a playout policy for better simulated games used in Pure Monte Carlo Search/Monte Carlo Tree Search.

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [22]:
import numpy as np
import math

In [23]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [24]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [25]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [26]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [27]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [28]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [29]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Create Training Data using Playouts


We will try to learn a function $y = h(X)$ where $X$ is a board and $y$ is the utility. The data we need can be creating by running playouts (complete games) and recording the boards and if the playout lead to a win or not.  

To make the data useful for learning, I recode `x` and `o` into numbers.

In [30]:
tr = {' ': 0, 'x': 1, 'o': -1} # I translate the board into numbers

def encode_state(state):
    """Represent the board using numbers."""
    return [tr[s] for s in state]

def playout_record(record = 'x'):
    """Run a playout and record the boards after the player record's move."""
    state = empty_board()
    player = 'x'
    current_player = 'x'
    
    boards = []
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: return(boards, [u] * len(boards))
  
        # we use a random playout policy
        a = np.random.choice(get_actions(state))
        state = result(state, current_player, a)   
  
        if current_player == record:
            boards.append(encode_state(state))

        # switch between players
        current_player = other(current_player)

playout_record()

([[0, 1, 0, 0, 0, 0, 0, 0, 0],
  [0, 1, -1, 0, 0, 0, 1, 0, 0],
  [0, 1, -1, 1, 0, 0, 1, 0, -1],
  [0, 1, -1, 1, -1, 0, 1, 1, -1]],
 [-1, -1, -1, -1])

Run `N` playouts and create a pandas dataframe for `X` and a list for y.

In [31]:
import pandas as pd

def create_data(N = 100, record = 'x'):
    board = []
    utility = []
    
    for i in range(N):
        b, u = playout_record(record)
        board.extend(b)
        utility.extend(u)
        
    return {'X': pd.DataFrame(board), 'y': np.array(utility)}
        
data = create_data(1000)
X = data['X']
y = data['y']

print("X")
display(X)

print("y")
display(y[0:10])

X


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,0,0,1,0,0,0,0
1,0,1,0,-1,1,0,0,0,0
2,0,1,0,-1,1,1,0,0,-1
3,0,1,1,-1,1,1,0,-1,-1
4,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...
4157,0,0,1,-1,0,0,0,0,1
4158,0,0,1,-1,0,-1,1,0,1
4159,0,0,0,0,1,0,0,0,0
4160,0,0,0,0,1,-1,0,0,1


y


array([-1, -1, -1, -1, -1, -1, -1, -1,  1,  1])

## Train a Model

In [32]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score

Split the data in training and testing data.

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We learn a ANN to approximate $f(X)$ by $h(X)$. See
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [34]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(
                    hidden_layer_sizes = (100),
                    #verbose = True,
                    max_iter = 1000)              # max. number of iterations
                    
%time clf.fit(X_train, y_train)

CPU times: user 9.81 s, sys: 2.93 ms, total: 9.81 s
Wall time: 9.82 s


MLPClassifier(hidden_layer_sizes=100, max_iter=1000)

Test the model against the test data.

In [35]:
pred = clf.predict(X_test)

print("y_test:\t", list(y_test)[0:10])
print("pred:\t",   pred[0:10])

print("Confusion matrix:\n", confusion_matrix(pred, y_test))
print("Accuracy:", accuracy_score(pred, y_test))

y_test:	 [0, 1, 1, 1, 1, -1, 1, 1, 1, -1]
pred:	 [-1  1  1  0 -1 -1  1  1 -1 -1]
Confusion matrix:
 [[ 56  16  59]
 [  4  44  16]
 [129  56 453]]
Accuracy: 0.6638655462184874


__Note:__ The accuracy is not great since we have many boards with only a few moves on it. Since these boards can easily lead to wins, losses or dies, they produce many errors.

Here is the number of cells for each board in the test set.

In [36]:
(X_test == 0).sum(axis=1)

3433    4
502     0
1380    4
2639    2
291     6
       ..
1629    2
514     6
2857    6
449     6
1745    6
Length: 833, dtype: int64

Test only on boards that have at only two cells left to play.

In [37]:
take = list((X_test == 0).sum(axis=1)<=2)

X_test2 = X_test[take]
y_test2 = y_test[take]

In [38]:
pred2 = clf.predict(X_test2)
print("Accuracy:", accuracy_score(pred2, y_test2))

Accuracy: 0.8207547169811321


## Some Tests

In [39]:
# x is about to win (play 8)

board = ['x', 'o', ' ',
         'o', 'x', ' ',
         ' ', ' ', ' ']

print("Board:")
show_board(board)

%time clf.predict(pd.DataFrame([encode_state(board)]))

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]
CPU times: user 21.3 ms, sys: 17 ms, total: 38.3 ms
Wall time: 7.03 ms


array([1])

In [43]:
# o will win

board = ['o', 'x', 'o',
         'x', 'o', 'x',
         ' ', ' ', ' ']
    
print("Board:")
show_board(board)

%time clf.predict(pd.DataFrame([encode_state(board)]))

Board:
[['o' 'x' 'o']
 ['x' 'o' 'x']
 [' ' ' ' ' ']]
CPU times: user 4.3 ms, sys: 25 µs, total: 4.33 ms
Wall time: 3.96 ms


array([-1])

In [44]:
# x can draw if it chooses 7.

board = ['x', 'o', 'x',
         ' ', 'o', ' ',
         ' ', ' ', ' ']

print("Board:")
show_board(board)

%time clf.predict(pd.DataFrame([encode_state(board)]))

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]
CPU times: user 6.07 ms, sys: 0 ns, total: 6.07 ms
Wall time: 5.65 ms


array([-1])

In [45]:
# A bad situation for x

board = ['o', ' ', 'x',
         ' ', ' ', 'x',
         ' ', ' ', 'o']

print("Board:")
show_board(board)

%time clf.predict(pd.DataFrame([encode_state(board)]))

Board:
[['o' ' ' 'x']
 [' ' ' ' 'x']
 [' ' ' ' 'o']]
CPU times: user 5.87 ms, sys: 153 µs, total: 6.03 ms
Wall time: 5.37 ms


array([1])

## Using the Predictions

The predict function can be used for
   - the heuristic evaluation function for Heuristic Minimax Search.
   - a playout policy for better simulated games used in Pure Monte Carlo Search/Monte Carlo Tree Search.