# Adversarial Search: Solving Tic-Tac-Toe with Monte Carlo Tree Search

## Introduction 

Multiplayer games can be implemented as:
1. Nondeterministic actions: The opponent is seen as part of an environment with nondeterministic actions. Non-determinism is the result of the unknown opponent's moves. 
2. Optimal Decisions: Minimax search (search complete game tree) and alpha-beta pruning.
3. Heuristic Alpha-Beta Tree Search: Cut off tree search and use heuristic to estimate state value. 
4. __Monte Carlo Search:__ Simulate playouts to estimate state value. 

Here we will implement search for Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). The game is a __zero-sum game__: Win by x results in +1, win by o in -1 and a tie has a value of 0. Max plays x and tries to maximize the outcome while Min plays o and tries to minimize the outcome.   

We will implement
* We enhance Pure Monte Carlo Search by using the upper confidence bound (UCB1) selection policy. That is, we use UCB1 to determine for which action to perform the next playout. This will allow the algorithm to focus on actions here it needs to collect more information. Note that complete Upper Confidence Bounds applied to Trees (UCT) creates a tree and the expand step in the code needs to be added. 

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [1]:
import numpy as np
import math

In [2]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [3]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [4]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [5]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [6]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [7]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [8]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Monte Carlo Search with Upper Confidence Bound

See AIMA page 163. 

We enhance pure Monte Carlo Search by using UCB1 as a __selection policy__. This can be seen as a 
restricted version of UTC: 
* Only build a tree of depth 1 and use the UBC1 selection policy.
* Use a random playout policy.


__Note on the playout policy:__ We use here a random playout policy, which ends up creating just a randomized search that works fine for this toy problem. For real applications you need to extend the code with a good __playout policy__ (e.g., manually created heuristics or a neural [network learned by self-play using reinforcement learning](https://towardsdatascience.com/how-to-teach-an-ai-to-play-games-deep-reinforcement-learning-28f9b920440a)).

## Simulate playouts

In [9]:
def playout(state, action, player = 'x'):
    """Perfrom a random playout starting with the given action on the fiven board 
    and return the utility of the finished game."""
    state = result(state, player, action)
    current_player = other(player)
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: 
            return u
        
        # we use a random playout policy
        a = np.random.choice(get_actions(state))
        state = result(state, current_player, a)
        #print(state)
        
        # switch between players
        current_player = other(current_player)


board = empty_board()
print(playout(board, 0))
print(playout(board, 0))
print(playout(board, 0))

0
-1
1


## Upper Confidence Bound Applied to Trees (Restricted to Depth 1)

In [10]:
import pandas as pd

DEBUG = 1

def UCT_depth1(board, N = 100, player = 'x'):
    """Upper Confidence bound applied to Trees for limited tree depth of 1. 
    Simulation budget is N playouts."""
    global DEBUG
    
    C = math.sqrt(2)
    
    # the tree is 1 action deep
    actions = get_actions(board)
    
    u = [0] * len(actions) # total utility through actions
    n = [0] * len(actions) # number of playouts through actions
    n_parent = 0 # total playouts so far (i.e., number of playouts through parent)
    
    UCB1 = [+math.inf] * len(actions) 
    
    for i in range(N):

        # Select
        action_id = UCB1.index(max(UCB1))
    
        # Expand
        # UTC would expand the tree. We keep the tree at depth 1, essentially performing
        # Pure Monte Carlo search with an added UCB1 selection policy. 
        
        # Simulate
        p = playout(board, actions[action_id], player = player)
    
        # Back-Propagate (i.e., update counts and UCB1)
        u[action_id] += p
        n[action_id] += 1
        n_parent += 1
        
        for action_id in range(len(actions)):
            if n[action_id] > 0:
                UCB1[action_id] = u[action_id]/n[action_id] + C * math.sqrt(math.log(n_parent)/n[action_id])
    
    # return action with largest number of playouts 
    action = actions[n.index(max(n))]
    
    if DEBUG >= 1: 
        print(pd.DataFrame({'action':actions, 
                            'total utility':u, 
                            '# of playouts':n, 
                            'UCB1':UCB1}))
        print()
        print(f"Best action: {action}")
    
    
    return action

In [11]:
board = empty_board()
display(board)

%timeit -n 1 -r 1 UCT_depth1(board, N = 1000)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

   action  total utility  # of playouts      UCB1
0       0             14             62  0.697856
1       1             21             77  0.696310
2       2             51            135  0.697680
3       3              0             29  0.690215
4       4            234            446  0.700665
5       5              0             30  0.678614
6       6             44            122  0.697170
7       7             -2             25  0.663384
8       8             19             74  0.688840

Best action: 4
639 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Some Tests

In [12]:
# x is about to win (play 8)

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'

print("Board:")
show_board(board)

print()
%timeit -n1 -r1 UCT_depth1(board)

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       2             17             21  1.471783
1       5              7             12  1.459420
2       6             11             16  1.446214
3       7              9             14  1.453956
4       8             37             37  1.498927

Best action: 8
29.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [13]:
# o is about to win

board = empty_board() 
board[0] = 'o'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'
board[8] = 'x'

print("Board:")
show_board(board)

print()
%timeit -n1 -r1 UCT_depth1(board, N = 1000)

Board:
[['o' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' 'x']]

   action  total utility  # of playouts      UCB1
0       2            -21            611  0.116001
1       5            -15             21  0.096813
2       6            -29            353  0.115679
3       7            -13             15  0.093039

Best action: 2
221 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [14]:
#### x can draw if it chooses 7.

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[2] = 'x'
board[4] = 'o'

print("Board:")
show_board(board)

print()
%timeit -n1 -r1 UCT_depth1(board, N = 1000)

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       3            -13             60  0.263186
1       5            -13             62  0.262372
2       6            -12             24  0.258714
3       7            118            800  0.278913
4       8            -13             54  0.265068

Best action: 7
382 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [15]:
# o went first

board = empty_board() 
board[4] = 'o'

print("Board:")
show_board(board)


print()
%timeit -n1 -r1 UCT_depth1(board)

Board:
[[' ' ' ' ' ']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       0             -5             15  0.450263
1       1             -5             17  0.441943
2       2             -5             15  0.450263
3       3             -5             17  0.441943
4       5             -6             10  0.359705
5       6             -5              5  0.357228
6       7             -6             15  0.383596
7       8             -5              6  0.405641

Best action: 1
66 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [16]:
# Empty board: Only a draw an be guaranteed

board = empty_board() 

print("Board:")
show_board(board)


print()
%timeit -n1 -r1 UCT_depth1(board, N = 100)
%timeit -n1 -r1 UCT_depth1(board, N = 5000)

Board:
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       0              7             15  1.250263
1       1              1              8  1.197983
2       2              3             10  1.259705
3       3              6             13  1.303256
4       4              6             15  1.183596
5       5              1              8  1.197983
6       6             11             19  1.275191
7       7             -2              2  1.145966
8       8              3             10  1.259705

Best action: 6
75.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
   action  total utility  # of playouts      UCB1
0       0            126            371  0.553900
1       1             45            181  0.555397
2       2            189            510  0.553347
3       3              7             80  0.548943
4       4           1286           2674  0.560742
5       5             27            135  0.555219
6       6      

In [17]:
# A bad situation

board = empty_board() 
board[0] = 'o'
board[2] = 'x'
board[8] = 'o'

print("Board:")
show_board(board)


print()
display(UCT_depth1(board))

Board:
[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

   action  total utility  # of playouts      UCB1
0       1             -3              3  0.752174
1       3             -3              6  0.738974
2       4             29             66  0.812959
3       5             -3              7  0.718496
4       6             -1             15  0.716929
5       7             -3              3  0.752174

Best action: 4


4

__Note:__ It looks like random player o is very unlikely to block x and take advantage of the trap by playing the bottom left corner!

## Experiments


### Baseline: Randomized Player

A completely randomized player agent should be a weak baseline.

In [18]:
def random_player(board, player = None):
    """Simple player that chooses a random empy square. player is unused"""
    return np.random.choice(get_actions(board))

show_board(board)
random_player(board)

[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]


5

### The Environment

Implement the environment that calls the agent. The percept is the board and the action is move.

In [19]:
DEBUG = 1

def switch_player(player, x, o):
    if player == 'x':
        return 'o', o
    else:
        return 'x', x

def play(x, o, N = 100):
    results = {'x': 0, 'o': 0, 'd': 0}
    for i in range(N):
        board = empty_board()
        player, fun = 'x', x
        
        while True:
            a = fun(board, player)
            board = result(board, player, a)
            
            win = check_win(board)
            if win != 'n':
                if DEBUG >= 1: print(f"{board} winner: {win}")
                results[win] += 1
                break
            
            player, fun = switch_player(player, x, o)   
 
    return results

### Random vs. Random

In [20]:
DEBUG = 0

%timeit -n 1 -r 1 display(play(random_player, random_player))

{'x': 57, 'o': 31, 'd': 12}

83.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Pure Monte Carlo Search vs. Random

In [28]:
def ucb1_10_player(board, player = 'x'):
    action = UCT_depth1(board, N = 10, player = player)
    return action

def ucb1_100_player(board, player = 'x'):
    action = UCT_depth1(board, N = 100, player = player)
    return action



DEBUG = 1
print("UCB1 (10) vs. random:")
display(play(ucb1_10_player, random_player, N = 1))

UCB1 (10) vs. random:
   action  total utility  # of playouts      UCB1
0       0              0              1  2.145966
1       1             -1              1  1.145966
2       2             -1              1  1.145966
3       3             -1              1  1.145966
4       4              2              2  2.517427
5       5             -1              1  1.145966
6       6              1              1  3.145966
7       7              1              1  3.145966
8       8             -1              1  1.145966

Best action: 4
   action  total utility  # of playouts      UCB1
0       0              0              2  1.517427
1       1             -1              1  1.145966
2       3              0              2  1.517427
3       5              2              2  2.517427
4       6              1              1  3.145966
5       7              1              1  3.145966
6       8             -1              1  1.145966

Best action: 0
   action  total utility  # of playouts      U

{'x': 1, 'o': 0, 'd': 0}

In [29]:
DEBUG = 0
print("UCB1 (10) vs. random:")
%timeit -n1 -r1 display(play(ucb1_10_player, random_player))

print()
print("random vs. UCB1 (10):")
%timeit -n1 -r1 display(play(random_player, ucb1_10_player))

UCB1 (10) vs. random:


{'x': 90, 'o': 4, 'd': 6}

1.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. UCB1 (10):


{'x': 26, 'o': 64, 'd': 10}

1.13 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [30]:
DEBUG = 0
print("UCB1 (100) vs. random:")
%timeit -n 1 -r 1 display(play(ucb1_100_player, random_player))

print()
print("random vs. UCB1 (100):")
%timeit -n 1 -r 1 display(play(random_player, ucb1_100_player))

UCB1 (100) vs. random:


{'x': 97, 'o': 0, 'd': 3}

12.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. UCB1 (100):


{'x': 8, 'o': 84, 'd': 8}

10.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [31]:
DEBUG = 0
print("UCB1 (100) vs. UCT (10):")
%timeit -n 1 -r 1 display(play(ucb1_100_player, ucb1_10_player))

print()
print("UCT (10) vs. UCB1 (100):")
%timeit -n 1 -r 1 display(play(ucb1_10_player, ucb1_100_player))

UCB1 (100) vs. UCT (10):


{'x': 88, 'o': 1, 'd': 11}

14.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

UCT (10) vs. UCB1 (100):


{'x': 28, 'o': 54, 'd': 18}

11.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
