# Adversarial Search: Solving Tic-Tac-Toe with Monte Carlo Tree Search

## Introduction 

Multiplayer games can be implemented as:
1. Nondeterministic actions: The opponent is seen as part of an environment with nondeterministic actions. Non-determinism is the result of the unknown opponent's moves. 
2. Optimal Decisions: Minimax search (search complete game tree) and alpha-beta pruning.
3. Heuristic Alpha-Beta Tree Search: Cut off tree search and use heuristic to estimate state value. 
4. __Monte Carlo Search:__ Simulate playouts to estimate state value. 

Here we will implement search for Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). The game is a __zero-sum game__: Win by x results in +1, win by o in -1 and a tie has a value of 0. Max plays x and tries to maximize the outcome while Min plays o and tries to minimize the outcome.   

We will implement
* Monte Carlo Tree Search with upper confidence bound (UCB1).

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [2]:
import numpy as np
import pandas as pd
import math

In [3]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [4]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [5]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [6]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [7]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [8]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [9]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Monte Carlo Tree Search with Upper Confidence Bound

See AIMA page 163. 

* Use a random playout policy.

__Important note:__ we use here a random playout policy, which ends up creating just a randomized search that works fine for this toy problem. For real applications you need to extend the code with a good __playout policy__ (e.g., learned by self-play)

## Simulate a Playout

In [10]:
def playout(state, action, player = 'x'):
    """Perfrom a random playout starting with the given action on the fiven board 
    and return the utility of the finished game."""
    state = result(state, player, action)
    current_player = other(player)
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: 
            return u
        
        # we use a random playout policy
        a = np.random.choice(get_actions(state))
        state = result(state, current_player, a)
        #print(state)
        
        # switch between players
        current_player = other(current_player)


board = empty_board()
display([ playout(board, 0) for i in range(20) ])

[-1, -1, 0, 1, 1, 0, 1, -1, 1, -1, 1, -1, -1, 1, -1, 1, -1, -1, 1, 1]

## Upper Confidence Bound Applied to Trees

Tree node definition.

In [20]:
class UCT_Node:
    def __init__(self, state, parent):
        self.state = state.copy()     # state description (a board)
        
        self.u = 0                    # utility
        self.n = 0                    # count
        
        self.parent = parent          # the parent
        self.children = {}            # dictionary with child nodes; keys are actions.
    
    def __str__(self):
        return f"UCT Node for state {self.state} with {len(self.children.keys())} children: ({self.u}/{self.n})"
    
    def UCB1(self):
        if(self.n < 1): return +math.inf
        if(self.parent is None): return +math.inf # CHECK!!!
        self.u/self.n + C * math.sqrt(math.log(self.parent.n)/self.n)
    
n = UCT_Node(empty_board(), None)
n.n = 10
n.u = 5
print(n)

print(n.UCB1())

UCT Node for state [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] with 0 children: (5/10)
inf


In [25]:
DEBUG = 1


class UCT:
    def __init__(self, C = math.sqrt(2)):
        self.tree = None
        self.leafs = []
        self.C = C
        
    def search(self, board, N = 100, player = 'x'):
        """Upper Confidence bound applied to Trees. 
        Simulation budget is N playouts."""
        global DEBUG
     
        self.tree = UCT_Node(board, None)
        self.leafs = [ self.tree ]
    
        actions = get_actions(board)
        u = {a: 0 for a in actions}
        n = {a: 0 for a in actions}
        n_parent = 0
    
        for i in range(N):

            # Select leaf node with highest UCB1
            UCB1s = [ i.UCB1 for i in self.leafs ] 
            node = self.leafs.index(max(UCB1s))
    
            # Expand: add a successor
            
            
            # TODO
        
            # Simulate
            p = playout(board, actions[action_id], player = player)
    
            # Back-Propagate (i.e., update counts and UCB1)
            u[action_id] += p
            n[action_id] += 1
            n_parent += 1
        
            for action_id in range(len(actions)):
                if n[action_id] > 0:
                    UCB1[action_id] = u[action_id]/n[action_id] + self.C * math.sqrt(math.log(n_parent)/n[action_id])
    
        if DEBUG >= 1: 
            print(pd.DataFrame({'action':actions, 
                                'total utility':u, 
                                '# of playouts':n, 
                                'UCB1':UCB1}))
    
        # return action with largest number of playouts
        action = actions[n.index(max(n))]
    
        return action

board = empty_board()
display(board)

uct = UCT()

%timeit -n 1 -r 1 display(uct.search(board, N = 100))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

NameError: name 'UCB1' is not defined

## Some Tests

In [105]:
# x is about to win (play 8)

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'

print("Board:")
show_board(board)

print()
display(UCT_depth1(board ))

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       2             17             20  1.528614
1       5              6             11  1.460498
2       6             17             20  1.528614
3       7             12             16  1.508714
4       8             33             33  1.528300


8

In [106]:
# o is about to win

board = empty_board() 
board[0] = 'o'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'
board[8] = 'x'

print("Board:")
show_board(board)

print()
display(UCT_depth1(board))

Board:
[['o' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' 'x']]

   action  total utility  # of playouts      UCB1
0       2             -4             20  0.478614
1       5             -5              7  0.432781
2       6             10             68  0.515089
3       7             -5              5  0.357228


6

In [107]:
#### x can draw if it chooses 7.

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[2] = 'x'
board[4] = 'o'

print("Board:")
show_board(board)

print()
display(UCT_depth1(board))

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       3             -3             13  0.610948
1       5             -4              9  0.567174
2       6             -2             16  0.633714
3       7             14             52  0.690089
4       8             -4             10  0.559705


7

In [108]:
# o went first

board = empty_board() 
board[4] = 'o'

print("Board:")
show_board(board)


print()
display(UCT_depth1(board))

Board:
[[' ' ' ' ' ']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       0             -7              9  0.233840
1       1             -6              6  0.238974
2       2             -8             18  0.270878
3       3             -7             15  0.316929
4       5             -8             18  0.270878
5       6             -7             15  0.316929
6       7             -7             10  0.259705
7       8             -7              9  0.233840


2

In [115]:
# Empty board: Only a draw an be guaranteed

board = empty_board() 

print("Board:")
show_board(board)


print()
%timeit -n 1 -r 1 display(UCT_depth1(board, N = 1000))

Board:
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

   action  total utility  # of playouts      UCB1
0       0             57            165  0.634817
1       1             43            136  0.634900
2       2             13             70  0.629971
3       3              3             45  0.620753
4       4            145            328  0.647306
5       5             13             70  0.629971
6       6              4             46  0.634987
7       7             32            112  0.636930
8       8             -2             28  0.631004


4

660 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [114]:
# A bad situation

board = empty_board() 
board[0] = 'o'
board[2] = 'x'
board[8] = 'o'

print("Board:")
show_board(board)


print()
display(UCT_depth1(board))

Board:
[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

   action  total utility  # of playouts      UCB1
0       1             -6              8  0.322983
1       3             -7             15  0.316929
2       4             -3             44  0.389340
3       5             -7             13  0.303256
4       6             -6              8  0.322983
5       7             -7             12  0.292754


4

__Note:__ It looks like random player o is very unlikely to block x and take advantage of the trap by playing the bottom left corner!

## Experiments


### Baseline: Randomized Player

A completely randomized player agent should be a weak baseline.

In [116]:
def random_player(board, player = None):
    """Simple player that chooses a random empy square. player is unused"""
    return np.random.choice(get_actions(board))

show_board(board)
random_player(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]


0

### The Environment

Implement the environment that calls the agent. The percept is the board and the action is move.

In [117]:
DEBUG = 1

def switch_player(player, x, o):
    if player == 'x':
        return 'o', o
    else:
        return 'x', x

def play(x, o, N = 100):
    results = {'x': 0, 'o': 0, 'd': 0}
    for i in range(N):
        board = empty_board()
        player, fun = 'x', x
        
        while True:
            a = fun(board, player)
            board = result(board, player, a)
            
            win = check_win(board)
            if win != 'n':
                if DEBUG >= 1: print(f"{board} winner: {win}")
                results[win] += 1
                break
            
            player, fun = switch_player(player, x, o)   
 
    return results

### Random vs. Random

In [118]:
DEBUG = 0

%timeit -n 1 -r 1 display(play(random_player, random_player))

{'x': 58, 'o': 27, 'd': 15}

91.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Pure Monte Carlo Search vs. Random

In [120]:
def uct10_player(board, player = 'x'):
    action = UCT_depth1(board, N = 10, player = player)
    return action

def uct100_player(board, player = 'x'):
    action = UCT_depth1(board, N = 100, player = player)
    return action



DEBUG = 1
print("UCT vs. random:")
display(play(uct10_player, random_player, N = 1))

UCT vs. random:
   action  total utility  # of playouts      UCB1
0       0             -1              1  1.145966
1       1              0              1  2.145966
2       2              0              1  2.145966
3       3             -1              1  1.145966
4       4              0              1  2.145966
5       5              2              2  2.517427
6       6              1              1  3.145966
7       7             -1              1  1.145966
8       8             -1              1  1.145966
   action  total utility  # of playouts      UCB1
0       0              2              2  2.517427
1       1              0              1  2.145966
2       3             -1              1  1.145966
3       4              2              2  2.517427
4       6              1              2  2.017427
5       7              1              1  3.145966
6       8             -1              1  1.145966
   action  total utility  # of playouts      UCB1
0       1              0          

{'x': 1, 'o': 0, 'd': 0}

In [122]:
DEBUG = 0
print("UCT vs. random:")
%timeit -n 1 -r 1 display(play(uct10_player, random_player))

print()
print("random vs. UCT")
%timeit -n 1 -r 1 display(play(random_player, uct10_player))

UCT vs. random:


{'x': 89, 'o': 6, 'd': 5}

1.54 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. UCT


{'x': 15, 'o': 77, 'd': 8}

1.19 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [123]:
DEBUG = 0
print("UCT vs. random:")
%timeit -n 1 -r 1 display(play(uct100_player, random_player))

print()
print("random vs. UCT")
%timeit -n 1 -r 1 display(play(random_player, uct100_player))

UCT vs. random:


{'x': 100, 'o': 0, 'd': 0}

13.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. UCT


{'x': 5, 'o': 89, 'd': 6}

10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [124]:
DEBUG = 0
print("UCT (100) vs. UCT (10):")
%timeit -n 1 -r 1 display(play(uct100_player, uct10_player))

print()
print("UCT (10) vs. UCT (100)")
%timeit -n 1 -r 1 display(play(uct10_player, uct100_player))

UCT (100) vs. UCT (10):


{'x': 89, 'o': 5, 'd': 6}

14.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

UCT (10) vs. UCT (100)


{'x': 30, 'o': 46, 'd': 24}

13 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
