# Adversarial Search: Solving Tic-Tac-Toe with Monte Carlo Tree Search

## Introduction 

Multiplayer games can be implemented as:
1. Nondeterministic actions: The opponent is seen as part of an environment with nondeterministic actions. Non-determinism is the result of the unknown opponent's moves. 
2. Optimal Decisions: Minimax search (search complete game tree) and alpha-beta pruning.
3. Heuristic Alpha-Beta Tree Search: Cut off tree search and use heuristic to estimate state value. 
4. __Monte Carlo Tree search:__ Simulate playouts to estimate state value. 

Here we will implement search for Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). The game is a __zero-sum game__: Win by x results in +1, win by o in -1 and a tie has a value of 0. Max plays x and tries to maximize the outcome while Min plays o and tries to minimize the outcome.   

We will implement
* Pure Monte Carlo search

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [1]:
import numpy as np
import math

In [2]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [3]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [4]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [5]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [6]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [7]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [8]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Pure Monte Carlo Tree Search

See AIMA page 161ff. 

We implement a extremely simplified version.

For the current state: 
1. Simulate $N$ random playouts for each possible action and 
2. pick the action with the highest average utility.

__Important note:__ we use here a random playout policy, which ends up creating just a randomized search that works fine for this toy problem. For real applications you need to extend the code with
1. a good __playout policy__ (e.g., learned by self-play) and 
2. a __selection policy__ (e.g., UCB1).

## Simulate playouts

In [9]:
def playout(state, action, player = 'x'):
    """Perfrom a random playout starting with the given action on the fiven board 
    and return the utility of the finished game."""
    state = result(state, player, action)
    current_player = other(player)
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: return(u)
        
        # we use a random playout policy
        a = np.random.choice(get_actions(state))
        state = result(state, current_player, a)
        #print(state)
        
        # switch between players
        current_player = other(current_player)


board = empty_board()
print(playout(board, 0))
print(playout(board, 0))
print(playout(board, 0))

-1
-1
0


In [10]:
def playouts(board, action, player = 'x', N = 100):
    """Perform N playouts following the given action for the given board."""
    return [playout(board, action, player) for i in range(N)]

p = playouts(board, 0)
print(p)

print(f"mean utility: {np.mean(p)}")
print(f"win probability: {sum(np.array(p) == +1)/len(p)}")
print(f"loss probability: {sum(np.array(p) == -1)/len(p)}")

[1, 1, 0, 1, 1, 0, 0, 1, 1, 0, -1, 1, -1, -1, 1, 0, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 0, 1, 1, 1, -1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, -1, 1, 1, 1, -1, 1, -1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 0, 1, 0, 1, 1, 1, 1, 1, 1, -1, 0, -1, -1, 1, -1, 1, 1, 0, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 0, -1, 1, 1, 1, 1]
mean utility: 0.5
win probability: 0.68
loss probability: 0.18


__Note:__ This shows that the player who goes first has a significant advantage in __pure random play.__ A better playout policy would be good.

## Choose the best action

Pure Monte Carlo Search (pmcs)

In [11]:
DEBUG = 1

def pmcs(board, N = 100, player = 'x'):
    """Pure Monte Carlo Search. Returns the action that has the largest average utility.
    The N playouts are evenly divided between the possible actions."""
    global DEBUG
    
    actions = get_actions(board)
    n = math.floor(N/len(actions))
    if DEBUG >= 1: print(f"Actions: {actions} ({n} playouts per actions)")
    
    ps = {i:np.mean(playouts(board, i, player, N = n)) for i in actions}
    if DEBUG >= 1: display(ps)
        
    action = max(ps, key=ps.get)
    return action

board = empty_board()
display(board)
print(pmcs(board))

print()
print("1000 playouts give a better utility estimate.")
print(pmcs(board, N = 1000))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (11 playouts per actions)


{0: 0.36363636363636365,
 1: -0.2727272727272727,
 2: 0.36363636363636365,
 3: 0.6363636363636364,
 4: 0.7272727272727273,
 5: 0.2727272727272727,
 6: 0.18181818181818182,
 7: 0.18181818181818182,
 8: 0.2727272727272727}

4

1000 playouts give a better utility estimate.
Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (111 playouts per actions)


{0: 0.18018018018018017,
 1: 0.0990990990990991,
 2: 0.4144144144144144,
 3: 0.1981981981981982,
 4: 0.44144144144144143,
 5: 0.10810810810810811,
 6: 0.44144144144144143,
 7: 0.13513513513513514,
 8: 0.38738738738738737}

4


Looks like the center and the corners are a lot better.

## Some Tests

In [12]:
# x is about to win (play 8)

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'

print("Board:")
show_board(board)

print()
display(pmcs(board ))

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]

Actions: [2, 5, 6, 7, 8] (20 playouts per actions)


{2: 0.95, 5: 0.75, 6: 0.95, 7: 0.75, 8: 1.0}

8

In [13]:
# o is about to win

board = empty_board() 
board[0] = 'o'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'
board[8] = 'x'

print("Board:")
show_board(board)

print()
display(pmcs(board))

print()
display(pmcs(board, N = 1000))

Board:
[['o' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' 'x']]

Actions: [2, 5, 6, 7] (25 playouts per actions)


{2: -0.04, 5: -0.6, 6: -0.2, 7: -0.84}

2


Actions: [2, 5, 6, 7] (250 playouts per actions)


{2: -0.064, 5: -0.744, 6: 0.024, 7: -0.672}

6

In [14]:
#### x can draw if it chooses 7.

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[2] = 'x'
board[4] = 'o'

print("Board:")
show_board(board)

print()
display(pmcs(board))

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

Actions: [3, 5, 6, 7, 8] (20 playouts per actions)


{3: -0.35, 5: -0.4, 6: -0.3, 7: 0.25, 8: -0.5}

7

In [15]:
# o went first

board = empty_board() 
board[4] = 'o'

print("Board:")
show_board(board)


print()
display(pmcs(board))

Board:
[[' ' ' ' ' ']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

Actions: [0, 1, 2, 3, 5, 6, 7, 8] (12 playouts per actions)


{0: -0.3333333333333333,
 1: -0.6666666666666666,
 2: -0.3333333333333333,
 3: -0.25,
 5: -0.9166666666666666,
 6: -0.75,
 7: -0.6666666666666666,
 8: -0.4166666666666667}

3

In [16]:
# Empty board: Only a draw an be guaranteed

board = empty_board() 

print("Board:")
show_board(board)


print()
%timeit -n 1 -r 1 display(pmcs(board ))

Board:
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (11 playouts per actions)


{0: 0.8181818181818182,
 1: 0.2727272727272727,
 2: 0.36363636363636365,
 3: 0.6363636363636364,
 4: 0.2727272727272727,
 5: 0.2727272727272727,
 6: 0.5454545454545454,
 7: -0.09090909090909091,
 8: 0.45454545454545453}

0

90.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [17]:
# A bad situation

board = empty_board() 
board[0] = 'o'
board[2] = 'x'
board[8] = 'o'

print("Board:")
show_board(board)


print()
display(pmcs(board))

Board:
[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

Actions: [1, 3, 4, 5, 6, 7] (16 playouts per actions)


{1: -0.625, 3: -0.5625, 4: 0.125, 5: -0.875, 6: -0.125, 7: -0.4375}

4

__Note:__ It looks like random player o is very unlikely to block x and take advantage of the trap by playing the bottom left corner!

## Experiments


### Baseline: Randomized Player

A completely randomized player agent should be a weak baseline.

In [18]:
def random_player(board, player = None):
    """Simple player that chooses a random empy square. player is unused"""
    return np.random.choice(get_actions(board))

show_board(board)
random_player(board)

[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]


5

### The Environment

Implement the environment that calls the agent. The percept is the board and the action is move.

In [19]:
DEBUG = 1

def switch_player(player, x, o):
    if player == 'x':
        return 'o', o
    else:
        return 'x', x

def play(x, o, N = 100):
    results = {'x': 0, 'o': 0, 'd': 0}
    for i in range(N):
        board = empty_board()
        player, fun = 'x', x
        
        while True:
            a = fun(board, player)
            board = result(board, player, a)
            
            win = check_win(board)
            if win != 'n':
                if DEBUG >= 1: print(f"{board} winner: {win}")
                results[win] += 1
                break
            
            player, fun = switch_player(player, x, o)   
 
    return results

### Random vs. Random

In [20]:
DEBUG = 0

%timeit -n 1 -r 1 display(play(random_player, random_player))

{'x': 58, 'o': 30, 'd': 12}

84.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Pure Monte Carlo Search vs. Random

In [21]:
def pmcs10_player(board, player = 'x'):
    action = pmcs(board, N = 10, player = player)
    return action

def pmcs100_player(board, player = 'x'):
    action = pmcs(board, N = 100, player = player)
    return action

def pmcs1000_player(board, player = 'x'):
    action = pmcs(board, N = 1000, player = player)
    return action


DEBUG = 1
print("PMCS vs. random:")
display(play(pmcs10_player, random_player, N = 1))

PMCS vs. random:
Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (1 playouts per actions)


{0: 1.0, 1: 1.0, 2: 0.0, 3: -1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0}

Actions: [1, 3, 4, 5, 6, 7, 8] (1 playouts per actions)


{1: 1.0, 3: -1.0, 4: 1.0, 5: -1.0, 6: 1.0, 7: -1.0, 8: 1.0}

Actions: [3, 4, 6, 7, 8] (2 playouts per actions)


{3: -1.0, 4: 0.0, 6: -1.0, 7: -1.0, 8: 1.0}

Actions: [4, 6, 7] (3 playouts per actions)


{4: 1.0, 6: 1.0, 7: 0.3333333333333333}

['x', 'x', 'o', 'o', 'x', 'o', ' ', ' ', 'x'] winner: x


{'x': 1, 'o': 0, 'd': 0}

In [22]:
DEBUG = 0
print("PMCS vs. random:")
%timeit -n 1 -r 1 display(play(pmcs10_player, random_player))

print()
print("random vs. PMCS")
%timeit -n 1 -r 1 display(play(random_player, pmcs10_player))

PMCS vs. random:


{'x': 92, 'o': 4, 'd': 4}

1.41 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. PMCS


{'x': 18, 'o': 77, 'd': 5}

1.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [23]:
DEBUG = 0
print("PMCS vs. random:")
%timeit -n 1 -r 1 display(play(pmcs100_player, random_player))

print()
print("random vs. PMCS")
%timeit -n 1 -r 1 display(play(random_player, pmcs100_player))

PMCS vs. random:


{'x': 100, 'o': 0, 'd': 0}

14.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random vs. PMCS


{'x': 7, 'o': 86, 'd': 7}

10.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [24]:
DEBUG = 0
print("PMCS (100) vs. PMCS (10):")
%timeit -n 1 -r 1 display(play(pmcs100_player, pmcs10_player))

print()
print("PMCS (10) vs. PMCS (100)")
%timeit -n 1 -r 1 display(play(pmcs10_player, pmcs100_player))

PMCS vs. PMCS:


{'x': 88, 'o': 2, 'd': 10}

14.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

PMCS vs. PMCS


{'x': 39, 'o': 46, 'd': 15}

11.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
