# Adversarial Search: Solving Tic-Tac-Toe with Pure Monte Carlo Search

## Introduction 

Multiplayer games can be implemented as:
1. Nondeterministic actions: The opponent is seen as part of an environment with nondeterministic actions. Non-determinism is the result of the unknown opponent's moves. 
2. Optimal Decisions: Minimax search (search complete game tree) and alpha-beta pruning.
3. Heuristic Alpha-Beta Tree Search: Cut off tree search and use heuristic to estimate state value. 
4. __Monte Carlo Search:__ Simulate playouts to estimate state value. 

Here we will implement search for Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). The game is a __zero-sum game__: Win by x results in +1, win by o in -1 and a tie has a value of 0. Max plays x and tries to maximize the outcome while Min plays o and tries to minimize the outcome.   

We will implement
* Pure Monte Carlo search

## The board

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [1]:
import numpy as np
import math

In [2]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [3]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [4]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [5]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [6]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


In [7]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [8]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Pure Monte Carlo Search

See AIMA page 161ff. 

We implement a extremely simplified version.

For the current state: 
1. Simulate $N$ random playouts for each possible action and 
2. pick the action with the highest average utility.

__Important note:__ we use here a random playout policy, which ends up creating just a randomized search that works fine for this toy problem. For real applications you need to extend the code with
1. a good __playout policy__ (e.g., learned by self-play) and 
2. a __selection policy__ (e.g., UCB1).

## Simulate playouts

In [9]:
def playout(state, action, player = 'x'):
    """Perfrom a random playout starting with the given action on the fiven board 
    and return the utility of the finished game."""
    state = result(state, player, action)
    current_player = other(player)
    
    while(True):
        # reached terminal state?
        u = utility(state, player)
        if u is not None: return(u)
        
        # we use a random playout policy
        a = np.random.choice(get_actions(state))
        state = result(state, current_player, a)
        #print(state)
        
        # switch between players
        current_player = other(current_player)


# Playout for action 0 (top-left corner)
board = empty_board()
print(playout(board, 0))
print(playout(board, 0))
print(playout(board, 0))
print(playout(board, 0))
print(playout(board, 0))

1
1
1
-1
-1


In [10]:
def playouts(board, action, player = 'x', N = 100):
    """Perform N playouts following the given action for the given board."""
    return [ playout(board, action, player) for i in range(N) ]

u = playouts(board, 0)
print("Playout results:", u)

print(f"mean utility: {np.mean(u)}")

p_win = sum(np.array(u) == +1)/len(u)
p_loss = sum(np.array(u) == -1)/len(u)
p_draw = sum(np.array(u) == 0)/len(u)
print(f"win probability: {p_win}")
print(f"loss probability: {p_loss}")
print(f"draw probability: {p_draw}")

Playout results: [-1, -1, -1, 1, 1, -1, 0, 0, 1, 1, -1, 0, -1, 1, -1, 0, 1, 1, 1, 1, 0, 1, 1, 1, -1, -1, 0, 1, 1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 0, -1, -1, 1, 1, 1, -1, -1, 1, -1, 1, 1, 0, 1, -1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, -1, 0, -1, 1, 1, -1, 1, -1, 1, 1, 0, 1, 1, -1, 0, -1, 0, 1, -1, 0, -1, -1, -1, 1, 1, -1, 1, 1, -1]
mean utility: 0.18
win probability: 0.52
loss probability: 0.34
draw probability: 0.14


__Note:__ This shows that the player who goes first has a significant advantage in __pure random play.__ A better playout policy would be good.

## Pure Monte Carlo Search

In [11]:
DEBUG = 1

def pmcs(board, N = 100, player = 'x'):
    """Pure Monte Carlo Search. Returns the action that has the largest average utility.
    The N playouts are evenly divided between the possible actions."""
    global DEBUG
    
    actions = get_actions(board)
    n = math.floor(N/len(actions))
    if DEBUG >= 1: print(f"Actions: {actions} ({n} playouts per actions)")
    
    ps = { i:np.mean(playouts(board, i, player, N = n)) for i in actions }
    if DEBUG >= 1: display(ps)
        
    action = max(ps, key=ps.get)
    return action

board = empty_board()
display(board)
%time print(pmcs(board))

print()
print("10000 playouts give a better utility estimate.")
%time print(pmcs(board, N = 10000))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (11 playouts per actions)


{0: 0.7272727272727273,
 1: 0.6363636363636364,
 2: 0.36363636363636365,
 3: 0.18181818181818182,
 4: 0.9090909090909091,
 5: 0.36363636363636365,
 6: 0.6363636363636364,
 7: -0.45454545454545453,
 8: 0.18181818181818182}

4
CPU times: user 92.9 ms, sys: 6.59 ms, total: 99.5 ms
Wall time: 88.7 ms

10000 playouts give a better utility estimate.
Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (1111 playouts per actions)


{0: 0.33033303330333036,
 1: 0.23672367236723674,
 2: 0.3177317731773177,
 3: 0.21332133213321333,
 4: 0.4977497749774977,
 5: 0.1719171917191719,
 6: 0.35373537353735374,
 7: 0.23402340234023403,
 8: 0.3501350135013501}

4
CPU times: user 6.45 s, sys: 276 ms, total: 6.72 s
Wall time: 6.2 s


Looks like the center and the corners are a lot better.

## Some Tests

### x is about to win (play 8)


In [12]:
board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'

print("Board:")
show_board(board)

print()
%time display(pmcs(board))

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]

Actions: [2, 5, 6, 7, 8] (20 playouts per actions)


{2: 0.85, 5: 0.7, 6: 0.7, 7: 0.8, 8: 1.0}

8

CPU times: user 44.2 ms, sys: 227 µs, total: 44.4 ms
Wall time: 38.2 ms


### o is about to win

In [13]:
board = empty_board() 
board[0] = 'o'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'
board[8] = 'x'

print("Board:")
show_board(board)

print()
%time display(pmcs(board))

print()
%time display(pmcs(board, N = 1000))

Board:
[['o' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' 'x']]

Actions: [2, 5, 6, 7] (25 playouts per actions)


{2: -0.2, 5: -0.68, 6: 0.12, 7: -0.76}

6

CPU times: user 40.4 ms, sys: 0 ns, total: 40.4 ms
Wall time: 36.5 ms

Actions: [2, 5, 6, 7] (250 playouts per actions)


{2: -0.064, 5: -0.632, 6: 0.048, 7: -0.632}

6

CPU times: user 227 ms, sys: 4.21 ms, total: 231 ms
Wall time: 217 ms


### x can draw if it chooses 7

In [14]:
board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[2] = 'x'
board[4] = 'o'

print("Board:")
show_board(board)

print()
%time display(pmcs(board))

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

Actions: [3, 5, 6, 7, 8] (20 playouts per actions)


{3: -0.45, 5: -0.1, 6: -0.25, 7: 0.3, 8: -0.1}

7

CPU times: user 50.8 ms, sys: 448 µs, total: 51.3 ms
Wall time: 44.8 ms


### Empty board: Only a draw an be guaranteed

In [15]:
board = empty_board() 

print("Board:")
show_board(board)

print()
%time display(pmcs(board ))

Board:
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (11 playouts per actions)


{0: 0.7272727272727273,
 1: 0.36363636363636365,
 2: 0.5454545454545454,
 3: 0.09090909090909091,
 4: 0.09090909090909091,
 5: 0.36363636363636365,
 6: 0.18181818181818182,
 7: -0.09090909090909091,
 8: 0.2727272727272727}

0

CPU times: user 106 ms, sys: 867 µs, total: 106 ms
Wall time: 90.6 ms


### A bad situation

In [16]:
board = empty_board() 
board[0] = 'o'
board[2] = 'x'
board[8] = 'o'

print("Board:")
show_board(board)

print()
%time display(pmcs(board))

Board:
[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

Actions: [1, 3, 4, 5, 6, 7] (16 playouts per actions)


{1: -0.75, 3: -0.5, 4: 0.25, 5: -0.75, 6: -0.125, 7: -0.5625}

4

CPU times: user 76.6 ms, sys: 7.7 ms, total: 84.3 ms
Wall time: 72.9 ms


__Note:__ It looks like random player o is very unlikely to block x and take advantage of the trap by playing the bottom left corner!

## Experiments


### Baseline: Randomized Player

A completely randomized player agent should be a weak baseline.

In [17]:
def random_player(board, player = None):
    """Simple player that chooses a random empy square. player is unused"""
    return np.random.choice(get_actions(board))

show_board(board)
print()
%time random_player(board)

[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

CPU times: user 96 µs, sys: 9 µs, total: 105 µs
Wall time: 111 µs


4

### The Environment

Implement the environment that calls the agent. The percept is the board and the action is move.

In [18]:
DEBUG = 1

def switch_player(player, x, o):
    if player == 'x':
        return 'o', o
    else:
        return 'x', x

def play(x, o, N = 100):
    results = {'x': 0, 'o': 0, 'd': 0}
    for i in range(N):
        board = empty_board()
        player, fun = 'x', x
        
        while True:
            a = fun(board, player)
            board = result(board, player, a)
            if DEBUG >= 1: print(f"Player {player} uses action {a}")
            
            win = check_win(board)
            if win != 'n':
                if DEBUG >= 1: print(f"Result {board} winner: {win}")
                results[win] += 1
                break
            
            player, fun = switch_player(player, x, o)   
 
    return results

### Random vs. Random

In [19]:
DEBUG = 0

%time display(play(random_player, random_player))

{'x': 57, 'o': 29, 'd': 14}

CPU times: user 105 ms, sys: 8.8 ms, total: 114 ms
Wall time: 100 ms


### Pure Monte Carlo Search vs. Random

In [20]:
def pmcs10_player(board, player = 'x'):
    action = pmcs(board, N = 10, player = player)
    return action

def pmcs100_player(board, player = 'x'):
    action = pmcs(board, N = 100, player = player)
    return action

def pmcs1000_player(board, player = 'x'):
    action = pmcs(board, N = 1000, player = player)
    return action


DEBUG = 1
print("PMCS vs. random:")
display(play(pmcs10_player, random_player, N = 1))

PMCS vs. random:
Actions: [0, 1, 2, 3, 4, 5, 6, 7, 8] (1 playouts per actions)


{0: 1.0, 1: 1.0, 2: -1.0, 3: -1.0, 4: -1.0, 5: 1.0, 6: 1.0, 7: -1.0, 8: 1.0}

Player x uses action 0
Player o uses action 3
Actions: [1, 2, 4, 5, 6, 7, 8] (1 playouts per actions)


{1: 1.0, 2: -1.0, 4: 1.0, 5: 1.0, 6: -1.0, 7: 1.0, 8: -1.0}

Player x uses action 1
Player o uses action 7
Actions: [2, 4, 5, 6, 8] (2 playouts per actions)


{2: 1.0, 4: 1.0, 5: 0.5, 6: 0.5, 8: 1.0}

Player x uses action 2
Result ['x', 'x', 'x', 'o', ' ', ' ', ' ', 'o', ' '] winner: x


{'x': 1, 'o': 0, 'd': 0}

In [21]:
DEBUG = 0
print("PMCS vs. random:")
%time display(play(pmcs10_player, random_player))

print()
print("random vs. PMCS")
%time display(play(random_player, pmcs10_player))

PMCS vs. random:


{'x': 90, 'o': 6, 'd': 4}

CPU times: user 1.36 s, sys: 70.2 ms, total: 1.43 s
Wall time: 1.26 s

random vs. PMCS


{'x': 27, 'o': 61, 'd': 12}

CPU times: user 989 ms, sys: 65.7 ms, total: 1.05 s
Wall time: 939 ms


In [22]:
DEBUG = 0
print("PMCS vs. random:")
%time display(play(pmcs100_player, random_player))

print()
print("random vs. PMCS")
%time display(play(random_player, pmcs100_player))

PMCS vs. random:


{'x': 100, 'o': 0, 'd': 0}

CPU times: user 14.5 s, sys: 696 ms, total: 15.2 s
Wall time: 13.4 s

random vs. PMCS


{'x': 6, 'o': 91, 'd': 3}

CPU times: user 10.7 s, sys: 360 ms, total: 11 s
Wall time: 10.2 s


In [23]:
DEBUG = 0
print("PMCS (100) vs. PMCS (10):")
%time display(play(pmcs100_player, pmcs10_player))

print()
print("PMCS (10) vs. PMCS (100)")
%time display(play(pmcs10_player, pmcs100_player))

PMCS (100) vs. PMCS (10):


{'x': 90, 'o': 3, 'd': 7}

CPU times: user 14.6 s, sys: 364 ms, total: 15 s
Wall time: 13.8 s

PMCS (10) vs. PMCS (100)


{'x': 40, 'o': 43, 'd': 17}

CPU times: user 11.5 s, sys: 274 ms, total: 11.7 s
Wall time: 11.1 s


_Note 1:_ You can try `pmcs_100`, but it will take a few minutes to run 100 games.

_Note 2:_ The important advantage about Monte Carlo Search is that we do not need to define a heuristic evaluation function to decide what boards are good.