# Adversarial Search: Solving Tic-Tac-Toe with Heuristic Alpha-Beta Tree Search

## Introduction 

Multiplayer games can be implemented as:
1. Nondeterministic actions: The opponent is seen as part of an environment with nondeterministic actions. Non-determinism is the result of the unknown opponent's moves. 
2. Optimal Decisions: Minimax search (search complete game tree) and alpha-beta pruning.
3. __Heuristic Alpha-Beta Tree Search:__ Cut off tree search and use heuristic to estimate state value. 
4. Monte Carlo Tree search: Simulate playouts to estimate state value. 

Here we will implement search for Tic-Tac-Toe (see [rules](https://en.wikipedia.org/wiki/Tic-tac-toe)). The game is a __zero-sum game__: Win by x results in +1, win by o in -1 and a tie has a value of 0. Max plays x and tries to maximize the outcome while Min plays o and tries to minimize the outcome.   

We will implement
* Heuristic Alpha-Beta Tree Search

The algorithms search the game tree and we could return a conditional plan (or partial plan if cut offs are used), but the implementation here only identifies and returns the optimal next move.

## The board and helper functions

I represent the board as a vector of length 9. The values are `' ', 'x', 'o'`.  

In [1]:
import numpy as np
import math

In [2]:
def empty_board():
    return [' '] * 9

board = empty_board()
display(board)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Some helper functions.

In [3]:
def show_board(board):
    """display the board"""
    b = np.array(board).reshape((3,3))
    print(b)

board = empty_board()
show_board(board)    

print()
print("Add some x's")
board[0] = 'x'; board[3] = 'x'; board[6] = 'x';  
show_board(board)

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Add some x's
[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


In [4]:
def check_win(board):
    """check the board and return one of x, o, d (draw), or n (for next move)"""
    
    board = np.array(board).reshape((3,3))
    
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if len(set(row)) == 1 and row[0] != ' ':
                return row[0]
    
    # check for draw
    if(np.sum(board == ' ') < 1):
        return 'd'
    
    return 'n'

show_board(board)
print('Win? ' + check_win(board))

print()
show_board(empty_board())
print('Win? ' + check_win(empty_board()))

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]
Win? x

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
Win? n


In [5]:
def get_actions(board):
    """return possible actions as a vector ot indices"""
    return np.where(np.array(board) == ' ')[0].tolist()

    # randomize the action order
    #actions = np.where(np.array(board) == ' ')[0]
    #np.random.shuffle(actions)
    #return actions.tolist()


show_board(board)
get_actions(board)

[['x' ' ' ' ']
 ['x' ' ' ' ']
 ['x' ' ' ' ']]


[1, 2, 4, 5, 7, 8]

In [6]:
def other(player): 
    if player == 'x': return 'o'
    else: return 'x'

In [7]:
def result(state, player, action):
    """Add move to the board."""
    
    state = state.copy()
    state[action] = player
  
    return state

show_board(empty_board())

print()
print("State for placing an x at position 4:")
show_board(result(empty_board(), 'x', 4))

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

State for placing an x at position 4:
[[' ' ' ' ' ']
 [' ' 'x' ' ']
 [' ' ' ' ' ']]


Return utility for a state. Terminal states return the utility for Max as `+1`, -`1` or `0`. Non-terminal state have no utility and return `None`. Note that I use `utility is not None` to identify terminal states. That is, I use for `is_terminal(s)`.

In [8]:
def utility(state, player = 'x'):
    """check is a state is terminal and return the utility if it is. None means not a terminal mode."""
    goal = check_win(state)        
    if goal == player: return +1 
    if goal == 'd': return 0  
    if goal == other(player): return -1  # loss is failure
    return None # continue

print(utility(['x'] * 9))
print(utility(['o'] * 9))
print(utility(empty_board()))

1
-1
None


# Heuristic Alpha-Beta Tree Search

See AIMA page 156ff. 


## Heuristic Evaluation Function

In [9]:
def eval_fun(state, player = 'x'):
    """heuristic for utility of state. Returns score for a node:
    1. For terminal states it returns the utility. 
    2. For non-terminal states, it calculates a weighted linear function using features of the state. 
    The features we look at are 2 in a row/col/diagonal where the 3rd square is empty. We assume that
    the more of these positions we have, the higher the chance of winning.
    We need to be careful that the utility of the heuristic stays between [-1,1]. 
    Note that the largest possible number of these positions is 4. I weigh the count by 0.1, 
    guaranteeing that is in the needed range.
    
    Function Returns: heuistic value, terminal?"""
    
    # terminal state?
    u = utility(state, player)
    if u is not None: return u, True
    
    
    score = 0
    board = np.array(state).reshape((3,3))
    diagonals = np.array([[board[i][i] for i in range(len(board))], 
                          [board[i][len(board)-i-1] for i in range(len(board))]])
    
    for a_board in [board, np.transpose(board), diagonals]:
        for row in a_board:
            if sum(row == player) == 2 and any(row ==' '): score += .1
            if sum(row == other(player)) == 2 and any(row ==' '): score -= .1
    
    return score, False

In [10]:
board = empty_board() 
show_board(board)
print(f"eval for x: {eval_fun(board)}")
print(f"eval for o: {eval_fun(board, 'o')}")

board = empty_board() 
board[0] = 'x'
board[1] = 'x'
board[2] = 'x' 
show_board(board)
print(f"eval for x: {eval_fun(board)}")
print(f"eval for o: {eval_fun(board, 'o')}")

board = empty_board() 
board[0] = 'x'
board[1] = 'x'
board[3] = 'x' 
board[4] = 'o'
board[8] = 'o'
show_board(board)
print(f"eval for x: {eval_fun(board)}")
print(f"eval for o: {eval_fun(board, 'o')}")

board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'x' 
board[4] = 'o'
show_board(board)
print(f"eval for x: {eval_fun(board)}")
print(f"eval for o: {eval_fun(board, 'o')}")

[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
eval for x: (0, False)
eval for o: (0, False)
[['x' 'x' 'x']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]
eval for x: (1, True)
eval for o: (-1, True)
[['x' 'x' ' ']
 ['x' 'o' ' ']
 [' ' ' ' 'o']]
eval for x: (0.2, False)
eval for o: (-0.2, False)
[['x' 'o' ' ']
 ['x' 'o' ' ']
 [' ' ' ' ' ']]
eval for x: (0.0, False)
eval for o: (0.0, False)


## Search with Cutoff

We add a cutoff to the Recursive DFS algorithm for Minimax Search with Alpha-Beta Pruning (see AIMA page 156ff). We use the heuristic evaluation function and the back-up the value using minimax search with alpha-beta pruning to determine the next move.

In [11]:
# global variables
DEBUG = 1 # 1 ... count nodes, 2 ... debug each node
COUNT = 0

def alpha_beta_search(board, cutoff = None, player = 'x'):
    """start the search. cutoff = None is minimax search with alpha-beta pruning."""
    global DEBUG, COUNT
    COUNT = 0

    value, move = max_value_ab(board, player, -math.inf, +math.inf, 0, cutoff)
    
    if DEBUG >= 1: print(f"Number of nodes searched (cutoff = {cutoff}): {COUNT}") 
    
    return {"move": move, "value": value}

def max_value_ab(state, player, alpha, beta, depth, cutoff):
    """player's best move."""
    global DEBUG, COUNT
    COUNT += 1
    
    # cut off and terminal test
    v, terminal = eval_fun(state, player)
    if((cutoff is not None and depth >= cutoff) or terminal): 
        if(terminal): 
            alpha, beta = v, v
        if DEBUG >= 2: print(f"stopped at {depth}: {state} term: {terminal} eval: {v} [{alpha}, {beta}]" ) 
        return v, None
    
    v, move = -math.inf, None

    # check all possible actions in the state, update alpha and return move with the largest value
    for a in get_actions(state):
        v2, a2 = min_value_ab(result(state, player, a), player, alpha, beta, depth + 1, cutoff)
        if v2 > v:
            v, move = v2, a
            alpha = max(alpha, v)
        if v >= beta: return v, move
    
    return v, move

def min_value_ab(state, player, alpha, beta, depth, cutoff):
    """opponent's best response."""
    global DEBUG, COUNT
    COUNT += 1
    
    # cut off and terminal test
    v, terminal = eval_fun(state, player)
    if((cutoff is not None and depth >= cutoff) or terminal): 
        if(terminal): 
            alpha, beta = v, v
        if DEBUG >= 2: print(f"stopped at {depth}: {state} term: {terminal} eval: {v} [{alpha}, {beta}]" ) 
        return v, None
    
    v, move = +math.inf, None

    # check all possible actions in the state, update beta and return move with the smallest value
    for a in get_actions(state):
        v2, a2 = max_value_ab(result(state, other(player), a), player, alpha, beta, depth + 1, cutoff)
        if v2 < v:
            v, move = v2, a
            beta = min(beta, v)
        if v <= alpha: return v, move
    
    return v, move

## Some Tests

### x is about to win (play 8)

In [12]:
board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'

print("Board:")
show_board(board)

print()
%time display(alpha_beta_search(board, 2))

print()
%time display(alpha_beta_search(board, 4))

print()
%time display(alpha_beta_search(board))

Board:
[['x' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' ' ']]

Number of nodes searched (cutoff = 2): 13


{'move': 8, 'value': 1}

CPU times: user 11.1 ms, sys: 0 ns, total: 11.1 ms
Wall time: 9.86 ms

Number of nodes searched (cutoff = 4): 47


{'move': 2, 'value': 1}

CPU times: user 33.2 ms, sys: 380 µs, total: 33.5 ms
Wall time: 32.1 ms

Number of nodes searched (cutoff = None): 61


{'move': 2, 'value': 1}

CPU times: user 24.2 ms, sys: 2.46 ms, total: 26.7 ms
Wall time: 25.9 ms


### o is about to win

In [13]:
board = empty_board() 
board[0] = 'o'
board[1] = 'o'
board[3] = 'o'
board[4] = 'x'
board[8] = 'x'

print("Board:")
show_board(board)

print()
%time display(alpha_beta_search(board, 2))
print()
%time display(alpha_beta_search(board))

Board:
[['o' 'o' ' ']
 ['o' 'x' ' ']
 [' ' ' ' 'x']]

Number of nodes searched (cutoff = 2): 11


{'move': 2, 'value': -1}

CPU times: user 900 µs, sys: 9.08 ms, total: 9.98 ms
Wall time: 8.58 ms

Number of nodes searched (cutoff = None): 15


{'move': 2, 'value': -1}

CPU times: user 7.71 ms, sys: 0 ns, total: 7.71 ms
Wall time: 6.78 ms


### x can draw if it chooses 7

In [14]:
board = empty_board() 
board[0] = 'x'
board[1] = 'o'
board[2] = 'x'
board[4] = 'o'

print("Board:")
show_board(board)

print()
%time display(alpha_beta_search(board, 2))
print()
%time display(alpha_beta_search(board, 4))
print()
%time display(alpha_beta_search(board))

Board:
[['x' 'o' 'x']
 [' ' 'o' ' ']
 [' ' ' ' ' ']]

Number of nodes searched (cutoff = 2): 21


{'move': 7, 'value': -0.1}

CPU times: user 9.49 ms, sys: 3 ms, total: 12.5 ms
Wall time: 11.2 ms

Number of nodes searched (cutoff = 4): 81


{'move': 7, 'value': 0}

CPU times: user 44.7 ms, sys: 2.78 ms, total: 47.4 ms
Wall time: 46.3 ms

Number of nodes searched (cutoff = None): 101


{'move': 7, 'value': 0}

CPU times: user 35.8 ms, sys: 43 µs, total: 35.9 ms
Wall time: 35.1 ms


### Empty board: Only a draw an be guaranteed

In [16]:
board = empty_board() 

print("Board:")
show_board(board)


print()
%time display(alpha_beta_search(board, 2))
print()
%time display(alpha_beta_search(board, 4))
print()
%time display(alpha_beta_search(board))

Board:
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 [' ' ' ' ' ']]

Number of nodes searched (cutoff = 2): 26


{'move': 0, 'value': 0}

CPU times: user 9.71 ms, sys: 495 µs, total: 10.2 ms
Wall time: 9.24 ms

Number of nodes searched (cutoff = 4): 541


{'move': 4, 'value': 0.0}

CPU times: user 149 ms, sys: 385 µs, total: 149 ms
Wall time: 148 ms

Number of nodes searched (cutoff = None): 18297


{'move': 0, 'value': 0}

CPU times: user 3.51 s, sys: 1.58 ms, total: 3.52 s
Wall time: 3.51 s


### A bad situation

In [17]:
board = empty_board() 
board[0] = 'o'
board[2] = 'x'
board[8] = 'o'

print("Board:")
show_board(board)

print()
%time display(alpha_beta_search(board, 2))
print()
%time display(alpha_beta_search(board, 4))
print()
%time display(alpha_beta_search(board))

Board:
[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]

Number of nodes searched (cutoff = 2): 26


{'move': 4, 'value': -0.2}

CPU times: user 9.58 ms, sys: 4.15 ms, total: 13.7 ms
Wall time: 12.5 ms

Number of nodes searched (cutoff = 4): 148


{'move': 1, 'value': -1}

CPU times: user 48.1 ms, sys: 101 µs, total: 48.2 ms
Wall time: 47 ms

Number of nodes searched (cutoff = None): 238


{'move': 1, 'value': -1}

CPU times: user 53.8 ms, sys: 0 ns, total: 53.8 ms
Wall time: 53 ms


## Experiments


### Baseline: Randomized Player

A completely randomized player agent should be a weak baseline.

In [18]:
def random_player(board, player = None):
    """Simple player that chooses a random empy square. player is unused"""
    return np.random.choice(get_actions(board))

show_board(board)
random_player(board)

[['o' ' ' 'x']
 [' ' ' ' ' ']
 [' ' ' ' 'o']]


1

### The Environment

Implement the environment that calls the agent. The percept is the board and the action is move.

In [19]:
DEBUG = 0

def switch_player(player, x, o):
    if player == 'x':
        return 'o', o
    else:
        return 'x', x

def play(x, o, N = 100):
    results = {'x': 0, 'o': 0, 'd': 0}
    for i in range(N):
        board = empty_board()
        player, fun = 'x', x
        
        while True:
            a = fun(board, player)
            board = result(board, player, a)
            
            win = check_win(board)
            if win != 'n':
                if DEBUG >= 1: print(f"{board} winner: {win}")
                results[win] += 1
                break
            
            player, fun = switch_player(player, x, o)   
 
    return results

### Random vs. Random

In [20]:
%time display(play(random_player, random_player))

{'x': 57, 'o': 28, 'd': 15}

CPU times: user 105 ms, sys: 8.2 ms, total: 113 ms
Wall time: 98.2 ms


### Minimax with Alpha-Beta Pruning vs. Random

In [21]:
def heuristic2_player(board, player = 'x'):
    return alpha_beta_search(board, cutoff = 2, player = player)["move"]

def heuristic4_player(board, player = 'x'):
    return alpha_beta_search(board, cutoff = 4, player = player)["move"]

def alpha_beta_player(board, player = 'x'):
    return alpha_beta_search(board, cutoff = None, player = player)["move"]

DEBUG = 1
print("heuristic2 vs. random:")
display(play(heuristic2_player, random_player, N = 3))

heuristic2 vs. random:
Number of nodes searched (cutoff = 2): 26
Number of nodes searched (cutoff = 2): 37
Number of nodes searched (cutoff = 2): 13
['x', ' ', 'o', 'x', ' ', ' ', 'x', ' ', 'o'] winner: x
Number of nodes searched (cutoff = 2): 26
Number of nodes searched (cutoff = 2): 27
Number of nodes searched (cutoff = 2): 10
['x', 'x', 'x', 'o', ' ', ' ', ' ', ' ', 'o'] winner: x
Number of nodes searched (cutoff = 2): 26
Number of nodes searched (cutoff = 2): 37
Number of nodes searched (cutoff = 2): 13
['x', ' ', 'o', 'x', 'o', ' ', 'x', ' ', ' '] winner: x


{'x': 3, 'o': 0, 'd': 0}

In [22]:
DEBUG = 0
print("heuristic2 vs. random:")
%time display(play(heuristic2_player, random_player))

print("heuristic4 vs. random:")
%time display(play(heuristic4_player, random_player))

print()
print("random vs. heuristic2")
%time display(play(random_player, heuristic2_player))

print("random vs. heuristic4")
%time display(play(random_player, heuristic4_player))

heuristic2 vs. random:


{'x': 93, 'o': 0, 'd': 7}

CPU times: user 1.9 s, sys: 515 µs, total: 1.9 s
Wall time: 1.89 s
heuristic4 vs. random:


{'x': 97, 'o': 0, 'd': 3}

CPU times: user 24.1 s, sys: 91 µs, total: 24.1 s
Wall time: 24.1 s

random vs. heuristic2


{'x': 5, 'o': 79, 'd': 16}

CPU times: user 1.76 s, sys: 0 ns, total: 1.76 s
Wall time: 1.76 s
random vs. heuristic4


{'x': 0, 'o': 80, 'd': 20}

CPU times: user 14.4 s, sys: 0 ns, total: 14.4 s
Wall time: 14.4 s


### Heuristic vs. Minimax with Alpha-Beta Pruning

In [25]:
DEBUG = 0

# Note: No randomness -> play only once

print("heuristic2 vs. alpha_beta")
%time display(play(heuristic2_player, alpha_beta_player, N = 1))

print()
print("alpha_beta vs. heuristic2")
%time display(play(alpha_beta_player, heuristic2_player, N = 1))

print()
print("heuristic4 vs alpha_beta")
%time display(play(heuristic4_player, alpha_beta_player, N = 1))

print()
print("alpha_beta vs. heuristic4")
%time display(play(alpha_beta_player, heuristic4_player, N = 1))

heuristic2 vs. alpha_beta


{'x': 0, 'o': 0, 'd': 1}

CPU times: user 520 ms, sys: 38 µs, total: 520 ms
Wall time: 519 ms

alpha_beta vs. heuristic2


{'x': 1, 'o': 0, 'd': 0}

CPU times: user 3.67 s, sys: 0 ns, total: 3.67 s
Wall time: 3.67 s

heuristic4 vs alpha_beta


{'x': 0, 'o': 0, 'd': 1}

CPU times: user 709 ms, sys: 0 ns, total: 709 ms
Wall time: 708 ms

alpha_beta vs. heuristic4


{'x': 0, 'o': 0, 'd': 1}

CPU times: user 3.69 s, sys: 0 ns, total: 3.69 s
Wall time: 3.69 s


### Heuristic vs. Heuristic

In [27]:
DEBUG = 0

# Note: No randomness -> play only once

print("heuristic2 vs. heuristic4")
%time display(play(heuristic2_player, heuristic4_player, N = 1))

print()
print("heuristic4 vs. heuristic2")
%time display(play(heuristic4_player, heuristic2_player, N = 1))

heuristic2 vs. heuristic4


{'x': 0, 'o': 0, 'd': 1}

CPU times: user 180 ms, sys: 38 µs, total: 180 ms
Wall time: 178 ms

heuristic4 vs. heuristic2


{'x': 0, 'o': 0, 'd': 1}

CPU times: user 270 ms, sys: 0 ns, total: 270 ms
Wall time: 269 ms


__Idea:__ Start experiments with different boards that already have a few x's and o's randomly placed on them.