# Chapter 8: Monte Carlo Tree Search



***
*“Our program AlphaGo combined these deep neural networks with a state-of-the-art tree search. In October 2015, AlphaGo became the first program to defeat a professional human player.”*

-- David Silver
***



What you'll learn in this chapter:

* Monte Carlo Tree Search (MCTS) and Upper Confidence Bounds for Trees (UCT)  
* Implementing a naive MCTS algorithm in the Coin game
* Breaking down the four steps in MCTS: select, expand, simulate, and backpropagate
* Designing a generic MCTS game strategy that can apply to the Coin game, Tic Tac Toe, and Connect Four



So far we have covered one type of tree search: MiniMax tree search. In
simple games such as Last Coin Standing or Tic Tac Toe, MiniMax agents
solve the game and provide game strategies that are as good as any other intelligent
game strategies. In complicated games such as Connect Four or Chess, the number of
possibilities is too large and it’s infeasible for the MiniMax agent to solve the game
in a short amount of time. We therefore use depth pruning and alpha-beta pruning,
combined with powerful position evaluation functions, to help MiniMax agents come
up with intelligent moves in the allotted time. With such an approach, Deep Blue
beat the world Chess champion Gary Kasparov in 1997.

While in games such as Chess, position evaluation functions are relatively accurate,
in certain games such as Go, evaluating positions is more challenging. For example,
in Chess, if white has an extra rook than black, it’s difficult for the black to win or
tie the game. We are fairly certain that white will win without following a game tree
all the way to the end. In Go, on the other hand, guessing who will eventually win
in the mid-game based on board positions is not as simple. Counting the number of
stones for each side can provide a clue, but this can change in an instant if one side
captures a large number of the opponent’s stones.

Therefore, in the game of Go, researchers usually use another type of tree search:
Monte Carlo Tree Search (MCTS). In depth-pruned MiniMax tree search, the agent
searches all possible outcomes to a fixed number of moves ahead, this is sometimes
called a breadth-first approach. In contrast, in MCTS, the agent simulates games all
the way to the terminal state to see the game outcome. It doesn’t cover all scenarios.
MCTS, therefore, is a depth-first approach.

The idea behind MCTS is to roll out random games starting from the current game
state and see what the average game outcome is. If you roll out, say 1000, games from
this point, and Player 1 wins 99 percent of the time, you know that the current game
state must favor Player 1 over Player 2. To select the best next move, the MCTS
algorithm uses the Upper Confidence Bounds for Trees (UCT) method. We’ll break
down the process into four steps: select, expand, simulate, and backpropagate. You’ll
implement an MCTS algorithm that can be applied to the three games in this book:
the coin game, Tic Tac Toe, and Connect Four.

As the opening quote of this chapter states, DeepMind combines deep learning with
MCTS when designing game strategies for AlphaGo, which led to the algorithm’s
great success. The “state-of-the-art tree search” David Silver mentions is an improved
version of MCTS. After learning the basics of MCTS, you’ll be ready to integrate
MCTS with other algorithms to create superhuman agents in various games,
as you’ll learn to do for the rest of the book.

# 1. What Is Monte Carlo Tree Search?

## 1.1. A Thought Experiment

In [1]:
from utils.coin_simple_env import coin_game

# Initiate the game environment
env = coin_game()
state=env.reset()   
print(f"the current state is {state}")
# Player 1 takes one coin from the pile
player1_move=1
print(f"Player 1 has chosen action={player1_move}") 
state, reward, done, info = env.step(player1_move)
print(f"the current state is {state}")

the current state is 21
Player 1 has chosen action=1
the current state is 20


In [2]:
counts={}
wins={}
losses={}
for move in env.validinputs:
    counts[move]=0
    wins[move]=0
    losses[move]=0
print(f"the dictionary counts has values {counts}")   
print(f"the dictionary wins has values {wins}")    
print(f"the dictionary losses has values {losses}")    

the dictionary counts has values {1: 0, 2: 0}
the dictionary wins has values {1: 0, 2: 0}
the dictionary losses has values {1: 0, 2: 0}


In [3]:
from copy import deepcopy

# simulate 10,000 games
for _ in range(100000):
    # create a deep copy of the game environment
    env_copy=deepcopy(env)
    # record moves
    actions=[]
    # play a full game
    while True:
        # randomly select a next move
        move=env.sample()
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            # see whehter the enxt move is 1 or 2
            next_move=deepcopy(actions[0])
            # update total number of simulated games
            counts[actions[0]] += 1
            # update total number of wins
            if (reward==1 and env.turn==1) or \
                (reward==-1 and env.turn==2):
                wins[actions[0]] += 1
            # update total number of losses                
            if (reward==-1 and env.turn==1) or \
                (reward==1 and env.turn==2):
                losses[actions[0]] += 1                
            break

In [4]:
print(f"the dictionary counts has values {counts}")   
print(f"the dictionary wins has values {wins}")    
print(f"the dictionary losses has values {losses}") 

the dictionary counts has values {1: 49986, 2: 50014}
the dictionary wins has values {1: 25022, 2: 25191}
the dictionary losses has values {1: 24964, 2: 24823}


In [5]:
# See which action is most promising
scores={}
for k,v in counts.items():
    scores[k]=(wins.get(k,0)-losses.get(k,0))/v
print(scores)
best_move=max(scores,key=scores.get)  
print(f"the best move is {best_move}")

{1: 0.0011603248909694715, 2: 0.007357939776862479}
the best move is 2


## 1.2. A Naive MCTS Algorithm

```python
def simulate_a_game(env,counts,wins,losses):
    env_copy=deepcopy(env)
    actions=[]
    # play a full game
    while True:
        #randomly select a next move
        move=random.choice(env_copy.validinputs)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            counts[actions[0]] += 1
            if (reward==1 and env.turn==1) or \
                (reward==-1 and env.turn==2):
                wins[actions[0]] += 1
            if (reward==-1 and env.turn==1) or \
                (reward==1 and env.turn==2):
                losses[actions[0]] += 1                
            break
    return counts, wins, losses
```

In the file *ch08util.py*, we also define a *best_move()* function as follows:

```python
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    return max(scores,key=scores.get)  
```

Finally, we define the *naive_mcts()* function in *ch08util.py* as follows:

```python
def naive_mcts(env, num_rollouts=10000):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # roll out games
    for _ in range(num_rollouts):
        counts,wins,losses=simulate_a_game(env,counts,\
                                           wins, losses)
    return best_move(counts,wins,losses) 
```

# 2. A Naive MCTS Player in the Coin Game


In [6]:
from utils.ch08util import naive_mcts

results=[]
for i in range(100):
    env=coin_game()
    state=env.reset() 
    if i%2==0:
        action=env.sample()
        state, reward, done, info=env.step(action)
    while True:
        action=naive_mcts(env,num_rollouts=1000)  
        state, reward, done, info=env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            results.append(1) 
            break  
        action=env.sample()   
        state, reward, done, info=env.step(action)
        if done:
            # result is -1 if the MCTS agent loses
            results.append(-1) 
            break              

In [7]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")         

the MCTS agent has won 96 games
the MCTS agent has lost 4 games


# 3. Upper Confidence Bounds for Trees (UCT)

## 3.1. The UCT Formala

## 3.2. An MCTS Agent

```python
from math import sqrt, log

def select(env,counts,wins,losses,temperature):
    # calculate the uct score for all next moves
    scores={}
    # the ones not visited get the priority
    for k in env.validinputs:
        if counts[k]==0:
            return k
    # total number of simulations conducted
    N=sum([v for k,v in counts.items()])
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            # vi for each next move
            vi=(wins.get(k,0)-losses.get(k,0))/v
            exploration=temperature*sqrt(log(N)/counts[k])
            scores[k]=vi+exploration
    # Select the next move with the highest UCT score
    return max(scores,key=scores.get)   
```

```python
def expand(env,move):
    env_copy=deepcopy(env)
    state,reward,done,info=env_copy.step(move)
    return env_copy, done, reward
```

```python
def simulate(env_copy,done,reward):
    # if the game has already ended
    if done==True:
        return reward
    while True:
        move=env_copy.sample()
        state,reward,done,info=env_copy.step(move)
        if done==True:
            return reward
```

```python
def backpropagate(env,move,reward,counts,wins,losses):
    # add 1 to the total game counts
    counts[move]=counts.get(move,0)+1
    # if the current player wins
    if reward==1 and (env.turn==1 or \
        env.turn=="X" or env.turn=="red"):
        wins[move]=wins.get(move,0)+1
    if reward==-1 and (env.turn==2 or \
        env.turn=="O" or env.turn=="yellow"):
        wins[move]=wins.get(move,0)+1        
    if reward==-1 and (env.turn==1 or \
        env.turn=="X" or env.turn=="red"):
        losses[move]=losses.get(move,0)+1
    if reward==1 and (env.turn==2 or \
        env.turn=="O" or env.turn=="yellow"):
        losses[move]=losses.get(move,0)+1       
    return counts,wins,losses
```

```python
def next_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    return max(scores,key=scores.get)
```

Finally, we define the *mcts()* function in the local module:

```python
def mcts(env,num_rollouts=100,temperature=1.4):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # create three dictionaries counts, wins, losses
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # roll out games
    for _ in range(num_rollouts):
        # selection
        move=select(env,counts,wins,losses,temperature)
        # expansion
        env_copy, done, reward=expand(env,move)
        # simulation
        reward=simulate(env_copy,done,reward)      
        # backpropagate
        counts,wins,losses=backpropagate(\
            env,move,reward,counts,wins,losses)
    # make the move
    return next_move(counts,wins,losses)
```

# 4. Test the MCTS Agent in Tic Tac Toe

## 4.1. Manually Play against the MCTS Agent in Tic Tac Toe

In [8]:
from utils.ttt_simple_env import ttt
from utils.ch08util import mcts

env=ttt()
state=env.reset() 
while True:
    action=mcts(env,num_rollouts=10000)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"Current state is \n{state.reshape(3,3)[::-1]}")
    if done:
        if reward==1:
            print(f"Player {env.turn} has won!") 
        else:
            print("Game over, it's a tie!")    
        break  
    action=int(input("What's your move?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"Current state is \n{state.reshape(3,3)[::-1]}")
    if done:
        print(f"Player {env.turn} has won!")  
        break

Player X has chosen 5
Current state is 
[[0 0 0]
 [0 1 0]
 [0 0 0]]
What's your move?1
Player O has chosen 1
Current state is 
[[ 0  0  0]
 [ 0  1  0]
 [-1  0  0]]
Player X has chosen 3
Current state is 
[[ 0  0  0]
 [ 0  1  0]
 [-1  0  1]]
What's your move?7
Player O has chosen 7
Current state is 
[[-1  0  0]
 [ 0  1  0]
 [-1  0  1]]
Player X has chosen 4
Current state is 
[[-1  0  0]
 [ 1  1  0]
 [-1  0  1]]
What's your move?6
Player O has chosen 6
Current state is 
[[-1  0  0]
 [ 1  1 -1]
 [-1  0  1]]
Player X has chosen 2
Current state is 
[[-1  0  0]
 [ 1  1 -1]
 [-1  1  1]]
What's your move?8
Player O has chosen 8
Current state is 
[[-1 -1  0]
 [ 1  1 -1]
 [-1  1  1]]
Player X has chosen 9
Current state is 
[[-1 -1  1]
 [ 1  1 -1]
 [-1  1  1]]
Game over, it's a tie!


## 4.2. Effectiveness of the Tic Tac Toe MCTS Agent


In [9]:
import random

results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=env.sample()
        state, reward, done, info=env.step(action)
    while True:
        action=mcts(env,num_rollouts=10000)  
        state, reward, done, info=env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=env.sample()   
        state, reward, done, info=env.step(action)
        if done:
            # result is -1 if the random-move agent wins
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  

In [10]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game is tied {losses} times") 

the MCTS agent has won 96 games
the MCTS agent has lost 1 games
the game is tied 3 times


# 5. An MCTS Agent in Connect Four
 

## 5.1. A Manual Game against the Connect Four MCTS Agent


In [11]:
from utils.conn_simple_env import conn

# Initiate the game environment
env=conn()
state=env.reset() 
while True:
    action=mcts(env,num_rollouts=5000)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info=env.step(action)
    print(f"Current state is \n{state.T[::-1]}")
    if done:
        print("Player red has won!") 
        break
    action=int(input("What's your move?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info=env.step(action)
    print(f"Current state is \n{state.T[::-1]}")
    if done:
        if reward!=0:
            print("Player yellow has won!") 
        else:
            print("Game over, it's a tie!")    
        break  

Player red has chosen 4
Current state is 
[[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0]]
What's your move?4
Player yellow has chosen 4
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0  1  0  0  0]]
Player red has chosen 3
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  1  1  0  0  0]]
What's your move?4
Player yellow has chosen 4
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  1  1  0  0  0]]
Player red has chosen 5
Current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  1  1  1  0  0]]
What's your move?5
Player yellow has chosen 5
Current state is 

## 5.2. Effectiveness of the Connect Four MCTS Agent


In [12]:
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=env.sample()
        state, reward, done, info=env.step(action)
    while True:
        action=mcts(env,num_rollouts=5000)  
        state, reward, done, info=env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            results.append(abs(reward)) 
            break  
        action=env.sample()  
        state, reward, done, info=env.step(action)
        if done:
            # result is -1 if the MCTS agent loses
            results.append(-abs(reward)) 
            break  

In [13]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game is tied {losses} times") 

the MCTS agent has won 100 games
the MCTS agent has lost 0 games
the game is tied 0 times
