# Chapter 8: Introduction to Monte Carlo Tree Search

So far we have covered one type of tree search: minimax tree search. In simple games such as Last Coin Standing or Tic Tac Toe, minimax agents solve the game and provide game strategies that are as good as any other intelligent game strategies. In complicated games such as Connect Four or Chess, the number of possibilities is too large and it's infeasible for the minimax agent to solve the game in a short amount of time. We therefore, use depth pruning and alpha-beta pruning, combined with powerful position evaluation functions, for the minimax agent to come up with intelligent moves in the alloted time span. With such an approach, Deep Blue beat the world chess champion Gary Kasporov in 1997. 

While in games such as Chess, position evaluation functions are relatively accurate, in certain games such as Go, evaluating positions is harder. For example, in Chess, if white has an extra rook than black, it's difficult for the black to win or tie the game. We are fairly certain that white will win without following a game tree all the way to the end. In Go, on the other hand, guessing who will eventually win in the mid-game based on board positions is more difficult. Counting the number of stones for each side can provide a clue, but this can change in an instant if one side captures a large number of the opponent's stones. 

Therefore, in the game of Go, researchers usually use another type of tree search: Monte Carlo Tree Search (MCTS). In depth-pruned minimax tree search, the agent searches all possible outcomes to a fixed number of moves ahead, this is sometimes called a breadth-first approach. In contrast, in MCTS, the agent simulate games all the way to the terminal state to see the game outcome. It doesn't cover all scenarios. MCTS, therefore, is a depth-first approach. 

The idea behind MCTS is to roll out random games starting from the current game state and see what is the average game outcome. If you roll out, say 100, games from this point, and Player 1 wins 99 percent of the time, you know that the current game state must favor Player 1 against Player 2. 

In this chapter, you'll learn to implement a simple version of the MCTS, which we call naive MCTS. We'll implement it in the coin game and show that it provides powerful game strategies. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 8}}$<br>
***
We'll put all files in Chapter 8 in a subfolder /files/ch08. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch08", exist_ok=True)

# 1. What Is Monte Carlo Tree Search?

Monte Carlo Tree Search (MCTS) is a simulation method in AI to evaluate a game state by rolling out a large number of random games and see what the average outcome is. 

To help us understand how MCTS works, we'll develop a naive MCTS game strategy and apply it to the coin game that we developed in Chapter 1. 

## 1.1. A Thought Experiment

Imagine that you are playing the coin game against a naive MCTS agent. You move first and choose to take one coin from the pile. Let's code that in:

In [2]:
from utils.coin_simple_env import coin_game

# Initiate the game environment
env = coin_game()
state=env.reset()   
print(f"the current state is {state}")
# Player 1 takes one coin from the pile
player1_move=1
print(f"Player 1 has chosen action={player1_move}") 
state, reward, done, info = env.step(player1_move)
print(f"the current state is {state}")

the current state is 21
Player 1 has chosen action=1
the current state is 20


The coin_simple_env is a simpliflied coin game environment. It's the same as the coin game environment we used in Chapter 1 except that we removed the graphical rendering functionality. We do this because in MCTS, the agent makes deep copies of the game environmet many times. Using a simplied game environment can greatly speed up the tree search process. The file coin_simple_env.py is in the folder /utils/ in the book's GitHub repository. Download the file and put it in /Desktop/ai/utils/ on your computer. 

You have chosen action=1. This leaves 20 coins in the pile. Now the naive MCTS agent needs to make a move. The agent thinks as follows: I'll simulate 10000 games from this point on. Sometimes I choose action=1 as my next move and other times I choose action=2 as my next move. I'll see which action leads to more wins for me and I'll pick that action as my next move. 

To do that, the naive MCTS agent first creates three dictionaries, counts, wins, and losses, to record the total number of games, number of wins, and number of losses associated with each next move. Let's code that in as follows:

In [3]:
counts={}
wins={}
losses={}
for move in env.validinputs:
    counts[move]=0
    wins[move]=0
    losses[move]=0
print(f"the dictionary counts has values {counts}")   
print(f"the dictionary wins has values {wins}")    
print(f"the dictionary losses has values {losses}")    

the dictionary counts has values {1: 0, 2: 0}
the dictionary wins has values {1: 0, 2: 0}
the dictionary losses has values {1: 0, 2: 0}


If you run the above code cell, you'll see that the three dictionaries all start with values {1: 0, 2: 0}. The naive MCTS agent will then simulate games and update these three dictionaries accordingly. Below, the agent simulates 10,000 games: 

In [4]:
import random
from copy import deepcopy

# simulate 10,000 games
for _ in range(10000):
    # create a deep copy of the game environment
    env_copy=deepcopy(env)
    # record moves
    actions=[]
    # play a full game
    while True:
        #randomly select a next move
        move=random.choice(env_copy.validinputs)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            # see whehter the enxt move is 1 or 2
            next_move=deepcopy(actions[0])
            # update total number of simulated games
            counts[actions[0]] += 1
            # update total number of wins
            if (reward==1 and env.turn==1) or (reward==-1 and env.turn==2):
                wins[actions[0]] += 1
            # update total number of losses                
            if (reward==-1 and env.turn==1) or (reward==1 and env.turn==2):
                losses[actions[0]] += 1                
            break

Next, the MCTS agent counts total number of games, number of wins, and number of losses associated with each next move: action=1 and action=2. Let's code that in as follows:

In [5]:
print(f"the dictionary counts has values {counts}")   
print(f"the dictionary wins has values {wins}")    
print(f"the dictionary losses has values {losses}") 

the dictionary counts has values {1: 5084, 2: 4916}
the dictionary wins has values {1: 2529, 2: 2458}
the dictionary losses has values {1: 2555, 2: 2458}


Out of the 10,000 games, the MCTS agent has chosen action=1 5,084 times and the rest 4,916 times the agent has chosen action=2. When action=1 is chosen, the agent has won 2529 times and lost 2555 times. So the agent knows that when action=1 is chosen, he/she is more likely to lose than to win. On the other hand, if the agent chooses action=2 instead, he/she has an equal chance of winning and losing. Therefore, the agent prefers to choose action 2. Let's code in that thought process:

In [6]:
# See which action is most promising
scores={}
for k,v in counts.items():
    scores[k]=(wins.get(k,0)-losses.get(k,0))/v
best_move=max(scores,key=scores.get)  
print(f"the best move is {best_move}")

the best move is 2


Here, the agent first creates a dictionary *scores*. For each possible next move, it assigns a socre to it. The score is the number of wins minus the number of losses, scaled by the total number of simulations. This score is a value between -1 and 1: if a move leads to winning 100% of the time, the move has a score of 1; if a move leads to losing 100% of the time, the move has a score of -1; a score of 0 means the move leads to an equal chance of winning and losing.

## 1.2. A Naive MCTS Algorithm
Based on the thought experiment in the last subsection, let's create a naive MCTS algorithm. Whenever it's the MCTS agent's turn. The MCTS agent simulates 10,000 game and makes a decision based on which next action leads to better outcome. 

To do that, we first first define a simulate_a_game() function in the local package. Download ch08util.py from the book's GitHub repository and place it in the folder /Desktop/ai/utils/ on your computer. The function simulate_a_game() is defined as follows:

In [7]:
def simulate_a_game(env,counts,wins,losses):
    env_copy=deepcopy(env)
    actions=[]
    # play a full game
    while True:
        #randomly select a next move
        move=random.choice(env_copy.validinputs)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            counts[actions[0]] += 1
            if (reward==1 and env.turn==1) or \
                (reward==-1 and env.turn==2):
                wins[actions[0]] += 1
            if (reward==-1 and env.turn==1) or \
                (reward==1 and env.turn==2):
                losses[actions[0]] += 1                
            break
    return counts, wins, losses

This function take four arguments: the game environment, env, and the three dictionaries: counts, wins, and losses. This function hypothetically plays a game, all the way to the terminal state, by choosing random moves for both players. After the game, it updates the three dictionaries. If the next move is 1, the number of games assocated with action 1 increases by 1 in the dictionary counts. Depending on wether the current player has won or lost, one of the two dictionaries, wins and losses, will update the number of wins or losses associated with action 1. For example, if action 1 is chosen as the next move and the game is lost, then the dictionary wins remains unchanged while the number of losses increases by 1 in the ditionary losses associated with key 1. 



In the file ch08util.py, we also define a best_move() function as follows:

In [8]:
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    best_move=max(scores,key=scores.get)  
    return best_move

This function selects the best move based on the three dictionaries: counts, wins, and losses. It calculates a score for each potential next move: the score is the difference between the number of wins and losses scaled by the total number of moves. 

Finally, we define the naive_mcts() function in ch08util.py as follows:

In [9]:
def naive_mcts(env, num_rollouts=100):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    # roll out games
    for _ in range(num_rollouts):
        counts,wins,losses=simulate_a_game(env,counts,\
                                           wins, losses)
    best_next_move=best_move(counts,wins,losses)  
    return best_next_move

We set the default number of roll outs to 100. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the simulation is complete, we select the best next move based on the simulation results. 

# 2. A Naive MCTS Player in the Coin Game
In this section, we play games with the naive MCTS algorithm we just developed and see how effective it is.

## 2.1. Play A Game Against the MCTS Agent
We'll play a coin game agaist the naive MCTS agent we developed in the last seciton. 

In [10]:
from utils.ch08util import naive_mcts

state=env.reset() 
print(f"the current state is state={env.state}")

while True:
    action=int(input("Player 1, what's your move?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is state={env.state}")
    if done:
        print("Player 1 has won!") 
        break  
    action=naive_mcts(env,num_rollouts=100)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is state={env.state}")
    if done:
        print("Player 2 has won!") 
        break

the current state is state=21
Player 1, what's your move?1
Player 1 has chosen 1
the current state is state=20
Player 2 has chosen 2
the current state is state=18
Player 1, what's your move?2
Player 1 has chosen 2
the current state is state=16
Player 2 has chosen 1
the current state is state=15
Player 1, what's your move?1
Player 1 has chosen 1
the current state is state=14
Player 2 has chosen 1
the current state is state=13
Player 1, what's your move?2
Player 1 has chosen 2
the current state is state=11
Player 2 has chosen 1
the current state is state=10
Player 1, what's your move?1
Player 1 has chosen 1
the current state is state=9
Player 2 has chosen 2
the current state is state=7
Player 1, what's your move?1
Player 1 has chosen 1
the current state is state=6
Player 2 has chosen 2
the current state is state=4
Player 1, what's your move?1
Player 1 has chosen 1
the current state is state=3
Player 2 has chosen 1
the current state is state=2
Player 1, what's your move?1
Player 1 has cho

With 100 rollouts in each step, the naive MCTS agent is not very strong. But let's see whehter it's better than a random player

## 2.2. Effectiveness of the Naive MCTS Algorithm
We will let the naive MCTS agent play against a random player 100 games and see how many times it wins

In [11]:
results=[]
for i in range(100):
    env = coin_game()
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        action = naive_mcts(env,num_rollouts=1000)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            results.append(1) 
            break  
        action = random.choice(env.validinputs)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the MCTS agent loses
            results.append(-1) 
            break  

Half the time, the MCTS agent moves first so that no player has an advantage. We record a result of 1 if the MCTS agent wins and a result of -1 if the MCTS agent loses. 

We now count how many times the MCTS agent with pruning has won and lost.

In [12]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")         

the MCTS agent has won 96 games
the MCTS agent has lost 4 games


The above results show that the naive MCTS agent is much better than the random agent: winning 96% of the time. 