# Chapter 9: Upper Confidence Bounds in MCTS

Your learned the basic idea behind Monte Carlo Tree Search (MCTS) in the last chapter. Instead of using a breadth-first approach such as depth pruning or alpha-beta pruning in minimax, MCTS uses a depth-first approach. An MCTS agent simulates a certain number of games all the way to the terminal state to see the game outcome. It then selects the best next move based on the simulation outcomes. 

In the last chapter, we used a naive move-selection approach. That is, when simulating games, the MCTS agent randomly selects the next move. This leads to some inefficiecies. For example, the same path may be explored multiple times while other paths may never be visited. In this chapter, you'll use the Upper Confidence Bounds for Trees (UCT) method to correct these inefficiencies and make the MCTS algorithm more powerful. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 9}}$<br>
***
We'll put all files in Chapter 9 in a subfolder /files/ch09. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch09", exist_ok=True)

# 1. Upper Confidence Bounds for Trees (UCT)

This section introduces the concept of Upper Confidence Bounds for Trees (UCT) method. We'll discuss the intuition behand it and the formula we use for move selection. 

## 1.1. The UCT Formala

The idea behind UCT MCTS is to select the next move based on a formula instead of randomly selecting a move. The Upper Confidence Bounds (UCB) formula is as follows:
$$UCB=v_i+C\times \sqrt{\frac {logN}{n_i}}$$

In the above formula, the value $v_i$ is the estimated value of choosing the next move i. C is a constant that adjusts how much exploration one wants in move selection. N is the total number of times the parent node has been visited, whereas $n_i$ is the number of times that move i has been selected. 

## 1.2. Exploitation versus Exploration
The temperature constant, C, controls the balance between exploitation and exploration. When the value of C is large, the formula favors unexplored nodes so that the MCTS agent can examine all nodes and consider all possibilities. When C is small, the agent focuses on the most promising node so far. The theoretical value of C is $\sqrt{2}$, so you can set the value of C at 1.4 as a starter. 

# 2. Implement UCT MCTS in Tic Tac Toe
We'll implement the UCT MCTS agent in Tic Tac Toe in this section.

## 2.1. The uct_simulate_ttt() Function
We first define a uct_select() function to implement the UCB formula so that a move is selected when simulating a game. Download the file ch09util.py from the book's GitHub repository and place it in the folder /Desktop/ai/utils/ on your computer. The function is defined in the file as follows:

In [2]:
def uct_select(env_copy,path,paths,temperature):
    # use uct to select move
    parent=[]
    pathvs=[]
    for v in env_copy.validinputs:
        pathv=path+str(v)
        pathvs.append(pathv)
        for p in paths:
            if p[0]==pathv:
                parent.append(p)
    # calculate uct score for each action
    uct={}
    for pathv in pathvs:
        history=[p for p in parent if p[0]==pathv]
        if len(history)==0:
            uct[pathv]=float("inf")
        else:
            uct[pathv]=sum([p[1] for p in history])/len(history)+\
                temperature*sqrt(log(len(parent))/len(history))    
    move=max(uct,key=uct.get)
    move=int(move[-1])
    return move

We use a string variable *path* to denote the game history. For example "54" means the current player occupied cell 5; after that, the opponent placed a game piece in cell 4. 

Once we know how to select moves, we can simulate a full Tic Tac Toe game using the UCT formula. The uct_simulate_ttt() function is defined in the file ch09util.py that you just downloaded. The function is defined as follows:

In [3]:
def uct_simulate_ttt(env,paths,counts,wins,losses,temperature):
    env_copy=deepcopy(env)
    actions=[]
    path=""
    # play a full game
    while True:
        utc_move=uct_select(env_copy,path,paths,temperature)
        #move=random.choice(env_copy.validinputs)
        move=deepcopy(utc_move)
        actions.append(move)
        state,reward,done,info=env_copy.step(move)
        path += str(move)
        if done:
            result=0
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="X") or \
                (reward==-1 and env.turn=="O"):
                result=1
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="X") or \
                (reward==1 and env.turn=="O"):
                result=-1
                losses[actions[0]] += 1                
            break
    return result,path,counts,wins,losses

The function rolls out a game till the terminal game state is reached. Each move is selected using the uct_select() function we just defined. Depending on the game outcome, it updates the dictionaries *counts*, *wins*, and *losses*. The function also returns the game outcome and the game path (that is, moves made by the two players). 

## 2.2. An UCT MCTS Agent for Tic Tac Toe
Next, the agent can roll out many games and select the best next move based on the game outcome. To do that, the UCT MCTS agent needs to backpropagate game outcome after each complete game. The backpropagate() function below accomplishes that: 

In [4]:
# backpropagate
def backpropagate(path,result,paths):
    while path != "":
        paths.append((path,result))
        path=path[:-1]

The backpropagate() function takes three arguments. The first arguemnt *path* is a string variable with the moves made by the two players. The second argument *result* indicates the game outcome (1 means the current player wins, -1 means the current player loses, and 0 means a tie game). The third argument *paths* is a list of path-result pairs. After each rollout, the backpropagate() function retroactively add path-result pairs to the list *paths*. 

After the agent finishes the fixed number of rollouts, it selects the best next move based on the numbers recorded in the three dictionaries *counts*, *wins*, and *losses*, as follows:

In [5]:
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    best_move=max(scores,key=scores.get)  
    return best_move

This function selects the best move based on the numbers in the three dictionaries. It calculates a score for each potential next move: the score is the difference between the number of wins and losses scaled by the total number of moves. 

Finally, we define the uct_mcts_ttt() function.

In [6]:
def uct_mcts_ttt(env,num_rollouts=100,temperature=1.4):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    paths=[]    
    # roll out games
    for _ in range(num_rollouts):
        result,path,counts,wins,losses=uct_simulate_ttt(\
         env,paths,counts,wins,losses,temperature)      
        # backpropagate
        backpropagate(path,result,paths)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses) 
    return best_next_move

We set the default number of roll outs to 100. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the rollouts are complete, the agent selects the best next move based on the simulation results. 

With 100 simulations each step, the MCTS agent is not very strong. But let's see whehter it's better than a rnanom player

# 3. Test the UCT MCTS Agent in Tic Tac Toe
In this section, we'll first play a game manually against the UCT MCTS agent in Tic Tac Toe. We then test 100 games and see how effective it is. 

## 3.1. Manually Play against the UCT MCTS Agent
We let the UCT MCTS agent move first in Tic Tac Toe. We use the input() function to key in the moves for Player O. 

In [7]:
from utils.ttt_simple_env import ttt
from utils.ch09util import uct_mcts_ttt

env=ttt()
state=env.reset() 
while True:
    action=uct_mcts_ttt(env,num_rollouts=1000)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{env.state.reshape(3,3)}")
    if done:
        if reward==1:
            print("Player 1 has won!") 
        else:
            print("Game over, it's a tie!")    
        break  
    action=int(input("What's your move, player 1?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{env.state.reshape(3,3)}")
    if done:
        print("Player 2 has won!") 
        break

Player X has chosen 5
the current state is 
[[0 0 0]
 [0 1 0]
 [0 0 0]]
What's your move, player 1?4
Player O has chosen 4
the current state is 
[[ 0  0  0]
 [-1  1  0]
 [ 0  0  0]]
Player X has chosen 1
the current state is 
[[ 1  0  0]
 [-1  1  0]
 [ 0  0  0]]
What's your move, player 1?9
Player O has chosen 9
the current state is 
[[ 1  0  0]
 [-1  1  0]
 [ 0  0 -1]]
Player X has chosen 3
the current state is 
[[ 1  0  1]
 [-1  1  0]
 [ 0  0 -1]]
What's your move, player 1?2
Player O has chosen 2
the current state is 
[[ 1 -1  1]
 [-1  1  0]
 [ 0  0 -1]]
Player X has chosen 7
the current state is 
[[ 1 -1  1]
 [-1  1  0]
 [ 1  0 -1]]
Player 1 has won!


We set the number of rollouts to 1000. It takes a few seconds for the UCT MCTS agent to make a move. The MCTS agent is able to create a double attack and win the game. This shows that the UCT MCTS agent is very strong. 

## 3.2. Effectiveness of the Tic Tac Toe MCTS Agent
We will let the naive MCTS agent play against a random player 100 games and see how many times it wins

In [8]:
import random

results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info=env.step(action)
    while True:
        action=uct_mcts_ttt(env,num_rollouts=1000)  
        state, reward, done, info=env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=random.choice(env.validinputs)   
        state, reward, done, info=env.step(action)
        if done:
            # result is -1 if the random-move agent wins
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  

Half the time, the MCTS agent moves first so that no player has an advantage. We record a result of 1 if the UCT MCTS agent wins and a result of -1 if the UCT MCTS agent loses. 

We now count how many times the MCTS agent with pruning has won and lost.

In [9]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game has tied {losses} times") 

the MCTS agent has won 99 games
the MCTS agent has lost 0 games
the game has tied 1 times


The above results show that the MCTS agent is much better than the random agent. 

# 4. A UCT MCTS Agent in Connect Four
Next, we create a UCT MCTS agent in Connect Four and see how effective it is. 

## 4.1. Create A Connect Four UCT MCTS Agent
Similar to the UCT MCTS agent for Tic Tac Toe, we create a UCT MCTS agent for Connect Four. In the local module ch09util.py, we define a couple of functions:

In [10]:
def uct_simulate_conn(env,paths,counts,wins,losses,temperature):
    env_copy=deepcopy(env)
    actions=[]
    path=""
    while True:
        utc_move=uct_select(env_copy,path,paths,temperature)
        move=deepcopy(utc_move)
        actions.append(move)
        state,reward,done,info=env_copy.step(move)
        path += str(move)
        if done:
            result=0
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="red") or \
                (reward==-1 and env.turn=="yellow"):
                result=1
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="red") or \
                (reward==1 and env.turn=="yellow"):
                result=-1
                losses[actions[0]] += 1                
            break
    return result,path,counts,wins,losses

The uct_simulate_conn() function above simulate a Connect Four game till the terminal state. Note that it uses the same uct_select() function that we used for Tic Tic Toe to select the next move when deciding which child node to select. It uses the same UCT formula when doing so. 

The uct_simulate_conn() then updates the number of games simulated, as well as the number of wins and losses for the current player. 

Furhter, we define a uct_mcts_conn() function in the local module as follows:

In [11]:
def uct_mcts_conn(env,num_rollouts=100,temperature=1.4):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    paths=[]    
    # roll out games
    for _ in range(num_rollouts):
        result,path,counts,wins,losses=uct_simulate_conn(\
         env,paths,counts,wins,losses,temperature)      
        # backpropagate
        backpropagate(path,result,paths)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses) 
    return best_next_move

The default number of rollouts is 100. After each rollout, the function backpropagates to update winning and losing statistics. Finally, it uses the best_move() function to select the best next move based on the difference in the winning and losing proabilities for each next move.  

## 4.2. Play against the Connect Four MCTS Agent

In [12]:
from utils.conn_simple_env import conn
from utils.ch09util import uct_mcts_conn

# Initiate the game environment
env=conn()
state=env.reset() 
while True:
    action=int(input("What's your move, player red?"))
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{env.state.T[::-1]}")
    if done:
        print("Player red has won!") 
    action=uct_mcts_conn(env,num_rollouts=500)
    print(f"Player {env.turn} has chosen {action}")    
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{env.state.T[::-1]}")
    if done:
        if reward!=0:
            print("Player yellow has won!") 
        else:
            print("Game over, it's a tie!")    
        break  

What's your move, player red?4
Player red has chosen 4
the current state is 
[[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0]]
Player yellow has chosen 1
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [-1  0  0  1  0  0  0]]
What's your move, player red?3
Player red has chosen 3
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [-1  0  1  1  0  0  0]]
Player yellow has chosen 1
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [-1  0  0  0  0  0  0]
 [-1  0  1  1  0  0  0]]
What's your move, player red?5
Player red has chosen 5
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [-1  0  0  0  0  0  0]
 [-1  0  1  1

The MCTS agent is not very strong.

## 4.3. Effectiveness of the Connect Four MCTS Agent
We will let the naive MCTS agent play against a random player 100 games and see how many times it wins

In [13]:
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        action=uct_mcts_conn(env,num_rollouts=100)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=random.choice(env.validinputs)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the MCTS agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  

Half the time, the MCTS agent moves first so that no player has an advantage. We record a result of 1 if the MCTS agent wins and a result of -1 if the MCTS agent loses. 

We now count how many times the MCTS agent with pruning has won and lost.

In [14]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game has tied {losses} times") 

the MCTS agent has won 100 games
the MCTS agent has lost 0 games
the game has tied 0 times


The above results show that the MCTS agent has won all 100 games.