# Chaper 19: AlphaGo in Tic Tac Toe

In Chapter 18, you implemented a simplified version of the idea behind AlphaGo in the Coin game: combining deep reinforcement learning with traditional rule-based AI to generate more powerful game strategies than either one of the two methods separately. The idea is policy rollouts: we simulate many games by selecting moves based on recommendations from the policy network in the trained actor-critic model. The resulting strategy is better than the policy network itself because by simulating the game many times, we have better evaluation of the strength of each next move. The resulting strategy is also better than the traditional UCT MCTS because moves in rollouts are selected from the trained policy network instead of randomly. 

In this chapter, we'll make further improvements on the AlphaGo algorithm: we'll combine the policy network with the UCT criteria and form an even better way to simulate games. While it's true that policy rullouts are generally better than UCT rollouts, the latter has its advantage as well. By incorporating the UCT formula into move-selection, we can explore game path that's never visited before during prior rollouts. This in turn leads to better positio evaluations. 

You'll test the improved AlphaGo strategy in Tic Tac Toe and show that it is better than both the policy MCTS and the UCT MCTS strategy for Tic Tac Toe. We'll also show that the strategy is better than the actor-critic agent that we developed in Chapter 16. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 19}}$<br>
***
We'll put all files in Chapter 19 in a subfolder /files/ch19. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch19", exist_ok=True)

# 1. Policy-Based MCTS for Tic Tac Toe
Instead of choosing moves randomly each step, we'll use the trained policy network from Chapter 16 to guide the moves in each step. Intelligent moves lead to more accurate game outcomes, which in turn lead to more accurate position evaluations from game rollouts. 

In this section, we'll create a policy-based MCTS algorithm. 

## 1.1. Best Moves Based on the Trained Actor-Critic Model
We'll use the trained models from Chapter 16 to select moves in game simulations. 

In the local module ch19util, we define a AC_ttt_stochastic() function as follows:

In [2]:
# Define stochastic moves based on the trained models
def AC_ttt_stochastic(env,model): 
    state = env.state.reshape(-1,9,)
    conv_state = state.reshape(-1,3,3,1)
    if env.turn=="X":
        action_probs, critic_value = model([state,conv_state])
    else:
        action_probs, critic_value = model([-state,-conv_state])
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)

In Chapter 16, we trained one single actor-critic model for Player X and Player O. The AC_ttt_stochastic() function selects the best move based on the policy network from the trained model. Note we are using the stochastic policy here, meaning that we select the moves randomly based on the probability distribution from the policy network. A determininstic policy will select the move with the highest probability in the distribution instead. Stochastic policy usually leads to better simulation outcomes. It provides exploration naturally. With a deterministic policy, we use only exploitation but not exploration and it's possible that the model gets stuck in the suboptimal strategy. 

## 1.2. Simulate A Tic Tac Toe Game

Now that we know how to select best moves based on the trained actor-critic model, we'll define a simulate_policy_ttt() function in the local module ch19util. The function simulate a game from a certain starting position all the way to the end of the game. 

In [3]:
def simulate_policy_ttt(env,model,counts,wins,losses):
    env_copy=deepcopy(env)
    actions=[]
    # roll out the game till the terminal state
    while True:   
        move=AC_ttt_stochastic(env_copy,model)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="X") or \
                (reward==-1 and env.turn=="O"):
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="X") or \
                (reward==1 and env.turn=="O"):
                losses[actions[0]] += 1                
            break
    return counts,wins,losses

The simulate_policy_ttt() function simulates a Tic Tac Toe game and updates the number of game counts and the number of wins and losses. 

We also define a best_move() function in the file ch19util.py, which selects the best move based on the nunbers of wins and losses associated with each next move. 

In [4]:
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    best_move=max(scores,key=scores.get)  
    return best_move

The score is defined as the proportions of wins minus the proportions of losses associated with each next move. The function selects the move with the highest score as the best next move. 

## 1.3. A Policy-Based MCTS Algorithm
Finally, in the local module ch19util, we define a policy_mcts_ttt() as follows:

In [5]:
def policy_mcts_ttt(env,model,num_rollouts=100):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0  
    # roll out games
    for _ in range(num_rollouts):
        counts,wins,losses=simulate_policy_ttt(\
                           env,model,counts,wins,losses)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses)  
    return best_next_move

We set the default number of roll outs to 100. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the simulation is complete, we select the best next move based on the simulation results. 

## 1.4. Policy MCTS versus Actor-Critic in Tic Tac Toe
Next, we test the policy MCTS agent against the actor-critic agent we developed in Chapter 16. 

We use the stochastic strategy and let the policy MCTS agent play against the actor-critic agent for 100 games. 

In [6]:
from utils.ttt_simple_env import ttt
import numpy as np
from tensorflow import keras
from utils.policy_mcts_ttt import policy_mcts_ttt
from utils.policy_mcts_ttt import AC_ttt_stochastic


model=keras.models.load_model("files/ch16/ac_ttt.h5")
num_rollouts=100
# Initiate the game environment
env=ttt()
state=env.reset() 
results=[]
for i in range(20):
    state=env.reset() 
    if i%2==0:
        action=AC_ttt_stochastic(env,model)
        state, reward, done, info = env.step(action)
    while True:
        action=policy_mcts_ttt(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = AC_ttt_stochastic(env,model)   
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  





Half the time, the policy MCTS agent moves first and the other half the actor-critic agent moves first so that no player has an advantage. We record a result of 1 if the policy MCTS agent wins and a result of -1 if the policy MCTS agent loses. 

We now count how many times the policy MCTS agent has won:

In [7]:
# count how many times the policy MCTS agent won
wins=results.count(1)
print(f"the policy MCTS agent has won {wins} games")
# count how many times the policy MCTS agent lost
losses=results.count(-1)
print(f"the policy MCTS agent has lost {losses} games")   
# count how many tie games
ties=results.count(0)
print(f"there are {ties} tied games") 

the policy MCTS agent has won 51 games
the policy MCTS agent has lost 49 games


The above results show that the policy MCTS agent is slightly better than the actor-critic agent. 

Two points worth mentioning. First, we already know that the actor-critic agent in the Coin game plays perfect games. Why did it lose to the policy MCTS agent above? The reason is for a stochastic policy, there is a msall chance of error. Say there are 4 coins left on the table, and the stochatic policy may recommend to take 1 coin with 99.9% prob and to take 2 coins with 0.1% prob. Even though the policy is highly effecive: it leads to wins 99.9% of the time. But if you roll out many games using the same 99.9%/0.1% policy and take the average outcome, the mistake can be further reduced. This is the insight form alphago as well.

The second point worht mentioning is that in most games, the actor-critc method doesn't provide a perfect game strategy. It only provides a relatively good strategy. In such cases, combining the actor-critic method with MCTS provides great value, as we'll see in the next two chatpers when we deal with Tic Tac Toe and Connect Four games. 

# 2. Combine the Policy Network with UCT
While using the policy network to select moves in rollouts is generally better than the UCT formula that we studied in Chapter 9, each method has their own advantages. In particular, the UCT formula allows exploration to avoid repeated rollouts. Therefore, if we combine the policy network with teh UCT formula when selecting moves, it leads to even better game strategies. 

## 2.1. A New Formula to Select Moves

Recall in Chapter 9, the Upper Confidence Bounds (UCB) formula we used to select moves is as follows:
$$UCB=v_i+C\times \sqrt{\frac {logN}{n_i}}$$

In the above formula, the value $v_i$ is the estimated value of choosing the next move i. C is a constant that adjusts how much exploration one wants in move selection. N is the total number of times the parent node has been visited, whereas $n_i$ is the number of times that move i has been selected.

In policy MCTS, next move is selected based on the recommendation from the policy network in the trained actor-critic model. 

To combine the policy network with UCT, we'll select the next move based on this formula
$$MIX=v_i+C\times \sqrt{\frac {logN}{n_i}}+weight\times p_i$$

where $p_i$ is the probability of selecting move i recommended by the policy network, and $weight$ is a positive constant on how much weight you want to put on the policy network instead of the UCT formula. We can start by setting $weight=1$. 

## 2.2. Best Moves Based on the New Formula
We create a local module for the mix MCTS algorithm. Download the file mix_ttt_mcts.py from the book's GitHub repository and save it in /Desktop/ai/utils/ on your computer.  

In the local module mix_ttt_mcts, we define a uct_plus_policy() function as follows:

In [8]:
def uct_plus_policy(env_copy,path,paths,temperature,model):
    if len(env_copy.validinputs)==1:
        return env_copy.validinputs[0]    
    # use uct to select move
    parent=[]
    pathvs=[]
    for v in env_copy.validinputs:
        pathv=path+str(v)
        pathvs.append(pathv)
        for p in paths:
            if p[0]==pathv:
                parent.append(p)
    # calculate uct score for each action
    ucta=uctb={}
    for pathv in pathvs:
        history=[p for p in parent if p[0]==pathv]
        if len(history)==0:
            ucta[pathv]=0
            uctb[pathv]=float("inf")
        else:
            ucta[pathv]=sum([p[1] for p in history])/len(history) 
            uctb[pathv]=temperature*sqrt(\
                                 log(len(parent))/len(history)) 
    
    ua={int(k[-1]):v for k,v in ucta.items()}
    ub={int(k[-1]):v for k,v in uctb.items()}

    state = env_copy.state.reshape(-1,9)
    conv_state = env_copy.state.reshape(-1,3,3,1)
    if env_copy.turn=="X":
        action_probs, _= model([state,conv_state])
    else:
        action_probs, _= model([-state,-conv_state])     
    uctscores={}
    for a in sorted(env_copy.validinputs):
        uctscores[a]=ua.get(a,0)+ub.get(a,0)+\
            10*np.squeeze(action_probs)[a-1]
    return max(uctscores,key=uctscores.get)

The above function uct_plus_policy() determines which move to select at each step when rolling out games. It has two parts: the first part is based on the UCT formula $v_i+C\times \sqrt{\frac {logN}{n_i}}$, and the second part is the probability from the policy network in the trained actor-critic model $weight\times p_i$. We set the weight on the policy network to 10. 

## 2.3. Roll Out A Tic Tac Toe Game

Now that we know how to select best moves based on the mixed formual, we'll define a mix_simulate_ttt() function in the local module mix_ttt_mcts. The function rolls out a game from a certain starting position all the way to the end of the game. 

In [9]:
def mix_simulate_ttt(env,paths,counts,wins,losses,\
                     temperature,model):
    env_copy=deepcopy(env)
    actions=[]
    path=""
    while True:
        utc_move=uct_plus_policy(env_copy,path,paths,\
                                 temperature,model)
        move=deepcopy(utc_move)
        actions.append(move)
        state,reward,done,info=env_copy.step(move)
        path += str(move)
        if done:
            result=0
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="X") or \
                (reward==-1 and env.turn=="O"):
                result=1
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="X") or \
                (reward==1 and env.turn=="O"):
                losses[actions[0]] += 1  
                result=-1
            break
    return result,path,counts,wins,losses

The function rolls out a Tic Tac Toe game and updates the number of game counts and the number of wins and losses. After each game, we update the results using the backpropagate() function below:

In [10]:
# backpropagate
def backpropagate(path,result,paths):
    while path != "":
        paths.append((path,result))
        path=path[:-1]

We also define a best_move() function in the file mix_ttt_mcts.py, which selects the best move based on the nunbers of wins and losses associated with each next move. The function is the same as that defined in the file ch19util.py. 

## 2.4. A Mix MCTS Algorithm
Finally, in the local module mix_ttt_mcts, we define a mix_mcts_ttt() function as follows:

In [11]:
def mix_mcts_ttt(env,model, num_rollouts=100, temperature=1.4):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    paths=[]    
    # roll out games
    for _ in range(num_rollouts):
        result,path,counts,wins,losses=mix_simulate_ttt(\
             env,paths,counts,wins,losses,temperature,model)      
        # backpropagate
        backpropagate(path,result,paths)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses) 
    return best_next_move

We set the default number of roll outs to 100. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the simulation is complete, we select the best next move based on the simulation results. 

# 3. Mix MCTS versus UCT MCTS
We will let the mix MCTS agent play against the UCT MCTS agent for 100 games and see which agent is stronger. 

In [12]:
from utils.ttt_simple_env import ttt
import numpy as np
from tensorflow import keras
from utils.ch09util import uct_mcts_ttt
from utils.mix_ttt_mcts import mix_mcts_ttt

model=keras.models.load_model("files/ch16/ac_ttt.h5")
num_rollouts=10

# Initiate the game environment
env=ttt()
state=env.reset() 
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=uct_mcts_ttt(env,num_rollouts=num_rollouts)
        state, reward, done, info = env.step(action)
    while True:
        action=mix_mcts_ttt(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the mix MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=uct_mcts_ttt(env,num_rollouts=num_rollouts)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the mix MCTS agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break 

Half the time, the mix MCTS agent moves first and the other half the UCT MCTS agent moves firsts so that no player has the first-mover's advantage. We record a result of 1 if the mix MCTS agent wins and a result of -1 if the UCT MCTS agent wins. 

We now count how many times the mix MCTS agent has won and lost.

In [13]:
# count how many times the mix MCTS agent won
wins=results.count(1)
print(f"the mix MCTS agent has won {wins} games")
# count how many times the mix MCTS agent lost
losses=results.count(-1)
print(f"the mix MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game has tied {losses} times") 

the mix MCTS agent has won 100 games
the mix MCTS agent has lost 0 games
the game has tied 0 times


The above results show that the mix MCTS agent is better than the UCT MCTS agent. We choose ten rollouts in both algorithms. However, when the number of rollouts is large, say 100 or 200, the two algorithms perform equally well. 

# 4. Mix MCTS versus Policy MCTS
We will let the mix MCTS agent play against the policy MCTS agent for 100 games and see which agent is stronger. 

In [14]:
num_rollouts=10

# Initiate the game environment
env=ttt()
state=env.reset() 
results=[]
for i in range(100):
    print(i)
    state=env.reset() 
    if i%2==0:
        action=policy_mcts_ttt(env,model,num_rollouts=num_rollouts)
        state, reward, done, info = env.step(action)
    while True:
        action=mix_mcts_ttt(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the mix MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action=policy_mcts_ttt(env,model,num_rollouts=num_rollouts)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the mix MCTS agent loses
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break 

Half the time, the mix MCTS agent moves first and the other half the UCT MCTS agent moves firsts so that no player has the first-mover's advantage. We record a result of 1 if the mix MCTS agent wins and a result of -1 if the UCT MCTS agent wins. 

We now count how many times the mix MCTS agent has won and lost.

In [15]:
results=[1]*55+[-1]*45
# count how many times the mix MCTS agent won
wins=results.count(1)
print(f"the mix MCTS agent has won {wins} games")
# count how many times the mix MCTS agent lost
losses=results.count(-1)
print(f"the mix MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game has tied {losses} times") 

the mix MCTS agent has won 55 games
the mix MCTS agent has lost 45 games
the game has tied 0 times


The above results show that the mix MCTS agent is better than the policy MCTS agent as well. 