# Chaper 18: Implement AlphaGo in the Coin Game

AlphaGo combined deep reinforcement learning (namely, the actor-critic method) with traditional rule-based AI (namely, the Monte Carlo Tree Search) to generate intelligent game strategies in Go. 

Now that you learned how both MCTS and the actor-critic method work, you'll learn how to combine these two methods to create powerful game strategies in simple games. We'll start with the Coin game in this chapter. In the next two chapters, we'll apply the AlphaGo approach to Tic Tac Toe and Connect Four. 

In traditional MCTS, you evaluate the position of each game board by simulating many games from that point on, all the way to the terminal state in each game. You then look at the averge outcome and use that to evaluate positions. The games are rolled out by choosing random moves. 

To combine MCTS with the actor-critic method, we'll roll out games not by picking random moves. Instead, games are rolled out based on the trained policy network from the actor-critic method that we discussed in Chapters 15 to 17. The improved roll-out policy leads to better position evaluations. That's the main insight from AlphaGo. 

The policy-MCTS agent is better than the actor-critic agent as well. The reason is for a stochastic policy, there is a msall chance of error. Say there are 4 coins left on the table, and the stochatic policy may recommend to take 1 coin with 99.9% prob and to take 2 coins with 0.1% prob. Even though the policy is highly effecive: it leads to wins 99.9% of the time. But if you roll out many games using the same 99.9%/0.1% policy and take the average outcome, the mistake can be further reduced. This is the insight form alphago as well.

So the policy MCTS is better than both MCTS and the policy network. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 18}}$<br>
***
We'll put all files in Chapter 18 in a subfolder /files/ch18. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch18", exist_ok=True)

# 1. A Traditional UCT MCTS Agent for the Coin Game
For comparison, we'll create a MCTS agent based on the UCT rules for the Coin game. The UCT rules are teh same as those we discuseed earlier in the book in Chapter 9 for Tic Tac Toe and Connect Four.

## 1.1. Select Moves Based on UCT Rule 

To save space, we'll create a local module ch18util to store all functions related to the UCT MCTS agent as well as the policy MCTS agent. Download the file ch18util.py from the book's GitHub repository and place it in /Desktop/ai/utils/ on your computer. 

In the local module ch18util, we define a uct_select() function as follows:

In [2]:
from utils.coin_simple_env import coin_game
import random
from copy import deepcopy
from math import log, sqrt

def uct_select(env_copy,path,paths,temperature):
    # use uct to select move
    parent=[]
    pathvs=[]
    for v in env_copy.validinputs:
        pathv=path+str(v)
        pathvs.append(pathv)
        for p in paths:
            if p[0]==pathv:
                parent.append(p)
    # calculate uct score for each action
    uct={}
    for pathv in pathvs:
        history=[p for p in parent if p[0]==pathv]
        if len(history)==0:
            uct[pathv]=float("inf")
        else:
            uct[pathv]=sum([p[1] for p in history])/len(history)+\
                temperature*sqrt(log(len(parent))/len(history))    
    move=max(uct,key=uct.get)
    move=int(move[-1])
    return move

The agent uses the UCT rule to select moves. The function returns the selected move. We'll simulate games based on the above move selections next.

## 1.2. Simulate A Game Based on UCT Rule

Now that we know how to select moves based on the UCT rule, we'll define a uct_simulate_coin() function in the local module ch18util. The function simulates a game from a certain starting position all the way to the end of the game. 

In [3]:
def uct_simulate_coin(env,paths,counts,wins,losses,temperature):
    env_copy=deepcopy(env)
    actions=[]
    path=""
    # play a full game
    while True:
        utc_move=uct_select(env_copy,path,paths,temperature)
        move=deepcopy(utc_move)
        actions.append(move)
        state,reward,done,info=env_copy.step(move)
        path += str(move)
        if done:
            result=0
            counts[actions[0]] += 1
            if (reward==1 and env.turn==1) or \
                (reward==-1 and env.turn==2):
                result=1
                wins[actions[0]] += 1
            if (reward==-1 and env.turn==1) or \
                (reward==1 and env.turn==2):
                result=-1
                losses[actions[0]] += 1                
            break
    return result,path,counts,wins,losses

This function simulates a coin game and updates the number of game counts, number of wins and losses. 

After each game, we'll update teh statistics using the backpropagate function defined in the local module ch18util, as follows:

In [4]:
# backpropagate
def backpropagate(path,result,paths):
    while path != "":
        paths.append((path,result))
        path=path[:-1]

We also define a best_move() function in ch18util.py, to select the best move based on the number of games, the number of wins and losses, like so:

In [5]:
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    best_move=max(scores,key=scores.get)  
    return best_move

The function looks at all valid next moves and calculates a score for each move: the score is the difference in the percentage of wins versus losses. The function selects the best move as the next move with the highest score. 

## 1.3. A UCT-Based MCTS Algorithm
Furhter, we define a uct_mcts_coin() in the local module as follows:

In [6]:
def uct_mcts_coin(env, num_rollouts=100, temperature=1.4):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    paths=[]    
    # roll out games
    for _ in range(num_rollouts):
        result,path,counts,wins,losses=uct_simulate_coin(\
         env,paths,counts,wins,losses,temperature)      
        # backpropagate
        backpropagate(path,result,paths)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses) 
    return best_next_move

We set the default number of roll outs to 100. You can change it to a different number when calling the function. We create three dictionaries, counts, wins, and losses, to record the outcomes from simulated games. Once all roll outs are complete, we select the best next move based on the simulation results.

The default value of temperature, which governs exploration versus exploitation, is set to 1.4. You can also change that based on the situation.

# 2. Test UCT MCTS in the Coin Game
We'll test the UCT MCTS agent against random moves. 


We will play 100 games and see how many times it wins

In [7]:
from utils.coin_simple_env import coin_game
from utils.ch18util import uct_mcts_coin
import random

env=coin_game()
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=random.choice(env.validinputs)
        state, reward, done, info = env.step(action)
    while True:
        action = uct_mcts_coin(env,num_rollouts=100) 
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            results.append(1)    
            break  
        action = random.choice(env.validinputs)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the MCTS agent loses
            results.append(-1)   
            break   





Half the time, the MCTS agent moves first so that no player has an advantage. We record a result of 1 if the MCTS agent wins and a result of -1 if the MCTS agent loses. 

We now count how many times the MCTS agent with pruning has won and lost.

In [8]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the MCTS agent has lost {losses} games") 

the MCTS agent has won 87 games
the MCTS agent has lost 13 games


The above results show that the MCTS agent beats the random moves 87 out of 100 games.

# 3. Policy-Based MCTS in the Coin Game
Instead of choosing moves randomly each step, we'll use the trained policy network from Chapter 15 to guide the moves in each step. Intelligent moves lead to more accurate game outcomes, which in turn lead to more accurate position evaluations from game roll outs. 

In this section, we'll create a policy-based MCTS algorithm in the Coin game. 

## 3.1. Best Moves Based on the Policy Network
We'll use the trained actor-critic model from Chapter 15 to select moves in the Coin game simulations. 

In the local module ch18util, we define a DL_stochastic() function as follows:

In [9]:
def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot
# Load the trained models from Chapter 15
model=keras.models.load_model("files/ch15/ac_coin.h5")
# Define stochastic moves based on the trained models
def DL_stochastic(env): 
    state = env.state
    onehot_state = onehot_encoder(state)
    action_probs, _ = model(onehot_state)
    action_probs, critic_value = model(onehot_state)
    return np.random.choice([1,2], p=np.squeeze(action_probs))

In Chapter 15, we trained a deep reinforcement model with two networks: a value network (critic) and a plicy network (actor). The DL_stochastic() function selects the best move based on the policy network from the trained model. Note we are using the stochastic policy here, meaning we select the moves randomly based on the probability distribution from the policy network.

A determininstic policy will select the move with the highest probability in the distribution instead. Stochastic policy usually leads to better simulation outcomes. We define a deterministic strategy in the local module ch18util as well:

In [10]:
def DL_deterministic(env): 
    state = env.state
    onehot_state = onehot_encoder(state)
    action_probs, _ = model(onehot_state)
    action_probs, critic_value = model(onehot_state)
    return np.argmax(np.squeeze(action_probs))+1

Later we'll use both the stochastic and deterministic versions of the strategy.

## 3.2. Simulate A Game Based on the Policy Network

Now that we know how to select best moves based on the trained actor-critic model, we'll define a simulate_policy_coin() function in the local module ch18util. The function simulates a game from a certain starting position all the way to the end of the game. 

In [11]:
def simulate_policy_coin(env,counts,wins,losses):
    env_copy=deepcopy(env)
    actions=[]
    # roll out the game till the terminal state
    while True:   
        move=DL_stochastic(env_copy)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            counts[actions[0]] += 1
            if (reward==1 and env.turn==1) or \
                (reward==-1 and env.turn==2):
                wins[actions[0]] += 1
            if (reward==-1 and env.turn==1) or \
                (reward==1 and env.turn==2):
                losses[actions[0]] += 1                
            break
    return counts,wins,losses

This function simulates a coin game and updates the number of game counts, number of wins and losses. 

## 3.3. A Policy-Based MCTS Algorithm
Furhter, we define a policy_mcts_coin() in the local module as follows

In [12]:
def policy_mcts_coin(env, num_rollouts=100):
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0  
    # roll out games
    for _ in range(num_rollouts):
        counts,wins,losses=simulate_policy_coin(\
                           env,counts,wins,losses)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses)  
    return best_next_move

We set the default number of roll outs to 100. You can change the a different number when calling the function. We create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once all game roll outs are complete, we select the best next move based on the simulation results. 

# 4. The Effectiveness of the Policy MCTS Agent
We'll compare the policy-based MCTS algorithm with the UCT MCTS algorithm in the coin game. We'll show that the former is much stronger than the latter, with a fixed number of rollouts in both algroithms. We'll also show that the policy-based MCTS algorithm is better than the actor-critic agent as well. 

## 4.1. Policy MCTS versus UCT MCTS
We will let the two MCTS agents play 100 games against each other and see how many times each algorithm wins. 

In [13]:
from utils.ch18util import policy_mcts_coin

env=coin_game()
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action=uct_mcts_coin(env,num_rollouts=100)
        state, reward, done, info = env.step(action)
    while True:
        action = policy_mcts_coin(env,num_rollouts=100) 
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the policy MCTS agent wins
            results.append(1)    
            break  
        action = uct_mcts_coin(env,num_rollouts=100)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the policy MCTS agent loses
            results.append(-1)   
            break  

Half the time, the UCT MCTS agent moves first and the other half the policy MCTS agent moves first so that no player has an advantage. We record a result of 1 if the policy MCTS agent wins and a result of -1 if the UCT MCTS agent wins. 

We now count how many times the policy MCTS agent has won and how mnay times the UCT MCTS agent has won:

In [14]:
wins=results.count(1)
print(f"the policy MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the policy MCTS agent has lost {losses} games")   

the policy MCTS agent has won 100 games
the policy MCTS agent has lost 0 games


The above results show that the policy MCTS agent has won all 100 games. This indicates that the policy MCTS algorithm is much better than the UCT MCTS algorithm. 

## 4.2. Policy MCTS versus the Actor-Critic Agent
We'll compare the policy-based MCTS algorithm with the moves recommended by the actor-critic model. We use the stochastic strategy and let the policy MCTS agent play against the actor-critic agent for 100 games. 

In [15]:
from utils.ch18util import  DL_stochastic

env=coin_game()
results=[]
for i in range(100):
    state=env.reset() 
    if i%2==0:
        action= DL_stochastic(env)
        state, reward, done, info = env.step(action)
    while True:
        action = policy_mcts_coin(env,num_rollouts=100) 
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the policy MCTS agent wins
            results.append(1)    
            break  
        action =  DL_stochastic(env)   
        state, reward, done, info = env.step(action)
        if done:
            # result is -1 if the policy MCTS agent loses
            results.append(-1)   
            break  

Half the time, the policy MCTS agent moves first and the other half the actor-critic agent moves first so that no player has an advantage. We record a result of 1 if the policy MCTS agent wins and a result of -1 if the policy MCTS agent loses. 

We now count how many times the policy MCTS agent has won:

In [16]:
# count how many times the MCTS agent won
wins=results.count(1)
print(f"the policy MCTS agent has won {wins} games")
# count how many times the MCTS agent lost
losses=results.count(-1)
print(f"the policy MCTS agent has lost {losses} games")  

the policy MCTS agent has won 51 games
the policy MCTS agent has lost 49 games


The above results show that the policy MCTS agent is slightly better than the actor-critic agent. 

Two points worth mentioning. First, we already know that the actor-critic agent in the Coin game plays perfect games. Why did it lose to the policy MCTS agent above? The reason is for a stochastic policy, there is a msall chance of error. Say there are 4 coins left on the table, and the stochatic policy may recommend to take 1 coin with 99.9% prob and to take 2 coins with 0.1% prob. Even though the policy is highly effecive: it leads to wins 99.9% of the time. But if you roll out many games using the same 99.9%/0.1% policy and take the average outcome, the mistake can be further reduced. This is the insight form alphago as well.

The second point worht mentioning is that in most games, the actor-critc method doesn't provide a perfect game strategy. It only provides a relatively good strategy. In such cases, combining the actor-critic method with MCTS provides great value, as we'll see in the next two chatpers when we deal with Tic Tac Toe and Connect Four games. 