# Chaper 20: AlphaGo in Connect Four: Self Play

The AlphaGo algorithm combines deep reinforcement learning (namely, the actor-critic method) with traditional rule-based AI (namely, the Monte Carlo Tree Search) to generate intelligent game strategies in Go. In the last two chapters, we have applied this technique to the Coin game and Tic Tac Toe and created perfect or nearly perfect game strategies. 

In this chapter, you'll apply the same techniques to Connect Four. You'll use self play to make your actor-critic model much stronger. A stronger actor-critic model will lead to a better policy network, which in turn leads to better rollout policies in MCTS, hence better game strategies. 

By now, you have probably figured out that in reinforcement learning, the stronger the opponent, the stronger the trained agent. Since we have rule-based agents who play games perfectly in both the Coin game and Tic Tac Toe, we were able to train perfect reinforcement learning agents (hence AlphaGo agents) in these two games. In Connect Four, in contrast, building a perfect agent using rule-based AI is too costly. The trained reinforcement learning agent in Connect Four, therefore, is not as strong. To make the RL agent stronger, we let it play against a slightly stronger version of itself through self-play. After many rounds of training, the RL agent becomes much stronger than we started, and that's the idea behind self play.

***
$\mathbf{\text{Create a subfolder for files in Chapter 20}}$<br>
***
We'll put all files in Chapter 20 in a subfolder /files/ch20. Run the code in the cell below to create the subfolder.

***

In [1]:
import os

os.makedirs("files/ch20", exist_ok=True)

# 1. Policy-Based MCTS for Connect Four
Instead of choosing moves randomly each step, we'll use the trained policy network from Chapter 17 to guide the moves in each step. Intelligent moves lead to more accurate game outcomes, which in turn lead to more accurate position evaluations from game rollouts. 

In this section, we'll create a policy-based MCTS algorithm in Connect Four. 

## 1.1. Best Moves from the Trained Actor-Critic Model
We'll use the trained actor-critic model from Chapter 17 to select moves in game rollouts in MCTS. 

In the local module policy_mcts_conn, we define a AC_conn_stochastic() function as follows:

In [2]:
# Define stochastic moves based on the trained models
def AC_conn_stochastic(env,model): 
    state = env.state.reshape(-1,42)
    if env.turn=="red":
        action_probs, critic_value = model(state)
    else:
        action_probs, critic_value = model(-state)
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)

In Chapter 17, we trained one single actor-critic model for both the red and yelllow players. The AC_conn_stochastic() function selects the best move based on the policy network from the trained model. Note we are using the stochastic policy here, meaning that we select the moves randomly based on the probability distribution from the policy network. A determininstic policy will select the move with the highest probability in the distribution instead. Stochastic policy usually leads to better simulation outcomes. It provides exploration naturally. With a deterministic policy, we use only exploitation but not exploration and it's possible that the model gets stuck in the suboptimal strategy. 

## 1.2. Roll Out A Connect Four Game

Now that we know how to select best moves based on the trained actor-critic model, we'll define a simulate_policy_conn() function in the local module policy_mcts_conn. The function rolls out a game from a certain starting position all the way to the end of the game. 

In [3]:
def simulate_policy_conn(env,model,counts,wins,losses):
    env_copy=deepcopy(env)
    actions=[]
    # roll out the game till the terminal state
    while True:   
        move=AC_conn_stochastic(env_copy,model)
        actions.append(deepcopy(move))
        state,reward,done,info=env_copy.step(move)
        if done:
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="red") or \
                (reward==-1 and env.turn=="yellow"):
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="red") or \
                (reward==1 and env.turn=="yellow"):
                losses[actions[0]] += 1                
            break
    return counts,wins,losses

The simulate_policy_conn() function rolls out a Connect Four game and updates the number of game counts and the number of wins and losses. 

We also define a best_move() function in the file policy_mcts_conn.py, which selects the best move based on the nunbers of wins and losses associated with each next move. 

In [4]:
def best_move(counts,wins,losses):
    # See which action is most promising
    scores={}
    for k,v in counts.items():
        if v==0:
            scores[k]=0
        else:
            scores[k]=(wins.get(k,0)-losses.get(k,0))/v
    best_move=max(scores,key=scores.get)  
    return best_move

The score is defined as the proportions of wins minus the proportions of losses associated with each next move. The function selects the move with the highest score as the best next move. 

## 1.3. Policy MCTS in Connect Four
Finally, in the local module policy_mcts_conn, we define a policy_mcts_conn() as follows:

In [5]:
def policy_mcts_conn(env,model,num_rollouts=50):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0  
    # roll out games
    for _ in range(num_rollouts):
        counts,wins,losses=simulate_policy_conn(\
                           env,model,counts,wins,losses)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses)  
    return best_next_move

We set the default number of roll outs to 50. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the simulation is complete, we select the best next move based on the simulation results. 

## 1.4. Test the Policy MCTS in Connect Four
Next, we test the policy MCTS agent against the actor-critic agent we developed in Chapter 17. 

We use the stochastic strategy and let the policy MCTS agent play against the actor-critic agent for 20 games. 

In [6]:
from utils.conn_simple_env import conn
import numpy as np
from tensorflow import keras
from utils.policy_mcts_conn import policy_mcts_conn
from utils.policy_mcts_conn import AC_conn_stochastic


model=keras.models.load_model("files/ch17/ac_conn.h5")
num_rollouts=200
# Initiate the game environment
env=conn()
state=env.reset() 
results=[]
for i in range(20):
    state=env.reset() 
    if i%2==0:
        action=AC_conn_stochastic(env,model)
        state, reward, done, info = env.step(action)
    while True:
        action=policy_mcts_conn(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = AC_conn_stochastic(env,model)   
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  

Half the time, the policy MCTS agent moves first and the other half the actor-critic agent moves first so that no player has an advantage. We record a result of 1 if the policy MCTS agent wins and a result of -1 if the policy MCTS agent loses. 

We now count how many times the policy MCTS agent has won:

In [7]:
# count how many times the policy MCTS agent won
wins=results.count(1)
print(f"the policy MCTS agent has won {wins} games")
# count how many times the policy MCTS agent lost
losses=results.count(-1)
print(f"the policy MCTS agent has lost {losses} games")   
# count how many tie games
ties=results.count(0)
print(f"there are {ties} tied games") 

the policy MCTS agent has won 11 games
the policy MCTS agent has lost 9 games
there are 0 tied games


The above results show that the policy MCTS agent is slightly better than the actor-critic agent. 

# 2. A Mix MCTS Algorithm in Connect Four
While using the policy network to select moves in rollouts is generally better than the UCT formula that we studied in Chapter 9, each method has its own advantages. In particular, the UCT formula allows exploration to avoid repeated rollouts. Therefore, if we combine the policy network with the UCT formula when selecting moves in rollouts, it leads to even better game strategies in Connect Four. 

## 2.1. A Mixed Formula to Select Moves

Recall in Chapter 9, the Upper Confidence Bounds (UCB) formula we used to select moves is as follows:
$$UCB=v_i+C\times \sqrt{\frac {logN}{n_i}}$$

In the above formula, the value $v_i$ is the estimated value of choosing the next move i. C is a constant that adjusts how much exploration one wants in move selection. N is the total number of times the parent node has been visited, whereas $n_i$ is the number of times that move i has been selected.

In policy MCTS, next move is selected based on the recommendation from the policy network in the trained actor-critic model. 

To combine the policy network with UCT, we'll select the next move based on this formula
$$MIX=v_i+C\times \sqrt{\frac {logN}{n_i}}+weight\times p_i$$

where $p_i$ is the probability of selecting move i recommended by the policy network, and $weight$ is a positive constant on how much weight you want to put on the policy network instead of the UCT formula. We can start by setting $weight=1$. 

## 2.2. Best Moves Based on the Mixed Formula
We create a local module for the mix MCTS algorithm. Download the file mix_conn_mcts.py from the book's GitHub repository and save it in /Desktop/ai/utils/ on your computer.  

In the local module mix_conn_mcts, we define a uct_plus_policy() function as follows:

In [8]:
def uct_plus_policy(env_copy,path,paths,temperature,model):
    if len(env_copy.validinputs)==1:
        return env_copy.validinputs[0]    
    # use uct to select move
    parent=[]
    pathvs=[]
    for v in env_copy.validinputs:
        pathv=path+str(v)
        pathvs.append(pathv)
        for p in paths:
            if p[0]==pathv:
                parent.append(p)
    # calculate uct score for each action
    ucta=uctb={}
    for pathv in pathvs:
        history=[p for p in parent if p[0]==pathv]
        if len(history)==0:
            ucta[pathv]=0
            uctb[pathv]=float("inf")
        else:
            ucta[pathv]=sum([p[1] for p in history])/len(history) 
            uctb[pathv]=temperature*sqrt(\
                                 log(len(parent))/len(history)) 
    
    ua={int(k[-1]):v for k,v in ucta.items()}
    ub={int(k[-1]):v for k,v in uctb.items()}
    # probability from the policy network in actor-critic
    state = env_copy.state.reshape(-1,42)
    if env_copy.turn=="red":
        action_probs, _= model(state)
    else:
        action_probs, _= model(-state)     
    uctscores={}
    # combine UCT with policy network
    for a in sorted(env_copy.validinputs):
        uctscores[a]=ua.get(a,0)+ub.get(a,0)+\
            np.squeeze(action_probs)[a-1]
    return max(uctscores,key=uctscores.get)

The above function uct_plus_policy() determines which move to select at each step when rolling out games. It has two parts: the first part is based on the UCT formula $v_i+C\times \sqrt{\frac {logN}{n_i}}$, and the second part is the probability from the policy network in the trained actor-critic model $weight\times p_i$. We set the weight on the policy network to 1. 

## 2.3. Roll Out A Connect Four Game

Now that we know how to select best moves based on the mixed formual, we'll define a mix_simulate_conn() function in the local module mix_conn_mcts. The function rolls out a game from a certain starting position all the way to the end of the game. 

In [9]:
def mix_simulate_conn(env,paths,counts,wins,losses,\
                     temperature,model):
    env_copy=deepcopy(env)
    actions=[]
    path=""
    while True:
        utc_move=uct_plus_policy(env_copy,path,paths,\
                                 temperature,model)
        move=deepcopy(utc_move)
        actions.append(move)
        state,reward,done,info=env_copy.step(move)
        path += str(move)
        if done:
            result=0
            counts[actions[0]] += 1
            if (reward==1 and env.turn=="red") or \
                (reward==-1 and env.turn=="yellow"):
                result=1
                wins[actions[0]] += 1
            if (reward==-1 and env.turn=="red") or \
                (reward==1 and env.turn=="yellow"):
                losses[actions[0]] += 1  
                result=-1
            break
    return result,path,counts,wins,losses

The function rolls out a Connect Four game and updates the number of game counts and the number of wins and losses. After each game, we update the results using the backpropagate() function below:

In [10]:
# backpropagate
def backpropagate(path,result,paths):
    while path != "":
        paths.append((path,result))
        path=path[:-1]

We also define a best_move() function in the file mix_conn_mcts.py, which selects the best move based on the nunbers of wins and losses associated with each next move. The function is the same as that defined in the file policy_mcts_conn.py. 

## 2.4. A Mix MCTS Algorithm in Connect Four
Finally, in the local module mix_conn_mcts, we define a mix_mcts_conn() function as follows:

In [11]:
def mix_mcts_conn(env,model,num_rollouts=50,temperature=1.4):
    if len(env.validinputs)==1:
        return env.validinputs[0]
    counts={}
    wins={}
    losses={}
    for move in env.validinputs:
        counts[move]=0
        wins[move]=0
        losses[move]=0
    paths=[]    
    # roll out games
    for _ in range(num_rollouts):
        result,path,counts,wins,losses=mix_simulate_conn(\
             env,paths,counts,wins,losses,temperature,model)      
        # backpropagate
        backpropagate(path,result,paths)
    # See which action is most promising
    best_next_move=best_move(counts,wins,losses) 
    return best_next_move

We set the default number of roll outs to 50. If there is only one legal move left, we skip searching and select the only move available. Otherwise, we create three dicitonaries counts, wins, and losses to record the outcomes from simulated games. Once the simulation is complete, we select the best next move based on the simulation results. 

## 2.5. Test the Mix MCTS in Connect Four
We will let the mix MCTS agent play against the UCT MCTS agent for 100 games and see which agent is stronger. 

In [12]:
from utils.mix_conn_mcts import mix_mcts_conn

num_rollouts=200
# Initiate the game environment
env=conn()
state=env.reset() 
results=[]
for i in range(20):
    print(i)
    state=env.reset() 
    if i%2==0:
        action=AC_conn_stochastic(env,model)
        state, reward, done, info = env.step(action)
    while True:
        action=mix_mcts_conn(env,model,num_rollouts=num_rollouts)  
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(1) 
            else:
                results.append(0)    
            break  
        action = AC_conn_stochastic(env,model)   
        state, reward, done, info = env.step(action)
        if done:
            # result is 1 if the MCTS agent wins
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)    
            break  

Half the time, the mix MCTS agent moves first and the other half the actor-critic agent moves firsts so that no player has the first-mover's advantage. We record a result of 1 if the mix MCTS agent wins and a result of -1 if the actor-critic agent wins. 

We now count how many times the mix MCTS agent has won and lost.

In [13]:
# count how many times the mix MCTS agent won
wins=results.count(1)
print(f"the mix MCTS agent has won {wins} games")
# count how many times the mix MCTS agent lost
losses=results.count(-1)
print(f"the mix MCTS agent has lost {losses} games")  
# count how many tie games
losses=results.count(0)
print(f"the game has tied {losses} times") 

the mix MCTS agent has won 12 games
the mix MCTS agent has lost 7 games
the game has tied 1 times


The above results show that the mix MCTS agent is slightly better than the actor-critic agent. 

# 3. The Idea of Self Play
We implement self play in this section. Specifically, we build a stonger version of a RL agnet by enhancing it through a mixed MCTS algorithm that we discussed in Chapter 19. 

we let a RL agent play against a slightly stronger version of itself through self-play. The stronger version is the mix MCTS agent with 50 rollouts. 

To implement self play, we follow the steps we used in Chapter 17 when training the actor-critc agent. The difference here is that in Chapter 17, the opponent is the look-three-steps-ahead rule-based AI agent. In self play, we'll change the opponent to the mix MCTS agent who selects moves by using the UCT formula and the probability recommended by its own policy network. In a way, the opponent is a moving target since as the actor-critic agent becomes stronger, the opponent also becomes stronger. 

Specifically, we'll create a local module selfplay in the local package to train the selfplay agent. Download the file selfplay.py from the book's GitHub page and save it in the folder /Desktop/ai/utils/ on your computer. Below, we'll highlight the changes we have made compared to what we have done in Chapter 17. 

## 3.1. Load the Trained Model
We'll not create a model from scratch. Instead, we'll use the model we have trained in Chapter 7 and further train it using self play. Therefore, in the local module selfplay.py, we load up the model that we have saved in Chapter 17 as follows:

In [14]:
from utils.conn_simple_env import conn
import numpy as np
from utils.conn_policy_mcts import policy_mcts_conn
from tensorflow import keras
from utils.mix_policy_mcts42 import mix_mcts_conn

num_inputs = 42
num_actions = 7
model=keras.models.load_model("files/ch17/ac_conn.h5")   
env=conn()
num_rollouts=50

The saved model ac_conn.h5 is loaded up and is now the model we are training. We set the number of rollouts in the mix MCTS algorithm to 50 to save time: if this number is too low, we cannot further improve the model; if the number is too large, it takes too long to train the model. 

Below, we specify the optimizer and the loss function we use to train the model. 

In [15]:
optimizer = keras.optimizers.Adam(learning_rate=0.0005)
loss_func = keras.losses.MeanAbsoluteError()

The optimizer is Adam with a learning rate of 0.0005. The loss function is the mean absolute error loss function, which punishes outliner less compared to other loss functions such as Huber or mean squared error. 

## 3.2. Train Players in Connect Four
We'll let the actor-critic agent play against the mix MCTS algorithm we developed in the las section. In the local module selfplay.py, we'll define a playing_red() function. The playing_red() function simulates a full game, with the mix MCTS agent as the second player, and the actor-critic (AC) agent as the first player, as follows:

In [16]:
max_steps=50
def playing_red():
    # create lists to record game history
    action_probs_history = []
    critic_value_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        action_probs, critic_value = model(state)
        critic_value_history.append(critic_value[0, 0])
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        action_probs_history.append(\
                    tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
            #episode_reward += -1 
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                episode_reward += reward 
                break
            else:
                state, reward, done, _ = env.step(\
          mix_mcts_conn(env,model,num_rollouts=num_rollouts))
                rewards_history.append(reward)
                wrongmoves_history.append(0)
                episode_reward += reward                 
                if done:
                    break                
    return action_probs_history,critic_value_history, \
            wrongmoves_history,rewards_history, episode_reward

We define another function playing_yellow() function similarly. The playing_yellow() function simulates a full game, with the mix MCTS agent as the first player, and the actor-critic (AC) agent as the second player. As we did in Chapter 17, we use *-state* instead of *state* when we feed the game board to the model. Instead of using 1 and -1 to denote a win by the red and yellow player, respectively, we'll use a reward of 1 to denote the current player has won and a reward of -1 to denote that the current player has lost. We accomplish this by use *-reward* instead of *reward* in the above code cell. Therefore, after the yellow player makes a move, the reward is now 1 if the yellow player wins and -1 if the yellow player loses.

## 3.3. Keep Track of Progress
Self plays are extremely costly in terms of computational needs. Therefore, it takes a long time to train the model. To make things more complicated, we need to tune the hyper parameters to make the model works (such as the larning rate, the loss function, and so on). Hence it's important to mesure the progress of the model so that you know things are going in the right dirction for you.

To do that, we'll test the model against the rule-based AI on the side to determine the progress. Note that the rule-based AI is not involved in training hte model directly. It just checks how good the model is periodially without interfereing with teh model itself. 

In [17]:
from utils.conn_think3 import red_think3
from utils.conn_policy_mcts import DL_stochastic

def one_game():
    state=env.reset()   
    while True:  
        action = red_think3(env) 
        state, reward, done, info = env.step(action)
        if done:       
            return reward 
        action = DL_stochastic(env,model) 
        state, reward, done, info = env.step(action)
        if done:
            return reward

from collections import deque
testresults=deque(maxlen=10000)
def test():
    for i in range(100):
        testresults.append(-one_game())

## 3.4. Train the Self-Play Model

To alternate between training the red player and training the yellow player, we'll create a variable *batches*. It starts with a value of 1. After each batch of training, we add 1 to the value of the variable *batches*. We'll train the red player when the value of *batches* is even and train the yellow player otherwise. 

We train the model for 300 episodes of games. For that purpose, we define the selfplay_conn() function in the local module selfplay.py, as follows:

In [18]:
def selfplay_conn():
    episode_count = 0
    batches=0    
    while episode_count<=300:
        if batches%2==0:
            with tf.GradientTape() as tape:
                action_probs_history,critic_value_history,\
    rewards_history,episode_rewards=create_batch_red(batch_size)                      
        else:
            with tf.GradientTape() as tape:
                action_probs_history,critic_value_history,\
rewards_history,episode_rewards=create_batch_yellow(batch_size)                              
            # Calculating loss values to update our network        
            tfdif=tf.convert_to_tensor(rewards_history,\
                                       dtype=tf.float32)-\
    tf.convert_to_tensor(critic_value_history,dtype=tf.float32)
            alosses=-tf.multiply(tf.convert_to_tensor(\
                  action_probs_history, dtype=tf.float32),tfdif)
            closs=loss_func(tf.convert_to_tensor(rewards_history,\
                                             dtype=tf.float32),\
     tf.convert_to_tensor(critic_value_history,dtype=tf.float32))
            # Backpropagation
            loss_value = tf.reduce_sum(alosses) + closs
        grads = tape.gradient(loss_value,\
                              model.trainable_variables)
        optimizer.apply_gradients(zip(grads,\
                                      model.trainable_variables))    
        # Log details
        episode_count += batch_size
        batches += 1
        if episode_count % 20 == 0:
            model.save(f"files/ch20/selfplay_conn.h5") 
            # check progress
            test()
            avg=np.array(testresults).mean()
            print(f"score {avg:.6f} at episode {episode_count}")

Now we can call the selfplay_conn() function from the local module to train the model through self play, as follows:

In [19]:
from utils.selfplay import selfplay_conn

selfplay_conn()

It takes about several hours to train the model. Once done, the model is saved as selfplay_conn.h5 on your computer. 

# 4. Test the Self-Play Trained Model
In this section, we test the self-play trained model and compare it to the model that we trained in Chapter 17. 

In [20]:
old_model=keras.models.load_model("files/ch17/ac_conn.h5")
new_model=keras.models.load_model("files/ch20/selfplay_conn.h5")

def DL(env,model): 
    state = env.state.reshape(-1,42)
    if env.turn=="red":
        action_probs, critic_value = model(state)
    else:
        action_probs, critic_value = model(-state)
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)

results=[]
for i in range(100):
    env=conn()
    state=env.reset() 
    if i%2==0:
        action=DL(env,old_model)
        state, reward, done, info = env.step(action)        
    while True:    
        action=DL(env,new_model)
        state, reward, done, info = env.step(action)
        if done:
            # if the new model wins, record 1
            if reward!=0:
                results.append(1)
            else:
                results.append(0)
            break
        action=DL(env,old_model) 
        state, reward, done, info = env.step(action)
        if done:
            if reward!=0:
                results.append(-1) 
            else:
                results.append(0)
            break

We create an empty list results. We play 100 games and in 50 of them, the new model moves first; in the other 50 games, the old model moves first. If the new model wins, we add an outcome 1 to results. If the old model wins, we add an outcome of -1 to results. If the game is tied, we add a 0 to results.

We can now count how many games the new model has won:

In [21]:
# Print out the number of games the new model won
wins=results.count(1)
print(f"The new model has won {wins} games")
# Print out the number of games the new model lost
losses=results.count(-1)
print(f"The new model has lost {losses} games")              
# Print out the number of tie games
ties=results.count(0)
print(f"There are {ties} tied games")     

The new model has won 55 games
The new model has lost 42 games
There are 3 tied games


The results show that the new model has won 55 games, lost 42 games, and the remaining 3 games are tied. We can see a significant improvement in the model due to self play. 

# The AlphaGo Agent in Connect Four
To conclude this chapter, we'll create an AlphaGo agent in Connect Four. The agent will use mix MCTS algorithm to come up with moves when playing a game. We'll have two versions of the AlphaGo agent: a fast version in which the agent rolls out 50 games before making a move, and a strong version in which the agent rolls out 200 games before making a move. The fast version makes a move quickly but slightly weaker. The strong version takes longer to make a move but makes more intelligent moves. 

In [22]:
model=keras.models.load_model("files/ch20/slefplay_conn.h5")
from utils.mix_conn_mcts import mix_mcts_conn
def AlphaGo_conn_fast(env,model):
    return mix_mcts_conn(env,model,num_rollouts=50)  

def AlphaGo_conn_strong(env,model):
    return mix_mcts_conn(env,model,num_rollouts=200)  

Below, I play a full game with the fast version of the AlphaGo agent, and I timed how long it takes the agent to make a move.

In [23]:
from time import time

env=conn()
state=env.reset() 
while True:
    start=time()
    action=AlphaGo_conn_fast(env,model)  
    print(f"it took AlphaGo {time()-start} seconds")
    print(f"AlphaGo drops a disc in column {action}")
    state, reward, done, info = env.step(action)
    print(f"the current state is \n{env.state.T[::-1]}")
    if done:
        print("the AlphaGo agent has won!")    
        break  
    action = int(input("enter a move:"))   
    state, reward, done, info = env.step(action)  
    if done:
        # result is 1 if the MCTS agent wins
        if reward!=0:
            print("the human player has won!") 
        else:
            print("tie game")    
        break  



it took AlphaGo 3.649463176727295 seconds
AlphaGo drops a disc in column 4
the current state is 
[[0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0]]
enter a move:4
it took AlphaGo 1.8945810794830322 seconds
AlphaGo drops a disc in column 3
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  1  1  0  0  0]]
enter a move:4
it took AlphaGo 1.1149749755859375 seconds
AlphaGo drops a disc in column 2
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  1  1  1  0  0  0]]
enter a move:5
it took AlphaGo 0.5531809329986572 seconds
AlphaGo drops a disc in column 1
the current state is 
[[ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 0  0  0 -1  0  0  0]
 [ 1  1  1  1 -1  0  0]]
the AlphaGo agent has won!

The agent is fast but powerful. It takes just a couple of seconds for the AlphaGo agent to make a move, but it wins by creating a double attack horizontally. 