# Chaper 18: Hyperparameter Tuning in AlphaGo



***
***“Currently, most of the job of a deep-learning engineer consists of munging data with Python scripts and then tuning the architecture and hyperparameters of a deep network at length to get a working model—or even to get a state-of-the-art model, if the engineer is that ambitious.”***

-- Francois Chollet, creator of the Keras deep-learning library 
***




In Chapters 16 and 17, we applied a simplified version of the AlphaGo algorithm to the Coin game and Tic Tac Toe. This integration of Monte Carlo Tree Search (MCTS) with deep reinforcement learning enabled the AlphaGo-based agent to effectively solve both games. Notably, in the Coin game, the AlphaGo agent consistently won when playing second, even against opponents employing flawless
strategies. In Tic Tac Toe, it consistently achieved draws against the MiniMax agent, which made perfect moves in each game.

The focus of this chapter is on adapting the AlphaGo algorithm for Connect Four. Unlike in the previous games, creating expert-level moves in Connect Four is a more complex task. Consequently, the AlphaGo agent, trained within a constrained time frame and with limited computational resources, won’t achieve perfection. Nevertheless, it impressively outperforms the agent that plans three moves ahead. This achievement underscores a key takeaway: the adaptability of the AlphaGo algorithm to various games. Given sufficient training and computational power, the AlphaGo algorithm has the potential to reach superhuman performance levels, a feat demonstrated by the DeepMind team in 2016.

In the coin game and Tic Tac Toe, fine-tuning hyperparameters is less critical: the majority of hyperparameter configurations result in the development of optimal game strategies. However, in the case of Connect Four, where the AlphaGo agent does not fully solve the game, it becomes crucial to identify the optimal combination of hyperparameters that yields the most effective game strategy.

Grid search is a common approach for hyperparameter tuning. This process involves experimenting with different permutations of hyperparameters in the model to determine empirically which combination offers the best performance. However, this technique can be quite resource-intensive and time-consuming, particularly when dealing with a large number of hyperparameters.

In this chapter, to streamline the process and save time, we’ll limit ourselves to testing only a few values per hyperparameter. The optimal parameters identified here should be regarded as a basic benchmark and a lower bound for the model’s capabilities. Specifically, you’ll explore 18 different hyperparameter combinations using grid search to determine which leads to the strongest game strategy.
We focus on four key hyperparameters for the AlphaGo agent: weight (ranging from 0 to 1), which balances the impact of priors (suggested probabilities by the policy gradient agent) and rollout outcomes in node selection; policy_rollout (either True or False), determines whether to use the policy network to roll out games; depth, defining the maximum search steps ahead in a rollout; and num_rollouts, the number of rollouts performed in MCTS for each move in actual gameplay. 

We also impose a one-second time limit per move. This constraint effectively reduces the number of hyperparameters to three, as we will determine the maximum num_rollouts that can fit within this limit for each combination of weight, policy_rollout, and depth. The most effective combination identified is weight=0.75, policy_rollout=False, depth=45. The maximum number of rollouts the AlphaGo agent
can perform within a second is num_rollouts=584. This optimized AlphaGo agent can defeat an AI that plans three steps ahead. If we relax the one-second-per-move time constraint and increase the number of rollouts to 5000, the AlphaGo agent becomes even stronger: it surpasses a rule-based AI that looks four steps ahead.

# 1. Test the AlphaGo Agent in Connect Four

## 2.1. The Opponent in Connect Four Games

In [1]:
from utils.ch06util import MiniMax_conn

def rule_based_AI(env,depth=3):
    move = MiniMax_conn(env,depth=depth)
    return move 

In [2]:
from copy import deepcopy
import numpy as np
from tensorflow import keras

# Load the trained fast policy network from Chapter 11
fast_net=keras.models.load_model("files/policy_conn.h5")
# Load the policy gradient network from Chapter 15
PG_net=keras.models.load_model("files/PG_conn.h5")
# Load the trained value network from Chapter 15
value_net=keras.models.load_model("files/value_conn.h5")



## 1.2. AlphaGo vs Rule-Based AI in Connect Four

In [3]:
from utils.conn_simple_env import conn
from utils.ch17util import alphago

weight=0.75
depth=20
# initiate game environment
env=conn()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # AlphaGo moves first
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=True,num_rollouts=100)
        state,reward,done,_=env.step(action)     
        if done:
            print("AlphaGo wins!")
            break     
        # move recommended by rule-based AI
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("Rule-based AI wins!")
            break                

AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!


In [4]:
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # Rule-based AI moves first 
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)        
        if done:
            print("Rule-based AI wins!")
            break     
        # AlphaGo moves second
        action=alphago(env,weight,depth,PG_net,value_net,
                fast_net, policy_rollout=True,num_rollouts=100)
        state,reward,done,_=env.step(action)   
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaGo wins!")
            break                          

Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
The game is tied!
Rule-based AI wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
Rule-based AI wins!
Rule-based AI wins!


# 2. Hypterparameter Tuning

## 2.1 Real-World Constaints to Narrow Down Hyperparameters



In [5]:
grid=[]
weights=[0.25, 0.5, 0.75]
policy_rollouts=[True, False]
depths=[25, 35, 45]
for x in weights:
    for y in policy_rollouts:
        for z in depths:
            parameters={"weight":x,"policy_rollout":y,"depth":z}
            grid.append(parameters)
print(f"there are {len(grid)} combinations of hyperparameters")

there are 18 combinations of hyperparameters


## 2.2 The Maximum Number of Rollouts


In [6]:
# get the priors from the PG policy network
state=env.reset()
state=env.state.reshape(-1,42)
conv_state=state.reshape(-1,7,6,1)
priors=PG_net([state,conv_state])

In [7]:
# create a new list of hyperparameters
new_grid=[]

import time
from utils.ch17util import (select, expand,
            simulate, backpropagate)
# iterate through different combinations of hyperparameters
for parameters in grid:
    # reset the Connect Four game
    state=env.reset()  
    # create a dictionary results
    results={}
    for move in env.validinputs:
        results[move]=[]    
    # extract hyperparameters
    weight=parameters["weight"]
    policy_rollout=parameters["policy_rollout"]    
    depth=parameters["depth"]    
    # start counting
    start=time.time()
    n=0
    # roll out games
    while True:
        n += 1
        # select
        move=select(priors,env,results,weight)
        # expand
        env_copy, done, reward=expand(env,move)
        # simulate
        reward=simulate(env_copy,done,reward,depth,value_net,
                     fast_net, policy_rollout)
        # backpropagate
        results=backpropagate(env,move,reward,results)
        # stop rollouts if it lasts more than a second
        if time.time()-start>=1:
            parameters["num_rollouts"]=n
            new_grid.append(parameters)
            break

In [8]:
from pprint import pprint
pprint(new_grid)

[{'depth': 25, 'num_rollouts': 31, 'policy_rollout': True, 'weight': 0.25},
 {'depth': 35, 'num_rollouts': 32, 'policy_rollout': True, 'weight': 0.25},
 {'depth': 45, 'num_rollouts': 34, 'policy_rollout': True, 'weight': 0.25},
 {'depth': 25, 'num_rollouts': 107, 'policy_rollout': False, 'weight': 0.25},
 {'depth': 35, 'num_rollouts': 464, 'policy_rollout': False, 'weight': 0.25},
 {'depth': 45, 'num_rollouts': 575, 'policy_rollout': False, 'weight': 0.25},
 {'depth': 25, 'num_rollouts': 29, 'policy_rollout': True, 'weight': 0.5},
 {'depth': 35, 'num_rollouts': 32, 'policy_rollout': True, 'weight': 0.5},
 {'depth': 45, 'num_rollouts': 34, 'policy_rollout': True, 'weight': 0.5},
 {'depth': 25, 'num_rollouts': 128, 'policy_rollout': False, 'weight': 0.5},
 {'depth': 35, 'num_rollouts': 387, 'policy_rollout': False, 'weight': 0.5},
 {'depth': 45, 'num_rollouts': 577, 'policy_rollout': False, 'weight': 0.5},
 {'depth': 25, 'num_rollouts': 27, 'policy_rollout': True, 'weight': 0.75},
 {'dep

# 3. Search for the Best Hyperparameters


## 3.1 Stategies Compete with Each Other


In [9]:
def one_game(si,sj):
    # parameters in strategy i
    weight_i=si["weight"]
    policy_rollout_i=si["policy_rollout"]    
    depth_i=si["depth"]    
    num_rollouts_i=si["num_rollouts"] 
    # parameters in strategy j   
    weight_j=sj["weight"]
    policy_rollout_j=sj["policy_rollout"]    
    depth_j=sj["depth"]    
    num_rollouts_j=sj["num_rollouts"] 
    # reset the game
    state=env.reset() 
    # play a full game
    while True: 
        # move recommended by strategy si
        action=alphago(env,weight_i,depth_i,
                       PG_net,value_net,fast_net,
                       policy_rollout_i,
                       num_rollouts_i) 
        state,reward,done,_=env.step(action)     
        if done:
            result=abs(reward)
            break      
        # move recommended by strategy sj
        action=alphago(env,weight_j,depth_j,
                       PG_net,value_net,fast_net,
                       policy_rollout_j,
                       num_rollouts_j) 
        state,reward,done,_=env.step(action)
        if done:
            result=-abs(reward)
            break  
    return result     

In [10]:
# create a list to record game outcome
results=[]
# play ten games for each permutation
for _ in range(10):
    for i in new_grid:
        for j in new_grid:
            if i!=j:
                result=one_game(i,j)
                # record players and outcome
                results.append((i,j,result))
                print(i,j,result)
# save the record to the computer
import pickle
with open("files/outcome.p","wb") as fb:
    pickle.dump(results, fb)

## 3.2 Find Out The Best Game Strategy


In [11]:
# load the grid search results from the file
import pickle
with open("files/outcome.p","rb") as fb:
    outcome=pickle.load(fb) 

# Create a dictionary to count results
rewards={}
for i, x in enumerate(new_grid):
    rewards[i]=[]
# iterate through all grid search results
for x in outcome:
    si, sj, r = x
    # count the game outcome for the red player
    rewards[new_grid.index(si)].append(r)
    # count the game outcome for the yellow player
    rewards[new_grid.index(sj)].append(-r) 

In [12]:
# Create a dictionary to count average score
scores={}
# iterate through keys and values in rewards
for k, v in rewards.items():
    # calculate average score for each game strategy
    score=sum(v)/len(v)
    # record in the dictionary scores
    scores[k]=score
# print out the saverage scores
from pprint import pprint
pprint(scores)

{0: -0.1864406779661017,
 1: -0.2994350282485876,
 2: -0.2542372881355932,
 3: -0.11864406779661017,
 4: 0.096045197740113,
 5: 0.1694915254237288,
 6: -0.2138728323699422,
 7: -0.19254658385093168,
 8: -0.22784810126582278,
 9: 0.07586206896551724,
 10: 0.2357142857142857,
 11: 0.3023255813953488,
 12: -0.11627906976744186,
 13: -0.078125,
 14: -0.015748031496062992,
 15: 0.3333333333333333,
 16: 0.42857142857142855,
 17: 0.4523809523809524}


In [13]:
# the index of the best game strategy
best_idx=max(scores, key=scores.get)
print(best_idx)
best_parameters=new_grid[best_idx]
print(best_parameters)

17
{'depth': 45, 'num_rollouts': 584, 'policy_rollout': False, 'weight': 0.75}


# 4. Test the Best AlphaGo Agent in Connect Four
 

In [14]:
from utils.conn_simple_env import conn
from utils.ch17util import alphago

weight_i=best_parameters["weight"]
policy_rollout_i=best_parameters["policy_rollout"]    
depth_i=best_parameters["depth"]    
num_rollouts_i=best_parameters["num_rollouts"] 
# initiate game environment
env=conn()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # AlphaGo moves first
        action=alphago(env,weight_i,depth_i,PG_net,value_net,
                fast_net, policy_rollout_i,num_rollouts_i)
        state,reward,done,_=env.step(action)     
        if done:
            print("AlphaGo wins!")
            break     
        # move recommended by rule-based AI
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("Rule-based AI wins!")
            break                

AlphaGo wins!
The game is tied!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!


In [15]:
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # Rule-based AI moves first 
        action=rule_based_AI(env)  
        state,reward,done,_=env.step(action)        
        if done:
            print("Rule-based AI wins!")
            break     
        # AlphaGo moves second
        action=alphago(env,weight_i,depth_i,PG_net,value_net,
                fast_net, policy_rollout_i,num_rollouts_i)
        state,reward,done,_=env.step(action)   
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaGo wins!")
            break                          

AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!
Rule-based AI wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!


In [16]:
# test ten games
for i in range(10):
    state=env.reset()  
    if i%2==0:
        # Ruled-based AI moves
        action=rule_based_AI(env, depth=4)
        state,reward,done,_=env.step(action)          
    while True:     
        # AlphaGo moves, setting num_rollouts=5000 
        action=alphago(env,weight_i,depth_i,PG_net,value_net,
                fast_net, policy_rollout_i,num_rollouts=5000)
        state,reward,done,_=env.step(action)
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("AlphaGo wins!")
            break  
        # Rule-based AI moves
        action=rule_based_AI(env, depth=4)
        state,reward,done,_=env.step(action)     
        if done:
            if reward==0:
                print("The game is tied!")
            else:
                print("Rule-based AI wins!")
            break             

AlphaGo wins!
AlphaGo wins!
Rule-based AI wins!
Rule-based AI wins!
Rule-based AI wins!
AlphaGo wins!
Rule-based AI wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
