# Chaper 16: Implement AlphaGo in the Coin Game



***
***“I thought AlphaGo was based on probability calculation and that it was merely a machine. But when I saw this move, I changed my mind. Surely, AlphaGo is creative.”***

-- Lee Sedol, winner of 18 world Go titles
***




The AlphaGo algorithm combines combines deep reinforcement learning (namely, the policy gradient method) with traditional rule-based AI (specifically, Monte Carlo Tree Search or MCTS) to generate intelligent game strategies in Go. Now that you have learned both deep reinforcement learning and MCTS, you’ll learn to combine them together as the DeepMind team did and apply the algorithm
to the three games in this book: the coin game, Tic Tac Toe, and Connect Four.

In this chapter, you’ll implement AlphaGo in the coin game. First, we’ll go over the AlphaGo algorithm and see how it brings various pieces together to create a powerful AI player. After that, you’ll apply the same logic to the coin game to create your very own AlphaGo agent.

MCTS is the core of AlphaGo’s decision-making process. MCTS is used to find the most promising moves by building a search tree. Each node in the tree represents a game position, and branches represent possible moves. The search through this tree is guided by statistical analysis of the moves. AlphaGo also uses three deep neural networks - a fast policy network, a strengthened policy network through self-play deep reinforcement learning, and a value network. The fast policy network helps in narrowing down the selection of moves to consider in game rollouts. It’s trained on expert games and learns to predict their moves. This network is used to guide the tree search to more promising paths. The strengthened policy network is used to select child nodes in game rollouts so that the most valuable child nodes are selected to roll out games. The value network evaluates board positions and predicts the winner of the game from that position. It’s crucial for looking ahead and evaluating future
positions without having to play out the entire game.

AlphaGo’s success lies in the effective combination of traditional tree search methods with powerful machine-learning techniques. This allowed it to tackle the immense complexity of Go, a game with more possible positions than atoms in the observable universe.

The three games we consider in this book are much simpler than the game of Go. Nonetheless, we’ll mimic the strategies used by the DeepMind team and implement the AlphaGo algorithm in these games. Along the way, you’ll pick valuable skills in both rule-based AI and cutting-edge developments in deep learning.

To implement AlphaGo in the coin game, we’ll utilize the skills and the trained networks earlier in this book. Specifically, we’ll use the skills you learned about MCTS from Chapter 8. However, instead of rolling out games with random moves, you’ll roll out games by letting both players choose moves based on the fast policy network we trained in Chapter 9. More intelligent moves in rollouts lead to more informative game outcomes. Further, instead of playing out the entire game during rollouts, you’ll use the value network we trained in Chapter 13 to evaluate game states after playing out games after a fixed number of moves (that is, after a certain depth). Finally, to select a child node to roll out games in MCTS, instead of using the UCT formula from Chapter 8, you’ll use a weighted average of the rollout values and the prior probabilities recommended by the strengthened policy network from Chapter 13.

After creating the AlphaGo agent in the coin game, you’ll test it against both random moves and the perfect rule-based AI player from Chapter 1. You’ll see that when moving second, the AlphaGo agent beats the ruled-based AI player in all ten games. When moving first, the AlphaGo agent wins all ten games against random moves. This shows that the AlphaGo algorithm is as strong as any possible game strategy we could have designed for the coin game.

# 1. The AlphaGo Architecture 


# 2. AlphaGo in the Coin Game


## 2.1. Select the Best Child Node and Expand the Game Tree


```python
from copy import deepcopy
import numpy as np
from tensorflow import keras

# Load the trained fast policy network from Chapter 9
fast_net=keras.models.load_model("files/fast_coin.h5")
# Load the strengthend strong net from Chapter 13
PG_net=keras.models.load_model("files/PG_coin.h5")
# Load the trained value network from Chapter 13
value_net=keras.models.load_model("files/value_coin.h5")
```

```python
def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot

def best_move_fast_net(env):
    state = env.state
    onehot_state = onehot_encoder(state)
    action_probs = fast_net(onehot_state)
    return np.random.choice([1,2],
        p=np.squeeze(action_probs))
```

```python
def select(priors,env,results,weight):    
    # weighted average of priors and rollout_value
    scores={}
    for k,v in results.items():
        # rollout_value for each next move
        if len(v)==0:
            vi=0
        else:
            vi=sum(v)/len(v)
        # scale the prior by (1+N(L))
        prior=priors[0][k-1]/(1+len(v))
        # calculate weighted average
        scores[k]=weight*prior+(1-weight)*vi
    # select child node based on the weighted average     
    return max(scores,key=scores.get)  
```

```python
# expand the game tree by taking a hypothetical move
def expand(env,move):
    env_copy=deepcopy(env)
    state,reward,done,info=env_copy.step(move)
    return env_copy, done, reward
```

## 2.2. Roll Out A Game and Backpropagate

```python
# roll out a game till terminal state or depth reached
def simulate(env_copy,done,reward,depth):
    # if the game has already ended
    if done==True:
        return reward
    # select moves based on fast policy network
    for _ in range(depth):
        move=best_move_fast_net(env_copy)
        state,reward,done,info=env_copy.step(move)
        # if terminal state is reached, returns outcome
        if done==True:
            return reward
    # depth reached but game not over, evaluate
    onehot_state=onehot_encoder(state)
    # use the trained value network to evaluate
    ps=value_net.predict(onehot_state,verbose=0)
    # output is prob(1 wins)-prob(2 wins)
    reward=ps[0][1]-ps[0][0]  
    return reward
```

```python
def backpropagate(env,move,reward,results):
    # if current player is Player 1, update results
    if env.turn==1:
        results[move].append(reward)
    # if current player is Player 2, multiply outcome with -1
    elif env.turn==2:
        results[move].append(-reward)                  
    return results
```

## 2.3 Create An AlphaGo Agent in the Coin Game


```python
def alphago_coin(env,weight,depth,num_rollouts=100):
    # if there is only one valid move left, take it
    if len(env.validinputs)==1:
        return env.validinputs[0]
    # get the prior from the PG policy network
    priors = PG_net(onehot_encoder(env.state))    
    # create a dictionary results
    results={}
    for move in env.validinputs:
        results[move]=[]
    # roll out games
    for _ in range(num_rollouts):
        # select
        move=select(priors,env,results,weight)
        # expand
        env_copy, done, reward=expand(env,move)
        # simulate
        reward=simulate(env_copy,done,reward,depth)
        # backpropagate
        results=backpropagate(env,move,reward,results)
    # select the most visited child node
    visits={k:len(v) for k,v in results.items()}
    return max(visits,key=visits.get)
```

# 3. Test  the AlphaGo Algorithm in the Coin Game


## 3.1. When the AlphaGo Agent Moves Second

In [1]:
import random

def rule_based_AI(env):
    if env.state%3 != 0:
        move = env.state%3
    else:
        move = random.choice([1,2])
    return move 

In [2]:
from utils.coin_simple_env import coin_game
from utils.ch16util import alphago_coin

# initiate game environment
env=coin_game()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # The rule-based AI player moves first
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            # print out the winner
            print("Rule-based AI wins!")
            break        
        # The AlphaGo agent moves second
        action=alphago_coin(env,0.9,10,num_rollouts=100)  
        state,reward,done,_=env.step(action)
        if done:
            # print out the winner
            print("AlphaGo wins!")
            break

AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!


## 3.2. Against Random Moves


In [3]:
# test ten games against random moves
for i in range(10):
    state=env.reset()  
    while True: 
        # The AlphaGo agent moves first
        action=alphago_coin(env,0.9,10,num_rollouts=100)    
        state,reward,done,_=env.step(action)
        if done:
            print("AlphaGo wins!")
            break
        # The random player moves second
        action=random.choice(env.validinputs)
        state,reward,done,_=env.step(action)     
        if done:
            print("AlphaGo loses!")
            break

AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!
AlphaGo wins!


# 4. Redundancy in the AlphaGo Algorithm


In [4]:
from utils.coin_simple_env import coin_game
from utils.ch16util import alphago_coin

# initiate game environment
env=coin_game()
# test ten games
for i in range(10):
    state=env.reset()  
    while True: 
        # The rule-based AI player moves first
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            # print out the winner
            print("Without value network, rule-based AI wins!")
            break        
        # The AlphaGo agent moves second
        action=alphago_coin(env,0.8,22,num_rollouts=100)  
        state,reward,done,_=env.step(action)
        if done:
            # print out the winner
            print("Without value network, AlphaGo wins!")
            break

Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!


In [5]:
# test ten games against random moves
for i in range(10):
    state=env.reset()  
    while True: 
        # The AlphaGo agent moves first
        action=alphago_coin(env,0.8,22,num_rollouts=100)    
        state,reward,done,_=env.step(action)
        if done:
            print("Without value network, AlphaGo wins!")
            break
        # The random player moves second
        action=random.choice(env.validinputs)
        state,reward,done,_=env.step(action)     
        if done:
            print("Without value network, AlphaGo loses!")
            break

Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
Without value network, AlphaGo wins!
