# Chaper 15: A Value Network in Connect Four



***
***“One neural network — known as the “policy network” — selects the next move to play. The other neural network — the “value network” — predicts the winner of the game.”***

-- Demis Hassabis, CEO and Co-Founder, DeepMind 
***




In Chapter 14, we applied the policy gradient method to Tic Tac Toe. Specifically, you learned how to handle illegal moves and use vectorization to speed up training. After several hours of training, the policy gradient agent becomes as effective as the strong policy agent that we trained in Chapter 10: it ties all games when playing against perfect moves.

In this chapter, you’ll apply the policy gradient method to Connect Four. You’ll handle illegal moves the same way you did in Chapter 14: assigning a reward of −1 every time the agent makes an illegal move. You’ll also use vectorization to speed up training. You’ll create one single model to train both the red player and the yellow player in Connect Four. We’ll train the model as follows: in the first ten games, we’ll let the policy gradient agent move first. The strong policy agent acts as the opponent. We’ll then collect information such as the natural logarithms of the predicted probabilities and discounted rewards to update model parameters (i.e., train the model). In the second ten games, the policy gradient agent moves second and selects actions based on recommendations from the policy network in the model to play against the strong policy agent. We’ll also collect information from gameplays to train the model. We’ll alternate between training the red player and training the yellow player after every ten games. After the policy gradient agent beats or ties the strong policy agent at least 90% of the time, we stop the training process.

We also use the game experience data from the above training process to train a value network: the network will predict the game outcome based on the board position. You also learn how to use the trained value network to design a game strategy in Connect Four. Since the strong policy agent we trained in Chapter 11 doesn’t make perfect moves, the trained value network generates better game strategies than the strong policy agent.

# 1. The Policy Gradient Method in Connect Four

## 1.1. Create the Policy Gradient Model in Connect Four


In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

num_inputs=42
num_actions=7
# The convolutional input layer
conv_inputs=layers.Input(shape=(7,6,1))
conv=layers.Conv2D(filters=128, kernel_size=(4,4),padding="same",
     input_shape=(7,6,1), activation="relu")(conv_inputs)
flat=layers.Flatten()(conv)
# The dense input layer
inputs = layers.Input(shape=(42,))
# Combine the two into a single input layer
two_inputs = tf.concat([flat,inputs],axis=1)
# hidden layers
common = layers.Dense(256, activation="relu")(two_inputs)
common = layers.Dense(64, activation="relu")(common)
# Policy output network
action = layers.Dense(num_actions, activation="softmax")(common)
# The final model
model = keras.Model(inputs=[inputs, conv_inputs],\
                    outputs=action)

In [2]:
optimizer=keras.optimizers.Adam(learning_rate=0.00025)

## 1.2. Use the Strong Policy Agent as the Opponent


In [3]:
# use the trained strong policy model as opponent
trained=keras.models.load_model('files/policy_conn.h5')
def opponent(env):
    state = env.state.reshape(-1,7,6,1)
    if env.turn=="red":
        action_probs=trained(state)
    else:
        action_probs=trained(-state) 
    aps=[]
    for a in sorted(env.validinputs):
        aps.append(np.squeeze(action_probs)[a-1])
    ps=np.array(aps)/np.array(aps).sum()
    return np.random.choice(sorted(env.validinputs),p=ps)

In [4]:
from utils.conn_simple_env import conn

# allow a maximum of 50 steps per game
max_steps=50
env=conn()

def playing_red():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    states=[]
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        conv_state = state.reshape(-1,7,6,1)
        # Predict action probabilities and future rewards
        action_probs = model([state,conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
            #episode_reward += -1 
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            states.append(state)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(reward)
                episode_reward += reward 
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(reward)
                wrongmoves_history.append(0)
                episode_reward += reward                 
                if done:
                    break                
    return action_probs_history,\
            wrongmoves_history,rewards_history, \
                episode_reward,states,reward

In [5]:
def discount_rs(r,wrong):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        if wrong[i]==0:  
            running_add = gamma*running_add + r[i]
            discounted_rs[i] = running_add  
    return discounted_rs.tolist()

In [6]:
def playing_yellow():
    # create lists to record game history
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    states=[]
    state, reward, done, _ = env.step(opponent(env))
    for step in range(max_steps):
        state = state.reshape(-1,42,)
        conv_state = state.reshape(-1,7,6,1)
        # Predict action probabilities and future rewards
        action_probs = model([-state,-conv_state])
        # select action based on the policy network
        action=np.random.choice(num_actions,\
                                p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(\
                        tf.math.log(action_probs[0, action]))
        # punish the agent if there is an illegal move
        if action+1 not in env.validinputs:
            rewards_history.append(0)
            wrongmoves_history.append(-1)
            #episode_reward += -1 
        # otherwise, place the move on the game board
        else:              
        # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action+1)
            states.append(-state)
            if done:
                wrongmoves_history.append(0)
                rewards_history.append(-reward)
                episode_reward += -reward 
                break
            else:
                state,reward,done,_=env.step(opponent(env))
                rewards_history.append(-reward)
                wrongmoves_history.append(0)
                episode_reward += -reward                 
                if done:
                    break                
    return action_probs_history,\
            wrongmoves_history,rewards_history, \
                episode_reward,states,-reward

## 1.3. Train the Red and Yellow Players Interatively


In [7]:
batch_size=10     
allstates=[]
alloutcome=[]
def create_batch_red(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,wms,rs,er,ss,outcome = playing_red()
        # rewards are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er)  
        # record game history for the next section
        allstates.append(ss)
        alloutcome.append(outcome)
    return action_probs_history,\
        rewards_history,episode_rewards

In [8]:
def create_batch_yellow(batch_size):
    action_probs_history = []
    wrongmoves_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,wms,rs,er,ss,outcome = playing_yellow()
        # reward related to legal moves are discounted
        returns = discount_rs(rs,wms)
        action_probs_history += aps
        # punishments for wrong moves are not discounted
        wrongmoves_history += wms
        # combined discounted rewards with punishments
        combined=np.array(returns)+np.array(wms)
        # add combined rewards to rewards history
        rewards_history += combined.tolist()
        episode_rewards.append(er) 
        # record game history for the next section
        allstates.append(ss)
        alloutcome.append(outcome)
    return action_probs_history,\
        rewards_history,episode_rewards

In [9]:
from collections import deque
import numpy as np

running_rewards=deque(maxlen=100)
gamma = 0.95  
episode_count = 0
batches=0
# Train the model
while True:
    with tf.GradientTape() as tape:
        if batches%2==0:
            action_probs_history,rewards_history,\
                episode_rewards=create_batch_red(batch_size)
        else:
            action_probs_history,rewards_history,\
                episode_rewards=create_batch_yellow(batch_size)            
                     
        # Calculating loss values to update our network        
        rets=tf.convert_to_tensor(rewards_history,\
                                  dtype=tf.float32)
        alosses=-tf.multiply(tf.convert_to_tensor(\
          action_probs_history,dtype=tf.float32),rets)
        # Backpropagation
        loss_value = tf.reduce_sum(alosses) 
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,\
                                  model.trainable_variables))
    # Log details
    episode_count += batch_size
    batches += 1
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 100 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward>=0.9 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break
model.save("files/PG_conn.h5")  

In [10]:
import pickle

with open("files/PG_games_conn.p","wb") as f:
    pickle.dump((allstates,alloutcome),f)

# 2.  Train A Value Network in Connect Four


## 2.1. How to Train a Value Network in Connect Four


## 2.2. Processing  the Game Experience Data in Connect Four


In [11]:
with open("files/PG_games_conn.p","rb") as f:
    history,results=pickle.load(f)              
        
Xs=[]
ys=[]
for states, result in zip(history,results):
    for state in states:
        Xs.append(state)
        if result==1:
            ys.append(np.array([0,1,0]))
        elif result==-1:
            ys.append(np.array([0,0,1]))       
        elif result==0:
            ys.append(np.array([1,0,0]))  
                            
Xs=np.array(Xs).reshape(-1,7,6,1)
ys=np.array(ys).reshape(-1,3)       

## 2.3. Train A Value Network in Connect Four



In [12]:
from tensorflow.keras.layers import Dense,Conv2D,Flatten
from tensorflow.keras.models import Sequential

value_model = Sequential()
value_model.add(Conv2D(filters=128, 
  kernel_size=(4,4),padding="same",activation="relu",
                 input_shape=(7,6,1)))
value_model.add(Flatten())
value_model.add(Dense(units=64, activation="relu"))
value_model.add(Dense(units=64, activation="relu"))
value_model.add(Dense(3, activation='softmax'))
value_model.compile(loss='categorical_crossentropy',
    optimizer=keras.optimizers.Adam(learning_rate=0.0005), 
    metrics=['accuracy'])  

In [13]:
# Train the model for 100 epochs
value_model.fit(Xs, ys, epochs=100, verbose=1)
value_model.save('files/value_conn.h5') 

# 3. Play Connect Four with the Value Network
 

## 3.1. Best Moves


In [14]:
# design game strategy
from copy import deepcopy
from tensorflow.keras.models import load_model

value_model=load_model('files/value_conn.h5') 

def best_move(env):
    # Set the initial value of bestoutcome        
    bestoutcome=-2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        state=state.reshape(-1,7,6,1)
        if env.turn=="red":
            ps=value_model.predict(state,verbose=0)
        else:
            ps=value_model.predict(-state,verbose=0)
        # output is prob(win) - prob(lose)
        win_lose_dif=ps[0][1]-ps[0][2]
        if win_lose_dif>bestoutcome:
            # Update the bestoutcome
            bestoutcome = win_lose_dif
            # Update the best move
            bestmove = move
    return bestmove 

## 3.2. Play Against the Policy Agent 


In [15]:
# test ten games when move first
results=[]
for i in range(100):
    state=env.reset()  
    # half the time, the policy network agent moves first
    if i%2==0:
        action=opponent(env)
        state,reward,done,_=env.step(action)           
    while True: 
        # move recommended by the value network
        action=best_move(env)  
        state,reward,done,_=env.step(action)
        # if value network wins, record 1
        if done:
            if reward==0:
                results.append(0)
                print("It's a tie!")
            else:
                results.append(1)
                print("The value network wins!")
            break
        # The policy network agent moves
        action=opponent(env)
        state,reward,done,_=env.step(action)     
        # if value network loses, record -1
        if done:
            if reward==0:
                results.append(0)
                print("It's a tie!")
            else:  
                results.append(-1)
                print("The value network loses!")
            break

The value network wins!
The value network wins!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network wins!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network wins!
The value network wins!
The value network loses!
The value network wins!
The value network loses!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The v

In [16]:
# count how many times the value network agent won
wins=results.count(1)
print(f"the value network agent has won {wins} games")
# count how many times the value network agent lost
losses=results.count(-1)
print(f"the value network agent has lost {losses} games")  
# count how many tie games
ties=results.count(0)
print(f"the game is tied {ties} times") 

the value network agent has won 67 games
the value network agent has lost 33 games
the game is tied 0 times
