# Chaper 13: Deep Reinforcement Learning



AlphaGo employs self-play and deep reinforcement learning to enhance its policy network, making it even more robust. During self-play, AlphaGo accumulates game experience data, which is then used to train a powerful value network. This network predicts the outcome of the game, a critical component in AlphaGo's strategy when competing against the renowned Go player, Lee Sedol. This chapter will guide you through understanding and implementing deep reinforcement learning, particularly the policy gradient method, in a coin game context. You'll explore the concept of a policy and how deep neural networks can be utilized to train a policy for intelligent gameplay.

Previously, in Chapter 12, the focus was on value-based reinforcement learning for mastering the coin game. This involved learning the Q-value for each action in a state, denoted as $Q(a|s)$. After training the reinforcement model, decisions in the game were made based on choosing actions with the highest $Q(a|s)$ value in a given state $s$.

Now, this chapter introduces policy-based reinforcement learning, where decisions are driven by a policy $π(a∣s)$, directing the agent on which actions to take in each state $s$. Policies can be deterministic, prescribing actions with absolute certainty, or stochastic, offering a probability distribution for possible actions.

In the policy gradients method, the agent engages in numerous game sessions to learn the optimal policy. The agent bases its actions on the model's predictions, observes the resulting rewards, and adjusts the model parameters to align predicted action probabilities with desired probabilities. If a predicted action probability is lower than desired, the model weights are tweaked to increase the prediction, and vice versa.

Additionally, this chapter delves into self-play, a technique to strengthen the deep reinforcement learning agent by pitting it against an incrementally stronger version of itself. The self-play generates a wealth of game experiences, which are then used to train a value network that predicts game outcomes based on current game states, with a value of $−1$ indicating a win for the second player and 1 for the first player. You'll also learn how to leverage the trained value network for strategic gameplay.

# 1. The Policy Gradients Method

## 1.1. What Is A Policy?



## 1.2. What is the Policy Gradient Method?



# 2. Use Policy Gradients to Play the Coin Game

### 2.1. Use a network to define the policy
 

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

num_inputs = 22
num_actions = 2
# The input layer
inputs = layers.Input(shape=(22,))
# The common layer
common = layers.Dense(32, activation="relu")(inputs)
common = layers.Dense(32, activation="relu")(common)
# The policy layer (the output layer)
action = layers.Dense(num_actions, activation="softmax")(common)
# Put together the policy network
model = keras.Model(inputs=inputs, outputs=action)

In [2]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)

## 2.2. Calculate Gradients and Discounted Rewards


In [3]:
import numpy as np

def onehot_encoder(state):
    onehot=np.zeros((1,22))
    onehot[0,state]=1
    return onehot
# the trained strong policy model
trained=keras.models.load_model('files/strong_coin.h5')
def opponent(state):
    onehot_state=onehot_encoder(state)
    policy=trained(onehot_state)
    return np.random.choice([1,2],p=np.squeeze(policy))

In [4]:
from utils.coin_simple_env import coin_game
import tensorflow as tf

env=coin_game()
def playing():
    # create lists to record game history
    action_probs_history = []
    rewards_history = []
    state = env.reset()
    episode_reward = 0
    # the strong policy network agent moves first
    state, reward, done, _ = env.step(opponent(state))
    # record all game states 
    states = []    
    while True:
        # convert state to onehot to feed to model
        onehot_state = onehot_encoder(state)
        # estimate action probabilities 
        action_probs = model(onehot_state)
        # select action based on the policy distribution
        action=np.random.choice(num_actions, 
                            p=np.squeeze(action_probs))
        # record log probabilities
        action_probs_history.append(
            tf.math.log(action_probs[0, action]))
        # Apply the sampled action in our environment
        # Remember to add 1 to action (change 0 or 1 to 1 or 2)
        state, reward, done, _ = env.step(action+1)
        states.append(state)
        if done:
            # PG player is player 2, -1 means PG player wins
            reward = -reward
            rewards_history.append(reward)
            episode_reward += reward 
            break
        else:
            state, reward, done, _ = env.step(opponent(state))
            reward = -reward
            rewards_history.append(reward)
            episode_reward += reward                 
            if done:
                break
    return action_probs_history,\
            rewards_history, episode_reward, states, reward

In [5]:
def discount_rs(r):
    discounted_rs = np.zeros(len(r))
    running_add = 0
    for i in reversed(range(0, len(r))):
        running_add = gamma*running_add + r[i]
        discounted_rs[i] = running_add  
    return discounted_rs.tolist()

## 2.3. Update Parameters


In [6]:
batch_size=10
allstates=[]
alloutcome=[]
def create_batch(batch_size):
    action_probs_history = []
    rewards_history = []
    episode_rewards = []
    for i in range(batch_size):
        aps,rs,er,ss,outcome = playing()
        returns = discount_rs(rs)
        action_probs_history += aps
        rewards_history += returns
        episode_rewards.append(er)
        # record game history for the next section
        allstates.append(ss)
        alloutcome.append(outcome)
    return action_probs_history,\
        rewards_history,episode_rewards

In [7]:
from collections import deque

running_rewards=deque(maxlen=100)
gamma = 0.95  
episode_count = 0
# Train the model
while True:
    with tf.GradientTape() as tape:
        action_probs_history,\
        rewards_history,episode_rewards=create_batch(batch_size)
        # Calculating loss values to update our network
        history = zip(action_probs_history, rewards_history)
        actor_losses = []
        for log_prob, ret in history:
            # Calculate actor loss
            actor_losses.append(-log_prob * ret)
        # Adjust model parameters
        loss_value = sum(actor_losses) 
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads,
                                  model.trainable_variables))

    # Log details
    episode_count += batch_size
    for r in episode_rewards:
        running_rewards.append(r)
    running_reward=np.mean(np.array(running_rewards)) 
    # print out progress
    if episode_count % 100 == 0:
        template = "running reward: {:.6f} at episode {}"
        print(template.format(running_reward, episode_count))   
    # Stop if the game is solved
    if running_reward > 0.999 and episode_count>100:  
        print("Solved at episode {}!".format(episode_count))
        break
model.save("files/PG_coin.h5")

running reward: -0.980000 at episode 100
running reward: -0.940000 at episode 200
running reward: -0.960000 at episode 300
running reward: -1.000000 at episode 400
running reward: -0.980000 at episode 500
running reward: -0.920000 at episode 600
running reward: -0.840000 at episode 700
running reward: 0.100000 at episode 800
running reward: 0.920000 at episode 900
Solved at episode 960!


# 3.  Train A Value Network


## 3.1. Plans to Train a Value Network


## 3.2. Process the Game Experience Data


In [8]:
Xs=[]
ys=[]
for states, result in zip(allstates,alloutcome):
    for state in states:
        onehot_state=onehot_encoder(state)
        Xs.append(onehot_state)
        if result==1:
            # player 2 wins
            ys.append(np.array([1,0]))
        if result==-1:
            # player 2 loses
            ys.append(np.array([0,1]))       

Xs=np.array(Xs).reshape(-1,22)
ys=np.array(ys).reshape(-1,2)      

## 3.3. Train A Value Network


In [9]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

value_network = Sequential()
value_network.add(Dense(units=64,activation="relu",
                 input_shape=(22,)))
value_network.add(Dense(32, activation="relu"))
value_network.add(Dense(16, activation="relu"))
value_network.add(Dense(2, activation='softmax'))
value_network.compile(loss='categorical_crossentropy',
   optimizer=keras.optimizers.Adam(learning_rate=0.0001),
   metrics=['accuracy'])

In [10]:
# Train the model for 100 epochs
value_network.fit(Xs, ys, epochs=100, verbose=1)
value_network.save('files/value_coin.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

# 4. Play the Coin Game with the Value Network


## 4.1. Best Moves Based on the Value Network


In [11]:
# design game strategy
from copy import deepcopy

def best_move(env):
    # Set the initial value of bestoutcome        
    bestoutcome=-2;
    bestmove=None    
    #go through all possible moves hypothetically 
    for move in env.validinputs:
        env_copy=deepcopy(env)
        state,reward,done,info=env_copy.step(move)
        onehot_state=onehot_encoder(state)
        ps=value_network.predict(onehot_state,verbose=0)
        # output is prob(2 wins) 
        win_prob=ps[0][0]
        if win_prob>bestoutcome:
            # Update the bestoutcome
            bestoutcome = win_prob
            # Update the best move
            bestmove = move
    return bestmove

## 4.2. Against the Rule-Based AI


In [12]:
import random

def rule_based_AI(env):
    if env.state%3 != 0:
        move = env.state%3
    else:
        move = random.choice([1,2])
    return move 

In [13]:
# test ten games
for i in range(10):
    state=env.reset()  
    # The AI player moves firsts 
    action=rule_based_AI(env)
    state,reward,done,_=env.step(action)
    while True: 
        # move recommended by the value network
        action=best_move(env)  
        state,reward,done,_=env.step(action)
        if done:
            print("The value network wins!")
            break
        # The AI player moves
        action=rule_based_AI(env)
        state,reward,done,_=env.step(action)     
        if done:
            print("The value network loses!")
            break

The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!


## 4.3. Against Random Moves


In [14]:
# does the strategy work if the agent moves first?
# test ten games against random moves
for i in range(10):
    state=env.reset()  
    while True: 
        # move recommended by the value network
        action=best_move(env)  
        state,reward,done,_=env.step(action)
        if done:
            print("The value network wins!")
            break
        # The random player moves
        action=random.choice(env.validinputs)
        state,reward,done,_=env.step(action)     
        if done:
            print("The value network loses!")
            break

The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
The value network wins!
