## Copyright 2019 Google LLC.

In [0]:
#@title
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 5. DQN Techniques: Experience Replay and Target Networks

In the previous Colab, you trained a neural network on the results of every state transition. This approach tends to produce unstable training. In this Colab, you'll understand why training becomes unstable. Then, you'll understand the following two techniques that stabilize Deep Q-Network (DQN) training:

* experience replay
* target networks

## Disadvantages of Online DQN

In the previous Colab, every state transition generated a tuple, and you trained your agent on that tuple. Training your agent only on tuples generated by live training is called **online DQN**. Let's see why online DQN training is unstable.

The problem with online DQN is that training an agent on a trajectory of states means successive states are probably similar. Therefore, input data can be correlated. However, in general, input data to a model must be [independent and identically distributed (i.i.d)](https://developers.google.com/machine-learning/glossary/#iid). In practice, correlated input data means that the agent might not generalize well to other states, resulting in unstable training.

In general, neural network training relies on the assumption that data is i.i.d. In this Colab, you'll apply a technique called experience replay to satisfy this assumption.

## Setup

Run the following cell to set up Google Analytics for the Colab. Data from  Google Analytics helps improve the Colab.

In [0]:
#@title Set up Google Analytics for Colab
%reset -f
import uuid
client_id = uuid.uuid4()

import requests

# Bundle up reporting into a function.
def report_execution():
  requests.post('https://www.google-analytics.com/collect', 
                data=('v=1'
                      '&tid=UA-48865479-3'
                      '&cid={}'
                      '&t=event'
                      '&ec=cell'            # <-- event type
                      '&ea=execute'         # <-- event action
                      '&el=rl-experience-replay-target-networks'   # <-- event label
                      '&ev=1'               # <-- event value
                      '&an=bundled'.format(client_id)))

from IPython import get_ipython
get_ipython().events.register('post_execute', report_execution)

Run the following cell to import libraries and setup the environment:

In [0]:
import gym
import time
import numpy as np
import matplotlib.pyplot as plt
import random
from tensorflow import keras
from collections import deque

CHECK_SUCCESS_INTERVAL = 100
EPSILON_MIN = 0.01

env = gym.make('FrozenLake-v0')

num_states = env.observation_space.n
num_actions = env.action_space.n

Run the following cell to define functions that perform the following tasks:

* Define the neural network.
* Calculate the Bellman update.
* Select an action.
* Check the agent's training for success.

These functions are identical to functions in the previous Colab.

In [0]:
#@title Run cell to define model, Bellman update, select action, and check success (expand to view code)

def one_hot_encode_state(state):
  """Args:
     state: An integer representing the agent's state.
   Returns:
     A one-hot encoded vector of the input `state`.
  """
  return np.identity(num_states)[state:state+1]

def compute_bellman_target(discount_factor, reward, model, state_next):
  '''Returns the updated return calculation given the reward and next state.
  Args:
    discount_factor: factor by which to reduce return from next state when
    updating Q-values using Bellman update.
    reward: reward from state transition.
    model: model used to predict Q-values
    state_next: next state after state transition.
  Returns:
    updated Q-value using Bellman update
  '''
  return reward + discount_factor * \
           np.max(model.predict(one_hot_encode_state(state_next)))

def define_model(learning_rate):
  '''Returns a shallow neural net defined using tf.keras.
  Args:
    learning_rate: optimizer learning rate
  Returns:
    model: A shallow neural net defined using tf.keras input dimension equal to
    num_states and output dimension equal to num_actions.
  '''
  model = []
  model = keras.Sequential()
  model.add(keras.layers.Dense(input_dim = num_states,
                               units = num_actions,
                               activation = 'relu',
                               use_bias = False,
                               kernel_initializer = keras.initializers.RandomUniform(minval=1e-5, maxval=0.05)
                              )
           )
  model.compile(optimizer = keras.optimizers.SGD(lr = learning_rate),
                loss = 'mse'
               )
  print("======= Neural Network Summary =======")
  print(model.summary())
  return model

learning_rate = 0.2
model = define_model(learning_rate)

def select_action(epsilon, state):
  """Select action given Q-values using epsilon-greedy algorithm.
  Args:
    q_values: q_values for all possible actions from a state.
    epsilon: Current value of epsilon used to select action using epsilon-greedy
             algorithm.
  Returns:
    action: action to take from the state.
  """
  if(np.random.rand() < epsilon):
    return np.random.randint(num_actions)
  q_values = model.predict(one_hot_encode_state(state))
  return np.argmax(q_values)

def check_success(episode, epsilon, reward_history, length_history, time_history, success_percent_threshold):
  if((episode+1) % CHECK_SUCCESS_INTERVAL == 0):
    # Check the success % in the last 100 episodes
    success_percent = np.sum(reward_history[-100:-1])
    length_avg = int(np.sum(length_history[-100:-1])/100.0)
    time_avg = np.sum(time_history[-100:-1])/100.0
    print("Episode: " + f"{episode:0>4d}" + \
          ", Success: " + f"{success_percent:2.0f}" + "%" + \
          ", Avg length: " + f"{length_avg:0>2d}" + \
          ", Epsilon: " + f"{epsilon:.2f}" + \
          ", Avg time(s): " + f"{time_avg:.2f}"
         )
    if(success_percent > success_percent_threshold):
      print("Agent crossed success threshold of " + str(success_percent_threshold) + '%.')
      return(1)
  return(0)

## Improving DQN with Experience Replay

In online DQN, all previous tuples are discarded. Instead, previous tuples can be collected in a buffer. Now, the agent can replay those state transitions and train without needing to again experience those state transitions. This technique is called **experience replay**. The buffer storing the tuples is called a **replay buffer**.

To implement experience replay, the agent follows these steps on every state transition:

1. Save the transition's tuple $s, a, r, s'$ in the replay buffer.
1. Create a batch of tuples by sampling the buffer.
1. Train the neural network on the batch of tuples.

The following schematic shows these steps:

![A schematic showing the algorithm for implementing Experience Replay. The interaction of the agent with the environment generates tuples s, a, r, s'. The replay buffer stores these tuples. The agent samples a minibatch of tuples from the replay buffer and trains on this minibatch to update its policy. Then the agent interacts with the environment to generate more tuples. This loop between the agent, environment, replay buffer, and the training shows the algorithm for experience replay.](https://developers.google.com/machine-learning/reinforcement-learning/images/experience-replay.png)
<!--Source: https://docs.google.com/presentation/d/1b8KM93svquW-nd1B8MC9xvcwtIeoNTrMIR-FOFoEPu8/edit#slide=id.g286953c419_0_491 -->

Implement the first step by creating a replay buffer using a Python [deque](https://docs.python.org/2/library/collections.html#collections.deque). Set the buffer size to 2000. You will understand the context for why the buffer size is 2000 later in this Colab.

In [0]:
replay_buffer_size = 2000
replay_buffer = deque(maxlen = replay_buffer_size)

Collect transitions by using a random policy for a few episodes:

In [0]:
for episode in range(3):
  state = env.reset()
  done = False
  while not done:
    action = env.action_space.sample()
    state_next, reward, done, _ = env.step(action)
    replay_buffer.append((state, action, reward, state_next))
    state = state_next

print(replay_buffer)

Implement experience replay by defining a function to sample a batch from `replay_buffer` and train the agent on every tuple in the batch. Vectorize the code to train the model on the entire batch because training the model on a single tuple at a time is slow.

In [0]:
def sample_from_replay_buffer_and_train_model(replay_buffer, batch_size, model, discount_factor):
  '''Samples a batch from the buffer and trains the agent on the batch.
  
  Unpacks feature data from tuples of (state, action, reward, state_next).
  Encodes states as one-hot vectors and stacks these vectors into a matrix.
  Creates matrix of target Q-values. Uses both matrices to train model in one
  call for faster training.
  
  Args:
    replay_buffer: deque containing recorded tuples.
    batch_size: integer specifying training batch size.
    model: neural network representing agent.
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
  Returns:
    model: neural network trained on sampled batch.
  '''
  if(len(replay_buffer) > batch_size):
    batch = random.sample(replay_buffer, batch_size)
    # extract s, a, r, s' from tuples into vectors
    states = [item[0] for item in batch]
    actions = [item[1] for item in batch]
    rewards = [item[2] for item in batch]
    states_next = [item[3] for item in batch]
    # encode states as a matrix of one-hot vectors
    one_hot_encoded_states = np.empty(shape=(0,num_states))
    for state in states:
      one_hot_encoded_states = np.vstack((one_hot_encoded_states, one_hot_encode_state(state)))
    # predict Q-values and update predictions using Bellman update
    target_q_values = model.predict(one_hot_encoded_states) # TODO. This TODO is
            # a placeholder. You'll fill in code later, in Part 2 of this Colab.
    for i in range(len(states)):
      target_q_values[i, actions[i]] = compute_bellman_target(discount_factor, rewards[i], model, states_next[i])
    # now, you can run the following training step without a loop
    model.fit(one_hot_encoded_states, target_q_values, epochs = 1, verbose = False)
  return model

Train the agent on the replay_buffer by running the following cell. Compare the best action for the first state before and after training.

In [0]:
batch_size = 8
discount_factor = 0.95
print("Q-values for state 0 -")
print("Before training epoch:", model.predict(one_hot_encode_state(0)))

model = sample_from_replay_buffer_and_train_model(replay_buffer, batch_size, model, discount_factor)

print("After training epoch: ", model.predict(one_hot_encode_state(0)))

To summarize, on every state transition, the agent follows these steps:

* Save the tuple from the state transition to the buffer.
* Samples a batch of tuples from replay_buffer and trains on the batch.

## Train and Evaluate DQN

Training with experience replay is slow. This slowness restricts how much you can explore the hyperparameter space. Follow these steps:

1. From the previous Colab, copy the values for `eps_decay`, `discount_factor`, `episodes`, and `learning_rate`.
1. Set `replay_buffer_size` to an initial value. How can you estimate such a value?
1. `batch_size` is typically 16, 32, or 64. These are standard values in DQN. However, because FrozenLake is a simple environment, set `batch_size = 8` for faster training.

Run the cell and experiment with hyperparameter values to train the agent. How does training with experience replay compare with training with online DQN? Expand the following section for a discussion.

In [0]:
# Hyperparameters
epsilon = 1.0
eps_decay = 0.99
discount_factor = 0.999
episodes = 5000
learning_rate = 0.5
replay_buffer_size = 2000
batch_size = 8
# TODO. This TODO is a placeholder. You'll fill in code later,
# in Part 2 of this Colab.

# Parameters & model
success_percent_threshold = 20 # in percent, so 60 = 60%
model = define_model(learning_rate)
# TODO. This TODO is a placeholder. You'll fill in code later,
# in Part 2 of this Colab.
replay_buffer = deque(maxlen = replay_buffer_size) # create new replay_buffer

# Training metrics
length_history = []
reward_history = []
time_history = []

# Test if parameter values are valid
assert eps_decay < 1.0 and eps_decay > 0.
assert success_percent_threshold > 9 # agent could reach 9% randomly

print("======= Begin Training =======")
for episode in range(episodes):
  state = env.reset()
  done = False
  episode_reward = 0
  episode_length = 0
  episode_time_start = time.time()
  while not done:
    episode_length += 1
    action = select_action(epsilon, state)
    state_next, reward, done, _ = env.step(action)
    replay_buffer.append((state, action, reward, state_next))
    model = sample_from_replay_buffer_and_train_model(
        replay_buffer, batch_size, model, discount_factor)
    # TODO. This TODO is a placeholder. You'll fill in code later,
    # in Part 2 of this Colab.
    episode_reward += reward
    state = state_next

  # Decreasing epsilon here instead of inside sample_from_replay_buffer_and_train_model introduces
  # the possible edge condition that epsilon decreases before the
  # model starts training because the batch doesn't build up
  if epsilon > EPSILON_MIN:
    epsilon *= eps_decay
  length_history.append(episode_length)
  reward_history.append(episode_reward)
  time_history.append(time.time() - episode_time_start)
  
  if check_success(episode, epsilon, reward_history, length_history, time_history, success_percent_threshold):
    break

### Discussion (expand to view)

Replay buffer size is a balance between weighing new trajectories vs. old trajectories. As your agent improves, new trajectories are probably more rewarding than old trajectories. However, using old trajectories makes your training more stable because your agent trains on more diverse data.

Here, each episode has a length of about 7. The agent's initial success rate is about 2%. To ensure you have at least a few successful episodes in your memory, estimate a replay buffer containing about 200 episodes. 200 episodes are equivalent to about $200\cdot7 = 1400$ state transitions. Any buffer size in that range is okay.

Hyperparameter values that let the agent solve the environment are:
* `epsilon = 1.0`
* `eps_decay = 0.999`
* `discount_factor = 0.99`
* `episodes = 2000`
* `learning_rate = 0.2`
* `replay_buffer_size = 2000`
* `batch_size = 8`

Observations from training:
* Training using experience replay is slower because you're training on a batch of tuples instead of a single tuple.
* When compared to the previous Colab, your agent solves the environment in approximately the same number of episodes. Possible causes are:
  * Frozen Lake is not a complex enough environment for experience replay to be advantageous.
  * The hyperparameters are not correctly optimized.


## Visualize Performance of Trained Model

Seeing the metrics plots is one thing, but visualizing your agent succeed at retrieving the frisbee is another. Run the following code to visualize your agent solving `FrozenLake`.

In [0]:
from IPython.display import clear_output # to clear output on every episode run

state = env.reset()
done = False

epsilon = 0. # greedy policy
while(not(done)):
  action = select_action(epsilon, state)
  state_new, reward, done, _ = env.step(action)
  state = state_new
  clear_output()
  env.render()
  time.sleep(1.0)

## Advantages of Experience Replay

The advantages of experience replay over online DQN are as follows:

* Makes training more stable by training on batches of tuples instead of single tuples.
* Allows agent to generalize better by remembering past experience.

However, experience replay does not fully address the instability in DQN. The next section describes another technique to stabilize DQN training—target networks.

## Target Networks

When you train the neural network using Bellman update, you're calculating the target Q-values for training using the neural network itself. Because the neural network trains using its own predictions, you create a feedback loop. Changes in the neural networks predictions can reinforce each other because the neural network tries to target its own fluctuating Q-values.

The effect of fluctuations in target Q-values is magnified because the Q-values for a state depend on Q-values of successive states. Hence, changes in a state's Q-value can lead to changes in previous states' Q-values.

To break the feedback loop, calculate target Q-values using a separate neural network, called a **target network**. To stabilize training, update your target network slowly to your main neural network. The simplest approach is to update your target network to the main network  on every $N$ steps. Alternatively, on every step, add a small correction to the target network's weights.

The following schematic shows Q-learning with experience replay and target networks:

![The following schematic shows the steps in the Q-learning algorithm when Q-learning is enhanced with these two techniques: experience replay and target networks. This schematic builds on the previous schematic for experience replay by adding an additional component. The new component is the target network. The agent uses the target network instead of the main network to predict Q-values. However, the agent continues to train only the main network. As the main network is trained, the agent slowly updates the weights of the target network from the main network. ](https://developers.google.com/machine-learning/reinforcement-learning/images/experience-replay-with-target-networks.png)

Write a function to update the target network to the main neural network at a fixed interval of episodes by editing the following cell as indicated:

In [0]:
def update_target_network(
    episode, update_target_network_interval, main_network, target_network):
  '''Updates the target network on every certain number of episodes by copying
  the model to the target network.
  
  Args:
    episode: integer representing episode number in agent's training.
    update_target_network_interval: integer  representing interval of episodes
      on which `target_network` is updated to `model`.
    main network: main neural network used to choose actions and train.
    target_network: neural network used to predict Q-values.
  Returns:
    the `target_network`, whether updated or not.
  '''
  if((episode+1) % update_target_network_interval == 0):
    target_network.set_weights(main_network.get_weights())
  return target_network

The remaining steps consist of editing previously defined code to implement target networks.

1. Add a hyperparameter to control the interval for the target network update:
  
  a. Go to this [line](#scrollTo=dyqN5EQuhEqx&line=14&uniqifier=1) marked by `TODO`.
  
  b. Set this hyperparameter.

  > `update_target_network_interval = 10`

1. Define the target network on this [line](#scrollTo=dyqN5EQuhEqx&line=14&uniqifier=1) marked by `TODO`. Insert this code:

  > `target_network = define_model(learning_rate)`

1. Update the target network:

  a. Go to this [cell](#scrollTo=dyqN5EQuhEqx&line=1&uniqifier=1).
  
  b. Insert the call to `update_target_network` at the appropriate place.

1. Predict Q-values by using `target network` instead of `model`:

  a. Go to this [line](#scrollTo=Evho_UrWhEqn&line=30)  marked by `#TODO`. You are in the function definition for  `sample_from_replay_buffer_and_train_model`.
  
  b. Edit the line to predict target Q-values using `target_network` instead of `model`.
  
  c. Similarly, edit the following call to `compute_bellman_target` to use `target_network` instead of `model`.
  
  d. In the function's argument list, append the argument `target_network`. Accordingly, update the call to `sample_from_replay_buffer_and_train_model`.
 

### Solution Steps (expand to view)

1. The call to calculate target Q-values in `sample_from_replay_buffer_and_train_model` should read:
  > `target_q_values = target_network.predict(one_hot_encoded_states)`
2. To update the target network, place the following call right after the call to train the model:
  > `target_network = update_target_network(episode, update_target_network_interval, model, target_network)`


## Conclusion and Next Steps

You learned how to stabilize neural network training by using the following techniques:

* experience replay
* target networks

These two techiques are building blocks in the success of modern deep Q-learning programs.

Congratulations! You've completed the course Colabs. Return to the course [landing page](https://developers.google.com/machine-learning/reinforcement-learning/) to explore the Tensorflow library for Reinforcement Learning.