## Copyright 2019 Google LLC.

In [0]:
#@title
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 4. Deep Q-Learning

In this Colab, you will combine Q-learning with neural networks to create a powerful technique, called Deep Q-Learning (DQN).

## Motivation

In the last Colab, you learned tabular Q-learning. Your Q-table required an entry for every combination of state and action. However, for complex environments that have many states and actions, the Q table's size becomes massive. Instead of a Q-table, you can predict Q-values using a neural network. This application of deep learning to Q-learning is called DQN.

## Setup

Run the following cell to set up Google Analytics for the Colab. Data from  Google Analytics helps improve the Colab.

In [0]:
#@title Set up Google Analytics for Colab
%reset -f
import uuid
client_id = uuid.uuid4()

import requests

# Bundle up reporting into a function.
def report_execution():
  requests.post('https://www.google-analytics.com/collect', 
                data=('v=1'
                      '&tid=UA-48865479-3'
                      '&cid={}'
                      '&t=event'
                      '&ec=cell'            # <-- event type
                      '&ea=execute'         # <-- event action
                      '&el=rl-deep-q-learning'   # <-- event label
                      '&ev=1'               # <-- event value
                      '&an=bundled'.format(client_id)))

from IPython import get_ipython
get_ipython().events.register('post_execute', report_execution)

Run the following cell to import libraries and set up the environment:

In [0]:
import gym
import time
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras

EPSILON_MIN = 0.01
CHECK_SUCCESS_INTERVAL = 100

env = gym.make('FrozenLake-v0')

num_states = env.observation_space.n
num_actions = env.action_space.n

## Define Neural Network

You will implement the same epsilon-greedy policy environment. However, instead of storing Q-values in a table, you will use a neural network to generate the Q-values.

The input to the neural net is the state. In FrozenLake, represent the state using [one-hot encoding](https://developers.google.com/machine-learning/crash-course/representation/feature-engineering). For example, encode the state `10` by running the following code:

In [0]:
np.identity(num_states)[10:10+1]

Define a function to create encode states as one-hot vectors:

In [0]:
def one_hot_encode_state(state):
  """Args:
       state: An integer representing the agent's state.
     Returns:
       A one-hot encoded vector of the input `state`.
  """
  return(np.identity(num_states)[state:state+1])

The input to the neural net is a vector of length 16. The output is a vector of Q-values for each action. Since there are 4 actions, the output is a vector of length 4.

Define a nonlinear neural net with 16 inputs and 4 outputs by using the TF Keras API. The neural net has these characteristics:

* Uses `relu` activation function.
* Is initialized with small positive weights. Ideally, you should use known good initial values for the weights. The initialization with positive values is a workaround.
* Does not use biases. To understand why, suppose you used biases. Now, for an input $s_1$, the neural network predicts $Q(s_1,a_1)$ by transforming $s_1$ to $f(s_1)$. Then the output neuron for $a_1$ adds a bias, $b_{a_{1}}$,  as follows:
$$Q(s_1,a_1) = f(s_1) + b_{a_{1}}$$
Similarly, for state $s_2$, the prediction $Q(s_2,a_1)$ adds the same bias $b_{a_{1}}$ because the action (and thus the output neuron) remains the same:
$$Q(s_2,a_1) = f(s_2) + b_{a_{1}}$$
Therefore, for the same action, Q-value predictions depend on the same bias, even if the input state varies. Training to predict the Q-value for $(s_1,a_1)$ will change $b_{a_1}$. However, changing $b_{a_1}$ will change the predicted Q-values for $(s_2,a_1)$, resulting in wrong Q-values. Therefore, do not use biases.

Complete the neural net definition in the following cell as described. Then run the cell. For the solution, view the next cell.

In [0]:
def define_model(learning_rate):
  '''Returns a shallow neural net defined using tf.keras.
  Args:
    learning_rate: optimizer learning rate
  Returns:
    model: A shallow neural net defined using tf.keras input dimension equal to
    num_states and output dimension equal to num_actions.
  '''
  model = keras.Sequential()
  # === Complete this section by replacing the "..." with appropriate values ===
  # model.add(keras.layers.Dense(units = ...,
  #                              input_dim = ...,
  #                              activation = ...,
  #                              use_bias = False,
  #                              # next line initializes weights with small positive values 
  #                              kernel_initializer = keras.initializers.RandomUniform(minval=1e-5, maxval=0.05)
  #                             ))
  # ============================================================================
  model.compile(optimizer = keras.optimizers.SGD(lr = learning_rate),
                loss = 'mse')
  return(model)

learning_rate = 0.1
model = define_model(learning_rate)
model.summary()

In [0]:
#@title Solution (double-click to view code)
def define_model(learning_rate):
  '''Returns a shallow neural net defined using tf.keras.
  Args:
    learning_rate: optimizer learning rate
  Returns:
    model: A shallow neural net defined using tf.keras input dimension equal to
    num_states and output dimension equal to num_actions.
  '''
  model = []
  model = keras.Sequential()
  model.add(keras.layers.Dense(units = num_actions,
                               input_dim = num_states,
                               activation = 'relu',
                               use_bias = False,
                               kernel_initializer = keras.initializers.RandomUniform(minval=1e-5, maxval=0.05)
                              ))
  model.compile(optimizer = keras.optimizers.SGD(lr = learning_rate),
                loss = 'mse')
  return(model)

learning_rate = 0.1
model = define_model(learning_rate)
model.summary()

## Calculate Q-Values from Neural Network

You can use your neural network to predict Q-values for any state. For example, predict Q-values for state 5 by running the following cell. Since your neural network has not been trained, these predicted Q-values are inaccurate.

In [0]:
model.predict(one_hot_encode_state(5))

Complete the following cell to implement a function (identical to the previous Colab) that returns an action using an epsilon greedy policy. Then run the cell. For the solution, view the next cell.

In [0]:
def policy_eps_greedy(env, q_values, epsilon):
  """Select action given Q-values using epsilon-greedy algorithm.
  Args:
    q_values: q_values for all possible actions from a state.
    epsilon: Current value of epsilon used to select action using epsilon-greedy
             algorithm.
  Returns:
    action: action to take from the state.
  """
  # === Complete this section by replacing the "..." with appropriate values ===
  # if(np.random.rand() < ...):
  #   action = ...
  # else:
  #   action = ...
  # ============================================================================
  return action

In [0]:
#@title Solution (to view code, from cell's menu, select Form -> Show Code)
def policy_eps_greedy(env, q_values, epsilon):
  """Select action given Q-values using epsilon-greedy algorithm.
  Args:
    q_values: q_values for all possible actions from a state.
    epsilon: Current value of epsilon used to select action using epsilon-greedy
             algorithm.
  Returns:
    action: action to take from the state.
  """
  if(np.random.rand() < epsilon):
    action = env.action_space.sample()
  else:
    action = np.argmax(q_values)
  return action

In deep Q-learning, the neural network replaces the Q-table. To demonstrate how, run a full training step using the neural network.

First, reset the environment and calculate Q-values for the starting state:

In [0]:
state = env.reset()
q_values = model.predict(one_hot_encode_state(state))
print("Q-values for state " + str(state) + " :\n" + str(q_values))

Each Q-value represents the approximated return from taking the corresponding action and then following a greedy policy. Therefore, when Q-values are accurate, choosing the action with the highest Q-value will maximize return.

Using the Q-values, select an action using an epsilon-greedy policy. Take the action and record the next state and reward.

In [0]:
epsilon = 0.5 # assume some value of epsilon

action = policy_eps_greedy(env, q_values, epsilon)
state_new, reward, _, _ = env.step(action)

print("action:", action, ", next state:", state_new, ", reward:", reward)

Calculate the target Q-value by completing the following cell to define a function. The formula for the returned Q-value is:

$$
  r(s,a)
      + \gamma \displaystyle \max_{\substack{a_1}} Q(s_1,a_1)
$$

This function is similar to the Bellman update in the previous Colab, except for the use of a neural network.

In [0]:
def bellman_update(reward, discount_factor, model, state_new):
  # =========== Complete this section by replacing the "..." ===================
  # return ...
  # ============================================================================

In [0]:
#@title Solution (to view code, from cell's menu, select Form -> Show Code)
def bellman_update(reward, discount_factor, model, state_new):
  return reward + discount_factor * \
                  np.max(model.predict(one_hot_encode_state(state_new)))

Calculate target Q-values by calling the function `bellman_update`:

In [0]:
discount_factor = 0.99

print("Q-values before update for state " + str(state) + " :\n" + str(q_values))
target_q_values = q_values
target_q_values[0, action] = bellman_update(reward, discount_factor, model,
                                           state_new)

print("Q-values after update for state " + str(state) + " :\n" + str(target_q_values))

Notice that only the Q-value corresponding to the action taken changes after the update. The updated Q-values become the "target" label that the neural network uses to train.

Train the neural network to predict the target Q-values:

In [0]:
model.fit(one_hot_encode_state(state), target_q_values, verbose = True)

To summarize, in each state, train the neural network by following these steps:

1. Choose an action using an epsilon-greedy policy, using the neural network to predict Q-values.
1. Take the action and record the next state and reward.
1. Calculate a target Q-value for the $(s,a)$ pair using the Bellman update.
1. Train the neural network to predict the target Q-value.

Over many transitions, your neural network will learn to approximate the Q-values for every state-action pair. Using these Q-values, the epsilon-greedy policy can solve the `FrozenLake-v0` environment. This approach is called **online DQN** because the agent trains on the state transitions generated when it is running (online).

## Implement Framework to Solve Frozen Lake

Define the functions you need to train your agent. Start by completing the following code cell to define a function that runs one training episode by repeating the steps described previously.

In [0]:
def collect_one_episode_and_train_model(env, model, epsilon, discount_factor):
  '''Runs one episode and trains the model on every state transition.

  Runs one episode. On every state transition in the episode, collects the
  tuple s, a, r, s'. Then performs Bellman update on Q-values using the tuple
  and trains the agent to predict the updated Q-values.

  Args:
    env: environment that the agent is learning.
    model: neural network used to predict Q-values of (state, action) pairs
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
  Returns:
    episode_length: number of states visited during episode
    episode_reward: total reward earned by agent during episode
    model: updated model after training during episode
  '''
  state = env.reset()
  episode_reward = 0
  done = False
  episode_length = 0

  while not done:
    episode_length += 1
    # =========== Complete this section by replacing the "..." =================
    # q_values = ...
    # action = ...
    # state_new, reward, done, _ = ...
    # q_values[0, action] = ...
    # ==========================================================================
    model.fit(one_hot_encode_state(state), q_values, verbose=False)
    episode_reward += reward
    state = state_new

  return(episode_length, episode_reward, model)

In [0]:
#@title Solution (to view code, from cell's menu, select Form -> Show Code)
def collect_one_episode_and_train_model(env, model, epsilon, discount_factor):
  '''Runs one episode and trains the model on every state transition.

  Runs one episode. On every state transition in the episode, collects the
  tuple s, a, r, s'. Then performs Bellman update on Q-values using the tuple
  and trains the agent to predict the updated Q-values.

  Args:
    env: environment that the agent is learning.
    model: neural network used to predict Q-values of (state, action) pairs
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
  Returns:
    episode_length: number of states visited during episode
    episode_reward: total reward earned by agent during episode
    model: updated model after training during episode
  '''
  state = env.reset()
  episode_reward = 0
  done = False
  episode_length = 0

  while not done:
    episode_length += 1
    q_values = model.predict(one_hot_encode_state(state))
    action = policy_eps_greedy(env, q_values, epsilon)
    state_new, reward, done, _ = env.step(action)
    q_values[0, action] = bellman_update(reward, discount_factor, model,
                                         state_new)
    model.fit(one_hot_encode_state(state), q_values, verbose=False)
    episode_reward += reward
    state = state_new

  return(episode_length, episode_reward, model)

Define a function to test the agent's performance for a given success threshold. You will use this function to detect whether the agent has solved the enviroment.

In [0]:
def check_success(episode, reward_history, length_history, epsilon,
                 success_percent_threshold):
  '''Returns 1 if agent has crossed success threshold.

  For a fixed number of episodes, calculates and prints metrics summarizing
  agent's training over those episodes. Then checks and returns 1 if agent
  has crossed the defined success threshold. Otherwise, returns 0.

  Args:
    episode: episode number of agent's training
    reward_history: list containing rewards for every episode
    length_history: list containing length of every episode, where length is
      the number of states visited during the episode
    epsilon: current value of epsilon
    success_percent_threshold: percent of episodes that the agent must solve
      to prove that it is successfully learning the environment
  Returns:
    1 if the agent crossed the success threshold, 0 otherwise.
  '''
  if((episode+1) % CHECK_SUCCESS_INTERVAL == 0):
    # Check the success % in the last 100 episodes
    success_percent = np.sum(reward_history[-100:-1])
    length_avg = int(np.sum(length_history[-100:-1])/100.0)
    print("Episode: " + f"{episode:0>4d}" + \
          ", Success: " + f"{success_percent:2.0f}" + "%" + \
          ", Avg length: " + f"{length_avg:0>2d}" + \
          ", Epsilon: " + f"{epsilon:.2f}")
    if(success_percent > success_percent_threshold):
      print("Agent crossed success threshold of " + \
            str(success_percent_threshold) + '%.')
      return(1)
  return(0)

Using the functions `collect_one_episode_and_train_model` and `check_success`, define a function to train the agent until the agent crosses the success threshold:

In [0]:
#### Plotting functions ####
def visualize_training(reward_history):
  plt.plot(range(len(reward_history)), reward_history)
  plt.xlabel('Episodes')
  plt.ylabel('Reward')
  plt.title('Reward during Training')
  plt.show()

#### Training function ####
def train_agent(env, model, episodes, epsilon, discount_factor, eps_decay,
               success_percent_threshold):
  '''Trains the agent by running episodes while checking for successful
     learning.
  Args:
    env: environment to train the agent on
    model: neural network representing agent used to learn Q-values of
      environment
    epsilon: starting value of epsilon
    discount_factor: factor by which to reduce return from next state when
      updating Q-values using Bellman update.
    eps_decay: factor to reduce value of epsilon by, on every episode
    episodes: number of episodes to train agent for
    learning_rate: learning rate used by model
  '''
  length_history = []     # Record agent's episode length
  reward_history = []     # Record agent's episode reward
  timeStart = time.time() # Track training time

  for episode in range(episodes):
    episode_length, episode_reward, model = \
      collect_one_episode_and_train_model(env, model, epsilon, discount_factor)
    length_history.append(episode_length)
    reward_history.append(episode_reward)
    if epsilon > EPSILON_MIN:
      epsilon *= eps_decay
    if(check_success(episode, reward_history, length_history, epsilon,
                 success_percent_threshold)):
      break

  timeEnd = time.time()
  print("Training time (min): " + f'{(timeEnd - timeStart)/60:.2f}')
  visualize_training(reward_history)
  env.close() # Close environment

## Train Agent to Solve Frozen Lake

Run the code below to solve `FrozenLake-v0` using DQN. To solve Frozen Lake, you must play with hyperparameter values. In doing so, your goal is to develop intuition for how hyperparameters interact to affect the training outcome. 

Consider the following advice on adjusting hyperparameter values:

* Journey length begins increasing before success rate. Hence, journey length is a leading indicator of improvement. Further, journey length is a more stable metric than success percent.
* Aim to prioritize quick experimentation. For example, stop training if journey length doesn't begin increasing within 2000 episodes and try again.
* The agent should solve the environment in <5000 episodes.
* The output plot should show the incidence of successful episodes increasing.
* Frozen Lake is slightly more complex than NChain. Adjust `learning_rate` accordingly. 
* The reward from the final state must propagate back to the initial state's Q-values. The higher the `discount_factor`, the greater the fraction of the reward that propagates back. Hence, keep `discount_factor` high.

For the solution, expand the following section.

In [0]:
##### SETUP #####
episodes = 5000
epsilon = 1.0
eps_decay = 0.99
learning_rate = 0.01
discount_factor = 0.999
success_percent_threshold = 20 # in percent, so 60 = 60%

model = define_model(learning_rate)

#### TRAINING #####
train_agent(env, model, episodes, epsilon, discount_factor, eps_decay,
               success_percent_threshold)

## Solution (expand to view code)

The following code typically crosses a success rate of 20% in <2000 episodes. In the next cell, you'll visualize the trained agent solving the environment.

In [0]:
##### SETUP #####
episodes = 5000
epsilon = 1.0
eps_decay = 0.999
learning_rate = 0.2
discount_factor = 0.99
success_percent_threshold = 60 # in percent, so 60 = 60%

model = define_model(learning_rate)

#### TRAINING #####
train_agent(env, model, episodes, epsilon, discount_factor, eps_decay,
               success_percent_threshold)

While Frozen Lake is a more complex environment than NChain, it is simple in comparison to environments such as Pong and Breakout. When solving more and more complex environments, apply the intuition gained from solving simpler environments by using the following guidelines:

* The agent will take longer to find a successful path through random exploration. Therefore, epsilon must decay slower so that the agent explores for longer.
* The agent must use a deeper and wider neural network to approximate the increased complexity.
* The agent must train at a lower learning rate to adapt to the increased complexity.


## Visualize Performance of Trained Model

Seeing the metrics plots is one thing, but visualizing your agent succeed at retrieving the frisbee is another. Run the following code to visualize your agent solve `FrozenLake`:

In [0]:
from IPython.display import clear_output # to clear output on every episode run

state = env.reset()
done = False
while(not(done)):
  q_values = model.predict(np.identity(num_states)[state:state+1])
  action = np.argmax(q_values)
  state_new, reward, done,_ = env.step(action)
  state = state_new
  clear_output()
  env.render()
  time.sleep(0.5)

## Conclusion and Next Steps

You learned how to combine neural networks with traditional reinforcement learning approaches to solve a simple environment.

Move onto the next Colab: [Experience Replay and Target Networks](https://colab.research.google.com/drive/1DEv8FSjMvsgCDPlOGQrUFoJeAf67cFSo#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-experience-replay-and-target-networks).

For reference, the sequence of course Colabs is as follows:

1. [Problem Framing in Reinforcement Learning](https://colab.research.google.com/drive/1sUYro4ZyiHuuKfy6KXFSdWjNlb98ZROd#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-problem-framing)
1. [Q-learning Framework](https://colab.research.google.com/drive/1ZPsEEu30SH1BUqUSxNsz0xeXL2Aalqfa#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-q-learning)
1. [Tabular Q-Learning](https://colab.research.google.com/drive/1sX2kO_RA1DckhCwX25OqjUVBATmOLgs2#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-tabular-q-learning)
1. [Deep Q-Learning](https://colab.research.google.com/drive/1XnFxIE882ptpO83mcAz7Zg8PxijJOsUs#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-deep-q-learning)
1. [Experience Replay and Target Networks](https://colab.research.google.com/drive/1DEv8FSjMvsgCDPlOGQrUFoJeAf67cFSo#forceEdit=true&sandboxMode=true?utm_source=ss-reinforcement-learning&utm_campaign=colab-external&utm_medium=referral&utm_content=rl-experience-replay-and-target-networks)