# DQNs on GCP

Reinforcement Learning (RL) Agents can be quite fickle. This is because the environment for an Agent is different than that of Supervised and Unsupervised algorithms.

| Supervised / Unsupervised | Reinforcement Learning |
| ----------- | ----------- |
| Data is previously gathered | Data needs to be simulated |
| Big Data: Many examples covering many siutations | Sparse Data: Agent trades off between exploring and exploiting | 
| The environment is assumed static | The environment may change in response to the agent |

Because of this, hyperparameter tuning is even more crucial in RL as it not only impacts the training of the agent's neural network, but it also impacts how the data is gathered through simulation.

## Setup

Hypertuning takes some time, and in this case, it can take anywhere between **10 - 30 minutes**. If this hasn't been done already, run the cell below to kick off the training job now. We'll step through what the code is doing while our agents learn.

In [None]:
%%bash
BUCKET=<your-bucket-here> # Change to your bucket name
JOB_NAME=dqn_on_gcp_$(date -u +%y%m%d_%H%M%S)
REGION='us-central1' # Change to your bucket region
IMAGE_URI=gcr.io/qwiklabs-resources/rl-qwikstart/dqn_on_gcp@sha256:326427527d07f30a0486ee05377d120cac1b9be8850b05f138fc9b53ac1dd2dc

gcloud ai-platform jobs submit training $JOB_NAME \
    --staging-bucket=gs://$BUCKET \
    --region=$REGION \
    --master-image-uri=$IMAGE_URI \
    --scale-tier=BASIC_GPU \
    --job-dir=gs://$BUCKET/$JOB_NAME \
    --config=hyperparam.yaml

The above command sends a [hyperparameter tuning job](https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview) to the [Google Cloud AI Platform](https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training). It's a service that sets up scaling distributed training so data scientists and machine learning engineers do not have to worry about technical infrastructure. Usually, it automatically selects the [container environment](https://cloud.google.com/ml-engine/docs/runtime-version-list), but we're going to take advantage of a feature to specify our own environment with [Docker](https://www.docker.com/resources/what-container). Not only will this allow us to install our game environment to be deployed to the cloud, but it will also significantly speed up hyperparameter tuning time as each worker can skip the library installation steps.

The <a href="Dockerfile">Dockerfile</a> in this directory shows the steps taken to build this environment. First, we copy from a [Google Deep Learning Container](https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container) which already has Google Cloud Libraries installed. Then, we install our other desired modules and libraries. `ffmpeg`, `xvfb`, and `python-opengl` are needed in order to get video output from the server. Machines on the cloud don't typically have a display (why would they need one?), so we'll make a virtual display of our own.

After we copy our code, we tell the container to be configured as an executable so we can pass our hyperparameter tuning flags to it with the [ENTRYPOINT](https://stackoverflow.com/questions/21553353/what-is-the-difference-between-cmd-and-entrypoint-in-a-dockerfile) command. In order to set up our virtual display, we can use the [xvfb-run](http://manpages.ubuntu.com/manpages/trusty/man1/xvfb-run.1.html) command. Unfortunately, Docker strips quotes from specified commands in ENTRYPOINT, so we'll make a super simple shell script, <a href="train_model.sh">train_model.sh</a>, to specify our virtual display parameters. The `"@"` parameter is used to pass the flags called against the container to our python module, `trainer.trainer`.

## CartPole-v0

So what is the game we'll be solving for? We'll be playing with [AI Gym's CartPole Environment](https://gym.openai.com/envs/CartPole-v1/). As MNIST is the "Hello World" of image classification, CartPole is the "Hello World" of Deep Q Networks. Let's install [OpenAI Gym](https://gym.openai.com/) and play with the game ourselves!

In [None]:
# !python3 -m pip freeze | grep gym || python3 -m pip install --user gym==0.26.2
# !python3 -m pip freeze | grep 'tensorflow==2.5\|tensorflow-gpu==2.1' || \
# !python3 -m pip install -U tensorflow==2.3.0
# !python3 -m pip install pygame

##### Note: Restart the kernel if the above libraries needed to be installed. Please ignore incompatibility errors.

The `gym` library hosts a number of different gaming environments that our agents (and us humans) can play around in. To make an environment, we simply need to pass it what game we'd like to play with the `make` method.

This will create an environment object with a number of useful methods and properties.
* The `observation_space` parameter is the structure of observations about the environment.
  - Each "state" or snapshot or our environment will follow this structure
* The `action_space` parameter is the possible actions the agent can take

So for example, with CartPole, there are 4 observation dimensions which represent `[Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip]`. For the actions, there are 2 possible actions to take: 0 pushes the cart to the left, and 1 pushes the cart to the right. More detail is described in the game's code [here](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py).

In [None]:
print("sdffffffffffffffdddddddddddddddddddddddd")


In [None]:
from collections import deque

import gymnasium as gym
import numpy as np
import random
import tensorflow as tf
from tensorflow.keras import layers, models

env = gym.make('CartPole-v0')
print("The observation space is", env.observation_space)
print("The observation dimensions are", env.observation_space.shape)
print("The action space is", env.action_space)
print("The number of possible actions is", env.action_space.n)

* The `reset` method will restart the environment and return a starting state.
* The `step` method takes an action, applies it to the environment and returns a new state. Each step returns a new state, the transition reward, whether the game is over or not, and game specific information. For CartPole, there is no extra info, so it returns a blank dictionary.

In [None]:
def print_state(state, step, reward=None):
    format_string = 'Step {0} - Cart X: {1:.3f}, Cart V: {2:.3f}, Pole A: {3:.3f}, Pole V:{4:.3f}, Reward:{5}'
    print(state)
    print(format_string.format(step, *tuple(state), reward))


state = env.reset()[0]
step = 0
print_state(state, step)

In [None]:
state = env.reset()
step = 0
action = 0
step_result = env.step(action)
print(step_result)
observation, reward, terminated, truncated, info = step_result
step += 1
print_state(observation, step, reward)
print("The game is over." if terminated or truncated else "The game can continue.")
print("Info:", info)

Run the cell below repeatedly until the game is over, changing the action to push the cart left (0) or right (1). The game is considered "won" when the pole can stay up for an average of steps 195 over 100 games. How far can you get? An agent acting randomly can only survive about 10 steps.

In [None]:
action = 1  # Change me: 0 Left, 1 Right
state_prime, reward, done1, done2, info = env.step(action)
step += 1

print_state(state_prime, step, reward)
print("The game is over." if done1 or done2 else "The game can continue.")

We can make our own policy and create a loop to play through an episode (one full simulation) of the game. Below, actions are generated to alternate between pushing the cart left and right. The code is very similar to how our agents will be interacting with the game environment.

In [None]:

# [0, 1, 0, 1, 0, 1, ...]
actions = [x % 2 for x in range(200)]
state = env.reset()
step = 0
episode_reward = 0
done = False

while not done and step < len(actions):
    action = actions[step]  # In the future, our agents will define this.
    state_prime, reward, done1, done2, info = env.step(action)
    done = done1 or done2
    episode_reward += reward
    step += 1
    state = state_prime
    print_state(state, step, reward)

end_statement = "Game over!" if done else "Ran out of actions!"
print(end_statement, "Score =", episode_reward)

It's a challenge to get to 200! We could repeatedly experiment to find the best heuristics to beat the game, or we could leave all that work to the robot. Let's create an intelligence to figure this out for us.

## The Theory Behind Deep Q Networks

The fundamental principle behind RL is we have two entities: the **agent** and the **environment**. The agent takes state and reward information about the envionment and chooses an action. The environment takes that action and will change to be in a new state.

<img src="images/agent_and_environment.jpg" width="476" height="260">

RL assumes that the environment follows a [Markov Decision Process (MDP)](https://en.wikipedia.org/wiki/Markov_decision_process). That means the state is dependent partially on the agent's actions, and partially on chance. MDPs can be represented by a graph, with states and actions as nodes, and rewards and path probabilities on the edges.

<img src="images/mdp.jpg" width="471" height="243">

So what would be the best path through the graph above? Or perhaps a more difficult question, what would be our expected winnings if we played optimally? The probability introduced in this problem has inspired multiple strategies over the years, but all of them boil down to the idea of discounted future rewards.

Would you rather have `$100` now or `$105` a year from now? With inflation, there's no definitive answer, but each of us has a threshold that we use to determine the value of something now versus the value of something later. In psychology, this is called [Delayed Gratification](https://en.wikipedia.org/wiki/Delayed_gratification). Richard E. Bellman expressed this theory in an equation widely used in RL called the [Bellman Equation](https://en.wikipedia.org/wiki/Bellman_equation). Let's introduce some vocab to better define it.

| Symbol | Name | Definition | Example |
| - | - | - | - |
| | agent | An entity that can act and transition between states | Us when we play CartPole |
| s | state | The environmental parameters describing where the agent is | The position of the cart and angle of the pole |
| a | action | What the agent can do within a state | Pushing the cart left or right |
| t | time / step | One transition between states | One push of the cart |
|| episode | One full simulation run | From the start of the game to game over |
| v, V(s) | value | How much a state is worth | V(last state dropping the pole) = 0
| r, R(s, a) | reward | Value gained or lost transitioning between states through an action | R(keeping the pole up) = 1 |
| γ | gamma | How much to value a current state based on a future state | Coming up soon |
| 𝜋, 𝜋(s) | policy |The recommended action to the agent based on the current state | π(in trouble) = honesty |

Bellman realized this: The value of our current state should the discounted value of the next state the agent will be in plus any rewards picked up along the way, given the agent takes the best action to maximize this.

Using all the symbols from above, we get:

<img src="images/bellman_equation.jpg" width="260" height="50">

However, this is assuming we know all the states, their corresponding actions, and their rewards. If we don't know this in advance, we can explore and simulate this equation with what is called the [Q equation](https://en.wikipedia.org/wiki/Q-learning):

<img style="background-color:white;" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/47fa1e5cf8cf75996a777c11c7b9445dc96d4637">

Here, the value function is replaced with the Q value, which is a function of a state and action. The learning rate is how much we want to change our old Q value with new information found during simulation. Visually, this results in a Q-table, where rows are the states, actions are the columns, and each cell is the value found through simulation.

|| Meal | Snack | Wait |
|-|-|-|-|
| Hangry | 1 | .5 | -1 |
| Hungry | .5 | 1 | 0 |
| Full | -1 | -.5 | 1.5 |

So this is cool and all, but how exactly does this fit in with CartPole? Here, MDPs are discrete states. CartPole has multidimensional states on a continuous scale. This is where neural networks save the day! Rather than categorize each state, we can feed state properties into our network. By having the same number of output nodes as possible actions, our network can be used to predict the value of the next state given the current state and action.

## Building the Agent

These networks can be configured with the same architectures and tools as other problems, such as CNNs. However, the one gotcha is that uses a specialized loss function. We'll instead be using the derivative of the Bellman Equation. Let's go ahead and define our model function as it is in trainer/model.py

In [None]:
def deep_q_network(state_shape, action_size, learning_rate, hidden_neurons):
    """Creates a Deep Q Network to emulate Q-learning.

    Creates a two hidden-layer Deep Q Network. Similar to a typical nueral
    network, the loss function is altered to reduce the difference between
    predicted Q-values and Target Q-values.
    Args:
        space_shape: a tuple of ints representing the observation space.
        action_size (int): the number of possible actions.
        learning_rate (float): the nueral network's learning rate.
        hidden_neurons (int): the number of neurons to use per hidden
            layer.
    """
    state_input = layers.Input(state_shape, name='frames')
    actions_input = layers.Input((action_size,), name='mask')

    hidden_1 = layers.Dense(hidden_neurons, activation='relu')(state_input)
    hidden_2 = layers.Dense(hidden_neurons, activation='relu')(hidden_1)
    q_values = layers.Dense(action_size)(hidden_2)
    masked_q_values = layers.Multiply()([q_values, actions_input])

    model = models.Model(
        inputs=[state_input, actions_input], outputs=masked_q_values)
    optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)
    model.compile(loss='mse', optimizer=optimizer)
    return model

Notice any other atypical aspects of this network?

Here, we take in both state and actions as inputs to our network. The states are fed in as normal, but the actions are used to "mask" the output. This is actually used for faster training, as we'd only want to update the nodes correspnding to the action that we simulated.

The Bellman Equation actually isn't in the network. That's because this is only the "brain" of our agent. As an intelligence, it has much more! Before we get to how exactly the agent learns, let's looks at the other aspects of its body: "Memory" and "Exploration".

Just like other neural network algorithms, we need data to train on. However, this data is the result of our simulations, not something previously stored in a table. Thus, we're going to give our agent a memory where we can store state - action - new state transitions to learn on.

Each time the agent takes a step in gym, we'll save `(state, action, reward, state_prime, done)` to our buffer, which is defined like so.

In [67]:
class Memory():
    """Sets up a memory replay buffer for a Deep Q Network.

    A simple memory buffer for a DQN. This one randomly selects state
    transitions with uniform probability, but research has gone into
    other methods. For instance, a weight could be given to each memory
    depending on how big of a difference there is between predicted Q values
    and target Q values.

    Args:
        memory_size (int): How many elements to hold in the memory buffer.
        batch_size (int): The number of elements to include in a replay batch.
        gamma (float): The "discount rate" used to assess Q values.
    """

    def __init__(self, memory_size, batch_size, gamma):
        self.buffer = deque(maxlen=memory_size)
        self.batch_size = batch_size
        self.gamma = gamma

    def current_memory(self):
        return len(self.buffer)

    def add(self, experience):
        """Adds an experience into the memory buffer.

        Args:
            experience: a (state, action, reward, state_prime, done) tuple.
        """
        self.buffer.append(experience)

    def sample(self):
        """Uniformally selects from the replay memory buffer.

        Uniformally and randomly selects experiences to train the nueral
        network on. Transposes the experiences to allow batch math on
        the experience components.

        Returns:
            (list): A list of lists with structure [
                [states], [actions], [rewards], [state_primes], [dones]
            ]
        """
        buffer_size = self.current_memory()
        index = np.random.choice(
            np.arange(buffer_size), size=self.batch_size, replace=False)

        # Columns have different data types, so numpy array would be awkward.
        batch = np.asarray([self.buffer[i] for i in index], dtype=np.object_).T.tolist()
        states_mb = tf.convert_to_tensor(np.array(batch[0], dtype=np.float32))
        actions_mb = np.array(batch[1], dtype=np.int8)
        rewards_mb = np.array(batch[2], dtype=np.float32)
        states_prime_mb = np.array(batch[3], dtype=np.float32)
        dones_mb = batch[4]
        return states_mb, actions_mb, rewards_mb, states_prime_mb, dones_mb

Let's make a fake buffer and play around with it! We'll add the memory into our game play code to start collecting experiences.

In [None]:
test_memory_size = 20
test_batch_size = 4
test_gamma = .9  # Unused here. For learning.

test_memory = Memory(test_memory_size, test_batch_size, test_gamma)

In [None]:
actions = [x % 2 for x in range(200)]
state = env.reset()
step = 0
episode_reward = 0
done = False

while not done and step < len(actions):
    action = actions[step]  # In the future, our agents will define this.
    state_prime, reward, done1, done2, info = env.step(action)
    done = done1 or done2
    episode_reward += reward
    test_memory.add((state, action, reward, state_prime, done))  # New line here
    step += 1
    state = state_prime
    print_state(state, step, reward)

end_statement = "Game over!" if done else "Ran out of actions!"
print(end_statement, "Score =", episode_reward)

Now, let's sample the memory by running the cell below multiple times. It's different each call, and that's on purpose. Just like with other neural networks, it's important to randomly sample so that our agent can learn from many different situations.

The use of a memory buffer is called [Experience Replay](https://arxiv.org/pdf/1511.05952.pdf). The above technique of a uniform random sample is a quick and computationally efficient way to get the job done, but RL researchers often look into other sampling methods. For instance, maybe there's a way to weight memories based on their rarity or loss when the agent learns with it.

In [None]:
test_memory.sample()

But before the agent has any memories and has learned anything, how is it supposed to act? That comes down to [Exploration vs Exploitation](https://en.wikipedia.org/wiki/Multi-armed_bandit). The trouble is that in order to learn, risks with the unknown need to be made. There's no right answer, but there is a popular answer. We'll start by acting randomly, and over time, we will slowly decay our chance to act randomly.

Below is a partial version of the agent. 

In [None]:
class Partial_Agent():
    """Sets up a reinforcement learning agent to play in a game environment."""

    def __init__(self, network, memory, epsilon_decay, action_size):
        """Initializes the agent with DQN and memory sub-classes.

        Args:
            network: A neural network created from deep_q_network().
            memory: A Memory class object.
            epsilon_decay (float): The rate at which to decay random actions.
            action_size (int): The number of possible actions to take.
        """
        self.network = network
        self.action_size = action_size
        self.memory = memory
        self.epsilon = 1  # The chance to take a random action.
        self.epsilon_decay = epsilon_decay

    def act(self, state, training=False):
        """Selects an action for the agent to take given a game state.

        Args:
            state (list of numbers): The state of the environment to act on.
            traning (bool): True if the agent is training.

        Returns:
            (int) The index of the action to take.
        """
        if training:
            # Random actions until enough simulations to train the model.
            if len(self.memory.buffer) >= self.memory.batch_size:
                self.epsilon *= self.epsilon_decay

            if self.epsilon > np.random.rand():
                # print("Exploration!")
                return random.randint(0, self.action_size - 1)

        # If not acting randomly, take action with highest predicted value.
        # print("Exploitation!", state)
        state_batch = np.expand_dims(state, axis=0)
        predict_mask = np.ones((1, self.action_size,))
        action_qs = self.network.predict([state_batch, predict_mask], verbose=0)
        return np.argmax(action_qs[0])

Let's define the agent and get a starting state to see how it would act without any training.

In [None]:
state = env.reset()[0]

# Define "brain"
space_shape = env.observation_space.shape
action_size = env.action_space.n

# Feel free to play with these
test_learning_rate = .2
test_hidden_neurons = 10
test_epsilon_decay = .95

test_network = deep_q_network(
    space_shape, action_size, test_learning_rate, test_hidden_neurons)

test_agent = Partial_Agent(
    test_network, test_memory, test_epsilon_decay, action_size)

Run the cell below multiple times. Since we're decaying the random action rate after every action, it's only a matter a time before the agent exploits more than it explores.

In [None]:
action = test_agent.act(state, training=True)
print("Push Right" if action else "Push Left")

Memories, a brain, and a healthy dose of curiosity. We finally have all the ingredient for our agent to learn. After all, as the Scarecrow from the Wizard of Oz said:

"Everything in life is unusual until you get accustomed to it."  
~L. Frank Baum

Below is the code used by our agent to learn, where the Bellman Equation at last makes an appearance. We'll run through the following steps.

1. Pull a batch from memory
2. Get the Q value (the output of the neural network) based on the memory's ending state
    - Assume the Q value of the action with the highest Q value (test all actions)
4. Update these Q values with the Bellman Equation
    - `target_qs = (next_q_mb * self.memory.gamma) + reward_mb`
    - If the state is the end of the game, set the target_q to the reward for entering the final state.
5. Reshape the target_qs to match the networks output
    - Only learn on the memory's corresponding action by setting all action nodes to zero besides the action node taken.
6. Fit Target Qs as the label to our model against the memory's starting state and action as the inputs.

In [None]:
def learn(self):
    """Trains the Deep Q Network based on stored experiences."""
    batch_size = self.memory.batch_size
    if len(self.memory.buffer) < batch_size:
        return None

    # Obtain random mini-batch from memory.
    state_mb, action_mb, reward_mb, next_state_mb, done_mb = (
        self.memory.sample())

    # Get Q values for next_state.
    predict_mask = np.ones(action_mb.shape + (self.action_size,))
    next_q_mb = self.network.predict([next_state_mb, predict_mask], verbose=0)
    next_q_mb = tf.math.reduce_max(next_q_mb, axis=1)

    # Apply the Bellman Equation
    target_qs = (next_q_mb * self.memory.gamma) + reward_mb
    target_qs = tf.where(done_mb, reward_mb, target_qs)

    # Match training batch to network output:
    # target_q where action taken, 0 otherwise.
    action_mb = tf.convert_to_tensor(action_mb, dtype=tf.int32)
    action_hot = tf.one_hot(action_mb, self.action_size)
    target_mask = tf.multiply(tf.expand_dims(target_qs, -1), action_hot)

    return self.network.train_on_batch(
        [state_mb, action_hot], target_mask)


Partial_Agent.learn = learn
test_agent = Partial_Agent(
    test_network, test_memory, test_epsilon_decay, action_size)

Nice! We finally have an intelligence that can walk and talk and... well ok, this intelligence is too simple to be able to do those things, but maybe it can learn to push a cart with a pole on it. Let's update our training loop to use our new agent.

Run the below cell over and over up to ten times to train the agent.

In [None]:
def play_a_episode(test_agent, training):
    state = env.reset()[0]
    step = 0
    episode_reward = 0
    done = False
    while not done:
        action = test_agent.act(state, training=training)
        state_prime, reward, done1, done2, info = env.step(action)
        done = done1 or done2
        episode_reward += reward
        test_agent.memory.add((state, action, reward, state_prime, done))  # New line here
        step += 1
        state = state_prime
    return episode_reward

In [69]:
def train_agent(agent, train_episode):
    total_step = 0
    episode = 0
    while episode < train_episode:
        episode += 1
        episode_reward = play_a_episode(agent, True)
        total_step += episode_reward
        print("Episode:", episode, "Game over! Score =", episode_reward)
        print("Train episode:", episode, "memory step:", agent.memory.current_memory(), " loss:", agent.learn())
        if total_step % 1000 == 0:
            print("Train on steps:", total_step, " loss:", agent.learn())

In [70]:
learning_rate = 0.0001
hidden_neurons = 8
epsilon_decay = .95
memory_size = 102400
test_batch_size = 8
gamma = .9
train_steps = 10000
train_episode = 100
network = deep_q_network(
    space_shape, action_size, learning_rate, hidden_neurons)
memory = Memory(memory_size, test_batch_size, gamma)
agent = Partial_Agent(
        network, memory, epsilon_decay, action_size)

In [71]:
train_agent(agent, train_episode)

Episode: 1 Game over! Score = 26.0
Train episode: 1 memory step: 26  loss: 0.5395218
Episode: 2 Game over! Score = 17.0
Train episode: 2 memory step: 43  loss: 0.5436908
Episode: 3 Game over! Score = 8.0
Train episode: 3 memory step: 51  loss: 0.528027
Episode: 4 Game over! Score = 9.0
Train episode: 4 memory step: 60  loss: 0.5268538
Episode: 5 Game over! Score = 9.0
Train episode: 5 memory step: 69  loss: 0.54711705
Episode: 6 Game over! Score = 12.0
Train episode: 6 memory step: 81  loss: 0.5742945
Episode: 7 Game over! Score = 8.0
Train episode: 7 memory step: 89  loss: 0.56858736
Episode: 8 Game over! Score = 10.0
Train episode: 8 memory step: 99  loss: 0.5577821
Episode: 9 Game over! Score = 10.0
Train episode: 9 memory step: 109  loss: 0.5680715
Episode: 10 Game over! Score = 9.0
Train episode: 10 memory step: 118  loss: 0.5713593
Episode: 11 Game over! Score = 9.0
Train episode: 11 memory step: 127  loss: 0.584243
Episode: 12 Game over! Score = 9.0
Train episode: 12 memory step

In [72]:
def evaluate_agent(agent, episode):
    total_score = 0
    for i in range(episode):
        episode_reward = play_a_episode(agent, training=False)
        print("Run:", i, " Game over! Score =", episode_reward)
        total_score += episode_reward
    print("AVR :", total_score / episode)

In [73]:
evaluate_agent(agent, 10)

Run: 0  Game over! Score = 10.0
Run: 1  Game over! Score = 10.0
Run: 2  Game over! Score = 9.0
Run: 3  Game over! Score = 10.0
Run: 4  Game over! Score = 10.0
Run: 5  Game over! Score = 9.0
Run: 6  Game over! Score = 11.0
Run: 7  Game over! Score = 10.0
Run: 8  Game over! Score = 8.0
Run: 9  Game over! Score = 10.0
AVR : 9.7


## Hypertuning

Chances are, at this point, the agent is having a tough time learning. Why is that? Well, remember that hyperparameter tuning job we kicked off at the start of this notebook?

The are many parameters that need adjusting with our agent. Let's recap:
* The number of `episodes` or full runs of the game to train on
* The neural networks `learning_rate`
* The number of `hidden_neurons` to use in our network
* `gamma`, or how much we want to discount the future value of states
* How quickly we want to switch from explore to exploit with `explore_decay`
* The size of the memory buffer, `memory_size`
* The number of memories to pull from the buffer when training, `memory_batch_size`

These all have been added as flags to pass to the model in `trainer/trainer.py`'s `_parse_arguments` method. For the most part, `trainer/trainer.py` follows the structure of the training loop that we have above, but it does have a few extra bells and whistles, like a hook into TensorBoard and video output.

In [None]:
def _parse_arguments(argv):
    """Parses command-line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--game',
        help='Which open ai gym game to play',
        type=str,
        default='CartPole-v0')
    parser.add_argument(
        '--episodes',
        help='The number of episodes to simulate',
        type=int,
        default=200)
    parser.add_argument(
        '--learning_rate',
        help='Learning rate for the nueral network',
        type=float,
        default=0.2)
    parser.add_argument(
        '--hidden_neurons',
        help='The number of nuerons to use per layer',
        type=int,
        default=30)
    parser.add_argument(
        '--gamma',
        help='The gamma or "discount" factor to discount future states',
        type=float,
        default=0.5)
    parser.add_argument(
        '--explore_decay',
        help='The rate at which to decay the probability of a random action',
        type=float,
        default=0.1)
    parser.add_argument(
        '--memory_size',
        help='Size of the memory buffer',
        type=int,
        default=100000)
    parser.add_argument(
        '--memory_batch_size',
        help='The amount of memories to sample from the buffer while training',
        type=int,
        default=8)
    parser.add_argument(
        '--job-dir',
        help='Directory where to save the given model',
        type=str,
        default='models/')
    parser.add_argument(
        '--print_rate',
        help='How often to print the score, 0 if never',
        type=int,
        default=0)
    parser.add_argument(
        '--eval_rate',
        help="""While training, perform an on-policy simulation and record
        metrics to tensorboard every <record_rate> steps, 0 if never. Use
        higher values to avoid hyperparameter tuning "too many metrics"
        error""",
        type=int,
        default=20)
    return parser.parse_known_args(argv)

Geez, that's a lot. And like with other machine learning methods, there's no hard and fast rule and is problem dependent. Plus, there are many more paramaters we could explore, like the number of layers, learning rate decay, and so on.

We can tell Google Cloud how to explore the hyperparameter tuning space with a [config file](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#HyperparameterSpec). The `hyperparam.yaml` file in this directory is exactly that. It specifies which parameter to tune on (in this case, the `episode_reward`) and the range for the different flags we want to tune on.

In our code, we'll add the following

`import hypertune  #From cloudml-hypertune library`

`hpt = hypertune.HyperTune()  # Initialized before looping through episodes`

`# Placed right before the end of the training loop
hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag='episode_reward',
    metric_value=reward,
    global_step=episode)`
  
This way, at the end of every episode, we can send information to the tuning service on how the agent is doing. The service can only handle so much information being thrown at it at once, so we'll add a `eval_rate` flag to throttle information to every `eval_rate` episodes.

It is definately a worthwhile exercise to try and find the optimal set of parameters on one's on, but if life is too short, and there isn't time for that, the hyperparameter tuning job should now be complete. Head on over to [Google Cloud's AI Platform](https://console.cloud.google.com/ai-platform/jobs) to see the job labeled `dqn_on_gcp_<time_this_lab_was_started>`

Click on the job name to see the results. Information comes in as each trial is complete, and the best performing trial will be listed on the top.

<img src="images/hypertune_trials.jpg" width="966" height="464">

Logs can be invaluable when debugging. Click the three dots to the right of one of the trials to filter logs by that particular trial.

At last, let's see the results of the best trial. Keep in mind the best trial number and navigate over to [your bucket](https://console.cloud.google.com/storage/browser). The results will be in a file with the same Job Name as your hyperparameter tuning job. In that folder, there will be a number of subfolders equal to the number of hyperparameter tuning trials. Select the folder with your best performing `Trial Id`

<img src="images/best_trial.jpg" width="956" height="456">

There should be a number of goodies in the file including TensorBoard information in `/train`, a saved model in `saved_model.pb`, and a recording of the model in `recording.mp4`.

Open the [Google Cloud Shell](https://console.cloud.google.com/home/dashboard?cloudshell=true&_ga=2.207467987.-157492093.1570741979) and run Tensorboard with

`tensorboard --logdir=gs://<your-bucket>/<job-name>/<path-best-trial>`

The episode rewards and training loss are displayed for the trial in intervals of 20 episodes.

<img src="images/tensorboard.jpg" width="910" height="708">

Click `recording.mp4` in your bucket to visually see how the model performed! How did it do? If you're not proud of your little robot, check out the recordings of the other trials to see how it decimates the competition.

Congratulations on making a Deep Q Agent! That's it for now, but this is just scratching the surface for Reinforcement Learning. AI Gym has plenty of other [environments](https://gym.openai.com/envs/#classic_control), see if you can conquer them with your new skills!

Copyright 2021 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.