<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#From-Q-Learning-to-Deep-Q-Learning" data-toc-modified-id="From-Q-Learning-to-Deep-Q-Learning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>From Q-Learning to Deep Q-Learning</a></span></li><li><span><a href="#The-Deep-Q-Network-(DQN)" data-toc-modified-id="The-Deep-Q-Network-(DQN)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Deep Q-Network (DQN)</a></span><ul class="toc-item"><li><span><a href="#Preprocessing-the-input-and-temporal-limitation" data-toc-modified-id="Preprocessing-the-input-and-temporal-limitation-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Preprocessing the input and temporal limitation</a></span></li></ul></li><li><span><a href="#The-Deep-Q-Algorithm" data-toc-modified-id="The-Deep-Q-Algorithm-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Deep Q Algorithm</a></span><ul class="toc-item"><li><span><a href="#Experience-Replay-to-make-more-efficient-use-of-experiences" data-toc-modified-id="Experience-Replay-to-make-more-efficient-use-of-experiences-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Experience Replay to make more efficient use of experiences</a></span></li><li><span><a href="#Fixed-Q-Target-to-stabilize-the-training" data-toc-modified-id="Fixed-Q-Target-to-stabilize-the-training-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Fixed Q-Target to stabilize the training</a></span></li><li><span><a href="#Double-DQN-to-handle-the-problem-of-overestimation-of-the-Q-values" data-toc-modified-id="Double-DQN-to-handle-the-problem-of-overestimation-of-the-Q-values-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Double DQN to handle the problem of overestimation of the Q-values</a></span></li></ul></li><li><span><a href="#Glossary" data-toc-modified-id="Glossary-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Glossary</a></span></li><li><span><a href="#Hands-on" data-toc-modified-id="Hands-on-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Hands-on</a></span><ul class="toc-item"><li><span><a href="#SpacesInvadersNoFrameskip-v4" data-toc-modified-id="SpacesInvadersNoFrameskip-v4-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>SpacesInvadersNoFrameskip-v4</a></span></li></ul></li><li><span><a href="#Quiz" data-toc-modified-id="Quiz-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Quiz</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Additional-Readings" data-toc-modified-id="Additional-Readings-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Additional Readings</a></span></li></ul></div>

# Introduction

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" style="width:600px;" title="Unit 3 thumbnail">

In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, <b>implemented it from scratch</b>, and trained it in two environments, `FrozenLake-v1` ☃️ and `Taxi-v3` 🚕.

We got excellent results with this simple algorithm, but these environments were relatively simple because the <b>state space was discrete and small </b>(16 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can <b>contain $10^{9}$ to $10^{11}$ states.</b>

But as we’ll see, <span style="color:red">producing and updating a <b>Q-table can become ineffective in large state space environments.</b></span>

So in this unit, <span style="color:blue"><b>we’ll study our first Deep Reinforcement Learning agent</b>: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.</span>

And <b>we’ll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)</b>, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.

# From Q-Learning to Deep Q-Learning

We learned that <b>Q-Learning is an algorithm we use to train our Q-Function, an action-value function</b> that determines the value of being at a particular state and taking a specific action at that state.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" style="width:700px;" title="Q-function">

The <b>Q comes from “the Quality” of that action at that state.</b>

Internally, our Q-function is encoded by a <b>Q-table, a table where each cell corresponds to a state-action pair value</b>. Think of this Q-table as the <b>memory or cheat sheet of our Q-function.</b>

<span style="color:red">The problem is that Q-Learning is a <i>tabular method</i>. This becomes a problem if the states and actions spaces <b>are not small enough to be represented efficiently by arrays and tables</b>. In other words: it is <b>not scalable</b></span>. Q-Learning worked well with small state space environments like:

- FrozenLake, we had 16 states.
- Taxi-v3, we had 500 states.

But think of what we’re going to do today: we will train an agent to learn to play Space Invaders, a more complex game, using the frames as input.

As [Nikita Melkozerov mentioned](https://twitter.com/meln1k), <b>Atari environments</b> have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us $256^{210 \times160 \times 3} =  256^{100800}$ possible observations (for comparison, we have approximately $10^{80}$ atoms in the observable universe).

- A single frame in Atari is composed of an image of 210x160 pixels. Given that the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" style="width:600px;" title="Atari State Space">

Therefore, <span style="color:blue">the state space is gigantic;</span><span style="color:red"> due to this, creating and updating a Q-table for that environment would not be efficient.</span><span style="color:green"> In this case, the best idea is to approximate the Q-values using a parametrized Q-function $Q_{\theta}(s,a)$.</span>

This <b>neural network will approximate, given a state, the different Q-values for each possible action at that state. And that’s exactly what Deep Q-Learning does.</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" style="width:600px;" title="Deep Q Learning">

Now that we understand Deep Q-Learning, let’s dive deeper into the Deep Q-Network.

# The Deep Q-Network (DQN)

This is the architecture of our Deep Q-Learning network:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" style="width:600px;" title="Deep Q Network">

As input, we take a <b>stack of 4 frames</b> passed through the network as a state and output a <b>vector of Q-values for each possible action at that state</b>. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.

When the Neural Network is initialized, <b>the Q-value estimation is terrible</b>. But during training, our Deep Q-Network agent will associate a situation with the appropriate action and <b>learn to play the game well.</b>

## Preprocessing the input and temporal limitation

We need to <b>preprocess the input</b>. It’s an <span style="color:blue">essential step since we want to <b>reduce the complexity of our state to reduce the computation time needed for training.</b></span>

To achieve this, we <b>reduce the state space to 84x84 and grayscale it</b>. We can do this since the colors in Atari environments don’t add important information. This is a big improvement since we <b>reduce our three color channels (RGB) to 1.</b> 

> <span style="color:blue"><b>16x reduction</b> from $\Rightarrow 210*160*3=113,400$ pixels in a frame to $84*84*1=7,056$ pixels.</span>

We can also <b>crop a part of the screen in some games</b> if it does not contain important information. Then we stack four frames together.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" style="width:600px;" title="Preprocessing">

<span style="color:blue"><b>Why do we stack four frames together</b>? We stack frames together because it helps us <b>handle the problem of temporal limitation</b></span>. Let’s take an example with the game of Pong. When you see this frame:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" style="width:600px;" title="Temporal Limitation">

Can you tell me where the ball is going? No, because one frame is not enough to have a sense of motion! But what if I add three more frames? <b>Here you can see that the ball is going to the right.</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" style="width:600px;" title="Temporal Limitation">

That’s why, to capture temporal information, we stack four frames together.

Then the stacked frames are processed by three convolutional layers. These layers <b>allow us to capture and exploit spatial relationships in images</b>. But also, because the frames are stacked together, <b>we can exploit some temporal properties across those frames.</b>

If you don’t know what convolutional layers are, don’t worry. You can check out [Lesson 4 of this free Deep Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)

Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" style="width:600px;" title="Deep Q Network">

So, we see that Deep Q-Learning uses a neural network to approximate, given a state, the different Q-values for each possible action at that state. Now let’s study the Deep Q-Learning algorithm.

# The Deep Q Algorithm

We learned that Deep Q-Learning <b>uses a deep neural network to approximate the different Q-values for each possible action at a state </b>(value-function estimation).

The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:

$$Q^{new}(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t}) + \alpha[R_{t+1} + \gamma \text{max}_{a}Q(S_{t+1}, a) - Q(S_{t}, A_{t})] $$

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" style="width:700px;" title="Q Loss">

in Deep Q-Learning, we create a <b>loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better.</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" style="width:700px;" title="Q-target">

- <b>Q-Target</b> or <b>TD Target</b>: 
    - $y_{j} = r_{j} + \gamma max_{a} \hat{Q}(\psi_{j+1}, a'; \theta^{-})$
    - $y_{j} =  R_{t+1} + \gamma max_{a}Q(S_{t+1},a)$
- <b>Q-value prediction</b>
    - $Q(\psi_{j}, a_{j}; \theta)$
- <b>Q-loss (a.k.a.) TD error</b>: Q-target - Q-prediction
    - $y_{j} - Q(\psi_{j}, a_{j}; \theta)$
    - $R_{t+1} + \gamma max_{a}Q(S_{t+1},a) - Q(S_{t}, A_{t})$

where <b>(I guess)</b><br>
- <span style="font-size:large">$\phi$</span> is the state<br>
    - <span style="font-size:large">$\phi_{j}$</span> is the state at step $j$ and $\phi_{j+1}$ is the state at step $j+1$<br>
- <span style="font-size:large">$a_{j}$</span> is the action at step $j$<br>
- <span style="font-size:large">$r_{j}$</span> is the reward at step $j$<br>
    - <span style="color:#FF6133">What does $-$ in the $\theta^{-}$ mean? I guess, it means this is the [Fixed Q-target](#fixed-q-target), and $\theta$ without $-$ means it's the Q-value prediction.</span>
    - Why is there a ^ in $\hat{Q}$
- <span style="font-size:large">$y_{j}$</span> is the <span style="color:blue">Q-Target</span><br>
- <span style="font-size:large">$Q(\phi_{j}, a_{j}; \theta)$</span> is the <span style="color:blue">current Q-value (estimation of Q) </span>
    - ??? by our Deep Q-Network by learning optimal weights $\theta$ using gradient descent</span>
- <span style="font-size:large">$y_{j} - Q(\phi_{j}, a_{j}; \theta)$</span> is the Q-Loss a.k.a. TD Error.

The Deep Q-Learning training algorithm has <i>two phases</i>:

- <b>Sampling</b>: We perform actions and <b>store the observed experience tuples in a replay memory.</b>
- <b>Training</b>: Select a <b>small batch of tuples randomly and learn from this batch using a gradient descent update step.</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" style="width:700px;" title="Sampling Training">

This is not the only difference compared with Q-Learning. <span style="color:red">Deep Q-Learning training <b>might suffer from instability</b>, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).</span>

To help us stabilize the training, we implement three different solutions:

1. <i>Experience Replay</i> to make more <b>efficient use of experiences.</b>
2. <i>Fixed Q-Target</i> <b>to stabilize the training.</b>
3. <i>Double Deep Q-Learning</i>, to <b>handle the problem of the overestimation of Q-values.</b>

Let’s go through them!

## Experience Replay to make more efficient use of experiences

Why do we create a replay memory?

Experience Replay in Deep Q-Learning has two functions:

1. <b>Make more efficient use of the experiences during the training</b>. <span style="color:red">Usually, in online reinforcement learning, the agent interacts with the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.</span>

<span style="color:green">Experience replay helps by <b>using the experiences of the training more efficiently.</b> We use a replay buffer that saves experience samples <b>that we can reuse during the training.</b></span>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" style="width:700px;" title="Experience Replay">

⇒ This allows the agent to <b>learn from the same experiences multiple times.</b>

2. <b>Avoid forgetting previous experiences and reduce the correlation between experiences.</b>

- <span style="color:red">The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget the <b>previous experiences as it gets new experiences</b>. For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.</span>

<span style="color:green">The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents <b>the network from only learning about what it has done immediately before.</b></span>

<span style="color:green">Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid <b>action values from oscillating or diverging catastrophically.</b></span>

In the Deep Q-Learning pseudocode, we <b>initialize a replay memory buffer D with capacity N</b> (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" style="width:700px;" title="Experience Replay Pseudocode">

<a id="fixed-q-target"></a>
## Fixed Q-Target to stabilize the training

When we want to calculate the TD error (aka the loss), we calculate the <b>difference between the TD target (Q-Target) and the current Q-value (estimation of Q).</b>

<span style="color:red">But we <b>don’t have any idea of the real TD target.</b></span> We need to estimate it. <span style="color:green">Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.</span>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" style="width:700px;" title="Q-target">

<span style="color:red">However, the problem is that we are using the same parameters (weights) for estimating the TD target ($y_{j}$) and the Q-value ($Q(\phi_{j}, a_{j}; \theta)$). Consequently, there is a significant correlation between the TD target and the parameters we are changing.</span>

Therefore, at every step of training,<b> both our Q-values and the target values shift</b>. We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to significant oscillation in training.

It’s like if you were a cowboy (the Q estimation) and you wanted to catch a cow (the Q-target). Your goal is to get closer (reduce the error).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" style="width:400px;" title="Q-target">

At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" style="width:400px;" title="Q-target">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" style="width:400px;" title="-target">

This leads to a bizarre path of chasing (a significant oscillation in training).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" style="width:400px;" title="Q-target">

Instead, what we see in the pseudo-code is that we:

- Use a <b>separate network with fixed parameters</b> for estimating the TD Target
- <b>Copy the parameters from our Deep Q-Network every C steps</b> to update the target network.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" style="width:700px;" title="Fixed Q-target Pseudocode">

## Double DQN to handle the problem of overestimation of the Q-values

Double DQNs, or Double Deep Q-Learning neural networks, were introduced by [Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method <b>handles the problem of the overestimation of Q-values.</b>

To understand this problem, remember how we calculate the TD Target:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" style="width:600px;" title="TD target">

<span style="color:red">We face a simple problem by calculating the TD target: how are we sure that <b>the best action for the next state is the action with the highest Q-value?</b></span>

We know that the accuracy of Q-values depends on what action we tried <b>and</b> what neighboring states we explored.

<span style="color:red">Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly <b>given a higher Q value than the optimal best action, the learning will be complicated.</b></span>

<span style="color:green">The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation.</span> We:
- Use our <b>DQN network</b> to select the best action to take for the next state (the action with the highest Q-value).
- Use our <b>Target network</b> to calculate the target Q-value of taking that action at the next state.
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.

Since these three improvements in Deep Q-Learning, many more have been added, such as Prioritized Experience Replay and Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.

# Glossary

- <b>Tabular Method</b>: Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables. <b>Q-learning</b> is an example of tabular method since a table is used to represent the value for different state-action pairs.

- <b>Deep Q-Learning</b>: Method that trains a neural network to approximate, given a state, the different <b>Q-values</b> for each possible action at that state. It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.

- <b>Temporal Limitation</b> is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information. In order to obtain temporal information, we need to <b>stack</b> a number of frames together.

- <b>Phases of Deep Q-Learning:</b>
    - <b>Sampling</b>: Actions are performed, and observed experience tuples are stored in a <b>replay memory.</b>
    - <b>Training</b>: Batches of tuples are selected randomly and the neural network updates its weights using gradient descent.

- <b>Solutions to stabilize Deep Q-Learning:</b>
    - <b>Experience Replay</b>: A replay memory is created to save experiences samples that can be reused during training. This allows the agent to learn from the same experiences multiple times. Also, it helps the agent avoid forgetting previous experiences as it gets new ones.
    - <b>Random sampling</b> from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging catastrophically.
    - <b>Fixed Q-Target</b>: In order to calculate the <b>Q-Target</b> we need to estimate the discounted optimal <b>Q-value</b> of the next state by using Bellman equation. The problem is that the same network weights are used to calculate the <b>Q-Target</b> and the <b>Q-value</b>. This means that everytime we are modifying the <b>Q-value</b>, the <b>Q-Target</b> also moves with it. To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from our Deep Q-Network after certain <b>C steps</b>.

    - <b>Double DQN</b>: Method to handle <b>overestimation of Q-Values</b>. This solution uses two networks to decouple the action selection from the target <b>Value generation:</b>
        - <b>DQN Network</b> to select the best action to take for the next state (the action with the highest <b>Q-Value)</b>
        - <b>Target Network</b> to calculate the target <b>Q-Value</b> of taking that action at the next state. This approach reduces the Q-Values overestimation, it helps to train faster and have more stable learning.

# Hands-on

## SpacesInvadersNoFrameskip-v4

- My [model card](https://huggingface.co/prasanthntu/q-Taxi-v3) in huggingface.
    - It can be found under the `Reinforcement learning` [model libraries](https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=trending&search=prasanthntu) section.
- My [source code](https://colab.research.google.com/drive/1SSmGmegUEqArlttm9nM1omkkBefxt-VU?usp=sharing) in Google Colab stored in Google Drive for building the model.

<b>Summary</b>
- <b>Goal</b>:  Train our agent to navigate to passengers in a grid world, picking them up and dropping them off at one of four designated locations.
- <b>Environment</b>
    - [Gymnasium](https://gymnasium.farama.org/environments/toy_text/taxi/) - To create RL environments  
        - Taxi-v3 
            - Observation space: 1D vector
                - `Discrete(500)` as there are 
                    - (5x5 grid) 25 taxi positions
                    - 5 passenger locations (technically, 4 passenger locations + case when passenger is in the taxi)
                    - 4 destination locations
            - Action space: Scalar
                - `Discrete(6)` $\Rightarrow$ No. of actions available: 6
                - Possible actions
                    - 0: Move south (down)
                    - 1: Move north (up)
                    - 2: Move east (right)
                    - 3: Move west (left)
                    - 4: Pickup passenger
                    - 5: Drop off passenger
            - Rewards
                - -1 per step unless other reward is triggered.
                - +20 delivering passenger.
                - -10 executing “pickup” and “drop-off” actions illegally.
                
              An action that results a noop, like moving into a wall, will incur the time step penalty. Noops can be avoided by sampling the action_mask returned in info.
              
            - Episode end
                - Termination
                    - The taxi drops off the passenger. 👍                 
                - Truncation
                    - When exceeding the length of the interactions in a episode 👎
                        - 200 timesteps.
- <b>Model</b>
    - Q learning - Built from scratch 
        - Q table shape: n_states x n_actions: 500 x 6

# Quiz

# Conclusion

Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.

Take time to really grasp the material before continuing.

Don’t hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The <b>best way to learn is to try things on your own!</b>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" style="width:600px;" title="Environments">

In the next unit, <b>we’re going to learn about Optuna</b>. One of the most critical tasks in Deep Reinforcement Learning is to find a good set of training hyperparameters. Optuna is a library that helps you to automate the search.



# Additional Readings

These are <b>optional readings</b> if you want to go deeper.

- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)