<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#What-is-RL?-A-short-recap" data-toc-modified-id="What-is-RL?-A-short-recap-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What is RL? A short recap</a></span></li><li><span><a href="#The-two-types-of-value-based-methods" data-toc-modified-id="The-two-types-of-value-based-methods-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The two types of value-based methods</a></span></li><li><span><a href="#The-Bellman-Equation,-simplify-our-value-estimation" data-toc-modified-id="The-Bellman-Equation,-simplify-our-value-estimation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Bellman Equation, simplify our value estimation</a></span></li><li><span><a href="#Monte-Carlo-vs-Temporal-Difference-Learning" data-toc-modified-id="Monte-Carlo-vs-Temporal-Difference-Learning-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Monte Carlo vs Temporal Difference Learning</a></span></li><li><span><a href="#Mid-way-Recap" data-toc-modified-id="Mid-way-Recap-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Mid-way Recap</a></span></li><li><span><a href="#Mid-way-Quiz" data-toc-modified-id="Mid-way-Quiz-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Mid-way Quiz</a></span></li><li><span><a href="#Introducing-Q-Learning" data-toc-modified-id="Introducing-Q-Learning-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Introducing Q-Learning</a></span></li><li><span><a href="#A-Q-Learning-example" data-toc-modified-id="A-Q-Learning-example-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>A Q-Learning example</a></span></li><li><span><a href="#Q-Learning-Recap" data-toc-modified-id="Q-Learning-Recap-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Q-Learning Recap</a></span></li><li><span><a href="#Glossary" data-toc-modified-id="Glossary-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Glossary</a></span></li><li><span><a href="#Hands-on" data-toc-modified-id="Hands-on-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Hands-on</a></span></li><li><span><a href="#Q-Learning-Quiz" data-toc-modified-id="Q-Learning-Quiz-13"><span class="toc-item-num">13&nbsp;&nbsp;</span>Q-Learning Quiz</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-14"><span class="toc-item-num">14&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Additional-Readings" data-toc-modified-id="Additional-Readings-15"><span class="toc-item-num">15&nbsp;&nbsp;</span>Additional Readings</a></span></li></ul></div>

# Introduction

Source: https://huggingface.co/learn/deep-rl-course/unit2/introduction

In the first unit of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also <b>trained our first agents and uploaded them to the Hugging Face Hub</b>.

In this unit, we’re going to <b>dive deeper into one of the Reinforcement Learning methods: value-based methods</b> and study our <b>first RL algorithm: Q-Learning.</b>

We’ll also <b>implement our first RL agent from scratch</b>, a Q-Learning agent, and will train it in two environments:
- Frozen-Lake-v1 (non-slippery version): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H).
- An autonomous taxi: where our agent will need to learn to navigate a city to transport its passengers from point A to point B.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" style="width:700px;" title="Two environments">

Concretely, we will:

- Learn about <b>value-based methods</b>.
- Learn about the <b>differences between Monte Carlo and Temporal Difference Learning</b>.
- Study and implement <b>our first RL algorithm</b>: Q-Learning.

This unit is <b>fundamental if you want to be able to work on Deep Q-Learning</b>: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders, etc).

# What is RL? A short recap

In RL, we build an agent that can <b>make smart decisions</b>. For instance, an agent that <b>learns to play a video game</b>. Or a trading agent that <b>learns to maximize its benefits</b> by deciding on <b>what stocks to buy and when to sell</b>.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" style="width:700px;" title="RL Process">

To make intelligent decisions, our agent will learn from the environment by <b>interacting with it through trial and error and</b> receiving rewards (positive or negative) as <b>unique feedback</b>.

Its goal is to <b>maximize its expected cumulative reward</b> (because of the reward hypothesis).

The <b>agent’s decision-making process is called the policy π</b>: given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.

<img src="
https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" style="width:700px;" title="Policy: The agent's brain">

Our <b>goal is to find an optimal policy π*</b>, aka., a policy that leads to the best expected cumulative reward.

And to find this optimal policy (hence solving the RL problem), there are <b>two main types of RL methods</b>:

- <i>Policy-based methods</i>: <b>Train the policy directly</b> to learn which action to take given a state.
- <i>Value-based methods</i>: <b>Train a value function</b> to learn <b>which state is more valuable</b> and use this value function to <b>take the action that leads to it</b>.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" style="width:700px;" title="RL Process">

# The two types of value-based methods

In value-based methods, <b>we learn a value function that maps a state to the expected value of being at that state.</b>

# The Bellman Equation, simplify our value estimation

# Monte Carlo vs Temporal Difference Learning

# Mid-way Recap

# Mid-way Quiz

# Introducing Q-Learning

# A Q-Learning example

# Q-Learning Recap

# Glossary

# Hands-on

# Q-Learning Quiz

# Conclusion

# Additional Readings