# Week 3: Reinforcement Learning

## Table of Contents

1. [What is reinforcement learning?](#what-is-reinforcement-learning)  
2. [Mars rover example](#mars-rover-example)  
3. [The return in reinforcement learning](#the-return-in-reinforcement-learning)  
4. [Making decisions: Policies in reinforcement learning](#making-decisions-policies-in-reinforcement-learning)  
5. [Review of key concepts](#review-of-key-concepts)  
6. [State-action value function definition](#state-action-value-function-definition)  
7. [Bellman Equation](#bellman-equation)  
8. [Random (stochastic) environment](#random-stochastic-environment)  
9. [Example of continuous state space application](#example-of-continuous-state-space-application)  
10. [Learning the state-action function](#learning-the-state-action-function)  
11. [Algorithm refinement: Improved neural netwrok architecture](#algorithm-refinement-improved-neural-netwrok-architecture)  
12. [Algorithm refienment: ε-greedy policy](#algorithm-refienment-epsilon-greedy-policy)  
13. [Algorithm refinement: Mini-Batch and soft-updates](#algorithm-refinement-mini-batch-and-soft-updates)  
14. [The state of reinforcement learning](#the-state-of-reinforcement-learning)
15. [References](#References)

---

## What is reinforcement learning?

Reinforcement Learning (RL) is a fundamental pillar of machine learning used to train an **agent** (like a robot or an algorithm) to make a sequence of decisions in an environment by maximizing a cumulative **reward**.

Instead of relying on labeled data (like supervised learning), RL uses a system of trial and error guided by a reward function.

### Core Concepts and Mechanism

* **Goal:** To find a function (often called a **policy**) that maps an observed **State** ($\mathbf{s}$) of the environment to an optimal **Action** ($\mathbf{a}$).
* **State ($\mathbf{s}$):** The agent's current situation or observation (e.g., a helicopter's position, orientation, and speed).
* **Action ($\mathbf{a}$):** A decision the agent makes (e.g., how to move the control sticks).
* **Reward Function:** The key input to RL, which tells the agent *when* it is doing well and *when* it is doing poorly.
    * **Incentive:** The agent's task is to figure out the sequence of actions that maximize the total cumulative reward over time.
    * **Example:** For a helicopter, reward may be **+1** for every second flying well and a large **negative reward** (e.g., -1000) for crashing. 

<img src='images/rl.png' width='500px'>

### The Power of Reward

RL is powerful because the designer only needs to specify **what** the goal is (via the reward function), not **how** to achieve it (via specific optimal actions).
* **Analogy:** It is like training a dog; you reward "good dog" behavior and discourage "bad dog" behavior, allowing the dog to learn the complex path to the desired outcome itself.
* **Example:** An RL algorithm enabled a robot dog to learn complex leg placements to climb over obstacles solely by rewarding progress toward the goal, without explicit instructions on leg movement.

### Contrast with Supervised Learning (SL)

For many control tasks (like flying a robot), Supervised Learning (SL) fails because it requires a large dataset of states ($\mathbf{x}$) and their ideal actions ($\mathbf{y}$).

It is often ambiguous or impossible for a human expert to define the single, exact "right action" ($\mathbf{y}$) for every single complex state ($\mathbf{x}$), making SL impractical for these scenarios. RL overcomes this ambiguity by using rewards instead of perfect labels.

### Applications

* **Robotics:** Controlling autonomous systems (helicopters, drones, robot dogs) to perform complex maneuvers.
* **Optimization:** Factory optimization to maximize throughput and efficiency.
* **Finance:** Efficient stock execution and trading strategies (e.g., sequencing trades to minimize price impact).
* **Gaming:** Playing complex games like Chess, Go, Bridge, and various video games.

---

## Mars rover example

This section formalizes the core concepts of Reinforcement Learning (RL) using a simplified example inspired by the Mars rover, introducing the concepts of state, action, reward, and terminal states.

### The Environment Setup (States and Rewards)

* **States ($S$):** The environment is modeled as a sequence of six positions, $S_1$ through $S_6$, representing possible locations of the rover. The rover starts in $S_4$.
* **Rewards ($R$):** Rewards are associated with specific states based on their scientific value:
    * $R(S_1) = 100$ (Highest value, most interesting science).
    * $R(S_6) = 40$ (Second highest value).
    * $R(S_2) = R(S_3) = R(S_4) = R(S_5) = 0$.
* **Terminal States:** $S_1$ and $S_6$ are terminal states. Once the rover reaches these states, the day (or episode) ends, and no further rewards can be earned.

### Actions and Transitions

* **Actions ($A$):** At each step, the rover can choose one of two actions:
    * Go Left
    * Go Right
* **State Transition:** Taking an action leads the rover from the current state $S$ to a new state $S'$ (the next state). For example, from $S_4$, taking the action "Go Left" leads to the next state $S_3$.

<img src='images/rl_example.png' width=600px>

### The Core RL Loop Elements

The fundamental process that defines the reinforcement learning problem is the sequence of transitions:

At every time step, the robot is in a State ($\mathbf{S}$), chooses an Action ($\mathbf{A}$), receives the Reward ($\mathbf{R}(S)$) associated with that state, and transitions to a Next State ($\mathbf{S'}$).

### Evaluating Action Sequences

The goal of the RL algorithm is to figure out the optimal sequence of actions to maximize the total reward collected before reaching a terminal state.

* **Option 1 (Go Left):** $S_4 \to S_3 \to S_2 \to S_1$. Total Reward: $0 + 0 + 0 + 100 = 100$.
* **Option 2 (Go Right):** $S_4 \to S_5 \to S_6$. Total Reward: $0 + 0 + 40 = 40$.
* **Suboptimal Path:** $S_4 \to S_5 \to S_4 \to \dots$ (Wasting time by moving back and forth).

The algorithm must learn to choose the path (policy) that yields the highest cumulative return.

---

## The return in reinforcement learning

The section explains the critical Reinforcement Learning (RL) concept of the **Return**, which is used to evaluate the desirability of a sequence of rewards by introducing a **Discount Factor ($\gamma$)**.

### Defining the Return ($G$)

The Return is the single number used to quantify the total cumulative value of a sequence of rewards ($R_1, R_2, R_3, \dots$) that an agent receives over an episode.
* **Analogy:** It captures the idea that a smaller, immediate reward (like a \$5 bill now) might be more attractive than a larger reward that takes significant time and effort to obtain (like a \$10 bill across town).
* **Goal:** The primary objective of an RL algorithm is to find a policy (set of actions) that maximizes the expected Return.

### The Discount Factor ($\gamma$)

The discount factor, $\gamma$ (gamma), is a number between 0 and 1 (often close to 1, like 0.9 or 0.99) used to weigh future rewards. It makes the RL algorithm **"impatient"** by reducing the value of rewards received later in time. Rewards received sooner contribute more to the total Return.

### The Return Formula

The return ($G$) is calculated as the sum of all future rewards, where each successive reward is discounted by an increasing power of $\gamma$:

$$G = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \dots$$

* The first reward ($R_1$) is given full credit ($1 \cdot R_1$).
* The second reward ($R_2$) is multiplied by $\gamma$.
* The third reward ($R_3$) is multiplied by $\gamma^2$, and so on.

### Interpretation and Practical Effects

* **Financial Interpretation:** In applications like financial trading, $\gamma$ often represents the time value of money or the interest rate, meaning a dollar today is worth more than a dollar in the future.
* **Handling Negative Rewards:** If the system incurs negative rewards (costs or penalties), the discount factor incentivizes the algorithm to push these negative outcomes as far into the future as possible, minimizing their discounted impact on the total Return.
* **Policy Dependence:** The Return obtained from any given state depends entirely on the actions (policy) the agent chooses. In the rover example, choosing "Go Left" yielded a higher Return (12.5) than choosing "Go Right" (10) from the starting state $S_4$.

<img src='images/rl_return.png' width='700px'>

---

## Making decisions: Policies in reinforcement learning

This section explains the final formalized concept of Reinforcement Learning (RL): the **Policy ($\pi$)**, which is the core output of any RL algorithm.

### Definition of the Policy ($\pi$)

The policy ($\pi$) is a function that serves as the "brain" or "controller" for the reinforcement learning agent. It takes the current State ($\mathbf{s}$) as input and reliably outputs the recommended Action ($\mathbf{a}$) that the agent should take in that state.

$$\pi(\mathbf{s}) \longrightarrow \mathbf{a}$$

### The Goal of Reinforcement Learning

The ultimate objective of any reinforcement learning algorithm is to find the optimal policy ($\pi^{*}$). This is the policy that, when followed, guarantees the maximum possible **Return** (the discounted sum of future rewards) for the agent from every starting state.

### Policy Examples (Rover)

A policy defines a decision for every possible state:

*Example Policy:*
* If in State 2, go **Left**.
* If in State 3, go **Left**.
* If in State 4, go **Left**.
* If in State 5, go **Right**.

The algorithm must explore and learn to define the best action for every state to maximize the cumulative rewards.

---


## Review of key concepts

This section provides a concise review of the fundamental components of Reinforcement Learning (RL) using the formalism known as a **Markov Decision Process (MDP)**.

### Core RL Components Reviewed

The formalism requires defining five key elements for any application:

* **States ($\mathbf{S}$):** All possible configurations or situations of the environment (e.g., the rover's position, the helicopter's position/orientation, or the configuration of pieces on a chessboard).
* **Actions ($\mathbf{A}$):** The set of all possible decisions or moves the agent can make from any given state (e.g., "go left/right" for the rover, or moving a specific control stick for the helicopter).
* **Rewards ($\mathbf{R}$):** A function that assigns a positive or negative numerical value to a state or an action, telling the agent when it is performing well or poorly (e.g., $+1$ for winning a game, $-1$ for losing).
* **Discount Factor ($\gamma$):** A value (usually $0 < \gamma < 1$) used to compute the Return, which makes the agent prioritize immediate rewards over future rewards.
* **Return ($\mathbf{G}$):** The cumulative discounted sum of all future rewards, which the policy attempts to maximize.
* **Policy ($\pi$):** The function that maps a State ($\mathbf{S}$) to the optimal Action ($\mathbf{A}$) to maximize the Return.

### The Formalism: Markov Decision Process (MDP)

The entire framework comprising the states, actions, transitions, and rewards is known as a **Markov Decision Process (MDP)**. The Markov term means that the future only depends on the current state and action, and not on the history of states or actions that led to the current state.  

<img src='images/rl.png' width=500px>

The MDP describes the interaction between the **Agent** and the **Environment**:
1.  The Agent uses the **Policy ($\pi$)** to choose an **Action ($\mathbf{A}$)**.
2.  The Action changes the **Environment**.
3.  The Environment returns a new **State ($\mathbf{S'}$)** and a **Reward ($\mathbf{R}$)**.

### Next Step: The State-Action Value Function

In the next section, we will introduce the **State-Action Value Function** (or Q-function) as the next key concept necessary for developing algorithms to actually find the optimal policy.

---

## State-action value function definition

This section introduces the final foundational concept needed for reinforcement learning algorithms: the **State-Action Value Function**, also known as the **Q-Function**.

### Definition of the Q-Function

The Q-Function is denoted by $Q(s, a)$. It gives a single number representing the expected total Return (discounted sum of future rewards) under a specific condition. $Q(s, a)$ is the Return you get if you **start in state $s$**, **take action $a$** just once, and then **behave optimally thereafter** (i.e., follow the optimal policy, $\pi^*$). 

### The Relationship to Optimal Policy ($\pi^*$)

The initial definition is acknowledged as being slightly circular, as it requires knowledge of the "optimal behavior" to calculate $Q(s, a)$. Reinforcement learning algorithms resolve this by using techniques (like dynamic programming or temporal difference learning) to compute the $Q$-function *before* the optimal policy is known.

### Using the Q-Function to Find the Optimal Action

The Q-Function provides a direct way to find the optimal action to take in any state. The best possible Return an agent can get from state $s$ is the largest value of $Q(s, a)$ across all available actions $a$.
    
$$\max_a Q(s, a)$$

The optimal policy, $\pi^*(s)$, simply chooses the action $a$ that maximizes $Q(s, a)$.

$$\pi^*(s) = \underset{a}{\operatorname{argmax}} \ Q(s, a)$$

### Example (Mars Rover, $\gamma=0.5$)

For the Mars Rover example, if $Q(S_4, \text{Left}) = 12.5$ and $Q(S_4, \text{Right}) = 10$:
* The highest possible return from $S_4$ is $12.5$.
* The optimal action in $S_4$ is **Go Left**, because that action yields the higher Q-value.

<img src='images/Q-function.png' width='600px'>

**Conclusion:** If an RL algorithm can successfully compute the $Q$-function for every state and every action, it has effectively solved the problem, as the optimal policy can be derived immediately by simply choosing the action with the highest Q-value.

---

## Bellman Equation

This section introduces the **Bellman Equation**, the fundamental formula used in Reinforcement Learning (RL) to compute the **State-Action Value Function, $Q(s, a)$**.

### The Goal and Definition

The Bellman Equation is the key mathematical tool used to compute the $Q(s, a)$ values, which in turn are used to determine the optimal policy ($\pi^*$). $Q(s, a)$ is the return if you start in state $s$, take action $a$ once, and then behave optimally afterward.

### The Bellman Equation Formula

The equation expresses the $Q(s, a)$ value recursively, breaking the total return into two parts: the immediate reward and the discounted optimal future return.

$$\mathbf{Q(s, a)} = \mathbf{R(s)} + \gamma \cdot \max_{a'} Q(s', a')$$

Where:
* **$s$:** The current state.
* **$a$:** The current action taken.
* **$R(s)$:** The immediate reward received in state $s$.
* **$\gamma$ (Gamma):** The discount factor.
* **$s'$:** The next state reached after taking action $a$ from state $s$.
* **$\max_{a'} Q(s', a')$:** The maximum possible return obtainable from the *next* state, $s'$, by choosing the best possible subsequent action, $a'$.

### Intuition and Breakdown

The Bellman Equation formalizes the decomposition of the total Return:

1.  **Immediate Reward ($R(s)$):** This is the reward you get right away at the first step.
2.  **Discounted Future Return ($\gamma \cdot \max_{a'} Q(s', a')$):** This is the value of the best possible return you can expect from the *next* state, $s'$, discounted by $\gamma$.

The value of taking an action now is equal to the reward you get now, plus the discounted value of the optimal returns you can expect from the state you land in next.

### Edge Cases

If $s$ is a terminal state (where the process ends), the Bellman Equation simplifies to $$Q(s, a) = R(s)$$ because there is no next state ($s'$), and therefore no future return.

### Next Steps

* Once this equation is defined, RL algorithms can be developed to iteratively solve or learn the $Q(s, a)$ values for all states and actions, despite the initial circular nature of the $Q$-function's definition.
* The next section covers a topic on **Stochastic Markov Decision Processes**, where actions have random effects, before developing the first RL algorithm.

---

## Random (stochastic) environment

This section introduces the concept of a **Stochastic Markov Decision Process (MDP)**, which generalizes the standard RL framework to environments where actions are not perfectly reliable (i.e., the outcome is random).

### Stochastic Environment

A stochastic environment is one where, when the agent takes an action, the next state ($s'$) is not guaranteed but is instead random or probabilistic.

For example, if a Mars Rover is commanded to "Go Left," there might be:
* **90% Chance** of moving left to the expected state ($s'$).
* **10% Chance** of slipping and moving right to a different state ($s''$).

### The Goal: Expected Return

Because the sequence of rewards is now random, the goal is no longer to maximize a fixed return, but to maximize the **Expected Return**, which is the average value of the sum of discounted rewards if the agent were to try the same policy a very large number of times (e.g., a million times). The policy ($\pi$) must be chosen to maximize the average sum of discounted rewards:

$$\max_{\pi} E \left[ R_1 + \gamma R_2 + \gamma^2 R_3 + \dots \right]$$

### Modification to the Bellman Equation

The introduction of stochasticity modifies the Bellman Equation by adding an Expected Value ($E$) operator to the future return term:

$$Q(s, a) = R(s) + \gamma \cdot E \left[ \max_{a'} Q(s', a') \right]$$

The total return is the immediate reward $R(s)$ plus the discount factor $\gamma$ times the expected optimal return from the next state. This accounts for the uncertainty in where the agent will land after taking action $a$.

### Practical Impact

* **Lower Returns:** Stochastic environments are harder to control. When the probability of a misstep increases, the maximum possible $Q$-values (and thus the optimal expected return) for the agent decrease because control is less reliable.
* **Generalization:** This framework applies to almost all real-world RL problems where perfect control is impossible due to external factors (wind, terrain, complex game dynamics, etc.).

### Next Step

The core RL concepts (including the MDP, Return, Policy, Q-Function, and Bellman Equation) will now be generalized to handle much larger and **continuous state spaces**, which are common in practical applications.

---

## Example of continuous state space application

This section explains the concept of **Continuous State Spaces** in Reinforcement Learning (RL), which is necessary to model real-world problems where the environment's state is not limited to a small, discrete number of values.

### Defining Continuous State Spaces

In simplified examples (like the Mars Rover), the state space is discrete, meaning the agent can only be in one of a small, finite set of positions (e.g., $S_1, S_2, \dots, S_6$). In real-world applications (like robotics or vehicle control), the state is defined by real numbers that can take on any value within a range. The state is represented not by a single integer, but by a vector of continuous numbers.

### Example: Autonomous Car/Truck

For a self-driving vehicle, the state is a vector comprising at least six continuous numbers:

| State Variable | Description |
| :--- | :--- |
| **Position** | $x$ (horizontal), $y$ (vertical) coordinates. |
| **Orientation** | $\Theta$ (angle, or heading, ranging 0 to $360^\circ$). |
| **Velocity** | $\dot{x}$ (speed in the $x$ direction), $\dot{y}$ (speed in the $y$ direction). |
| **Angular Velocity** | $\dot{\Theta}$ (how fast the car is turning). |
| **Result:** | The policy takes this 6-number vector as input. |

### Example: Autonomous Helicopter

For a complex aerial robot, the state vector is much larger, comprising at least 12 continuous numbers:

| State Variable | Description |
| :--- | :--- |
| **Position** | $x, y, z$ (North/South, East/West, Height). |
| **Orientation** | $\Phi, \Theta, \Psi$ (Roll, Pitch, Yaw). |
| **Linear Velocity** | $\dot{x}, \dot{y}, \dot{z}$ (Rate of change for each position coordinate). |
| **Angular Velocity** | $\dot{\Phi}, \dot{\Theta}, \dot{\Psi}$ (Rate of change for Roll, Pitch, Yaw). |
| **Result:** | The policy takes this 12-number vector as input. |

### Generalization

The RL framework generalizes by replacing the simple, discrete state number (e.g., $S=4$) with a high-dimensional vector of continuous values. This continuous state space formalism is used for many complex control problems, including the Lunar Lander simulation that the user will work on in the practice lab.

---

## Learning the state-action function

This section explains the methodology for applying Reinforcement Learning (RL) to problems with continuous state spaces, like the lunar lander, using a technique called **Deep Q-Networks (DQN)**. The core idea is to train a neural network to approximate the $Q$-function using supervised learning principles derived from the Bellman Equation.

### The Q-Network Architecture

The goal here is to train a Neural Network (NN) to compute the State-Action Value Function, $Q(s, a)$. The input ($\mathbf{X}$) to the NN is a concatenated vector of the current State ($s$) and the Action ($a$).

* *Example (Lunar Lander):* The state is 8 continuous numbers (position, velocity, orientation, etc.). The action (one of 4 discrete choices: nothing, left, main, right) is encoded using a 4-number one-hot vector. Total input features is 12 (8 State + 4 Action). The NN has a single output unit ($\mathbf{Y}$) that predicts the value of $Q(s, a)$.

### Creating the Training Data

The algorithm generates a dataset of $(\mathbf{X}, \mathbf{Y})$ pairs from the agent's experience, leveraging the Bellman Equation to define the target value $\mathbf{Y}$.

* **Experience Collection:** The agent takes various actions (randomly or using a developing policy) in the environment, generating **tuples of experience**:
    
    $$\mathbf{(s, a, R(s), s')} : \text{(Current State, Action Taken, Immediate Reward, Next State)}$$
    
* **Replay Buffer:** To manage memory and decorrelate training examples, the algorithm stores only the $N$ most recent experience tuples (e.g., 10,000 tuples) in a temporary memory called the **Replay Buffer**.
* **Defining the Target Value ($\mathbf{Y}$):** For each experience tuple, the target value $\mathbf{Y}$ is computed using the right-hand side of the Bellman Equation:

    $$\mathbf{Y} = R(s) + \gamma \cdot \max_{a'} Q(s', a')$$

    Where $Q(s', a')$ is obtained from the **current, evolving neural network's estimate**. (Initially, this is a random guess, but it improves over time).
* **Defining the Input ($\mathbf{X}$):** The input is the state-action pair from the tuple: $\mathbf{X} = (s, a)$.

### The Deep Q-Network (DQN) Algorithm

The full DQN process uses an iterative cycle of experience and supervised learning:

1.  **Initialize:** Randomly initialize the parameters of the Q-Network.
2.  **Act and Collect:** Repeatedly take actions in the environment, observing $(\mathbf{s}, \mathbf{a}, \mathbf{R(s)}, \mathbf{s'})$ tuples and storing them in the Replay Buffer.
3.  **Train:** Periodically generate a training set of $(\mathbf{X}, \mathbf{Y})$ pairs from the Replay Buffer using the Bellman Equation to calculate $\mathbf{Y}$.
4.  **Update:** Train the neural network using standard supervised learning techniques (e.g., Mean Squared Error loss) to approximate $\mathbf{X} \to \mathbf{Y}$. This produces a **new, improved** Q-Network ($Q_{new}$).
5.  **Iterate:** Replace the old Q-Network ($Q$) with the new one ($Q_{new}$). The improved Q-Network provides a better estimate for calculating the target $Y$ in the next training cycle.

### Policy Derivation

Once the Q-Network is trained, the agent determines the optimal action in any state $s$ by using the network to predict $Q(s, a)$ for all possible actions $a$, and then choosing the action with the **highest predicted Q-value**.
    
$$\pi^*(s) = \underset{a}{\operatorname{argmax}}\ Q(s, a)$$


---

## Algorithm refinement: Improved neural netwrok architecture

This section describes a key architectural change to the Deep Q-Network (DQN) that makes the reinforcement learning algorithm significantly more efficient: using a single neural network to output the Q-values for all possible actions simultaneously.

### Inefficient Architecture (Previous)

The previous Q-Network took both the State ($s$) and the Action ($a$) as input. To decide the best action in a given state $s$, the agent had to run the entire neural network inference separately for every single action to calculate all $Q(s, a)$ values.

### Efficient Architecture (Single Network Output)

The revised, more efficient neural network only takes the State ($s$) as input. The single network is trained to have **multiple output units**—one for every possible action.

*Example (Lunar Lander):* The network has four output units, which simultaneously predict:
    
$$[Q(s, \text{Nothing}), Q(s, \text{Left}), Q(s, \text{Main}), Q(s, \text{Right})]$$

### Benefits of the New Architecture

* **Efficiency and Speed:** Given a state $s$, the algorithm runs neural network inference only once to get all $Q(s, a)$ values, making the process of picking an action much faster.
* **Simplified calculation of Bellman equation:** This architecture also speeds up the calculation of the target value **$Y$** using the Bellman Equation (which requires finding $\max_{a'} Q(s', a')$), as all necessary Q-values for the next state $s'$ are generated simultaneously.

### Next Step

The next concept introduced is the **Epsilon-Greedy Policy**, which is a crucial refinement to the DQN algorithm that governs how the agent selects actions while learning.

---

## Algorithm refienment: $\epsilon$-greedy policy

This section explains the **Epsilon-Greedy Policy**, a crucial technique used during the training phase of Deep Q-Networks (DQN) to balance the need to **exploit** current knowledge with the need to **explore** new possibilities.

### The Core Dilemma (Exploration vs. Exploitation)

During training, the Q-Function ($Q(s, a)$) is constantly being approximated and is initially inaccurate. The agent must decide whether to:
1.  **Exploit:** Use its current, imperfect knowledge to choose the best-known action.
2.  **Explore:** Take a random action to potentially discover better strategies or avoid being "stuck" in a suboptimal belief.

If the neural network is randomly initialized to believe a crucial action (e.g., firing the main thruster) is bad, the agent will never try it and thus will never learn its true high Q-value.

### The Epsilon-Greedy Policy Mechanism

The Epsilon-Greedy Policy introduces a simple mechanism to balance this trade-off using a small probability value, $\epsilon$ (epsilon):

| Scenario | Action Taken | Probability | Terminology |
| :--- | :--- | :--- | :--- |
| **Greedy Action** | Choose the action $a$ that maximizes the current $Q(s, a)$. | $1 - \epsilon$ (e.g., 95% of the time) | **Exploitation** |
| **Random Action** | Choose an action $a$ entirely at random. | $\epsilon$ (e.g., 5% of the time) | **Exploration** |

### Implementing $\epsilon$ Decay

* **Initial Stage (High $\epsilon$):** It is common practice to start with a high $\epsilon$ (e.g., $\epsilon = 1.0$), meaning the agent takes actions almost entirely at random initially. This quickly gathers diverse experience.
* **Later Stage (Low $\epsilon$):** $\epsilon$ is then gradually decreased over time (e.g., down to $0.01$). As the Q-Function estimate improves, the agent shifts from exploring to relying on (exploiting) its learned knowledge.

### General Observations on RL Algorithms

* **Fickle Nature:** Reinforcement Learning algorithms are generally more **finicky** and sensitive to hyperparameter choices (like $\epsilon$) than supervised learning algorithms.
* **Parameter Sensitivity:** Setting parameters poorly can lead to learning taking 10 to 100 times longer, highlighting the critical role of careful tuning in RL.

---

## Algorithm refinement: Mini-Batch and soft-updates

This section outlines two final and crucial refinements to the Deep Q-Network (DQN) algorithm: **Mini-Batching** and **Soft Updates**, which significantly improve the algorithm's speed and reliability.

### Mini-Batching for Efficiency

In traditional supervised learning, if the dataset is huge (e.g., $m = 100$ million examples), computing the gradient requires summing the error over *all* $m$ examples. This is computationally expensive and slow for every single update step.

Instead of using the full dataset, the algorithm uses a small subset of the data, a **mini-batch** (e.g., $m' = 1,000$ examples), to compute the gradient and update the parameters in each step. Each update is much faster, leading to a much faster overall training process, even if the direction of the update is slightly noisier (less reliable) than using the full batch.

**Application in DQN:** Even if the Replay Buffer stores a large number of experience tuples (e.g., 10,000), the algorithm uses a much smaller mini-batch size (e.g., 1,000) of these tuples to generate the training set $\mathbf{(X, Y)}$ and train the Q-Network.

### Soft Updates for Stability

* **Problem (Abrupt Updates):** In the basic DQN algorithm, the old Q-Network ($Q$) is completely replaced by the newly trained network ($Q_{new}$), often written as $Q \leftarrow Q_{new}$. If the new network is accidentally a poor or noisy estimate, this abrupt change can lead to instability, causing the algorithm to oscillate or diverge.
* **Solution (Soft Update):** Instead of completely overwriting the old parameters ($\mathbf{W}, \mathbf{B}$), the soft update makes a gradual, weighted change, mixing a small fraction of the new parameters with the old ones. The new weights $\mathbf{W}$ are updated as:

$$\mathbf{W} \leftarrow \tau \mathbf{W}_{new} + (1 - \tau) \mathbf{W}_{old}$$

Where $\tau$ (tau) is a small hyperparameter, e.g., $\tau = 0.01$.

The soft update prevents any single bad update step from drastically changing the Q-Function, causing the reinforcement learning algorithm to converge more reliably to a good solution.

### Conclusion on RL Algorithms

Reinforcement Learning algorithms are generally more finicky and difficult to tune than supervised learning algorithms. They require careful selection of parameters (like the discount factor, $\epsilon$, and mini-batch size) and these structural refinements to work well on challenging tasks like the Lunar Lander.


---

## The state of reinforcement learning

This section provides a realistic assessment of the current state and practical utility of Reinforcement Learning (RL) technology.

### Excitement vs. Hype

RL is an exciting, major pillar of machine learning, currently attracting significant research momentum and enthusiasm. Despite the excitement and media coverage, there is a degree of hype around RL's immediate practical utility compared to other machine learning paradigms.

### The Simulation-to-Reality Gap

Much of the published research on RL has focused on **simulated environments** (like video games or simplified physics models). Getting an RL algorithm to work successfully in a real-world application or on a physical robot is significantly more challenging than getting it to work in a simulation. Developers must be cautious and ensure their solutions successfully bridge this simulation-to-reality gap.

### Current Utility and Application Frequency

Today, there are far fewer successful applications of Reinforcement Learning in industry compared to Supervised Learning and Unsupervised Learning. For most practical applications, the odds are much higher that supervised or unsupervised learning will be the right tool for the job.

RL is most commonly applied to complex **robotic control applications** and specialized tasks where conventional methods struggle.

### Future Potential

Despite current limitations, the potential of reinforcement learning for future applications remains very large. RL provides a valuable framework for thinking about decision-making problems and should be maintained as a major pillar of any machine learning developer's knowledge base.

---

### References

If you would like to learn more about Deep Q-Learning, we recommend you check out the following papers.

* Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
* Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. Continuous Control with Deep Reinforcement Learning. ICLR (2016).
* Mnih, V., Kavukcuoglu, K., Silver, D. et al. Playing Atari with Deep Reinforcement Learning. arXiv e-prints. arXiv:1312.5602 (2013).