# Week 3: Reinforcement Learning

## Table of Contents

---

## What is reinforcement learning?

Reinforcement Learning (RL) is a fundamental pillar of machine learning used to train an **agent** (like a robot or an algorithm) to make a sequence of decisions in an environment by maximizing a cumulative **reward**.

Instead of relying on labeled data (like supervised learning), RL uses a system of trial and error guided by a reward function.

### Core Concepts and Mechanism

* **Goal:** To find a function (often called a **policy**) that maps an observed **State** ($\mathbf{s}$) of the environment to an optimal **Action** ($\mathbf{a}$).
* **State ($\mathbf{s}$):** The agent's current situation or observation (e.g., a helicopter's position, orientation, and speed).
* **Action ($\mathbf{a}$):** A decision the agent makes (e.g., how to move the control sticks).
* **Reward Function:** The key input to RL, which tells the agent *when* it is doing well and *when* it is doing poorly.
    * **Incentive:** The agent's task is to figure out the sequence of actions that maximize the total cumulative reward over time.
    * **Example:** For a helicopter, reward may be **+1** for every second flying well and a large **negative reward** (e.g., -1000) for crashing. 

<img src='images/rl.png' width='500px'>

### The Power of Reward

RL is powerful because the designer only needs to specify **what** the goal is (via the reward function), not **how** to achieve it (via specific optimal actions).
* **Analogy:** It is like training a dog; you reward "good dog" behavior and discourage "bad dog" behavior, allowing the dog to learn the complex path to the desired outcome itself.
* **Example:** An RL algorithm enabled a robot dog to learn complex leg placements to climb over obstacles solely by rewarding progress toward the goal, without explicit instructions on leg movement.

### Contrast with Supervised Learning (SL)

For many control tasks (like flying a robot), Supervised Learning (SL) fails because it requires a large dataset of states ($\mathbf{x}$) and their ideal actions ($\mathbf{y}$).

It is often ambiguous or impossible for a human expert to define the single, exact "right action" ($\mathbf{y}$) for every single complex state ($\mathbf{x}$), making SL impractical for these scenarios. RL overcomes this ambiguity by using rewards instead of perfect labels.

### Applications

* **Robotics:** Controlling autonomous systems (helicopters, drones, robot dogs) to perform complex maneuvers.
* **Optimization:** Factory optimization to maximize throughput and efficiency.
* **Finance:** Efficient stock execution and trading strategies (e.g., sequencing trades to minimize price impact).
* **Gaming:** Playing complex games like Chess, Go, Bridge, and various video games.

---

## Mars rover example

This section formalizes the core concepts of Reinforcement Learning (RL) using a simplified example inspired by the Mars rover, introducing the concepts of state, action, reward, and terminal states.

### The Environment Setup (States and Rewards)

* **States ($S$):** The environment is modeled as a sequence of six positions, $S_1$ through $S_6$, representing possible locations of the rover. The rover starts in $S_4$.
* **Rewards ($R$):** Rewards are associated with specific states based on their scientific value:
    * $R(S_1) = 100$ (Highest value, most interesting science).
    * $R(S_6) = 40$ (Second highest value).
    * $R(S_2) = R(S_3) = R(S_4) = R(S_5) = 0$.
* **Terminal States:** $S_1$ and $S_6$ are terminal states. Once the rover reaches these states, the day (or episode) ends, and no further rewards can be earned.

### Actions and Transitions

* **Actions ($A$):** At each step, the rover can choose one of two actions:
    * Go Left
    * Go Right
* **State Transition:** Taking an action leads the rover from the current state $S$ to a new state $S'$ (the next state). For example, from $S_4$, taking the action "Go Left" leads to the next state $S_3$.

<img src='images/rl_example.png' width=600px>

### The Core RL Loop Elements

The fundamental process that defines the reinforcement learning problem is the sequence of transitions:

At every time step, the robot is in a State ($\mathbf{S}$), chooses an Action ($\mathbf{A}$), receives the Reward ($\mathbf{R}(S)$) associated with that state, and transitions to a Next State ($\mathbf{S'}$).

### Evaluating Action Sequences

The goal of the RL algorithm is to figure out the optimal sequence of actions to maximize the total reward collected before reaching a terminal state.

* **Option 1 (Go Left):** $S_4 \to S_3 \to S_2 \to S_1$. Total Reward: $0 + 0 + 0 + 100 = 100$.
* **Option 2 (Go Right):** $S_4 \to S_5 \to S_6$. Total Reward: $0 + 0 + 40 = 40$.
* **Suboptimal Path:** $S_4 \to S_5 \to S_4 \to \dots$ (Wasting time by moving back and forth).

The algorithm must learn to choose the path (policy) that yields the highest cumulative return.

---

## The return in reinforcement learning

The section explains the critical Reinforcement Learning (RL) concept of the **Return**, which is used to evaluate the desirability of a sequence of rewards by introducing a **Discount Factor ($\gamma$)**.

### Defining the Return ($G$)

The Return is the single number used to quantify the total cumulative value of a sequence of rewards ($R_1, R_2, R_3, \dots$) that an agent receives over an episode.
* **Analogy:** It captures the idea that a smaller, immediate reward (like a \$5 bill now) might be more attractive than a larger reward that takes significant time and effort to obtain (like a \$10 bill across town).
* **Goal:** The primary objective of an RL algorithm is to find a policy (set of actions) that maximizes the expected Return.

### The Discount Factor ($\gamma$)

The discount factor, $\gamma$ (gamma), is a number between 0 and 1 (often close to 1, like 0.9 or 0.99) used to weigh future rewards. It makes the RL algorithm **"impatient"** by reducing the value of rewards received later in time. Rewards received sooner contribute more to the total Return.

### The Return Formula

The return ($G$) is calculated as the sum of all future rewards, where each successive reward is discounted by an increasing power of $\gamma$:

$$G = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \dots$$

* The first reward ($R_1$) is given full credit ($1 \cdot R_1$).
* The second reward ($R_2$) is multiplied by $\gamma$.
* The third reward ($R_3$) is multiplied by $\gamma^2$, and so on.

### Interpretation and Practical Effects

* **Financial Interpretation:** In applications like financial trading, $\gamma$ often represents the time value of money or the interest rate, meaning a dollar today is worth more than a dollar in the future.
* **Handling Negative Rewards:** If the system incurs negative rewards (costs or penalties), the discount factor incentivizes the algorithm to push these negative outcomes as far into the future as possible, minimizing their discounted impact on the total Return.
* **Policy Dependence:** The Return obtained from any given state depends entirely on the actions (policy) the agent chooses. In the rover example, choosing "Go Left" yielded a higher Return (12.5) than choosing "Go Right" (10) from the starting state $S_4$.

<img src='images/rl_return.png' width='700px'>

---

## Making decisions: Policies in reinforcement learning

This section explains the final formalized concept of Reinforcement Learning (RL): the **Policy ($\pi$)**, which is the core output of any RL algorithm.

### Definition of the Policy ($\pi$)

The policy ($\pi$) is a function that serves as the "brain" or "controller" for the reinforcement learning agent. It takes the current State ($\mathbf{s}$) as input and reliably outputs the recommended Action ($\mathbf{a}$) that the agent should take in that state.

$$\pi(\mathbf{s}) \longrightarrow \mathbf{a}$$

### The Goal of Reinforcement Learning

The ultimate objective of any reinforcement learning algorithm is to find the optimal policy ($\pi^{*}$). This is the policy that, when followed, guarantees the maximum possible **Return** (the discounted sum of future rewards) for the agent from every starting state.

### Policy Examples (Rover)

A policy defines a decision for every possible state:

*Example Policy:*
* If in State 2, go **Left**.
* If in State 3, go **Left**.
* If in State 4, go **Left**.
* If in State 5, go **Right**.

The algorithm must explore and learn to define the best action for every state to maximize the cumulative rewards.

---


## Review of key concepts

This section provides a concise review of the fundamental components of Reinforcement Learning (RL) using the formalism known as a **Markov Decision Process (MDP)**.

### Core RL Components Reviewed

The formalism requires defining five key elements for any application:

* **States ($\mathbf{S}$):** All possible configurations or situations of the environment (e.g., the rover's position, the helicopter's position/orientation, or the configuration of pieces on a chessboard).
* **Actions ($\mathbf{A}$):** The set of all possible decisions or moves the agent can make from any given state (e.g., "go left/right" for the rover, or moving a specific control stick for the helicopter).
* **Rewards ($\mathbf{R}$):** A function that assigns a positive or negative numerical value to a state or an action, telling the agent when it is performing well or poorly (e.g., $+1$ for winning a game, $-1$ for losing).
* **Discount Factor ($\gamma$):** A value (usually $0 < \gamma < 1$) used to compute the Return, which makes the agent prioritize immediate rewards over future rewards.
* **Return ($\mathbf{G}$):** The cumulative discounted sum of all future rewards, which the policy attempts to maximize.
* **Policy ($\pi$):** The function that maps a State ($\mathbf{S}$) to the optimal Action ($\mathbf{A}$) to maximize the Return.

### The Formalism: Markov Decision Process (MDP)

The entire framework comprising the states, actions, transitions, and rewards is known as a **Markov Decision Process (MDP)**. The Markov term means that the future only depends on the current state and action, and not on the history of states or actions that led to the current state.  

<img src='images/rl.png' width=500px>

The MDP describes the interaction between the **Agent** and the **Environment**:
1.  The Agent uses the **Policy ($\pi$)** to choose an **Action ($\mathbf{A}$)**.
2.  The Action changes the **Environment**.
3.  The Environment returns a new **State ($\mathbf{S'}$)** and a **Reward ($\mathbf{R}$)**.

### Next Step: The State-Action Value Function

In the next section, we will introduce the **State-Action Value Function** (or Q-function) as the next key concept necessary for developing algorithms to actually find the optimal policy.

---

## State-action value function definition

This section introduces the final foundational concept needed for reinforcement learning algorithms: the **State-Action Value Function**, also known as the **Q-Function**.

### Definition of the Q-Function

The Q-Function is denoted by $Q(s, a)$. It gives a single number representing the expected total Return (discounted sum of future rewards) under a specific condition. $Q(s, a)$ is the Return you get if you **start in state $s$**, **take action $a$** just once, and then **behave optimally thereafter** (i.e., follow the optimal policy, $\pi^*$). 

### The Relationship to Optimal Policy ($\pi^*$)

The initial definition is acknowledged as being slightly circular, as it requires knowledge of the "optimal behavior" to calculate $Q(s, a)$. Reinforcement learning algorithms resolve this by using techniques (like dynamic programming or temporal difference learning) to compute the $Q$-function *before* the optimal policy is known.

### Using the Q-Function to Find the Optimal Action

The Q-Function provides a direct way to find the optimal action to take in any state. The best possible Return an agent can get from state $s$ is the largest value of $Q(s, a)$ across all available actions $a$.
    
$$\max_a Q(s, a)$$

The optimal policy, $\pi^*(s)$, simply chooses the action $a$ that maximizes $Q(s, a)$.

$$\pi^*(s) = \underset{a}{\operatorname{argmax}} \ Q(s, a)$$

### Example (Mars Rover, $\gamma=0.5$)

For the Mars Rover example, if $Q(S_4, \text{Left}) = 12.5$ and $Q(S_4, \text{Right}) = 10$:
* The highest possible return from $S_4$ is $12.5$.
* The optimal action in $S_4$ is **Go Left**, because that action yields the higher Q-value.

<img src='images/Q-function.png' width='600px'>

**Conclusion:** If an RL algorithm can successfully compute the $Q$-function for every state and every action, it has effectively solved the problem, as the optimal policy can be derived immediately by simply choosing the action with the highest Q-value.

---

## Bellman Equation

This section introduces the **Bellman Equation**, the fundamental formula used in Reinforcement Learning (RL) to compute the **State-Action Value Function, $Q(s, a)$**.

### The Goal and Definition

The Bellman Equation is the key mathematical tool used to compute the $Q(s, a)$ values, which in turn are used to determine the optimal policy ($\pi^*$). $Q(s, a)$ is the return if you start in state $s$, take action $a$ once, and then behave optimally afterward.

### The Bellman Equation Formula

The equation expresses the $Q(s, a)$ value recursively, breaking the total return into two parts: the immediate reward and the discounted optimal future return.

$$\mathbf{Q(s, a)} = \mathbf{R(s)} + \gamma \cdot \max_{a'} Q(s', a')$$

Where:
* **$s$:** The current state.
* **$a$:** The current action taken.
* **$R(s)$:** The immediate reward received in state $s$.
* **$\gamma$ (Gamma):** The discount factor.
* **$s'$:** The next state reached after taking action $a$ from state $s$.
* **$\max_{a'} Q(s', a')$:** The maximum possible return obtainable from the *next* state, $s'$, by choosing the best possible subsequent action, $a'$.

### Intuition and Breakdown

The Bellman Equation formalizes the decomposition of the total Return:

1.  **Immediate Reward ($R(s)$):** This is the reward you get right away at the first step.
2.  **Discounted Future Return ($\gamma \cdot \max_{a'} Q(s', a')$):** This is the value of the best possible return you can expect from the *next* state, $s'$, discounted by $\gamma$.

The value of taking an action now is equal to the reward you get now, plus the discounted value of the optimal returns you can expect from the state you land in next.

### Edge Cases

If $s$ is a terminal state (where the process ends), the Bellman Equation simplifies to $$Q(s, a) = R(s)$$ because there is no next state ($s'$), and therefore no future return.

### Next Steps

* Once this equation is defined, RL algorithms can be developed to iteratively solve or learn the $Q(s, a)$ values for all states and actions, despite the initial circular nature of the $Q$-function's definition.
* The next section covers a topic on **Stochastic Markov Decision Processes**, where actions have random effects, before developing the first RL algorithm.