# Week 3: Reinforcement Learning

## Table of Contents

---

## What is reinforcement learning?

Reinforcement Learning (RL) is a fundamental pillar of machine learning used to train an **agent** (like a robot or an algorithm) to make a sequence of decisions in an environment by maximizing a cumulative **reward**.

Instead of relying on labeled data (like supervised learning), RL uses a system of trial and error guided by a reward function.

### Core Concepts and Mechanism

* **Goal:** To find a function (often called a **policy**) that maps an observed **State** ($\mathbf{s}$) of the environment to an optimal **Action** ($\mathbf{a}$).
* **State ($\mathbf{s}$):** The agent's current situation or observation (e.g., a helicopter's position, orientation, and speed).
* **Action ($\mathbf{a}$):** A decision the agent makes (e.g., how to move the control sticks).
* **Reward Function:** The key input to RL, which tells the agent *when* it is doing well and *when* it is doing poorly.
    * **Incentive:** The agent's task is to figure out the sequence of actions that maximize the total cumulative reward over time.
    * **Example:** For a helicopter, reward may be **+1** for every second flying well and a large **negative reward** (e.g., -1000) for crashing. 

<img src='images/rl.png' width='500px'>

### The Power of Reward

RL is powerful because the designer only needs to specify **what** the goal is (via the reward function), not **how** to achieve it (via specific optimal actions).
* **Analogy:** It is like training a dog; you reward "good dog" behavior and discourage "bad dog" behavior, allowing the dog to learn the complex path to the desired outcome itself.
* **Example:** An RL algorithm enabled a robot dog to learn complex leg placements to climb over obstacles solely by rewarding progress toward the goal, without explicit instructions on leg movement.

### Contrast with Supervised Learning (SL)

For many control tasks (like flying a robot), Supervised Learning (SL) fails because it requires a large dataset of states ($\mathbf{x}$) and their ideal actions ($\mathbf{y}$).

It is often ambiguous or impossible for a human expert to define the single, exact "right action" ($\mathbf{y}$) for every single complex state ($\mathbf{x}$), making SL impractical for these scenarios. RL overcomes this ambiguity by using rewards instead of perfect labels.

### Applications

* **Robotics:** Controlling autonomous systems (helicopters, drones, robot dogs) to perform complex maneuvers.
* **Optimization:** Factory optimization to maximize throughput and efficiency.
* **Finance:** Efficient stock execution and trading strategies (e.g., sequencing trades to minimize price impact).
* **Gaming:** Playing complex games like Chess, Go, Bridge, and various video games.

---

## Mars rover example

This section formalizes the core concepts of Reinforcement Learning (RL) using a simplified example inspired by the Mars rover, introducing the concepts of state, action, reward, and terminal states.

### The Environment Setup (States and Rewards)

* **States ($S$):** The environment is modeled as a sequence of six positions, $S_1$ through $S_6$, representing possible locations of the rover. The rover starts in $S_4$.
* **Rewards ($R$):** Rewards are associated with specific states based on their scientific value:
    * $R(S_1) = 100$ (Highest value, most interesting science).
    * $R(S_6) = 40$ (Second highest value).
    * $R(S_2) = R(S_3) = R(S_4) = R(S_5) = 0$.
* **Terminal States:** $S_1$ and $S_6$ are terminal states. Once the rover reaches these states, the day (or episode) ends, and no further rewards can be earned.

### Actions and Transitions

* **Actions ($A$):** At each step, the rover can choose one of two actions:
    * Go Left
    * Go Right
* **State Transition:** Taking an action leads the rover from the current state $S$ to a new state $S'$ (the next state). For example, from $S_4$, taking the action "Go Left" leads to the next state $S_3$.

<img src='images/rl_example.png' width=600px>

### The Core RL Loop Elements

The fundamental process that defines the reinforcement learning problem is the sequence of transitions:

At every time step, the robot is in a State ($\mathbf{S}$), chooses an Action ($\mathbf{A}$), receives the Reward ($\mathbf{R}(S)$) associated with that state, and transitions to a Next State ($\mathbf{S'}$).

### Evaluating Action Sequences

The goal of the RL algorithm is to figure out the optimal sequence of actions to maximize the total reward collected before reaching a terminal state.

* **Option 1 (Go Left):** $S_4 \to S_3 \to S_2 \to S_1$. Total Reward: $0 + 0 + 0 + 100 = 100$.
* **Option 2 (Go Right):** $S_4 \to S_5 \to S_6$. Total Reward: $0 + 0 + 40 = 40$.
* **Suboptimal Path:** $S_4 \to S_5 \to S_4 \to \dots$ (Wasting time by moving back and forth).

The algorithm must learn to choose the path (policy) that yields the highest cumulative return.