- Traditional MDPs assume we know everything about the environment. But we want to learn good policies without knowing everything in advance.
- In **Model-based RL**, we aim to estimate transition and reward functions from experience.

---

### **Monte Carlo (MC) Methods**  
- **Goal**: Estimate the value of states under a policy $ \pi $.
- **First-Visit MC Method**:
  - **Initialize state values** $ v^\pi(s) $ by our desire.
  - **Generate episodes**: You start at an initial state and take actions according to the **policy** which forms an **episode** (a sequence of states, actions, and rewards).
  - **For each state visited in the episode**, update its value using the **cumulative discounted rewards** from that state onwards.


- The **update rule**: For each state visited, calculate the **average reward** after its first occurrence.
  - **Formula**:  
    $$
    v^\pi(s) = \frac{(v^\pi(s) \text{ from previous episodes} + \text{new reward})}{\text{number of episodes}}
    $$
  
- **Goal**: Refine state value estimates after multiple episodes.

- **Monte Carlo for Action Values**:
  - Use similar MC methods but calculate averages for **state-action pairs**.
- **Exploration Issue**: Some state-action pairs may not be visited. **Exploring starts** (starting from a random state-action pair) can address this.

#### **First-Visit vs Every-Visit MC**
- **First-Visit MC**: Updates the value of a state only after its first occurrence in an episode.
- **Every-Visit MC**: Updates the value of a state every time it is visited in an episode.
- **Example**: 
  - **First-Visit**: If a state is visited multiple times in an episode, only the first visit's reward is used for the update.
  - **Every-Visit**: All visits to the state in the episode contribute to the update.


<img src="../../Files/fourth-semester/rl/7.png" width="500px">
<img src="../../Files/fourth-semester/rl/8.png" width="500px">
<img src="../../Files/fourth-semester/rl/9.png" width="500px">

#### **Generalized Policy Iteration (GPI)**
- **Goal**: Balance exploration and exploitation.
- **Policy Evaluation**: Estimate the value of a policy.
- **Policy Improvement**: Update the policy based on the estimated values.
- **GPI Process**:
  1. **Policy Evaluation**: Use MC methods to estimate the value of the current policy.
  2. **Policy Improvement**: Update the policy to be greedy with respect to the estimated values.
  3. Repeat until convergence.


### **On-Policy vs. Off-Policy Methods**
- **On-policy**: The policy used to generate episodes is the same as the one being optimized.
  - **Advantages**: Simple and easy to implement.
  - **Disadvantages**: Can be suboptimal due to constant exploration.
  
- **Off-policy**: The policy used to generate episodes differs from the one being optimized.
  - **Advantages**: More powerful and flexible.
  - **Disadvantages**: More complicated and slower to converge.
    - In off-policy learning, the behavior policy $ b $ (the policy used to generate episodes) differs from the target policy $ \pi $ (the policy being optimized).
    - **Problem**: If an action is not taken in the behavior policy, the value for that state-action pair is unknown.
    - **Solution**: **Importance sampling** adjusts the returns to account for differences in probabilities between $ b $ and $ \pi $.


---

### **Temporal Difference (TD) Learning**
- **Goal**: Avoid waiting until the end of an episode to update values. TD learning allows for updating values incrementally during the episode.
- **TD(0)**: A one-step update method where the value of a state is updated based on the next step:
  $$
  v^\pi(s_t) \leftarrow v^\pi(s_t) + \alpha \left[ R_{t+1} + \gamma v^\pi(s_{t+1}) - v^\pi(s_t) \right]
  $$
- **Advantages**: Updates are made during the episode, enabling faster learning.


#### **Q-Learning**
- **Off-policy TD Control**: Q-learning is an off-policy method where the agent learns the optimal policy using a greedy target:
  $$
  Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]
  $$
- **Key Point**: The **maximization bias** occurs when the agent overestimates action values.

- **Solution**: **Double Q-learning** uses two independent Q-tables to reduce maximization bias.

### **Sarsa**
- **On-policy TD Control**: Sarsa is an on-policy method where the agent learns the value of the current policy:
  $$
  Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
  $$
- **Key Point**: Sarsa updates the Q-value based on the action taken in the next state, making it more conservative than Q-learning.


**Sarsa vs Q-learning**: **Sarsa** uses the current policy to update the Q-values, while **Q-learning** uses the maximum future Q-value, independent of the current policy.

### **Maximization Bias and Double Q-learning**
- **Maximization Bias**: Q-learning can overestimate action values due to always selecting the maximum Q-value in future states.
- **Double Q-learning**: Addresses this bias by maintaining two separate Q-tables and using one to select actions and the other to evaluate them, reducing overestimation.

### **N-step TD and N-step SARSA**
- **N-step TD**: An extension of TD learning that uses multiple steps of temporal difference learning to update values instead of just one step, thus incorporating more information for better value estimation.
- **N-step SARSA**: Similar to N-step TD but applies to the SARSA algorithm, allowing for more accurate value updates by considering multiple steps of rewards.
### **Expected SARSA**
- **Expected SARSA**: An extension of SARSA where the agent uses the expected value of the next state-action pair, weighted by the policy's probabilities. This can be used both on-policy and off-policy.
- **Key Point**: Expected SARSA can provide a more stable learning process by averaging over possible actions rather than selecting the maximum.
### **Summary**
- **Monte Carlo (MC) Methods**: Learn state values by averaging returns over episodes.
- **Temporal-Difference (TD) Methods**: Learn values incrementally during episodes without waiting for the end.
- **On-Policy vs. Off-Policy**: On-policy uses the same policy for generating episodes and improving it, while off-policy uses different behavior and target policies.
- **Q-learning**: An off-policy TD method that uses the **greedy approach** to maximize rewards, but has **maximization bias**, which can be solved with Double Q-learning.
- **Sarsa**: An on-policy TD method that updates values based on the current policy, making it more conservative.
- **N-step TD and N-step SARSA**: Extensions of TD and SARSA that consider multiple steps for better value estimation.
- **Expected SARSA**: An extension of SARSA that uses expected values for more stable learning.
- **Generalized Policy Iteration (GPI)**: A framework that combines policy evaluation and improvement to iteratively refine policies.


<hr style="border: none; border-top: 3px solid royalblue; text-align: center;">
<p style="text-align: center; color: royalblue; font-weight: bold;">Additional Example</p>



Let's take a simple example to explain all the concepts in **Reinforcement Learning**. We'll use a **grid world** environment, which is a common example.

### **Grid World Example**:
Imagine a 3x3 grid world where the agent starts at the top-left corner (state $ S_0 $) and aims to reach the bottom-right corner (state $ S_8 $), where there’s a reward of +10. All other actions cost -1 reward. The agent can move **Up**, **Down**, **Left**, or **Right**. There are no walls, and the agent can move freely in any direction.

Here’s the grid representation:

```
S_0  | S_1 | S_2
-----------------
S_3  | S_4 | S_5
-----------------
S_6  | S_7 | S_8 (Goal)
```

### **Step-by-Step Explanation**:

#### 1. **Policy**:
- The policy $ \pi $ tells the agent **which action to take** from each state.
- **Example**: In state $ S_0 $, the policy might tell the agent to move **Right** with a 100% chance (action selection based on the policy).

#### 2. **State Value ($ v^\pi(s) $)**:
- The **value** of a state $ v^\pi(s) $ is the expected total reward the agent will get if it starts from state $ s $ and follows the policy $ \pi $.
- **Example**: If the agent is in state $ S_0 $, the value function $ v^\pi(S_0) $ will be the expected sum of rewards from $ S_0 $, considering all the possible actions it can take following policy $ \pi $.

#### 3. **First-Visit Monte Carlo (MC) Method**:
- To estimate the **value of states**, we use the **First-Visit MC Method**. This method updates the value of a state based on the **first time it is visited** during an episode.
  
**Episode Example**:
- Start at $ S_0 $, follow policy $ \pi $, and end up at $ S_8 $ (Goal).
- Suppose the rewards are: $ -1, -1, -1, -1, -1, -1, -1, 10 $ for states $ S_0, S_1, S_2, \dots, S_7, S_8 $ respectively.

**Updating state values**:
- After completing an episode, we calculate the total reward from each state visited and **average them** for each state.

---

#### 4. **Exploration Issue**:
- Not all state-action pairs are visited. For example, in this simple grid, the agent might not explore every direction equally (e.g., it could always move **Right** from $ S_0 $ and never move **Down**).
- **Exploring starts**: To solve this, we can start from different random state-action pairs to make sure all state-action pairs are explored.

---

#### 5. **Generalized Policy Iteration (GPI)**:
- **Goal**: Refine the policy by alternating between **policy evaluation** and **policy improvement**.
  
**GPI Steps**:
1. **Policy Evaluation**: Using MC methods, calculate the state values $ v^\pi(s) $ for the current policy.
2. **Policy Improvement**: After evaluating the current policy, **update the policy** to be **greedy** (choose actions that maximize state values).

For example, after evaluating $ v^\pi(s) $, if state $ S_1 $ has a higher value when moving **Down** instead of **Right**, we’ll change the policy at $ S_1 $ to move **Down**.

---

#### 6. **On-Policy vs. Off-Policy**:
- **On-policy**: The policy used to **generate episodes** is the same as the one being improved.  
    - **Example**: The agent always follows the policy $ \pi $ to explore and learn.
- **Off-policy**: The policy used to **generate episodes** is different from the policy being improved.
    - **Example**: The agent might follow an **exploratory** policy to gather data (e.g., random movements) but improve a **greedy** policy based on the collected experiences.

---

#### 7. **Temporal Difference (TD) Learning**:
- **Goal**: Update the value estimates **during the episode** without waiting for the episode to end.
  
**TD(0) Update Rule**:
- Update the value of a state using the next state’s value.
- **Formula**:
  $$
  v^\pi(s_t) \leftarrow v^\pi(s_t) + \alpha \left[ R_{t+1} + \gamma v^\pi(s_{t+1}) - v^\pi(s_t) \right]
  $$
  - **Example**: After moving from $ S_0 $ to $ S_1 $, the value of $ S_0 $ is updated using the reward and the estimated value of $ S_1 $.

---

#### 8. **Q-Learning (Off-policy TD Control)**:
- **Goal**: Learn the optimal policy by estimating **state-action values** $ Q(s, a) $.
  
**Q-Learning Update Rule**:
$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]
$$
- The agent updates $ Q(s, a) $ using the **maximum** value of the next state.
- **Example**: In state $ S_0 $, the agent chooses an action (e.g., **Right**). It updates $ Q(S_0, \text{Right}) $ based on the reward and the best future action.

---

#### 9. **Sarsa (On-policy TD Control)**:
- **Goal**: Update the action values based on the current policy.
  
**Sarsa Update Rule**:
$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
$$
- **Example**: In $ S_0 $, if the agent moves **Right** to $ S_1 $, then chooses **Down** at $ S_1 $, the value update at $ S_0 $ will include the action taken at $ S_1 $ (making it more conservative than Q-learning).

---

#### 10. **Maximization Bias and Double Q-learning**:
- **Maximization Bias**: Q-learning overestimates values because it always selects the maximum future action.
  
- **Double Q-learning**: Uses two separate Q-tables to reduce bias. One table selects the action, and the other evaluates it.
  
**Example**: If $ Q(S_1, \text{Right}) = 5 $ and $ Q(S_1, \text{Down}) = 3 $, Q-learning would always select **Right**. But with **Double Q-learning**, it alternates between two tables to avoid overestimating.

---

#### 11. **N-step TD and N-step SARSA**:
- **N-step TD**: Takes multiple steps of rewards into account instead of just one, for more accurate value updates.
  
- **N-step SARSA**: Applies to SARSA but uses multiple steps to update the action values.
  
**Example**: If the agent moves through several states before reaching the goal, **N-step** methods would consider all these steps rather than just one to update the value or action value.

---

#### 12. **Expected SARSA**:
- **Expected SARSA**: Uses the **expected** value of the next action rather than the actual one, helping to **stabilize learning**.

**Example**: Instead of choosing the **best action** at each state, it averages over all possible actions according to the policy, making the learning process smoother.

---

### **Summary Example** in a Grid World:

- The agent starts at $ S_0 $, moves right to $ S_1 $, down to $ S_4 $, and finally reaches the goal at $ S_8 $.
- Using **MC**, the agent will calculate the **return** from each state visited, and update the **value of states**.
- In **Q-learning**, the agent will learn **state-action values** for each state-action pair and improve its policy by choosing the action with the highest Q-value.
- With **SARSA**, the agent will update its Q-values more conservatively, using the action it actually takes next.
- Over time, the agent will refine its policy using **GPI** by evaluating and improving the policy through multiple iterations, balancing **exploration** and **exploitation**.
