- **RL main components**:
  - **Policy**: The strategy for selecting actions.
  - **Reward Function $R$**: Determines immediate desirability of states/actions.
  - **Value Function $V$**: Measures long-term expected rewards.
  - **Model (optional)**: Predicts state transitions for planning.
- **Types of Reinforcement Learning**:
  - **MDP (Markov Decision Process)**: You have clear states, actions, and rewards. The system’s next state depends only on the current state and action (Markov property).
    - **Example**: A robot vacuum cleaner deciding where to go next based on its current location and the dirt it detects.
  - **POMDP (Partially Observable MDP)**: Like MDP, but you don’t fully see the state. You get some clues (observations) about the state but not the full picture.
    - **Example**: A self-driving car in foggy weather, using limited visibility to make driving decisions.
  - **Bandit Problems**: No states, only actions and rewards. Balance exploration (trying new things) and exploitation (using what works).
    - **Example**: A website testing different ads to see which ad gets more clicks. Each ad is an action, and clicks are rewards.
    - **Common Algorithms**:
      - **ε-Greedy** – Explores randomly with probability **ε**, otherwise exploits the best-known action.
      - **UCB (Upper Confidence Bound)** – Picks actions based on confidence intervals, favoring uncertain options optimistically.
      - **Softmax (Boltzmann Exploration)**
      - **Thompson Sampling** – Uses Bayesian sampling to pick actions based on estimated reward probabilities.

    - **Applications**: A/B testing, recommendation systems, and online advertising.

    - **Types**:
      - **Multi-Armed Bandit (MAB)**: A single agent chooses between multiple actions (like pulling levers on slot machines).
        - **Example**: A recommendation system suggesting different products to users.
      - **Contextual Bandit**: Reward depends on some extra information (context). When actions also change the state (context), it becomes a Markov Decision Process (MDP).
        - **Example**: Netflix recommends a movie (**action**) based on your watch history (**state**). You watch it (**reward = 1**) or skip it (**reward = 0**). 
      - **Bayesian Bandit**: Uses Bayesian methods to update beliefs about the best action based on observed rewards.
        - **Example**: A medical trial adjusting treatment recommendations based on patient responses. 
  - **Multi-Agent RL**: Multiple agents learning together.
    - **Example**: Multiple robots in a warehouse.
  - **Model-Based RL**: Learns a model of the environment, (how states change with actions) and uses it to plan the best actions.
    - **Example**: A chess-playing AI that simulates possible moves and their outcomes before making a decision.
    - **Algorithms**:
      - **Monte Carlo Tree Search (MCTS)**
      - **Dyna**
  - **Model-Free RL**: Learns directly from experience without a model.
    - **Example**: Teaching a robot to walk by trying different movements and learning from success or failure without knowing physics equations.
    - **Types**:
      - **Policy Gradient**(Learn the policy directly):
        - **REINFORCE**
      - **Learn value functions**:
        - **Q-Learning**: Learns value of actions independent of policy.
        - **SARSA**: Learns action values based on the action actually taken.
        - **Deep Q-Networks (DQN)**: Use neural networks to approximate Q-values for complex states.


---



- **Initialization of Bandits $Q(a)$ in** $Q(a) = \mathbb{E}_{r \sim p(r|a)}[r]$
    - **Realistic Initialization**: $ Q(a)=0, starts with guessing zero expected reward till algorithm updates it based on observed rewards.
    - **Optimistic Initialization**:  $ Q(a)=ψ , ψ > 0$ make initial rewards high to encourage exploration.
- **Objective function**: Maximize expected cumulative reward over time $ T $ under a policy $ \pi $.
    - $J_T(\pi) = \mathbb{E}_{a_t \sim \pi, r_t \sim p(r|a_t)}\left[\sum_{t=1}^{T} r_t\right]$
    - This formula is a general way to measure how good a policy is.
    - **Policies in Bandit Problems**: 
        - **Deterministic Policy**: Always choose the same action for a given state.
        - **Greedy Policy**: Highest estimated reward $\to$ action
        - **Epsilon-Greedy Policy**: Mostly greedy, but sometimes explores random actions.
        - **Softmax Policy**: Chooses actions based on their estimated rewards, with a temperature parameter controlling exploration vs. exploitation.
        - **Thompson Sampling**: Samples actions based on their probability of being optimal, balancing exploration and exploitation. 
- **Update the mean reward**:
    - **Incremental Mean Update**: 
      - $Q_{n} = Q_{n-1} + \frac{1}{n} [r_{n} - Q_{n-1}]$
      - Where $N(a)$ is the number of times action $a$ has been selected.
    - **Learning update** (Exponential Moving Average): 
      - $Q_n \leftarrow Q_{n-1} + \alpha \left[ r_n - Q_{n-1} \right]$
      - Simply move the new mean a bit in the direction of the last observed reward, where $\alpha$ is the learning rate (0 < $\alpha$ < 1).
- **Epsilon-Greedy Policy**:
    - **How it works**:
      - With probability $\epsilon$, choose a random action.
      - With probability $1 - \epsilon$, choose the action with the highest estimated reward.
      - Epsilon-greedy exploration is simple but can waste time trying bad actions equally.
        - **Small** $\epsilon$, the policy mostly exploits the best action.
        - **Large** $\epsilon$, the policy explores more.
- **Softmax (Boltzmann) Policy**:
    - **How it works**:
      - Assigns probabilities to actions based on their estimated rewards.
      - Higher estimated rewards lead to higher probabilities of being chosen. (So not a single highest will be chosen, sometimes lower rewards (but still high like rank 2nd, 3rd..) will be chosen too because of high probability).
      - The temperature parameter ($\tau$) controls exploration vs. exploitation:
        - **High $\tau$**: More exploration (actions are chosen more uniformly).
        - **Low $\tau$**: More exploitation (actions with higher rewards are chosen more frequently).
- **Upper Confidence Bound (UCB) Policy**:
    - Picks the action with the highest value of: $Q(a) + c \times \sqrt{\frac{\ln t}{n(a)}}$
    - Where:
      - $Q(a)$ is the estimated value of action $a$.
      - $c$ is a constant that controls exploration.
        - **Higher $c$**: More exploration (more weight on uncertainty).
        - **Lower $c$**: More exploitation (focus on known rewards).
      - $t$ is the total number of actions taken.
      - $n(a)$ is the number of times action $a$ has been selected.


---

- **Trace $\tau$**: A sequence of policy based state-action-reward pairs like $\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$.
    - **Infinite Horizon**: Sum rewards infinitely unless a terminal state is reached. $ R(\tau) = \sum_{i=0}^{\infty} \gamma^i \cdot r_{t+i} $
- **Return $ R(\tau) $**: The cumulative sum of rewards from a trace
    - **Discounted Return** gives less importance to future rewards. $ R(\tau) = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \dots $
- **Value**: The expected return $ v^\pi(s) $, $ q^\pi(s,a) $
- **Optimal Value/Policy**: 
    - $ v^*(s) $: Optimal (maximum expected) value function for state $s$.
    - $ q^*(s,a) $: Optimal (maximum expected) action-value function for state $s$ and action $a$.
    - $ \pi^* $: Optimal (maximum expected) policy that achieves the optimal value (maximizes expected return).


---
- **Policy Iteration**: Find optimal policy by iteratively improving the policy based on the value function. Fast converges in small states. Caulculate two step per itteration.
  - **Steps**:
    - 1. Select random initial policy.
    - 2. **Policy Evaluation**: Calculate the value function for the current policy.
    - 3. **Policy Improvement**: Update the policy based on the value function.
    - 4. Repeat **steps 2-3** until the policy converges (no changes).

- **Value Iteration**: Combines both **policy evaluation** and **policy improvement** into a single step to find optimal policy by iteratively updating the value function until convergence. A single step but takes longer due updates the value function iteratively until it converges.
    - **Steps**:
        - 1. Initialize the value function $V(s)$ arbitrarily (e.g., zeros).
        - 2. For each state $s$ we calculate the value for **all actions** at state $s$, pick the maximum, and assign that as the new $V(s)$.
        - 3. Repeat step 2 for all states until the value function converges (changes become very small).
        - 4. After convergence, we have the best policy now.
- **Policy Iteration:** Small–medium states, fewer but slower iterations while **Value Iteration:** Medium–large states, more but faster iterations.


---

- Dynamic Programming finds the best policy (best actions) when you know **everything** about the environment. $\to$ Policy Iteration.
- When we **don't have full knowledge** of the environment, we use **Monte Carlo methods** to learn from **experience**. $\to$ Model-Free RL.
- **Monte Carlo Methods**: Learn from complete episodes (traces) to estimate value functions.
  - MC estimates $v^\pi(s)$ and $q^\pi(s,a)$ by averaging returns from multiple episodes we have seen before.
    - **First-Visit MC**: Averages returns from the first time each state-action pair is visited in an episode.
    - **Every-Visit MC**: Averages returns from every time each state-action pair is visited in an episode.
  - Begin episodes from random state-action pairs to ensure all pairs get visited sometimes because if some states/action pairs are never visited, MC can’t estimate their values.
  - $\epsilon$-greedy exploration is often used in MC to ensure all state-action pairs are explored.
  - Monte Carlo can be used inside **Generalized Policy Iteration (GPI)**:
    - **Two main types**:
      - **On-Policy MC**: Evaluates and improves the policy being used to generate episodes.
      - **Off-Policy MC**: Evaluates a different policy than the one being used to generate episodes.
        - **Example**: Using a behavior policy to explore while evaluating a target policy.
          - **Behavior Policy**: The policy used to generate episodes (exploration).
          - **Target Policy**: The policy we want to improve (evaluation).
        - **Problem**: Behavior and target policies can be different and make different decisions. We can't just average rewards because episodes came from a different policy. 
          - **Solution** = **Importance Sampling**: Give a weight to each episode based on how likely it was under the target policy compared to the behavior policy.

- **Temporal Difference (TD) Learning**: another way to learn from **experience**, combines Monte Carlo and dynamic programming. It learns **step-by-step** during an episode instead of waiting until the episode ends.
  - **MC**: Waits until the end of an episode to update values.
  - **TD**: Updates values after each step using the immediate reward and estimated value of the next state.
  - **Update Rule**: $ v(s_t) \leftarrow v(s_t) + \alpha \left[ r_{t+1} + \gamma v(s_{t+1}) - v(s_t) \right]$
    - Part in $[\cdots]$ is called the **TD error**, which will correct the value esti,ate right after the action is taken unlike MC. This part is **policy evaluation** and the whole update is **policy improvement**.

| Aspect              | Monte Carlo (MC)                    | Temporal Difference (TD)                     |
| ------------------- | ----------------------------------- | -------------------------------------------- |
| When update happens | After whole episode ends            | After every step                             |
| Use of next state   | No (waits for total return)         | Yes (bootstraps with estimated $v(s_{t+1})$) |
| Variance            | High (because it uses total return) | Lower (uses one-step estimate)               |
| Bias                | No bias (unbiased)                  | Slight bias due to bootstrapping             |
| Learning speed      | Slower                              | Faster and more online                       |

- **TD Control algorithms**
  - **SARSA (On-policy TD Control)**: Learns action values for the policy being followed. (Learning how good a move is based on the next move you actually make.)
    - Update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]$
  - **Q-learning (Off-policy TD Control)**: Learns the optimal action values regardless of the policy being followed. (Uses the **maximum** estimated value of the next state (best possible action), not necessarily the action actually taken.)
    - Update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]$
    - **Maximization Bias**: Q-learning can overestimate action values due to always selecting the maximum Q-value in future states.
    - **Double Q-learning**: Addresses this bias by maintaining two separate Q-tables and using one to select actions and the other to evaluate them, reducing overestimation.

  - **Expected SARSA**: A variant of SARSA that uses the expected value over possible next actions instead of the single next action.
    - Update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \sum_a \pi(a|s_{t+1}) Q(s_{t+1}, a) - Q(s_t, a_t) \right]$


- **N-step TD and N-step SARSA**: Extensions of TD and SARSA that consider multiple steps for better value estimation.

### Summary:

| Algorithm      | On-policy / Off-policy | Next state update                | Main idea                                           |
| -------------- | ---------------------- | -------------------------------- | --------------------------------------------------- |
| SARSA          | On-policy              | Uses next action actually taken  | Learns action values following policy               |
| Q-learning     | Off-policy             | Uses max action value next state | Learns optimal action values regardless of behavior |
| Expected SARSA | On or Off-policy       | Uses expected value over actions | Averages over possible next actions                 |


#### Note:

* **Monte Carlo Tree Search (MCTS)** is a **planning** method. It uses a **model** to simulate many possible future paths from the current state (reversible access) and builds a search tree to decide the best next action. It focuses on a **local solution**.

* **Monte Carlo methods in RL** are **model-free learning** methods. They learn from **real experience** by averaging returns from full episodes. They don’t use a model and update values globally for all states visited.

So, MCTS = planning with a model, and Monte Carlo RL = learning from experience without a model.

---


- **Back-up**: update estimation of state values $V(s)$ or state-action values $Q(s,a)$ (future reward) in reinforcement learning.
    - **Expected Back-ups**: used in planning with a full model of the environment. (e.g., Dynamic Programming)
    - **Sample Back-ups**: used in reinforcement learning without a full model, updating values based on sampled experiences. (e.g., TD learning, Monte Carlo methods)
- **On-policy Back-ups**: updates values based on the policy being followed. The agent learns about the consequences of its current behavior. 
  * Example: **SARSA** — updates use the next action that the agent actually takes.
- **Off-policy Back-ups**: updates values based on a different policy than the one being followed. The agent learns about the consequences of actions that may not be taken in the current behavior.
  * Example: **Q-learning** — updates use the best possible next action, regardless of what action the agent actually took.
* **1-Step Back-ups (Shallow updates):**
  * Updates use information from only one step ahead (the immediate next state).
  * Faster but may be less accurate.
  * Examples:
    * **TD(0)** updates values using the reward and the value estimate of the next state only.
    * **Monte Carlo** can be considered 1-step if it updates only after an episode ends (looking at total return).
* **Multi-Step Back-ups (Deep updates):**
  * Updates consider multiple steps into the future, often the entire remaining episode or a sequence of states.
  * More accurate but computationally heavier.
  * Examples:
    * **Monte Carlo** updates after entire episodes (multi-step).
    * **TD(λ)** uses a weighted average of different multi-step returns.
    * **Dynamic Programming** computes expected returns over all possible future steps.

| Algorithm                    | Type of Back-up  | Step-size  | Explanation                                                        |
| ---------------------------- | ---------------- | ---------- | ------------------------------------------------------------------ |
| **TD(0)**                    | Sample Back-up   | 1-step     | Uses one sample, updates values after one step                     |
| **Monte Carlo (MC)**         | Sample Back-up   | Multi-step | Uses full episode returns, updates after episode ends              |
| **Dynamic Programming (DP)** | Expected Back-up | Multi-step | Uses full model to calculate expected returns over all next states |



- **MDP Dynamics**
    - **Models** (Reversible access):
        - **Distribution model:** Give you full probability distribution from next till target state.
        - **Sample model:** We don’t get the whole distribution, just one possible next state from the randomness.
    - **Environments** (Irreversible access):
        - **Sample Environment:** You almost always get only samples, because you actually take an action and observe what happens next, you don't get full distributions.

- **MDP Approaches**:
    - **Planning**:  assumes reversible access to the environment dynamics, meaning you have a model that you can query at any state-action pair, anytime (you can simulate outcomes without actually moving).
    - **Learning**: (specifically model-free RL) assumes irreversible access, meaning you can only interact with the real environment step-by-step, moving forward and learning from experience.

| Access Type        | Solution Type   | Method Category                 | Example Algorithm(s)           |
| ------------------ | --------------- | ------------------------------- | ------------------------------ |
| Reversible (Model) | Local solution  | **Planning**                    | Monte Carlo Tree Search (MCTS) |
| Reversible (Model) | Global solution | **Model-based RL** (borderline) | Dynamic Programming (DP)       |
| Irreversible (Env) | Global solution | **Model-free RL**               | Q-learning, SARSA              |

- **Tabular Model Learning**: This is a way to learn a model of the environment when you don’t have it beforehand, by collecting experience from the environment.
* Keep arrays for $n(s,a,s')$ and $R_{sum}(s,a,s')$, size $|S| \times |A| \times |S|$.


- **Dyna**: Model-based RL algorithm that combines learning and planning.
    - **Steps**:
        1. Learn a model of the environment from experience.
            - Store in table the reward of taking action $a$ in state $s$ and ending up in state $s'$
        2. Use the model to simulate experiences and update the value function or policy.

        3. Update the value function or policy based on real experiences.
        - Update Q-values by simulating transitions as simulated the transitions without actually taking actions in the environment.
    - **Benefits**: Combines the strengths of model-based and model-free approaches, allowing for faster learning and better exploration.
    - **Parameters**:
        - Number of planning updates K: How many simulated updates you do using the model for every real experience.
        - Learning rate αα: How much you update the Q-values each time.
        - Discount factor γγ: How much future rewards count compared to immediate rewards.
        - Exploration parameter ϵϵ: Probability of choosing a random action (to explore).
    - **Algorithm**:
        1. Start with Q-values and model counts all zero.
        2. At each step, pick an action using epsilon-greedy policy.
        3. Take the action, get reward and next state from the real environment.
        4. Update the model (counts and rewards) based on this experience.
        5. Update Q-value using the real experience (Q-learning update).
        6. Do extra updates by simulating experiences from the model K times, updating Q-values each time.

- **Prioritized Sweeping**: Extension of Dyna (Model-Based) that instead of updating all states equally or randomly, it focuses on the most important or most promising states that need an update. This speeds up learning by spending effort where it matters most.

    - **Steps**:
        1. Keep a **priority queue** of states prioritized by their TD error magnitude (Higher priority $p \to$ higher TD error $\to$ more important to update).
            - Priority queue is a data structure keeps items sorted by their priority $\to$ when pop $p$ the highest priority item is removed first.
        2. Update the most important states first, ensuring that the learning process focuses on the most impactful changes.
    - **Key Parameter**: 
        - **Threshold $\theta$:** Minimum priority for a state-action to be added to the queue. Controls how sensitive the algorithm is to changes.
    - **Algorithm**:
        - 1. **Initialize** Q-values, model counts, and an empty priority queue.
        - 2. **Interact** with the environment:
            * Take action in current state using exploration (like epsilon-greedy).
            * Observe reward and next state.
            * Update the model with this transition.
        - 3. **Calculate priority** for that state-action: $ p = |r + \gamma \max_{a'} Q(s', a') - Q(s, a)| $ If $p > \theta$ (threshold), add $(s, a)$ to the priority queue.

        - 4. **Planning updates (repeat K times):**
            * Pop the highest priority $(s, a)$ from the queue.
            * Use the model to simulate next state and reward.
            * Update Q-value for $(s, a)$ with TD update.
            * For all state-actions leading to $s$, calculate their priority and add to queue if above threshold.

