- **RL main components**:
  - **Policy**: The strategy for selecting actions.
  - **Reward Function $R$**: Determines immediate desirability of states/actions.
  - **Value Function $V$**: Measures long-term expected rewards.
  - **Model (optional)**: Predicts state transitions for planning.
- **Types of Reinforcement Learning**:
  - **MDP (Markov Decision Process)**: You have clear states, actions, and rewards. The system’s next state depends only on the current state and action (Markov property).
    - **Example**: A robot vacuum cleaner deciding where to go next based on its current location and the dirt it detects.
  - **POMDP (Partially Observable MDP)**: Like MDP, but you don’t fully see the state. You get some clues (observations) about the state but not the full picture.
    - **Example**: A self-driving car in foggy weather, using limited visibility to make driving decisions.
  - **Bandit Problems**: No states, only actions and rewards. Balance exploration (trying new things) and exploitation (using what works).
    - **Example**: A website testing different ads to see which ad gets more clicks. Each ad is an action, and clicks are rewards.
    - **Common Algorithms**:
      - **ε-Greedy** – Explores randomly with probability **ε**, otherwise exploits the best-known action.
      - **UCB (Upper Confidence Bound)** – Picks actions based on confidence intervals, favoring uncertain options optimistically.
      - **Softmax (Boltzmann Exploration)**
      - **Thompson Sampling** – Uses Bayesian sampling to pick actions based on estimated reward probabilities.

    - **Applications**: A/B testing, recommendation systems, and online advertising.

    - **Types**:
      - **Multi-Armed Bandit (MAB)**: A single agent chooses between multiple actions (like pulling levers on slot machines).
        - **Example**: A recommendation system suggesting different products to users.
      - **Contextual Bandit**: Reward depends on some extra information (context). When actions also change the state (context), it becomes a Markov Decision Process (MDP).
        - **Example**: Netflix recommends a movie (**action**) based on your watch history (**state**). You watch it (**reward = 1**) or skip it (**reward = 0**). 
      - **Bayesian Bandit**: Uses Bayesian methods to update beliefs about the best action based on observed rewards.
        - **Example**: A medical trial adjusting treatment recommendations based on patient responses. 
  - **Multi-Agent RL**: Multiple agents learning together.
    - **Example**: Multiple robots in a warehouse.
  - **Model-Based RL**: Learns a model of the environment, (how states change with actions) and uses it to plan the best actions.
    - **Example**: A chess-playing AI that simulates possible moves and their outcomes before making a decision.
    - **Algorithms**:
      - **Monte Carlo Tree Search (MCTS)**
      - **Dyna**
  - **Model-Free RL**: Learns directly from experience without a model.
    - **Example**: Teaching a robot to walk by trying different movements and learning from success or failure without knowing physics equations.
    - **Types**:
      - **Policy Gradient**(Learn the policy directly):
        - **REINFORCE**
      - **Learn value functions**:
        - **Q-Learning**: Learns value of actions independent of policy.
        - **SARSA**: Learns action values based on the action actually taken.
        - **Deep Q-Networks (DQN)**: Use neural networks to approximate Q-values for complex states.


---



- **Initialization of Bandits $Q(a)$ in** $Q(a) = \mathbb{E}_{r \sim p(r|a)}[r]$
    - **Realistic Initialization**: $ Q(a)=0, starts with guessing zero expected reward till algorithm updates it based on observed rewards.
    - **Optimistic Initialization**:  $ Q(a)=ψ , ψ > 0$ make initial rewards high to encourage exploration.
- **Objective function**: Maximize expected cumulative reward over time $ T $ under a policy $ \pi $.
    - $J_T(\pi) = \mathbb{E}_{a_t \sim \pi, r_t \sim p(r|a_t)}\left[\sum_{t=1}^{T} r_t\right]$
    - This formula is a general way to measure how good a policy is.
    - **Policies in Bandit Problems**: 
        - **Deterministic Policy**: Always choose the same action for a given state.
        - **Greedy Policy**: Highest estimated reward $\to$ action
        - **Epsilon-Greedy Policy**: Mostly greedy, but sometimes explores random actions.
        - **Softmax Policy**: Chooses actions based on their estimated rewards, with a temperature parameter controlling exploration vs. exploitation.
        - **Thompson Sampling**: Samples actions based on their probability of being optimal, balancing exploration and exploitation. 
- **Update the mean reward**:
    - **Incremental Mean Update**: 
      - $Q_{n} = Q_{n-1} + \frac{1}{n} [r_{n} - Q_{n-1}]$
      - Where $N(a)$ is the number of times action $a$ has been selected.
    - **Learning update** (Exponential Moving Average): 
      - $Q_n \leftarrow Q_{n-1} + \alpha \left[ r_n - Q_{n-1} \right]$
      - Simply move the new mean a bit in the direction of the last observed reward, where $\alpha$ is the learning rate (0 < $\alpha$ < 1).
- **Epsilon-Greedy Policy**:
    - **How it works**:
      - With probability $\epsilon$, choose a random action.
      - With probability $1 - \epsilon$, choose the action with the highest estimated reward.
      - Epsilon-greedy exploration is simple but can waste time trying bad actions equally.
        - **Small** $\epsilon$, the policy mostly exploits the best action.
        - **Large** $\epsilon$, the policy explores more.
- **Softmax (Boltzmann) Policy**:
    - **How it works**:
      - Assigns probabilities to actions based on their estimated rewards.
      - Higher estimated rewards lead to higher probabilities of being chosen. (So not a single highest will be chosen, sometimes lower rewards (but still high like rank 2nd, 3rd..) will be chosen too because of high probability).
      - The temperature parameter ($\tau$) controls exploration vs. exploitation:
        - **High $\tau$**: More exploration (actions are chosen more uniformly).
        - **Low $\tau$**: More exploitation (actions with higher rewards are chosen more frequently).
- **Upper Confidence Bound (UCB) Policy**:
    - Picks the action with the highest value of: $Q(a) + c \times \sqrt{\frac{\ln t}{n(a)}}$
    - Where:
      - $Q(a)$ is the estimated value of action $a$.
      - $c$ is a constant that controls exploration.
        - **Higher $c$**: More exploration (more weight on uncertainty).
        - **Lower $c$**: More exploitation (focus on known rewards).
      - $t$ is the total number of actions taken.
      - $n(a)$ is the number of times action $a$ has been selected.


---

- **Trace $\tau$**: A sequence of policy based state-action-reward pairs like $\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$.
    - **Infinite Horizon**: Sum rewards infinitely unless a terminal state is reached. $ R(\tau) = \sum_{i=0}^{\infty} \gamma^i \cdot r_{t+i} $
- **Return $ R(\tau) $**: The cumulative sum of rewards from a trace
    - **Discounted Return** gives less importance to future rewards. $ R(\tau) = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \dots $
- **Value**: The expected return $ v^\pi(s) $, $ q^\pi(s,a) $
- **Optimal Value/Policy**: 
    - $ v^*(s) $: Optimal (maximum expected) value function for state $s$.
    - $ q^*(s,a) $: Optimal (maximum expected) action-value function for state $s$ and action $a$.
    - $ \pi^* $: Optimal (maximum expected) policy that achieves the optimal value (maximizes expected return).


---
- **Policy Iteration**: Find optimal policy by iteratively improving the policy based on the value function. Fast converges in small states. Caulculate two step per itteration.
  - **Steps**:
    - 1. Select random initial policy.
    - 2. **Policy Evaluation**: Calculate the value function for the current policy.
    - 3. **Policy Improvement**: Update the policy based on the value function.
    - 4. Repeat **steps 2-3** until the policy converges (no changes).

- **Value Iteration**: Combines both **policy evaluation** and **policy improvement** into a single step to find optimal policy by iteratively updating the value function until convergence. A single step but takes longer due updates the value function iteratively until it converges.
    - **Steps**:
        - 1. Initialize the value function $V(s)$ arbitrarily (e.g., zeros).
        - 2. For each state $s$ we calculate the value for **all actions** at state $s$, pick the maximum, and assign that as the new $V(s)$.
        - 3. Repeat step 2 for all states until the value function converges (changes become very small).
        - 4. After convergence, we have the best policy now.
- **Policy Iteration:** Small–medium states, fewer but slower iterations while **Value Iteration:** Medium–large states, more but faster iterations.


---