- **RL main components**:
  - **Policy**: The strategy for selecting actions.
  - **Reward Function $R$**: Determines immediate desirability of states/actions.
  - **Value Function $V$**: Measures long-term expected rewards.
  - **Model (optional)**: Predicts state transitions for planning.
- **Types of Reinforcement Learning**:
  - **MDP (Markov Decision Process)**: You have clear states, actions, and rewards. The system’s next state depends only on the current state and action (Markov property).
    - **Example**: A robot vacuum cleaner deciding where to go next based on its current location and the dirt it detects.
  - **POMDP (Partially Observable MDP)**: Like MDP, but you don’t fully see the state. You get some clues (observations) about the state but not the full picture.
    - **Example**: A self-driving car in foggy weather, using limited visibility to make driving decisions.
  - **Bandit Problems**: No states, only actions and rewards. Balance exploration (trying new things) and exploitation (using what works).
    - **Example**: A website testing different ads to see which ad gets more clicks. Each ad is an action, and clicks are rewards.
    - **Common Algorithms**:
      - **ε-Greedy** – Explores randomly with probability **ε**, otherwise exploits the best-known action.
      - **UCB (Upper Confidence Bound)** – Picks actions based on confidence intervals, favoring uncertain options optimistically.
      - **Softmax (Boltzmann Exploration)**
      - **Thompson Sampling** – Uses Bayesian sampling to pick actions based on estimated reward probabilities.

    - **Applications**: A/B testing, recommendation systems, and online advertising.

    - **Types**:
      - **Multi-Armed Bandit (MAB)**: A single agent chooses between multiple actions (like pulling levers on slot machines).
        - **Example**: A recommendation system suggesting different products to users.
      - **Contextual Bandit**: Reward depends on some extra information (context). When actions also change the state (context), it becomes a Markov Decision Process (MDP).
        - **Example**: Netflix recommends a movie (**action**) based on your watch history (**state**). You watch it (**reward = 1**) or skip it (**reward = 0**). 
      - **Bayesian Bandit**: Uses Bayesian methods to update beliefs about the best action based on observed rewards.
        - **Example**: A medical trial adjusting treatment recommendations based on patient responses. 
  - **Multi-Agent RL**: Multiple agents learning together.
    - **Example**: Multiple robots in a warehouse.
  - **Model-Based RL**: Learns a model of the environment, (how states change with actions) and uses it to plan the best actions.
    - **Example**: A chess-playing AI that simulates possible moves and their outcomes before making a decision.
    - **Algorithms**:
      - **Monte Carlo Tree Search (MCTS)**
      - **Dyna**
  - **Model-Free RL**: Learns directly from experience without a model.
    - **Example**: Teaching a robot to walk by trying different movements and learning from success or failure without knowing physics equations.
    - **Types**:
      - **Policy Gradient**(Learn the policy directly):
        - **REINFORCE**
      - **Learn value functions**:
        - **Q-Learning**: Learns value of actions independent of policy.
        - **SARSA**: Learns action values based on the action actually taken.
        - **Deep Q-Networks (DQN)**: Use neural networks to approximate Q-values for complex states.


---



- **Initialization of Bandits $Q(a)$ in** $Q(a) = \mathbb{E}_{r \sim p(r|a)}[r]$
    - **Realistic Initialization**: $ Q(a)=0, starts with guessing zero expected reward till algorithm updates it based on observed rewards.
    - **Optimistic Initialization**:  $ Q(a)=ψ , ψ > 0$ make initial rewards high to encourage exploration.
- **Objective function**: Maximize expected cumulative reward over time $ T $ under a policy $ \pi $.
    - $J_T(\pi) = \mathbb{E}_{a_t \sim \pi, r_t \sim p(r|a_t)}\left[\sum_{t=1}^{T} r_t\right]$
    - This formula is a general way to measure how good a policy is.
    - **Policies in Bandit Problems**: 
        - **Greedy Policy**: Highest estimated reward $\to$ action
        - **Epsilon-Greedy Policy**: Mostly greedy, but sometimes explores random actions.
        - **Softmax Policy**: Chooses actions based on their estimated rewards, with a temperature parameter controlling exploration vs. exploitation.
        - **Thompson Sampling**: Samples actions based on their probability of being optimal, balancing exploration and exploitation. 
- **Update the mean reward**:
    - **Incremental Mean Update**: 
      - $Q_{n} = Q_{n-1} + \frac{1}{n} [r_{n} - Q_{n-1}]$
      - Where $N(a)$ is the number of times action $a$ has been selected.
    - **Learning update** (Exponential Moving Average): 
      - $Q_n \leftarrow Q_{n-1} + \alpha \left[ r_n - Q_{n-1} \right]$
      - Simply move the new mean a bit in the direction of the last observed reward, where $\alpha$ is the learning rate (0 < $\alpha$ < 1).
- **Epsilon-Greedy Policy**:
    - **How it works**:
      - With probability $\epsilon$, choose a random action.
      - With probability $1 - \epsilon$, choose the action with the highest estimated reward.
      - Epsilon-greedy exploration is simple but can waste time trying bad actions equally.
        - **Small** $\epsilon$, the policy mostly exploits the best action.
        - **Large** $\epsilon$, the policy explores more.
- **Softmax (Boltzmann) Policy**:
    - **How it works**:
      - Assigns probabilities to actions based on their estimated rewards.
      - Higher estimated rewards lead to higher probabilities of being chosen. (So not a single highest will be chosen, sometimes lower rewards (but still high like rank 2nd, 3rd..) will be chosen too because of high probability).
      - The temperature parameter ($\tau$) controls exploration vs. exploitation:
        - **High $\tau$**: More exploration (actions are chosen more uniformly).
        - **Low $\tau$**: More exploitation (actions with higher rewards are chosen more frequently).
- **Upper Confidence Bound (UCB) Policy**:
    - Picks the action with the highest value of: $Q(a) + c \times \sqrt{\frac{\ln t}{n(a)}}$
    - Where:
      - $Q(a)$ is the estimated value of action $a$.
      - $c$ is a constant that controls exploration.
        - **Higher $c$**: More exploration (more weight on uncertainty).
        - **Lower $c$**: More exploitation (focus on known rewards).
      - $t$ is the total number of actions taken.
      - $n(a)$ is the number of times action $a$ has been selected.


---

## **Bandit algorithms**

In [None]:

### Upper Confidence Bound (UCB) Policy

* Picks the action with the highest value of: $Q(a) + c \times \sqrt{\frac{\ln t}{n(a)}}$
* Here:

  * $Q(a)$: estimated reward of action $a$.
  * $n(a)$: how many times action $a$ was tried.
  * $t$: current time step.
  * $c$: controls how much to explore.
* It prefers actions with high rewards but also gives a bonus to actions tried less often, encouraging exploration of new or less-tried actions.

---

### Hyperparameters and their impact

* $\epsilon$ in epsilon-greedy: controls how much you explore randomly.
* Optimistic Initialization: starts with high $Q(a)$ values to encourage trying all actions early.
* $c$ in UCB: controls how strongly you explore actions you haven’t tried much.

---

### Contextual Bandit and MDP

* **Contextual bandit:** reward depends on some extra information (context).
* When actions also change the state (context), it becomes a **Markov Decision Process (MDP)**.
* MDPs are the base of reinforcement learning, which deals with balancing exploration and exploitation in changing situations.

---

Want me to explain any of these policies in more detail?


 
### **Softmax (Boltzmann) Policy**:
  - Uses a probabilistic approach where actions with higher $ Q(a) $ values are more likely to be chosen.
  - **Mathematically**:
    $$
    \pi_{\text{softmax}}(a) = \frac{\exp(Q(a)/\tau)}{\sum_{b \in A} \exp(Q(b)/\tau)}
    $$
    Where $ \tau $ (temperature parameter) controls the level of exploration. A high $ \tau $ encourages exploration (more randomness), while a low $ \tau $ encourages exploitation of the best-known actions.

### **Upper Confidence Bound (UCB) Policy** (Exploration approach):
  - **Mathematically**:
    $$
    \pi_{\text{UCB}}(a) = \begin{cases} 
    1, & \text{if } a = \arg\max_{b} \left[ Q(b) + c \cdot \sqrt{\frac{\ln t}{n(b)}} \right] \\
    0, & \text{otherwise}
    \end{cases}
    $$
    Here, $ n(a) $ is the number of times action $ a $ has been taken, $ t $ is the current timestep, and $ c $ is a constant controlling exploration.
    
     The UCB formula **ensures untried actions are explored more**.

![](../../Files/fourth-semester/rl/6.png)

---
### Hyperparameters

- **$ \epsilon $-Greedy**: $ \epsilon $ controls exploration; higher $ \epsilon $ increases exploration.
- **Optimistic Initialization**: The **initial value** encourages exploration by setting high initial rewards.
- **UCB**: The **$ c $** parameter scales exploration; higher $ c $ increases exploration.

---

#### **3. Contextual Bandit**

- Often the reward distribution of the bandit you face depends on context

- When the state also changes based on our action we call it a
Markov Decision Process (MDP)

- MDPs are the foundation of reinforcement learning, where the goal is to balance exploration and exploitation in dynamic environments.