#### **Sequential Decision Making**
- Many AI problems are sequential (e.g., shortest-path problems).



### **Markov Decision Process (MDP)**
| **Component**              | **Symbol**   | **Description**                     |
|----------------------------|--------------|-------------------------------------|
| State Space                | S            | Possible observations.              |
| Action Space               | A            | Possible actions.                   |
| Transition Function        | T(s’|s,a)    | Environment's response.             |
| Reward Function            | R(s,a,s’)    | Transition's desirability.          |
| Discount Factor            | $\gamma$     | Weighting future rewards.           |
| Initial State Distribution | p₀(s)        | Starting probabilities.             |
| Policy                     | $π(a\|s)$     | Strategy for selecting actions.     |

### **Dynamic Programming (DP)**
- **Goal**: Solve for optimal value function and policy.
- **Approaches**:
  - **Policy Iteration**:
    1. Evaluate policy until convergence.
    2. Improve policy.
  - **Value Iteration**:
    1. Partial policy evaluation (1 sweep).
    2. Policy improvement.
    3. Repeat until convergence.

The policy tells the robot which action to take in each state.
For a deterministic policy, the robot always follows a fixed path (e.g., right → down).
For a stochastic policy, the robot may take different actions probabilistically.

### Formula:

$$
V^\pi(s) = \mathbb{E}_{\pi, T} \left[ \sum_{i=0}^\infty \gamma^i \cdot r_{t+i} \mid s_t = s \right]
$$

### Explanation 
- **$ V^\pi(s) $**: Expected cumulative reward starting from state $ s $ while following policy $ \pi $.
- **$ \mathbb{E}_{\pi, T} $**: Expectation over trajectories based on policy $ \pi $ and transition $ T $.
- **$ \sum_{i=0}^\infty \gamma^i \cdot r_{t+i} $**: Sum of discounted rewards:
  - $ \gamma $: Discount factor ($ 0 \leq \gamma \leq 1 $).
  - $ r_{t+i} $: Reward at time step $ t+i $.
- The calculation starts at $ s_t = s $.

### State and State-Action Values in MDPs

1. **State Value ($ V(s) $)**:
   - **Definition**: The expected cumulative reward starting from state $ s $, following policy $ \pi $.
   - **Formula**:
     $$
     V^\pi(s) = \mathbb{E}_{\pi, T} \left[ \sum_{i=0}^\infty \gamma^i \cdot r_{t+i} \mid s_t = s \right]
     $$
   - **Representation**:
     - Stored as a **vector** of size $|S|$, where each entry corresponds to a state.
     - Example:
       $$
       V(s) = 
       \begin{bmatrix}
       V(1) = 3, \\
       V(2) = -4, \\
       V(3) = 9, \dots
       \end{bmatrix}
       $$

2. **State-Action Value ($ Q(s,a) $)**:
   - **Definition**: The expected cumulative reward starting from state $ s $, taking action $ a $, and then following policy $ \pi $.
   - **Formula**:
     $$
     Q^\pi(s, a) = \mathbb{E}_{\pi, T} \left[ r(s,a) + \gamma \cdot V^\pi(s') \right]
     $$
   - **Representation**:
     - Stored as a **matrix** of size $|S| \times |A|$, where each entry corresponds to a state-action pair.
     - Example:
       $$
       Q(s,a) = 
       \begin{bmatrix}
       Q(1, \text{up}) = 5 & Q(1, \text{down}) = 3 & \dots \\
       Q(2, \text{up}) = 9 & Q(2, \text{down}) = 4 & \dots
       \end{bmatrix}
       $$

### **Optimal Value and Policy:**

1. **Optimal Value Function ($ V^*(s) $)**:
   - The maximum value achievable in state $ s $ under any policy.
   - $ V^*(s) = \max_{\pi} V^\pi(s) $.

2. **Optimal Policy ($ \pi^* $)**:
   - The policy that achieves $ V^*(s) $.
   - $ \pi^*(s) = \arg\max_a Q^*(s, a) $, where $ Q^*(s, a) $ is the optimal state-action value.

---

### Bellman Equation



### **Bellman Equation**
- A **recursive formula** for the value function $ V(s) $ or state-action value $ Q(s, a) $.
- It relates the value of a state to the values of its possible next states.



### **State Value Function $ V(s) $**
- **Formula**:
  $$
  V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} T(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]
  $$
- **Explanation**:
  - $ \pi(a|s) $: Probability of taking action $ a $ in state $ s $.
  - $ T(s'|s,a) $: Probability of transitioning to $ s' $ from $ s $ by taking $ a $.
  - $ R(s,a,s') $: Reward for transitioning from $ s $ to $ s' $ via $ a $.
  - $ \gamma $: Discount factor.



### **State-Action Value Function $ Q(s, a) $**
- **Formula**:
  $$
  Q^\pi(s, a) = \sum_{s' \in S} T(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a' \in A} \pi(a'|s') Q^\pi(s',a') \right]
  $$
- **Explanation**:
  - Combines the immediate reward $ R(s,a,s') $ and the discounted future value $ Q(s',a') $.


### **Backup Diagrams**

![Graph](../../Files/third-semester/sai/3.png)

- $ V(s) $ and $ Q(s, a) $ are interconvertible.

---

# Dynamic Programming
#### Key idea:
- Break a large problem into smaller subproblems.
- Efficiently store and reuse intermediate results.
- Repeatedly solving the small subproblem solves the overall problem.

In context of MDP: a central algorithm to solve for the **optimal policy**

![DP](../../Files/third-semester/sai/4.png)
![DP](../../Files/third-semester/sai/5.png)

### **Policy Evaluation**
![ DP](../../Files/third-semester/sai/6.png)

The **optimal policy** is greedy because it selects the action that **optimizes the immediate and future rewards** (based on the Bellman equation).

### **Dynamic Programming Notes**

---

#### **Two Approaches**
1. **Policy Iteration**:
   - Step 1: Policy Evaluation (until convergence).
   - Step 2: Policy Improvement.

2. **Value Iteration**:
   - Step 1: Policy Evaluation (1 cycle per iteration).
   - Step 2: Policy Improvement.
   - Both steps alternate until convergence.

---

#### **Value Iteration Algorithm**
- Combines **policy evaluation** and **improvement** in one equation.
- **Loop until convergence**:
  1. For each state $ s $:
     $$
     V(s) = \max_a \sum_{s'} T(s'|s,a) \left[ R(s,a,s') + \gamma V(s') \right]
     $$
- Returns the **optimal value function** $ V^*(s) $.

---

#### **Q-Value Iteration**
- Similar to value iteration but uses **state-action values**.
- **Algorithm**:
  1. Loop until convergence:
     $$
     Q(s,a) = \sum_{s'} T(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q(s',a') \right]
     $$
  2. Returns $ Q^*(s,a) $: optimal state-action values.

---

#### **Implicit Policies**
- Policies derived from value tables instead of being explicitly stored.
- **Example**:
  - Value Table:
    $$
    s: \{1, 2, 3\}, \quad V(s): \{12, 24, 33\}
    $$
  - Greedy policy selects actions maximizing $ V(s) $.

---

### **Comparison: Policy Iteration vs Value Iteration**
| **Aspect**              | **Policy Iteration**               | **Value Iteration**                |
|-------------------------|------------------------------------|------------------------------------|
| **Policy Evaluation**   | Full convergence per iteration.   | One cycle per iteration.          |
| **Efficiency**          | Slower due to full evaluation.    | Faster but requires more iterations. |
| **Complexity**          | Higher computational cost.        | Lower computational cost per step.|

---

#### **Key Challenges**
1. **Curse of Dimensionality**:
   - The number of states grows exponentially with the number of variables.
   - Example: Tic-tac-toe (3x3 board):
     - $ 3^9 = 19,683 $ unique states.
   - 4x4 board:
     - $ 3^{16} = 43,046,721 $ states.
   - High memory and computational requirements.

---

### **Summary**
1. **Markov Decision Process (MDP)**:
   - Framework for sequential tasks.
   - Handles stochastic dynamics and multiple goals.
2. **Bellman Equation**:
   - Recursive relation between state/state-action values.
   - Basis for MDP algorithms.
3. **Dynamic Programming**:
   - Policy Iteration: Alternates between evaluation and improvement.
   - Value Iteration: Combines evaluation and improvement in one step.
4. **Applications**:
   - Key principles in AI for search, planning, and reinforcement learning.


### **Differences Between Policy Iteration, Value Iteration, and Q-Value Iteration**


| **Aspect**               | **Policy Iteration**                                          | **Value Iteration**                                              | **Q-Value Iteration**                                          |
|--------------------------|-------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------|
| **Goal**                 | Find the **optimal policy** $ \pi^* $.                    | Find the **optimal value function** $ V^*(s) $.                | Find the **optimal state-action value** $ Q^*(s, a) $.      |
| **Representation**       | Explicitly maintains and updates a policy $ \pi(a\|s) $.   | Implicitly derives the policy from $ V(s) $.                  | Policy is derived implicitly from $ Q(s, a) $.              |
| **Evaluation Step**      | Performs **full policy evaluation** (until convergence).    | Performs **partial policy evaluation** (1 iteration per cycle). | Updates $ Q(s,a) $ for each state-action pair in one step.  |
| **Improvement Step**     | Improves the policy after full evaluation.                  | Combines policy evaluation and improvement in one equation.     | Directly updates $ Q(s,a) $ using Bellman optimality.       |
| **Formula Used**         | Bellman equation for $ V(s) $.                            | Bellman optimality equation for $ V(s) $.                     | Bellman optimality equation for $ Q(s,a) $.                 |
| **Algorithm Complexity** | Higher computational cost due to full evaluation per step.  | Faster convergence due to single-step evaluation.               | Similar to Value Iteration but operates in action space.      |
| **Output**               | Optimal policy $ \pi^*(a\|s) $.                            | Optimal value function $ V^*(s) $.                            | Optimal state-action value function $ Q^*(s,a) $.           |
| **Usage**                | Suitable for small state spaces.                            | Suitable for large state spaces with fewer actions.             | Preferred for environments where actions play a key role.     |

---

### **Key Points from Lecture Notes: Discrete Markov Decision Processes and Utility**

---

#### **1. Preliminaries**
- **Sets**: Discrete (e.g., {up, down}) or continuous (e.g., $[0, 1]$).
- **Functions**: Map from domain $X$ to co-domain $Y$ ($f: X \to Y$).
- **Probability**: 
  - Probability distribution $p(X)$ sums to 1.
  - Conditional probability: $p(X|Y)$.
- **Expectation**: Weighted average of outcomes:
  $$
  E[f(X)] = \sum_x p(x) \cdot f(x)
  $$

---

#### **2. Markov Decision Processes (MDPs)**
- **Components**:
  1. **State space** ($S$): Set of possible observations.
  2. **Action space** ($A$): Set of possible actions.
  3. **Transition function** ($T(s'|s, a)$): Probability of reaching $s'$ from $s$ using $a$.
  4. **Reward function** ($R(s, a, s')$): Reward for transitioning $s \to s'$.
  5. **Discount factor** ($\gamma$): Weight for future rewards ($[0, 1]$).
  6. **Initial state distribution** ($p_0(s)$): Starting probabilities.
- **Goals**:
  - Maximize cumulative rewards (utility).
  - Handle stochastic environments and multiple objectives.

---

#### **3. Policies**
- **Policy** ($\pi(a|s)$): Specifies action probabilities in each state.
- **Deterministic Policy**: Always selects a single action for each state ($\pi(s) = a$).
- **Greedy Policy**: Selects the action with the highest value.

---

#### **4. Cumulative Reward and Value**
- **Cumulative Reward (Return)**:
  $$
  G_t = \sum_{i=0}^\infty \gamma^i \cdot r_{t+i}
  $$
- **Value Function**:
  - **State Value ($V^\pi(s)$)**: Expected return starting from $s$ under policy $\pi$:
    $$
    V^\pi(s) = E[G_t | s_t = s, \pi]
    $$
  - **State-Action Value ($Q^\pi(s, a)$)**: Expected return starting from $s$, taking $a$, and then following $\pi$:
    $$
    Q^\pi(s, a) = E[G_t | s_t = s, a_t = a, \pi]
    $$

---

#### **5. Optimal Policy and Value Function**
- **Optimal Value Function ($V^*(s)$)**: Maximum possible value:
  $$
  V^*(s) = \max_\pi V^\pi(s)
  $$
- **Optimal Policy ($\pi^*$)**: Always greedy with respect to $V^*(s)$ or $Q^*(s, a)$:
  $$
  \pi^*(s) = \arg\max_a Q^*(s, a)
  $$

---

#### **6. Bellman Equations**
- **State Value ($V(s)$)**:
  $$
  V(s) = \max_a \sum_{s'} T(s'|s, a) \left[ R(s, a, s') + \gamma V(s') \right]
  $$
- **State-Action Value ($Q(s, a)$)**:
  $$
  Q(s, a) = \sum_{s'} T(s'|s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q(s', a') \right]
  $$
- Relation:
  $$
  V(s) = \max_a Q(s, a)
  $$

---

#### **7. Dynamic Programming (DP)**
- **Core Idea**: Iteratively improve policy and value functions.
- **Two Main Approaches**:
  1. **Policy Iteration**:
     - Alternate between:
       - Policy evaluation: Compute $V^\pi(s)$ for a given $\pi$.
       - Policy improvement: Update $\pi(s) = \arg\max_a Q(s, a)$.
     - Repeat until convergence.
  2. **Value Iteration**:
     - Update $V(s)$ directly using Bellman optimality equation:
       $$
       V(s) = \max_a \sum_{s'} T(s'|s, a) \left[ R(s, a, s') + \gamma V(s') \right]
       $$
     - Derive policy from $V(s)$ after convergence.

---

#### **8. Algorithms**
1. **Policy Iteration**:
   - Evaluate $V(s)$ for $\pi$ until convergence.
   - Improve $\pi(s)$ based on $Q(s, a)$.
2. **Value Iteration**:
   - Directly update $V(s)$ for each state until convergence.
   - Derive policy after $V(s)$ converges.
3. **Q-Value Iteration**:
   - Directly update $Q(s, a)$ for state-action pairs:
     $$
     Q(s, a) = \sum_{s'} T(s'|s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q(s', a') \right]
     $$

---

#### **9. Practical Considerations**
- **Dynamic Programming Strengths**:
  - Guaranteed to find the optimal solution.
  - No heuristic required.
- **Weaknesses**:
  - High memory requirements ($O(|S|)$ or $O(|S| \times |A|)$).
  - Inefficient for large state spaces.

---

#### **10. Key Insights for Exam**
- Understand MDP components (state space, action space, transition, reward).
- Derive and use Bellman equations for $V(s)$ and $Q(s, a)$.
- Differentiate between Policy Iteration, Value Iteration, and Q-Value Iteration.
- Understand dynamic programming workflow and greedy policies.
- Compute values using cumulative rewards and expectations.