# **⭐Model Based Learning**

## **✅1. Markov Decision Processes (MDPs)**

### 1.1 What is an MDP?

**Definition:** A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker (agent). It's foundational in reinforcement learning and dynamic programming.

**Purpose:** MDPs provide a formal way to describe environments where outcomes are partly random and partly under the control of a decision maker (agent).

**Key Components of an MDP:**
- **States (S):** All possible situations the agent can be in
- **Actions (A):** All possible moves the agent can make
- **Transition Probabilities (P):** Likelihood of moving from one state to another given an action
- **Rewards (R):** Immediate feedback received after taking an action
- **Discount Factor (γ):** Weight given to future rewards (0 ≤ γ ≤ 1)

### 1.2 The Markov Property

**Definition:** The Markov Property states that the future state depends only on the `current state` and `action`, **NOT** on the entire history of past states and actions.

- **Mathematical Expression:**
$$P(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} = s' | S_t = s, A_t = a)$$

- **Intuitive Explanation:** The current state contains all the information needed to make optimal decisions about the future.

- **Goal of an Agent in MDP**
  - The agent’s objective is to find a policy `𝜋(𝑎∣𝑠)` that maximizes the expected cumulative reward over time. This is often formalized using:
      - **Value functions**: Estimate how good it is to be in a state or take an action.
      - **Policy iteration** and **value iteration**: Algorithms to compute optimal policies.

### **Frozen Lake**

**Environment Description:** An agent must navigate across a frozen lake to reach a goal while avoiding holes.

**Components:**
- **States:** 16 positions (4×4 grid) numbered 0-15
- **Actions:** 4 possible moves (0: left, 1: down, 2: right, 3: up)
- **Terminal States:** Goal state (rewards +1) and hole states (episode ends) ~ 6
- **Transition Probabilities:** Actions don't always lead to expected outcomes due to slippery ice

![frozen-lake-png](_img\frozen-lake.png)

In [None]:
import gymnasium as gym

# Create environment
env = gym.make('FrozenLake-v1', is_slippery=True)

# Check state and action spaces
print(env.action_space)          
print(env.observation_space)    

print("Number of actions:", env.action_space.n)      
print("Number of states:", env.observation_space.n)  

Discrete(4)
Discrete(16)
Number of actions: 4
Number of states: 16


### 1.4 Transition Probabilities and Rewards

**Accessing Transition Information:**
```python
# env.unwrapped.P[state][action] returns:
# [(probability_1, next_state_1, reward_1, is_terminal_1),
#  (probability_2, next_state_2, reward_2, is_terminal_2), ...]

state = 6
action = 0  # left
print(env.unwrapped.P[state][action])
# Output: [(0.333, 2, 0.0, False), (0.333, 5, 0.0, True), (0.333, 10, 0.0, False)]

### **CliffWalking**

- The Cliff Walking environment involves an agent crossing a grid world from start to goal while avoiding falling off a cliff.
- If the player moves to a cliff location it returns to the start location.
- The player makes moves until they reach the goal, which ends the episode.
- Your task is to explore the state and action spaces of this environment.

![cliff-walking-gif](_img\cliff_walking.gif)

In [16]:
import gymnasium as gym   


# ==============================
# Environment Setup
# ==============================
# Create the CliffWalking environment.
# "CliffWalking-v1" is a classic control problem from reinforcement learning.
# "render_mode='rgb_array'" means the environment won't open a window;
# instead, it keeps the visual output as an image array (useful for debugging or rendering later).
env = gym.make('CliffWalking-v1', render_mode='rgb_array')


# ==============================
# Action and State Spaces
# ==============================
# Number of possible actions the agent can take (Up, Right, Down, Left = 4).
num_actions = env.action_space.n

# Number of possible states in the gridworld (4 rows × 12 columns = 48).
num_states = env.observation_space.n

print("Number of actions:", num_actions)
print("Number of states:", num_states)


# ==============================
# Exploring Transitions
# ==============================
# Each state has a set of transitions, depending on the chosen action.
# Let's pick a specific state (for example, 35) and explore what happens when we try different actions.
state = 35

# Loop through all possible actions from this state
for action in range(num_actions):
    # The environment has an internal dictionary "P" that stores transitions.
    # P[state][action] gives a list of possible outcomes when taking `action` in `state`.
    transitions = env.unwrapped.P[state][action]
    print(transitions)

    # Each transition has the format: (probability, next_state, reward, done)
    # -> probability: chance of this outcome (usually 1.0 for deterministic envs like CliffWalking)
    # -> next_state: the state you land in after the action
    # -> reward: the reward received for this action
    # -> done: whether the episode ends after this transition
    for transition in transitions:
        probability, next_state, reward, done = transition
        print(f"Action: {action} | Probability: {probability}, Next State: {next_state}, Reward: {reward}, Done: {done}")


Number of actions: 4
Number of states: 48
[(1.0, np.int64(23), -1, False)]
Action: 0 | Probability: 1.0, Next State: 23, Reward: -1, Done: False
[(1.0, np.int64(35), -1, False)]
Action: 1 | Probability: 1.0, Next State: 35, Reward: -1, Done: False
[(1.0, np.int64(47), -1, True)]
Action: 2 | Probability: 1.0, Next State: 47, Reward: -1, Done: True
[(1.0, np.int64(34), -1, False)]
Action: 3 | Probability: 1.0, Next State: 34, Reward: -1, Done: False


## **✅2. Policies and State-Value Functions**

### **✨2.1 Policies**

**Definition:** A policy $\pi$ is a strategy that defines which action to take in each state to maximize the expected cumulative reward (return).

**Types:**
- **Deterministic Policy:** Always chooses the same action for a given state
- **Stochastic Policy:** Chooses actions according to a probability distribution

### **✨2.2 State-Value Functions**

**Definition:**
The **state-value function** $V(s)$ estimates how good it is to be in a given state $s$.
It represents the **expected return (sum of discounted future rewards)** when starting in state $s$ and following a given policy $\pi$.

---

#### **1. Mathematical Expression (Expanded Form):**

$$
V(s) = r_{s+1} + \gamma r_{s+2} + \gamma^2 r_{s+3} + \cdots + \gamma^{n-1} r_{s+n}
$$

✅ **Interpretation:**

* $r_{s+1}$: Immediate reward after leaving state $s$
* $\gamma r_{s+2}$: Next reward, discounted by factor $\gamma$
* $\gamma^2 r_{s+3}$: Reward two steps later, discounted further
* $\cdots$ Continues infinitely
* $\gamma \in [0,1]$: Discount factor that balances **present vs. future rewards**

📌 **When to use:**

* **Conceptual / Theoretical explanation** of value functions.
* When introducing RL to beginners → easy to show “why future rewards are discounted.”
* To **manually calculate returns** in very short episodes (e.g., toy problems like a 3-step grid world).
* Useful in **Monte Carlo methods**, where we sample entire episodes and directly compute the return.
* Not practical for real-world problems because we can’t compute infinite sums.

---

#### **2. Bellman Equation (Recursive Form):**

$$
V(s) = r_{s+1} + \gamma V(s+1)
$$

✅ **Interpretation:**

* $r_{s+1}$: Reward right after leaving state $s$
* $\gamma V(s+1)$: The discounted *value of the next state*
* Turns the infinite sum into a **recursive relationship**

📌 **When to use:**

* **Dynamic Programming (DP):**
  * `Value Iteration`
  * `Policy Iteration`
  * `Policy Evaluation`
* Great for **deterministic environments**, where the next state is known with certainty.
* Useful in **algorithm derivations**, since recursion makes equations easier to solve.
* Helps in **bootstrapping methods** like Temporal-Difference (TD) learning, where we approximate returns by one-step lookahead instead of waiting for the whole episode.
* Key in proving convergence of RL algorithms.

💡 Example: In **Gridworld**, if moving right always gives +1 reward, then

$$
V(s) = 1 + \gamma V(s')
$$

is easier to compute recursively instead of expanding all rewards.

---

#### **3. General Bellman Equation with Policy (Full Form):**

$$
V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \Big[ R(s,a,s') + \gamma V^\pi(s') \Big]
$$

✅ **Interpretation:**

* $\pi(a|s)$: Policy → probability of taking action $a$ in state $s$
* $P(s'|s,a)$: Transition probability → chance of landing in state $s'$
* $R(s,a,s')$: Expected reward for going from $s$ to $s'$ with action $a$
* $V^\pi(s')$: Value of the next state under policy $\pi$
* Captures **stochasticity in both actions and environment dynamics**

📌 **When to use:**

* In **real-world MDPs** where:

  * Multiple actions are possible
  * Transitions are probabilistic (not deterministic)
* Central to **Reinforcement Learning algorithms**:

  * **Policy Evaluation** (compute value of a given policy)
  * **Policy Iteration** (improve policies step by step)
  * **Actor-Critic methods** (where the critic estimates $V^\pi$)
* Needed when using **model-based RL**, since it explicitly uses transition probabilities $P(s'|s,a)$.
* Forms the foundation for **policy optimization methods** like Policy Gradient, A2C, PPO (where value functions are approximated).

💡 Example: In a **robot navigation task**, if the robot in state $s$ has a 70% chance to move forward and a 30% chance to slip sideways, this equation correctly handles those probabilities.

---

✨ **Usage:**

* **Expanded formula** → Good for intuition, teaching, and small toy problems. Used in Monte Carlo methods.
* **Recursive Bellman formula** → Practical for computation, DP, and TD-learning. Efficient because it uses recursion instead of full sum.
* **General Bellman with policy** → Realistic MDPs with stochastic transitions and multiple actions. Core of almost all RL algorithms.

```python
# Create the environment
env = gym.make('MyGridWorld', render_mode='rgb_array')
state, info = env.reset()

# Example Policy (Grid World)
# 0: left, 1: down, 2: right, 3: up

policy = {
    0: 1,  # In state 0, go down
    1: 2,  # In state 1, go right
    2: 1,  # In state 2, go down
    3: 1,  # In state 3, go down
    4: 3,  # In state 4, go up
    5: 1,  # In state 5, go down
    6: 2,  # In state 6, go right
    7: 3   # In state 7, go up
}

# Policy Execution
terminated = False
while not terminated:
  # Select action based on policy 
  action = policy[state]
  state, reward, terminated, truncated, info = env.step(action)
  # Render the environment
  render()

# State-Value Functions
def compute_state_value(state, policy):
    if state == terminal_state:
        return 0
    
    action = policy[state]
    _, next_state, reward, _ = env.unwrapped.P[state][action][0]
    return reward + gamma * compute_state_value(next_state, policy)

# Compute all state values
gamma = 1
terminal_state = 8
state_values = {state: compute_state_value(state, policy) for state in range(num_states)}
print(state_values)
```

In [1]:
import gymnasium as gym
from gymnasium import spaces
import numpy as np

print("=== CUSTOM 3x3 GRIDWORLD WITH POLICY AND STATE-VALUES ===")
print()

# -------------------------------
# 1. Custom 3x3 GridWorld Env
# -------------------------------
class GridWorldEnv(gym.Env):
    metadata = {"render_modes": ["ansi"]}
    
    def __init__(self):
        super().__init__()
        self.shape = (3, 3)
        self.observation_space = spaces.Discrete(9)
        self.action_space = spaces.Discrete(4)  # 0:left,1:down,2:right,3:up
        
        # Define rewards
        self.terminal_state = 8
        self.rewards = {8: 10, 4: -2, 7: -2}
        
        # Precompute P like Gym
        self.P = {s: {a: [] for a in range(4)} for s in range(9)}
        for s in range(9):
            for a in range(4):
                ns = self._move(s, a)
                r = self.rewards.get(ns, -1)
                done = ns == self.terminal_state
                self.P[s][a] = [(1.0, ns, r, done)]  # deterministic
                
    def _move(self, state, action):
        if state == self.terminal_state:
            return state
        row, col = state // 3, state % 3
        if action == 0:    # left
            col = max(0, col - 1)
        elif action == 1:  # down
            row = min(2, row + 1)
        elif action == 2:  # right
            col = min(2, col + 1)
        elif action == 3:  # up
            row = max(0, row - 1)
        return row * 3 + col
    
    def render(self, mode="ansi"):
        grid = np.full(self.shape, " ")
        grid[0,0] = "S"  # start
        grid[2,2] = "D"  # diamond
        grid[1,1] = grid[1,2] = "M"  # mountains
        return "\n".join([" ".join(row) for row in grid])

# Create environment
env = GridWorldEnv()
num_states = env.observation_space.n
gamma = 1

# -------------------------------
# 2. Define a deterministic policy
# -------------------------------
policy = {
    0: 1,  # down
    1: 2,  # right
    2: 1,  # down
    3: 1,  # down
    4: 3,  # up
    5: 1,  # down
    6: 2,  # right
    7: 3,  # up
    8: 0   # terminal
}

# -------------------------------
# 3. Compute state values
# -------------------------------
def compute_state_value(state):
    if state == env.terminal_state:
        return 0
    
    action = policy[state]
    _, next_state, reward, _ = env.P[state][action][0]
    return reward + gamma * compute_state_value(next_state)

V = {s: compute_state_value(s) for s in range(num_states)}

# -------------------------------
# 4. Display results
# -------------------------------
print("Custom 3x3 GridWorld Layout:")
print(env.render())
print()
print("State-values:", V)

=== CUSTOM 3x3 GRIDWORLD WITH POLICY AND STATE-VALUES ===

Custom 3x3 GridWorld Layout:
S    
  M M
    D

State-values: {0: 1, 1: 8, 2: 9, 3: 2, 4: 7, 5: 10, 6: 3, 7: 5, 8: 0}


In [2]:
import numpy as np

# ==========================
# 1. POLICIES
# ==========================

# Actions: 0: left, 1: down, 2: right, 3: up
policy1 = {
    0: 1,  # down
    1: 2,  # right
    2: 1,  # down
    3: 1,  # down
    4: 3,  # up
    5: 1,  # down
    6: 2,  # right
    7: 3   # up
}

policy2 = {
    0: 2,  # right
    1: 2,  # right
    2: 1,  # down
    3: 2,  # right
    4: 2,  # right
    5: 1,  # down
    6: 2,  # right
    7: 2   # right
}

action_names = {0: 'left', 1: 'down', 2: 'right', 3: 'up'}

print("Policy 1:", {s: action_names[a] for s, a in policy1.items()})
print("Policy 2:", {s: action_names[a] for s, a in policy2.items()})
print()

# ==========================
# 2. ENVIRONMENT MODEL (P)
# ==========================
gamma = 1.0          # Discount factor
num_states = 9       # 3x3 grid
terminal_state = 8   # Diamond

# Build transition table: P[state][action] = [(prob, next_state, reward, done)]
P = {s: {a: [] for a in range(4)} for s in range(num_states)}

def move(state, action):
    """Return next state after taking an action."""
    if state == terminal_state:
        return state
    
    row, col = state // 3, state % 3
    if action == 0:    # left
        col = max(0, col - 1)
    elif action == 1:  # down
        row = min(2, row + 1)
    elif action == 2:  # right
        col = min(2, col + 1)
    elif action == 3:  # up
        row = max(0, row - 1)
    return row * 3 + col

def reward(next_state):
    """Return reward for landing in next_state."""
    if next_state == 8:   # Diamond
        return 10
    elif next_state in [4, 7]:  # Mountains
        return -2
    else:  # All other states
        return -1

# Fill transition table
for s in range(num_states):
    for a in range(4):
        ns = move(s, a)
        r = reward(ns)
        done = (ns == terminal_state)
        P[s][a] = [(1.0, ns, r, done)]   # deterministic env

# ==========================
# 3. STATE-VALUE FUNCTIONS
# ==========================
print("2. STATE-VALUE FUNCTIONS")
print("========================")
print("V(s) = Expected return starting from state s following policy π")
print()

def compute_state_value(state, policy):
    """Bellman expectation with env.P"""
    if state == terminal_state:
        return 0
    
    action = policy[state]
    transitions = P[state][action]
    
    value = 0
    for prob, next_state, reward, _ in transitions:
        value += prob * (reward + gamma * compute_state_value(next_state, policy))
    return value

# Calculate state values for both policies
print("Computing state values...")
print()

V1 = {s: compute_state_value(s, policy1) for s in range(num_states)}
V2 = {s: compute_state_value(s, policy2) for s in range(num_states)}

# ==========================
# 4. RESULTS
# ==========================
print("RESULTS:")
print("========")
print("State-values for Policy 1:", V1)
print("State-values for Policy 2:", V2)
print()

# Example calculation walkthrough
print("EXAMPLE CALCULATION (Policy 1, State 2):")
print("========================================")
state = 2
action = policy1[state]
prob, next_state, reward, _ = P[state][action][0]
print(f"State 2 → Action {action_names[action]} → State {next_state}")
print(f"Reward: {reward}")
print(f"V(2) = {reward} + {gamma} × V({next_state}) = {reward} + {gamma} × {V1[next_state]} = {V1[2]}")
print()

# Compare policies
print("POLICY COMPARISON:")
print("==================")
total1 = sum(V1[s] for s in range(8))  # exclude terminal
total2 = sum(V2[s] for s in range(8))
print(f"Total value (Policy 1): {total1}")
print(f"Total value (Policy 2): {total2}")
print(f"Better policy: Policy {'2' if total2 > total1 else '1'}")


Policy 1: {0: 'down', 1: 'right', 2: 'down', 3: 'down', 4: 'up', 5: 'down', 6: 'right', 7: 'up'}
Policy 2: {0: 'right', 1: 'right', 2: 'down', 3: 'right', 4: 'right', 5: 'down', 6: 'right', 7: 'right'}

2. STATE-VALUE FUNCTIONS
V(s) = Expected return starting from state s following policy π

Computing state values...

RESULTS:
State-values for Policy 1: {0: 1.0, 1: 8.0, 2: 9.0, 3: 2.0, 4: 7.0, 5: 10.0, 6: 3.0, 7: 5.0, 8: 0}
State-values for Policy 2: {0: 7.0, 1: 8.0, 2: 9.0, 3: 7.0, 4: 9.0, 5: 10.0, 6: 8.0, 7: 10.0, 8: 0}

EXAMPLE CALCULATION (Policy 1, State 2):
State 2 → Action down → State 5
Reward: -1
V(2) = -1 + 1.0 × V(5) = -1 + 1.0 × 10.0 = 9.0

POLICY COMPARISON:
Total value (Policy 1): 45.0
Total value (Policy 2): 68.0
Better policy: Policy 2


## **✅3. Action-Value Functions (Q-Values)**

### **✨3.1 Action-Value (Q) Function**

**Definition:**
The **action-value function** $Q^\pi(s,a)$ is the **expected return** when you **start in state $s$**, **take action $a$**, and then **follow policy $\pi$** thereafter:

$$
Q^\pi(s,a) = \mathbb{E}_\pi \left[ G_t \;\middle|\; S_t = s,\; A_t = a \right]
$$

---

#### **Bellman forms (from simplest → most general)**

1. **Deterministic one-step (matches the slide):**
   If taking $a$ from $s$ deterministically yields next state $s'$ and reward $r_a$:

$$
Q^\pi(s,a) = r_a + \gamma\,V^\pi(s')
$$

➡️ **Interpretation:** "Immediate reward now + discounted value of the next state."

---

2. **Expectation over next states (using $V^\pi$):**

$$
Q^\pi(s,a) = \sum_{s'} P(s' \mid s,a) \Big[ R(s,a,s') + \gamma\,V^\pi(s') \Big]
$$

---

3. **Bellman expectation in terms of $Q^\pi$ only (no $V$ needed):**

$$
Q^\pi(s,a) = \sum_{s'} P(s' \mid s,a) \Big[ R(s,a,s') + \gamma \sum_{a'} \pi(a' \mid s')\,Q^\pi(s',a') \Big]
$$

---

#### **Relationship with state values**

$$
V^\pi(s) = \sum_a \pi(a \mid s)\,Q^\pi(s,a)
$$

---

#### **Optimality (how Q drives control)**

The **optimal action-value function** satisfies the Bellman optimality equation:

$$
Q^*(s,a) = \sum_{s'} P(s' \mid s,a) \Big[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \Big]
$$

And the **optimal policy** is:

$$
\pi^*(s) = \arg\max_a Q^*(s,a)
$$

---

📌 **When to use $Q$:**

* **Model-free control** (Q-Learning, SARSA, DQN) to pick actions directly
* **Action selection** without needing the model once $Q$ is learned
* **Greedy/improvement steps** in DP or policy iteration

---

### **✨3.2 Computing Q-Values**

**Core recipe (deterministic, mirrors the slide):**

1. From $(s,a)$ get the **immediate reward** $r_a$ and **next state** $s'$.
2. Combine with the **discounted value of $s'$:**

$$
Q(s,a) = r_a + \gamma\,V(s')
$$

*(If $s'$ is terminal, use $V(s') = 0$.)*

---

**Stochastic transitions:**
If $(s,a)$ can lead to multiple $s'_i$ with probabilities $p_i$ and rewards $r_i$:

$$
Q(s,a) = \sum_i p_i \Big[ r_i + \gamma\,V(s'_i) \Big]
$$

---

**Directly with $Q$ (no $V$ table):**
Iterate the Bellman expectation:

$$
Q_{k+1}(s,a) \;\leftarrow\; \sum_{s'} P(s' \mid s,a)\Big[ R(s,a,s') + \gamma \sum_{a'} \pi(a' \mid s')\,Q_k(s',a') \Big]
$$

---

**Tiny code sketch (Gym-style envs):**

```python
def compute_q_value(env, s, a, V, gamma):
    # env.unwrapped.P[s][a] -> list of (p, next_state, r, done)
    outcomes = env.unwrapped.P[s][a]
    return sum(
        p * (r + (0 if done else gamma * V[next_state]))
        for p, next_state, r, done in outcomes
    )
```

*(For deterministic MDPs, this collapses to the single outcome, exactly like the slide.)*

---

**Example (Gridworld, γ=1):**

* $Q(4,\text{down}) = -2 + 1 \times 5 = 3$
* $Q(4,\text{left}) = -1 + 1 \times 2 = 1$
* $Q(4,\text{up}) = -1 + 1 \times 8 = 7$
* $Q(4,\text{right}) = -1 + 1 \times 10 = 9$

---

**Terminal-state note:**
Common conventions:

* Set $Q(s_{\text{terminal}},a) = 0$ for all $a$
* Or skip computing it (return `None`), as shown in the slide

Both work if applied consistently.

---

### **Cheat-Sheet 📝**

* **Intuition:**
  “Immediate reward of action $a$ in state $s$ + discounted desirability of where that action leads.”
* **From $V$ to $Q$:**
  $Q(s,a) = r + \gamma V(\text{next})$
* **From $Q$ to $V$:**
  $V(s) = \mathbb{E}_{a \sim \pi}[Q(s,a)]$
* **Control:**
  Act greedily w\.r.t. $Q$; learn $Q$ via **Q-Learning, SARSA, DQN**




**Implementation:**
```python
def compute_q_value(state, action, V):
    if state == terminal_state:
        return None
    
    _, next_state, reward, _ = env.unwrapped.P[state][action][0]
    return reward + gamma * V[next_state]

# Compute all Q-values
Q = {(state, action): compute_q_value(state, action, V)
     for state in range(num_states)
     for action in range(num_actions)}
```

### 3.3 Policy Improvement Using Q-Values

**Greedy Policy Improvement:** Select the action with the highest Q-value for each state.

```python
def improve_policy(Q, num_states, num_actions):
    improved_policy = {}
    
    for state in range(num_states - 1):  # Exclude terminal state
        # Find action with maximum Q-value
        max_action = max(range(num_actions), 
                        key=lambda action: Q[(state, action)])
        improved_policy[state] = max_action
    
    return improved_policy
```

## **✅4. Policy Iteration**

### 4.1 Algorithm Overview

**Definition:** Policy Iteration is an algorithm that finds the optimal policy by alternating between policy evaluation and policy improvement until convergence.

**Steps:**
1. **Initialize:** Start with an arbitrary policy π₀
2. **Policy Evaluation:** Compute V^π for current policy
3. **Policy Improvement:** Create new policy π' by acting greedily with respect to V^π
4. **Check Convergence:** If π' = π, stop. Otherwise, set π = π' and go to step 2

### 4.2 Implementation

```python
def policy_evaluation(policy, threshold=0.001):
    """Evaluate a given policy until convergence."""
    V = {state: 0 for state in range(num_states)}
    
    while True:
        new_V = {state: 0 for state in range(num_states)}
        
        for state in range(num_states - 1):
            if state != terminal_state:
                action = policy[state]
                _, next_state, reward, _ = env.P[state][action][0]
                new_V[state] = reward + gamma * V[next_state]
        
        # Check convergence
        if all(abs(new_V[s] - V[s]) < threshold for s in V):
            break
        V = new_V
    
    return V

def policy_improvement(V):
    """Improve policy based on current value function."""
    improved_policy = {}
    
    for state in range(num_states - 1):
        Q_values = []
        for action in range(num_actions):
            _, next_state, reward, _ = env.P[state][action][0]
            q_val = reward + gamma * V[next_state]
            Q_values.append(q_val)
        
        # Select best action
        max_action = max(range(num_actions), key=lambda a: Q_values[a])
        improved_policy[state] = max_action
    
    return improved_policy

def policy_iteration():
    """Complete policy iteration algorithm."""
    # Initialize with arbitrary policy
    policy = {0: 1, 1: 2, 2: 1, 3: 1, 4: 3, 5: 1, 6: 2, 7: 3}
    
    while True:
        # Policy Evaluation
        V = policy_evaluation(policy)
        
        # Policy Improvement
        improved_policy = policy_improvement(V)
        
        # Check for convergence
        if improved_policy == policy:
            break
            
        policy = improved_policy
    
    return policy, V
```

**Time Complexity:** O(|S|²|A|) per iteration, where |S| is number of states and |A| is number of actions.

## **✅5. Value Iteration**

### 5.1 Algorithm Overview

**Definition:** Value Iteration combines policy evaluation and improvement in a single step. It directly computes the optimal value function and derives the policy from it.

**Key Insight:** Instead of fully evaluating a policy, perform only one sweep of policy evaluation followed by policy improvement.

**Bellman Optimality Equation:**
$$V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^*(s')]$$

### 5.2 Algorithm Steps

1. **Initialize:** V(s) = 0 for all states
2. **Value Update:** For each state, compute the maximum expected value over all actions
3. **Policy Extraction:** Choose actions that achieve the maximum value
4. **Convergence Check:** Stop when value changes are below threshold

### 5.3 Implementation

```python
def value_iteration(threshold=0.001):
    """Value iteration algorithm."""
    # Initialize
    V = {state: 0 for state in range(num_states)}
    policy = {state: 0 for state in range(num_states - 1)}
    
    while True:
        new_V = {state: 0 for state in range(num_states)}
        
        for state in range(num_states - 1):
            if state != terminal_state:
                # Compute Q-values for all actions
                Q_values = []
                for action in range(num_actions):
                    _, next_state, reward, _ = env.P[state][action][0]
                    q_val = reward + gamma * V[next_state]
                    Q_values.append(q_val)
                
                # Take maximum
                max_q_value = max(Q_values)
                max_action = max(range(num_actions), key=lambda a: Q_values[a])
                
                new_V[state] = max_q_value
                policy[state] = max_action
        
        # Check convergence
        if all(abs(new_V[state] - V[state]) < threshold for state in V):
            break
            
        V = new_V
    
    return policy, V

def get_max_action_and_value(state, V):
    """Helper function to get optimal action and value for a state."""
    Q_values = []
    for action in range(num_actions):
        _, next_state, reward, _ = env.P[state][action][0]
        q_val = reward + gamma * V[next_state]
        Q_values.append(q_val)
    
    max_action = max(range(num_actions), key=lambda a: Q_values[a])
    max_q_value = Q_values[max_action]
    
    return max_action, max_q_value
```

**Time Complexity:** O(|S|²|A|) per iteration, typically converges faster than Policy Iteration.

## **✅6. Comparison: Policy Iteration vs Value Iteration**

| Aspect | Policy Iteration | Value Iteration |
|--------|------------------|-----------------|
| **Approach** | Separate evaluation and improvement | Combined evaluation and improvement |
| **Convergence** | Finite number of iterations | Asymptotic convergence |
| **Per Iteration Cost** | Higher (full policy evaluation) | Lower (single value update) |
| **Total Iterations** | Fewer iterations needed | More iterations needed |
| **Memory** | Stores explicit policy | Policy derived from values |
| **Practical Use** | Better for small state spaces | Better for large state spaces |

## **✅7. Key Formulas Summary**

### 7.1 Value Functions
- **State Value:** $V^π(s) = \mathbb{E}[G_t | S_t = s, π]$
- **Action Value:** $Q^π(s,a) = \mathbb{E}[G_t | S_t = s, A_t = a, π]$
- **Return:** $G_t = \sum_{k=0}^{\infty} γ^k R_{t+k+1}$

### 7.2 Bellman Equations
- **State Value Bellman:** $V^π(s) = \sum_a π(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^π(s')]$
- **Action Value Bellman:** $Q^π(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + γ\sum_{a'} π(a'|s')Q^π(s',a')]$
- **Optimality Bellman:** $V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^*(s')]$

## **✅8. Practical Tips and Best Practices**

### 8.1 Implementation Considerations
- **Convergence Threshold:** Choose appropriate threshold (e.g., 0.001) based on precision needs
- **Discount Factor:** γ close to 1 emphasizes future rewards; closer to 0 emphasizes immediate rewards
- **Terminal States:** Always handle terminal states separately (return 0 or fixed value)

### 8.2 Common Pitfalls
- **Infinite Loops:** Ensure proper convergence checks in iterative algorithms
- **State Indexing:** Be careful with state numbering and terminal state handling
- **Action Spaces:** Verify action space size matches expected number of actions

### 8.3 Extensions and Advanced Topics
- **Stochastic Policies:** Extend to probabilistic action selection
- **Function Approximation:** Use neural networks for large state spaces
- **Model-Free Methods:** Q-Learning and SARSA for unknown environments
- **Policy Gradient Methods:** Direct policy optimization without value functions

## **✅9. Code Examples Repository**

### 9.1 Complete Grid World Example
```python
import gymnasium as gym
import numpy as np

class GridWorldMDP:
    def __init__(self, env_name='FrozenLake-v1'):
        self.env = gym.make(env_name)
        self.num_states = self.env.observation_space.n
        self.num_actions = self.env.action_space.n
        self.gamma = 1.0
        self.terminal_state = self.num_states - 1
    
    def policy_iteration(self, threshold=0.001):
        """Complete policy iteration implementation."""
        policy = {i: 0 for i in range(self.num_states - 1)}
        
        while True:
            V = self.policy_evaluation(policy, threshold)
            new_policy = self.policy_improvement(V)
            
            if new_policy == policy:
                break
            policy = new_policy
        
        return policy, V
    
    def value_iteration(self, threshold=0.001):
        """Complete value iteration implementation."""
        V = {s: 0 for s in range(self.num_states)}
        
        while True:
            new_V = V.copy()
            for state in range(self.num_states - 1):
                if state != self.terminal_state:
                    values = []
                    for action in range(self.num_actions):
                        transitions = self.env.unwrapped.P[state][action]
                        value = sum(prob * (reward + self.gamma * V[next_state])
                                  for prob, next_state, reward, _ in transitions)
                        values.append(value)
                    new_V[state] = max(values)
            
            if all(abs(new_V[s] - V[s]) < threshold for s in V):
                break
            V = new_V
        
        # Extract policy
        policy = {}
        for state in range(self.num_states - 1):
            if state != self.terminal_state:
                values = []
                for action in range(self.num_actions):
                    transitions = self.env.unwrapped.P[state][action]
                    value = sum(prob * (reward + self.gamma * V[next_state])
                              for prob, next_state, reward, _ in transitions)
                    values.append(value)
                policy[state] = np.argmax(values)
        
        return policy, V

# Usage
mdp = GridWorldMDP()
optimal_policy, optimal_values = mdp.value_iteration()
print("Optimal Policy:", optimal_policy)
print("Optimal Values:", optimal_values)
```
