# **⭐Model Based Learning⭐**

# Table of Contents - Model Based Learning

- [1. Markov Decision Processes (MDPs)](#1-markov-decision-processes-mdps)
  - [1.1 What is an MDP?](#11-what-is-an-mdp)
  - [1.2 The Markov Property](#12-the-markov-property)
  - [1.3 Frozen Lake Environment](#13-frozen-lake-environment)
  - [1.4 CliffWalking Environment](#14-cliffwalking-environment)
  
- [2. State-value and Action-Value Functions](#2-state-value-and-action-value-functions)
  - [2.1 State-Value Functions (V^π)](#21-state-value-functions-$V^π$)
  - [2.2 Action-Value Functions (Q^π)](#22-action-value-functions-$Q^π$)
  
- [3. Policy & Value Iteration](#3-policy--value-iteration)
  - [3.1 Overview](#31-overview)
  - [3.2 Policy Iteration](#32-policy-iteration)
  - [3.3 Value Iteration](#33-value-iteration)
  - [3.4 Comparison: Policy vs Value Iteration](#34-comparison-policy-vs-value-iteration)
  
- [4. Summary](#4-summary)
  - [4.1 Value Functions](#41-value-functions)
  - [4.2 Bellman Equations](#42-bellman-equations)
  - [4.3 Practical Tips and Best Practices](#43-practical-tips-and-best-practices)
  - [4.4 Code Examples Repository](#44-code-examples-repository)

# **✅1. Markov Decision Processes (MDPs)**

## 1.1 What is an MDP?

>**Definition:** A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker (agent). It's foundational in reinforcement learning and dynamic programming.

**Purpose:** MDPs provide a formal way to describe environments where outcomes are partly random and partly under the control of a decision maker (agent).

**Key Components of an MDP:**
- **States (S):** All possible situations the agent can be in
- **Actions (A):** All possible moves the agent can make
- **Transition Probabilities (P):** Likelihood of moving from one state to another given an action
- **Rewards (R):** Immediate feedback received after taking an action
- **Discount Factor (γ):** Weight given to future rewards (0 ≤ γ ≤ 1)

## 1.2 The Markov Property

**Definition:** The Markov Property states that the future state depends only on the `current state` and `action`, **NOT** on the entire history of past states and actions.

- **Mathematical Expression:**
$$P(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} = s' | S_t = s, A_t = a)$$

- **Intuitive Explanation:** The current state contains all the information needed to make optimal decisions about the future.

- **Goal of an Agent in MDP**
  - The agent’s objective is to find a policy `𝜋(𝑎∣𝑠)` that maximizes the expected cumulative reward over time. This is often formalized using:
      - **Value functions**: Estimate how good it is to be in a state or take an action.
      - **Policy iteration** and **value iteration**: Algorithms to compute optimal policies.

## 1.3 Frozen Lake Environment

Imagine a 4x4 grid that represents a frozen lake. There are three types of tiles:

1.  **`S`** : The starting point (safe).
2.  **`F`** : Frozen surface (safe, you can walk on it).
3.  **`H`** : A hole in the ice (dangerous, you fall in and the episode ends).
4.  **`G`** : The goal (where you receive a reward).

A simple layout might look like this:
```
S  F  F  F
F  H  F  H
F  F  F  H
H  F  F  G
```
You are an agent (a person trying to cross the lake). Your goal is to find a path from `S` to `G` without falling into a hole `H`.


### Mapping the Problem to a Markov Decision Process (MDP)

An MDP is defined by a 5-tuple `(S, A, P, R, γ)`:
*   **S**: Set of states
*   **A**: Set of actions
*   **P**: Transition probabilities `P(s' | s, a)`
*   **R**: Reward function `R(s, a, s')`
*   **γ**: Discount factor (between 0 and 1)

Let's break down the Frozen Lake problem into these components.

#### 1. `States (S)`
Each tile (cell) in the grid is a state. We can represent them by their coordinates.
*   **S**: `{(0,0), (0,1), (0,2), (0,3), (1,0), ..., (3,3)}`
*   The **terminal states** are the holes `H` and the goal `G`. Once you enter one, the episode is over. For example, `(1,1)` is a hole and `(3,3)` is the goal.

#### 2. `Actions (A)`
The actions are the possible moves the agent can take from any state (if the move is possible).
*   **A**: `{UP, DOWN, LEFT, RIGHT}`

#### 3. `Transition Probabilities (P)`
This is the core of the "Markov" property. The outcome of an action is **stochastic** (non-deterministic). This mimics the slippery nature of ice.

*   **Intended Action**: 33.3% chance
*   **Slipping Left**: 33.3% chance
*   **Slipping Right**: 33.3% chance

**Example:** From state `(0,1)` (a frozen tile), if the agent intends to go `DOWN`:
*   With ~33% probability, it successfully moves `DOWN` to `(1,1)`.
*   With ~33% probability, it slips and moves `LEFT` to `(0,0)`.
*   With ~33% probability, it slips and moves `RIGHT` to `(0,2)`.

If a move would take the agent into a wall (e.g., moving `LEFT` from `(0,0)`), the agent simply stays in its current state. The probability mass for that invalid move is added to the probability of remaining in the current state.

#### 4. `Reward Function (R)`
The reward defines the goal of the agent. We give a reward only when the agent reaches a meaningful state.

*   **Reaching the Goal (G):** `R = +1`
*   **Falling into a Hole (H):** `R = 0` (Some versions use a small negative reward like `-1` to penalize failure)
*   **Stepping on any other Frozen (F) tile:** `R = 0`

The agent gets this reward *upon entering* the new state `s'`.

**Example:**
*   `R((2,3), DOWN, (3,3)) = +1` (Moving into the goal from above)
*   `R((1,0), RIGHT, (1,1)) = 0` (Moving into a hole)
*   `R((0,0), RIGHT, (0,1)) = 0` (Moving onto a frozen tile)

#### 5. `Discount Factor (γ)`
This determines how much the agent cares about `future rewards` vs. `immediate rewards`.
*   Let's choose **γ = 0.9** for this example. This means the agent strongly prefers reaching the goal quickly but still values eventually reaching it over never reaching it at all.



### **How an Episode Unrolls**

Let's simulate a few steps of a possible episode:

1.  **Time t=0**: State `s₀ = (0,0)` (Start).
2.  **Action a₀**: The agent chooses `RIGHT` (intending to go to `(0,1)`).
3.  **Transition**: Due to slippiness, it actually slips and moves `DOWN` to `s₁ = (1,0)`. Reward `r₀ = 0`.
4.  **Time t=1**: State `s₁ = (1,0)`.
5.  **Action a₁**: The agent chooses `RIGHT` again (intending to go to `(1,1)` - a hole!).
6.  **Transition**: It successfully executes the action and moves into the hole at `s₂ = (1,1)`. This is a terminal state.
7.  **Reward r₁**: The agent receives `r₁ = 0` for entering a hole. The episode ends. The total return for this episode is `0`.

**A successful episode** would involve the agent navigating the slippery ice, potentially getting lucky with slips, and eventually landing on `(3,3)` to collect a reward of `+1`.

### The "Solution" to the MDP

The goal of solving an MDP is to find a **policy (π)**, which is a strategy that tells the agent what action to take in every state (`π(s) -> a`).

The optimal policy `π*` is the one that maximizes the expected cumulative discounted reward (the **return**). In this case, it's the policy that has the highest chance of getting the agent to the goal without falling in a hole.

For a small grid like this, we can compute this optimal policy using algorithms like **Value Iteration** or **Policy Iteration**. The result would be a map showing the best action for every tile:


### **Frozen Lake**

**Environment Description:** An agent must navigate across a frozen lake to reach a goal while avoiding holes.

**Components:**
- **States:** 16 positions (4×4 grid) numbered 0-15
- **Actions:** 4 possible moves (0: left, 1: down, 2: right, 3: up)
- **Terminal States:** Goal state (rewards +1) and hole states (episode ends) ~ 6
- **Transition Probabilities:** Actions don't always lead to expected outcomes due to slippery ice

![frozen-lake-png](_img\frozen-lake.png)

In [None]:
import gymnasium as gym

# Create environment
env = gym.make('FrozenLake-v1', is_slippery=True)

# Check state and action spaces
print(env.action_space)          
print(env.observation_space)    

print("Number of actions:", env.action_space.n)      
print("Number of states:", env.observation_space.n)  

Discrete(4)
Discrete(16)
Number of actions: 4
Number of states: 16


**Transition Probabilities and Rewards**

**Accessing Transition Information:**
```python
# env.unwrapped.P[state][action] returns:
# [(probability_1, next_state_1, reward_1, is_terminal_1),
#  (probability_2, next_state_2, reward_2, is_terminal_2), ...]

state = 6
action = 0  # left
print(env.unwrapped.P[state][action])
# Output: [(0.333, 2, 0.0, False), (0.333, 5, 0.0, True), (0.333, 10, 0.0, False)]

## 1.4 CliffWalking Environment

- The Cliff Walking environment involves an agent crossing a grid world from start to goal while avoiding falling off a cliff.
- If the player moves to a cliff location it returns to the start location.
- The player makes moves until they reach the goal, which ends the episode.
- Your task is to explore the state and action spaces of this environment.

![cliff-walking-gif](_img\cliff_walking.gif)

In [16]:
import gymnasium as gym   


# ==============================
# Environment Setup
# ==============================
# Create the CliffWalking environment.
# "CliffWalking-v1" is a classic control problem from reinforcement learning.
# "render_mode='rgb_array'" means the environment won't open a window;
# instead, it keeps the visual output as an image array (useful for debugging or rendering later).
env = gym.make('CliffWalking-v1', render_mode='rgb_array')


# ==============================
# Action and State Spaces
# ==============================
# Number of possible actions the agent can take (Up, Right, Down, Left = 4).
num_actions = env.action_space.n

# Number of possible states in the gridworld (4 rows × 12 columns = 48).
num_states = env.observation_space.n

print("Number of actions:", num_actions)
print("Number of states:", num_states)


# ==============================
# Exploring Transitions
# ==============================
# Each state has a set of transitions, depending on the chosen action.
# Let's pick a specific state (for example, 35) and explore what happens when we try different actions.
state = 35

# Loop through all possible actions from this state
for action in range(num_actions):
    # The environment has an internal dictionary "P" that stores transitions.
    # P[state][action] gives a list of possible outcomes when taking `action` in `state`.
    transitions = env.unwrapped.P[state][action]
    print(transitions)

    # Each transition has the format: (probability, next_state, reward, done)
    # -> probability: chance of this outcome (usually 1.0 for deterministic envs like CliffWalking)
    # -> next_state: the state you land in after the action
    # -> reward: the reward received for this action
    # -> done: whether the episode ends after this transition
    for transition in transitions:
        probability, next_state, reward, done = transition
        print(f"Action: {action} | Probability: {probability}, Next State: {next_state}, Reward: {reward}, Done: {done}")


Number of actions: 4
Number of states: 48
[(1.0, np.int64(23), -1, False)]
Action: 0 | Probability: 1.0, Next State: 23, Reward: -1, Done: False
[(1.0, np.int64(35), -1, False)]
Action: 1 | Probability: 1.0, Next State: 35, Reward: -1, Done: False
[(1.0, np.int64(47), -1, True)]
Action: 2 | Probability: 1.0, Next State: 47, Reward: -1, Done: True
[(1.0, np.int64(34), -1, False)]
Action: 3 | Probability: 1.0, Next State: 34, Reward: -1, Done: False


# **✅2. State-value and Action-Value Funtion**

## 2.1 State-Value Functions ($V^\pi$)

### 🔹 **State-Value Function: $V^\pi(s)$**

* **Definition:**
  $V^\pi(s)$ is the *expected total reward (return)* you’ll get if you **start in state $s$** and then **keep following policy $\pi$**.

* **Think of it as:**
  “How good is it to just **be in this state**, assuming I behave according to policy $\pi$ from now on?”

* **Use case:**

  * Helps evaluate *how desirable a state is overall*.
  * Important in **policy evaluation** (checking how good a given policy is).

* **Example:**
  Imagine a gridworld where:

  * From state $s$, following policy $\pi$ usually leads you safely to the goal with +10 reward.
  * But sometimes you bump into walls and lose -1.
  * The average of all these future outcomes = **state value** $V^\pi(s)$.

### **✨1. Policies**

**Definition:** A policy $\pi$ is a strategy that defines which action to take in each state to maximize the expected cumulative reward (return).

**Types:**
- **Deterministic Policy:** Always chooses the same action for a given state
- **Stochastic Policy:** Chooses actions according to a probability distribution

### **✨2. State-Value Functions**

**Definition:**
The **state-value function** $V(s)$ estimates how good it is to be in a given state $s$.
It represents the **expected return (sum of discounted future rewards)** when starting in state $s$ and following a given policy $\pi$.

---

#### **a. Mathematical Expression (Expanded Form):**

$$
V(s) = r_{s+1} + \gamma r_{s+2} + \gamma^2 r_{s+3} + \cdots + \gamma^{n-1} r_{s+n}
$$

✅ **Interpretation:**

* $r_{s+1}$: Immediate reward after leaving state $s$
* $\gamma r_{s+2}$: Next reward, discounted by factor $\gamma$
* $\gamma^2 r_{s+3}$: Reward two steps later, discounted further
* $\cdots$ Continues infinitely
* $\gamma \in [0,1]$: Discount factor that balances **present vs. future rewards**

📌 **When to use:**

* **Conceptual / Theoretical explanation** of value functions.
* When introducing RL to beginners → easy to show “why future rewards are discounted.”
* To **manually calculate returns** in very short episodes (e.g., toy problems like a 3-step grid world).
* Useful in **Monte Carlo methods**, where we sample entire episodes and directly compute the return.
* Not practical for real-world problems because we can’t compute infinite sums.

---

#### **b. Bellman Equation (Recursive Form):**

$$
V(s) = r_{s+1} + \gamma V(s+1)
$$

✅ **Interpretation:**

* $r_{s+1}$: Reward right after leaving state $s$
* $\gamma V(s+1)$: The discounted *value of the next state*
* Turns the infinite sum into a **recursive relationship**

📌 **When to use:**

* **Dynamic Programming (DP):**
  * `Value Iteration`
  * `Policy Iteration`
  * `Policy Evaluation`
* Great for **deterministic environments**, where the next state is known with certainty.
* Useful in **algorithm derivations**, since recursion makes equations easier to solve.
* Helps in **bootstrapping methods** like Temporal-Difference (TD) learning, where we approximate returns by one-step lookahead instead of waiting for the whole episode.
* Key in proving convergence of RL algorithms.

💡 Example: In **Gridworld**, if moving right always gives +1 reward, then

$$
V(s) = 1 + \gamma V(s')
$$

is easier to compute recursively instead of expanding all rewards.

---

#### **c. General Bellman Equation with Policy (Full Form):**

$$
V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \Big[ R(s,a,s') + \gamma V^\pi(s') \Big]
$$

✅ **Interpretation:**

* $\pi(a|s)$: Policy → probability of taking action $a$ in state $s$
* $P(s'|s,a)$: Transition probability → chance of landing in state $s'$
* $R(s,a,s')$: Expected reward for going from $s$ to $s'$ with action $a$
* $V^\pi(s')$: Value of the next state under policy $\pi$
* Captures **stochasticity in both actions and environment dynamics**

📌 **When to use:**

* In **real-world MDPs** where:

  * Multiple actions are possible
  * Transitions are probabilistic (not deterministic)
* Central to **Reinforcement Learning algorithms**:

  * **Policy Evaluation** (compute value of a given policy)
  * **Policy Iteration** (improve policies step by step)
  * **Actor-Critic methods** (where the critic estimates $V^\pi$)
* Needed when using **model-based RL**, since it explicitly uses transition probabilities $P(s'|s,a)$.
* Forms the foundation for **policy optimization methods** like Policy Gradient, A2C, PPO (where value functions are approximated).

💡 Example: In a **robot navigation task**, if the robot in state $s$ has a 70% chance to move forward and a 30% chance to slip sideways, this equation correctly handles those probabilities.

---

✨ **Usage:**

* **Expanded formula** → Good for intuition, teaching, and small toy problems. Used in Monte Carlo methods.
* **Recursive Bellman formula** → Practical for computation, DP, and TD-learning. Efficient because it uses recursion instead of full sum.
* **General Bellman with policy** → Realistic MDPs with stochastic transitions and multiple actions. Core of almost all RL algorithms.

```python
# Create the environment
env = gym.make('MyGridWorld', render_mode='rgb_array')
state, info = env.reset()

# Example Policy (Grid World)
# 0: left, 1: down, 2: right, 3: up

policy = {
    0: 1,  # In state 0, go down
    1: 2,  # In state 1, go right
    2: 1,  # In state 2, go down
    3: 1,  # In state 3, go down
    4: 3,  # In state 4, go up
    5: 1,  # In state 5, go down
    6: 2,  # In state 6, go right
    7: 3   # In state 7, go up
}

# Policy Execution
terminated = False
while not terminated:
  # Select action based on policy 
  action = policy[state]
  state, reward, terminated, truncated, info = env.step(action)
  # Render the environment
  render()

# State-Value Functions
def compute_state_value(state, policy):
    if state == terminal_state:
        return 0
    
    action = policy[state]
    _, next_state, reward, _ = env.unwrapped.P[state][action][0]
    return reward + gamma * compute_state_value(next_state, policy)

# Compute all state values
gamma = 1
terminal_state = 8
state_values = {state: compute_state_value(state, policy) for state in range(num_states)}
print(state_values)
```

In [1]:
import gymnasium as gym
from gymnasium import spaces
import numpy as np

print("=== CUSTOM 3x3 GRIDWORLD WITH POLICY AND STATE-VALUES ===")
print()

# -------------------------------
# 1. Custom 3x3 GridWorld Env
# -------------------------------
class GridWorldEnv(gym.Env):
    metadata = {"render_modes": ["ansi"]}
    
    def __init__(self):
        super().__init__()
        self.shape = (3, 3)
        self.observation_space = spaces.Discrete(9)
        self.action_space = spaces.Discrete(4)  # 0:left,1:down,2:right,3:up
        
        # Define rewards
        self.terminal_state = 8
        self.rewards = {8: 10, 4: -2, 7: -2}
        
        # Precompute P like Gym
        self.P = {s: {a: [] for a in range(4)} for s in range(9)}
        for s in range(9):
            for a in range(4):
                ns = self._move(s, a)
                r = self.rewards.get(ns, -1)
                done = ns == self.terminal_state
                self.P[s][a] = [(1.0, ns, r, done)]  # deterministic
                
    def _move(self, state, action):
        if state == self.terminal_state:
            return state
        row, col = state // 3, state % 3
        if action == 0:    # left
            col = max(0, col - 1)
        elif action == 1:  # down
            row = min(2, row + 1)
        elif action == 2:  # right
            col = min(2, col + 1)
        elif action == 3:  # up
            row = max(0, row - 1)
        return row * 3 + col
    
    def render(self, mode="ansi"):
        grid = np.full(self.shape, " ")
        grid[0,0] = "S"  # start
        grid[2,2] = "D"  # diamond
        grid[1,1] = grid[1,2] = "M"  # mountains
        return "\n".join([" ".join(row) for row in grid])

# Create environment
env = GridWorldEnv()
num_states = env.observation_space.n
gamma = 1

# -------------------------------
# 2. Define a deterministic policy
# -------------------------------
policy = {
    0: 1,  # down
    1: 2,  # right
    2: 1,  # down
    3: 1,  # down
    4: 3,  # up
    5: 1,  # down
    6: 2,  # right
    7: 3,  # up
    8: 0   # terminal
}

# -------------------------------
# 3. Compute state values
# -------------------------------
def compute_state_value(state):
    if state == env.terminal_state:
        return 0
    
    action = policy[state]
    _, next_state, reward, _ = env.P[state][action][0]
    return reward + gamma * compute_state_value(next_state)

V = {s: compute_state_value(s) for s in range(num_states)}

# -------------------------------
# 4. Display results
# -------------------------------
print("Custom 3x3 GridWorld Layout:")
print(env.render())
print()
print("State-values:", V)

=== CUSTOM 3x3 GRIDWORLD WITH POLICY AND STATE-VALUES ===

Custom 3x3 GridWorld Layout:
S    
  M M
    D

State-values: {0: 1, 1: 8, 2: 9, 3: 2, 4: 7, 5: 10, 6: 3, 7: 5, 8: 0}


In [2]:
import numpy as np

# ==========================
# 1. POLICIES
# ==========================

# Actions: 0: left, 1: down, 2: right, 3: up
policy1 = {
    0: 1,  # down
    1: 2,  # right
    2: 1,  # down
    3: 1,  # down
    4: 3,  # up
    5: 1,  # down
    6: 2,  # right
    7: 3   # up
}

policy2 = {
    0: 2,  # right
    1: 2,  # right
    2: 1,  # down
    3: 2,  # right
    4: 2,  # right
    5: 1,  # down
    6: 2,  # right
    7: 2   # right
}

action_names = {0: 'left', 1: 'down', 2: 'right', 3: 'up'}

print("Policy 1:", {s: action_names[a] for s, a in policy1.items()})
print("Policy 2:", {s: action_names[a] for s, a in policy2.items()})
print()

# ==========================
# 2. ENVIRONMENT MODEL (P)
# ==========================
gamma = 1.0          # Discount factor
num_states = 9       # 3x3 grid
terminal_state = 8   # Diamond

# Build transition table: P[state][action] = [(prob, next_state, reward, done)]
P = {s: {a: [] for a in range(4)} for s in range(num_states)}

def move(state, action):
    """Return next state after taking an action."""
    if state == terminal_state:
        return state
    
    row, col = state // 3, state % 3
    if action == 0:    # left
        col = max(0, col - 1)
    elif action == 1:  # down
        row = min(2, row + 1)
    elif action == 2:  # right
        col = min(2, col + 1)
    elif action == 3:  # up
        row = max(0, row - 1)
    return row * 3 + col

def reward(next_state):
    """Return reward for landing in next_state."""
    if next_state == 8:   # Diamond
        return 10
    elif next_state in [4, 7]:  # Mountains
        return -2
    else:  # All other states
        return -1

# Fill transition table
for s in range(num_states):
    for a in range(4):
        ns = move(s, a)
        r = reward(ns)
        done = (ns == terminal_state)
        P[s][a] = [(1.0, ns, r, done)]   # deterministic env

# ==========================
# 3. STATE-VALUE FUNCTIONS
# ==========================
print("2. STATE-VALUE FUNCTIONS")
print("========================")
print("V(s) = Expected return starting from state s following policy π")
print()

def compute_state_value(state, policy):
    """Bellman expectation with env.P"""
    if state == terminal_state:
        return 0
    
    action = policy[state]
    transitions = P[state][action]
    
    value = 0
    for prob, next_state, reward, _ in transitions:
        value += prob * (reward + gamma * compute_state_value(next_state, policy))
    return value

# Calculate state values for both policies
print("Computing state values...")
print()

V1 = {s: compute_state_value(s, policy1) for s in range(num_states)}
V2 = {s: compute_state_value(s, policy2) for s in range(num_states)}

# ==========================
# 4. RESULTS
# ==========================
print("RESULTS:")
print("========")
print("State-values for Policy 1:", V1)
print("State-values for Policy 2:", V2)
print()

# Example calculation walkthrough
print("EXAMPLE CALCULATION (Policy 1, State 2):")
print("========================================")
state = 2
action = policy1[state]
prob, next_state, reward, _ = P[state][action][0]
print(f"State 2 → Action {action_names[action]} → State {next_state}")
print(f"Reward: {reward}")
print(f"V(2) = {reward} + {gamma} × V({next_state}) = {reward} + {gamma} × {V1[next_state]} = {V1[2]}")
print()

# Compare policies
print("POLICY COMPARISON:")
print("==================")
total1 = sum(V1[s] for s in range(8))  # exclude terminal
total2 = sum(V2[s] for s in range(8))
print(f"Total value (Policy 1): {total1}")
print(f"Total value (Policy 2): {total2}")
print(f"Better policy: Policy {'2' if total2 > total1 else '1'}")


Policy 1: {0: 'down', 1: 'right', 2: 'down', 3: 'down', 4: 'up', 5: 'down', 6: 'right', 7: 'up'}
Policy 2: {0: 'right', 1: 'right', 2: 'down', 3: 'right', 4: 'right', 5: 'down', 6: 'right', 7: 'right'}

2. STATE-VALUE FUNCTIONS
V(s) = Expected return starting from state s following policy π

Computing state values...

RESULTS:
State-values for Policy 1: {0: 1.0, 1: 8.0, 2: 9.0, 3: 2.0, 4: 7.0, 5: 10.0, 6: 3.0, 7: 5.0, 8: 0}
State-values for Policy 2: {0: 7.0, 1: 8.0, 2: 9.0, 3: 7.0, 4: 9.0, 5: 10.0, 6: 8.0, 7: 10.0, 8: 0}

EXAMPLE CALCULATION (Policy 1, State 2):
State 2 → Action down → State 5
Reward: -1
V(2) = -1 + 1.0 × V(5) = -1 + 1.0 × 10.0 = 9.0

POLICY COMPARISON:
Total value (Policy 1): 45.0
Total value (Policy 2): 68.0
Better policy: Policy 2


## 2.2 Action-Value Functions ($Q^\pi$)

### 🔹 **Action-Value Function: $Q^\pi(s,a)$**

* **Definition:**
  $Q^\pi(s,a)$ is the *expected total reward (return)* if you **start in state $s$**, **take action $a$ immediately**, and then **keep following policy $\pi$**.

* **Think of it as:**
  “How good is it to **choose this action right now** in this state, assuming I’ll behave according to $\pi$ afterward?”

* **Use case:**

  * More detailed than $V^\pi$, because it separates **which action** you take in a state.
  * Needed for **control** (improving or choosing better policies).

* **Example:**
  Same gridworld:

  * In state $s$, you could move **up** (+5 reward) or **down** (-1 penalty).
  * $V^\pi(s)$ would average over what policy $\pi$ usually does.
  * $Q^\pi(s,\text{up})=+5$, $Q^\pi(s,\text{down})=-1$.
  * So $Q$ tells you *which action is better in this state*, while $V$ just says *the overall goodness of the state under your policy*.

### 1) **what $Q^\pi(s,a)$ means**

* Formula:

  $$
  Q^\pi(s,a) = \mathbb{E}_\pi\big[G_t \mid S_t=s, A_t=a\big]
  $$
* Symbols:

  * $G_t$ = total return from time $t$: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$
  * $\mathbb{E}_\pi[\cdot]$ = expectation when actions after time $t$ are chosen according to policy $\pi$.
  * Condition $\mid S_t=s, A_t=a$ means: **we start in state $s$ and take action $a$ now**; randomness afterward is from environment transitions and the policy.
* Intuition: average (over possible randomness) of **immediate reward + discounted future rewards** after taking $a$ in $s$.
* Tiny numeric example:

  * Suppose $\gamma=0.9$, immediate reward $r=2$, and expected future return $V^\pi(s')=5$.
  * Then $Q^\pi(s,a)=2 + 0.9\times 5 = 6.5$.

### 2) **Break the return into immediate + future (derivation)**

* Start from definition:

  $$
  Q^\pi(s,a) = \mathbb{E}\big[R_{t+1} + \gamma G_{t+1} \mid S_t=s, A_t=a\big]
  $$
* Because expectation is linear:

  $$
  Q^\pi(s,a) = \mathbb{E}[R_{t+1}\mid s,a] \;+\; \gamma \,\mathbb{E}[G_{t+1}\mid s,a]
  $$
* Interpretations:

  * $\mathbb{E}[R_{t+1}\mid s,a]$ = expected immediate reward after doing $a$ in $s$.
  * $\mathbb{E}[G_{t+1}\mid s,a]$ = expected future return from the next time step onward.

### 3) **Deterministic one-step Bellman form**

* Formula (deterministic next state $s'$ and deterministic immediate reward $r_a$):

  $$
  Q^\pi(s,a) = r_a + \gamma V^\pi(s')
  $$
* Why this follows:

  * If taking $a$ from $s$ always lands in $s'$ and gives reward $r_a$, then $\mathbb{E}[R_{t+1}\mid s,a]=r_a$ and $\mathbb{E}[G_{t+1}\mid s,a]=V^\pi(s')$.
* Intuition: **immediate reward** + **discounted value of the known next state**.

### 4) **Stochastic transitions — expectation over next states**

* Formula:

  $$
  Q^\pi(s,a) = \sum_{s'} P(s'\mid s,a)\,\big[ R(s,a,s') + \gamma V^\pi(s')\big]
  $$
* Explanation:

  * If taking $a$ can lead to several possible next states $s'$, each with probability $P(s'|s,a)$, you average the immediate reward + future value for each possible $s'$.
* Numeric example (showing every arithmetic step):

  * Let two next states $s_1,s_2$ with probabilities $0.7$ and $0.3$.
  * Rewards: $R(s,a,s_1)=1$, $R(s,a,s_2)=2$.
  * Values: $V^\pi(s_1)=3$, $V^\pi(s_2)=4$.
  * $\gamma=0.9$.
  * Compute for $s_1$: inner = $1 + 0.9\times 3 = 2.7$.
    * Multiply by prob: $0.7\times 3.7 = 2.59$.
  
  * Compute for $s_2$: inner = $2 + 0.9\times 4 = 5.6$.
    * Multiply by prob: $0.3\times 5.6 = 1.68$.
    * So $Q^\pi(s,a) = 2.59 + 1.68 = 4.27$.

### 5) **Bellman expectation in terms of $Q$ (no $V$ needed)**

* Start: $V^\pi(s') = \sum_{a'} \pi(a'\mid s')\,Q^\pi(s',a')$.
* Substitute into previous expression:

  $$
  Q^\pi(s,a) = \sum_{s'} P(s'\mid s,a)\Big[ R(s,a,s') + \gamma \sum_{a'} \pi(a'\mid s')\,Q^\pi(s',a')\Big].
  $$
* Why useful:

  * This is a **self-consistent equation** for $Q^\pi$ only. Solve it (analytically or iteratively) to find the Q-values of a policy.
* Iteration form (policy evaluation):

  $$
  Q_{k+1}(s,a) \leftarrow \sum_{s'} P(s'\mid s,a)\Big[R(s,a,s') + \gamma \sum_{a'} \pi(a'\mid s') Q_k(s',a')\Big].
  $$

  * Keep updating $Q$ until it converges.

### 6) **Relationship $V^\pi$ ↔ $Q^\pi$**

* Formula:

  $$
  V^\pi(s) = \sum_a \pi(a\mid s)\,Q^\pi(s,a)
  $$
* Meaning: state-value = **expected Q-value when actions are sampled from $\pi$**.
* Quick numeric example:

  * Suppose two actions with probabilities $0.6$ and $0.4$.
  * $Q(s,a_1)=10,\; Q(s,a_2)=2$.
  * Then $V(s)=0.6\times 10 + 0.4\times 2 = 6.8$.

### 7) **Bellman *optimality* equation (why the $\max$ appears)**

* Formula:

  $$
  Q^*(s,a) = \sum_{s'} P(s'\mid s,a)\Big[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\Big]
  $$
* Explanation:

  * $Q^*$ assumes **after taking action $a$ now we will act optimally thereafter**.
  * So the future value from $s'$ is the **maximum** Q-value over all actions available in $s'$: $V^*(s')=\max_{a'}Q^*(s',a')$.
* Small numeric illustration:

  * Suppose deterministic next state $s'$, $R=1$, $\gamma=0.9$, and $\max_{a'}Q^*(s',a')=9$.
  * Then $Q^*(s,a)=1 + 0.9\times 9 = 9.1$.

* The optimal policy uses:

  $$
  \pi^*(s) = \arg\max_a Q^*(s,a).
  $$

  * Pick the action that yields the largest $Q^*$ in that state.

### 8) **Sample-based updates (how algorithms use these formulas)**

* **Q-Learning (off-policy)** one-step sample update (common in practice):

  $$
  Q(s,a) \leftarrow Q(s,a) + \alpha\big[ r + \gamma\max_{a'}Q(s',a') - Q(s,a)\big]
  $$

  * Use when you sample a transition $(s,a,r,s')$.
  * Intuition: move $Q(s,a)$ toward the sample target $r + \gamma\max_{a'}Q(s',a')$.
* **SARSA (on-policy)**:

  $$
  Q(s,a) \leftarrow Q(s,a) + \alpha\big[ r + \gamma Q(s',a') - Q(s,a)\big]
  $$

  * Here $a'$ is the actual next action chosen by current policy; used when learning while following that policy.

### 9) **Terminal-state conventions**

* Common options:

  * Set $Q(\text{terminal},a)=0$ for all $a$.
  * Or skip/ignore those states in updates.
* Both are fine if applied consistently.

### 10) **Quick intuitive checklist** 🧭

* If you *know* the next state exactly → use $Q = r + \gamma V(\text{next})$.
* If next state is random → average over next states with $P(s'|s,a)$.
* If you have a policy $\pi$ and want to compute its values → use the Bellman expectation (in terms of $Q$ or $V$).
* If you want the best possible behavior → replace expected future actions by $\max$ (Bellman optimality), then act greedily w\.r.t. $Q^*$.


In [27]:
# ============================================
# 1. Import Required Libraries
# ============================================
import gymnasium as gym
import numpy as np


# ============================================
# 2. Initialize Environment & Parameters
# ============================================
env = gym.make('FrozenLake-v1', is_slippery=True)

num_states = env.observation_space.n     # Number of states in FrozenLake
num_actions = env.action_space.n         # Number of possible actions
terminal_state = 15                      # Goal state index in FrozenLake
gamma = 0.9                              # Discount factor


# ============================================
# 3. Initialize Value Function
# ============================================
# Value function stores the "goodness" of each state
V = np.zeros(num_states)     # Start with all states = 0
V[terminal_state] = 1.0      # Goal state is assigned value 1


# ============================================
# 4. Q-Value Computation Function
# ============================================
def compute_q_value(state, action, V):
    """
    Compute the Q-value for a given (state, action) pair.

    Parameters:
        state  (int): Current state
        action (int): Action taken in this state
        V      (array): Current value function

    Returns:
        float: Estimated Q-value
    """
    # Terminal state has no future rewards
    if state == terminal_state:
        return 0
    
    # env.unwrapped.P[state][action] gives the transition dynamics:
    # [(probability, next_state, reward, done), ...]
    probability, next_state, reward, done = env.unwrapped.P[state][action][0]
    
    # Bellman expectation equation
    return reward + gamma * V[next_state]


# ============================================
# 5. Compute Q-values for All State-Action Pairs
# ============================================
Q = {
    (state, action): compute_q_value(state, action, V)
    for state in range(num_states)
    for action in range(num_actions)
}

print("Q-values:")
print(Q)


# ============================================
# 6. Greedy Policy Improvement
# ============================================
def improve_policy(Q, num_states, num_actions):
    """
    Improve policy using greedy selection:
    For each state, pick the action with the maximum Q-value.
    """
    improved_policy = {}
    
    for state in range(num_states - 1):  # Exclude terminal state
        max_action = max(range(num_actions), key=lambda action: Q[(state, action)])
        improved_policy[state] = max_action
    
    return improved_policy


# Generate improved policy
policy = improve_policy(Q, num_states, num_actions)

print("\nImproved Policy:")
print(policy)


# ============================================
# 7. Test the Improved Policy
# ============================================
state, _ = env.reset()   # Reset environment to initial state
terminated = False

while not terminated:
    action = policy[state]  # Select action according to policy
    state, reward, terminated, truncated, info = env.step(action)
    print(f"State: {state}, Reward: {reward}")

env.close()

Q-values:
{(0, 0): np.float64(0.0), (0, 1): np.float64(0.0), (0, 2): np.float64(0.0), (0, 3): np.float64(0.0), (1, 0): np.float64(0.0), (1, 1): np.float64(0.0), (1, 2): np.float64(0.0), (1, 3): np.float64(0.0), (2, 0): np.float64(0.0), (2, 1): np.float64(0.0), (2, 2): np.float64(0.0), (2, 3): np.float64(0.0), (3, 0): np.float64(0.0), (3, 1): np.float64(0.0), (3, 2): np.float64(0.0), (3, 3): np.float64(0.0), (4, 0): np.float64(0.0), (4, 1): np.float64(0.0), (4, 2): np.float64(0.0), (4, 3): np.float64(0.0), (5, 0): np.float64(0.0), (5, 1): np.float64(0.0), (5, 2): np.float64(0.0), (5, 3): np.float64(0.0), (6, 0): np.float64(0.0), (6, 1): np.float64(0.0), (6, 2): np.float64(0.0), (6, 3): np.float64(0.0), (7, 0): np.float64(0.0), (7, 1): np.float64(0.0), (7, 2): np.float64(0.0), (7, 3): np.float64(0.0), (8, 0): np.float64(0.0), (8, 1): np.float64(0.0), (8, 2): np.float64(0.0), (8, 3): np.float64(0.0), (9, 0): np.float64(0.0), (9, 1): np.float64(0.0), (9, 2): np.float64(0.0), (9, 3): np.flo

### 🔎 Why are most Q-values `0.0`?

* we **initialized the value function `V` with all zeros**, except for the terminal state (`V[15] = 1.0`).

* Then we computed Q-values as:

  $$
  Q(s,a) = r + \gamma V(s')
  $$

* Since **most transitions in FrozenLake lead to states with `V=0`**, their Q-values become `0.0`.
* The **only exception is state 14 → action 3 (right)**, which leads directly to the goal (`state 15`). That’s why `Q[(14,3)] = 1.9` (`reward=1 + γ*1.0`).


### 🔎 Why does the policy look like this?

```
{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 3}
```

* Greedy policy improvement picks the **action with the highest Q-value**.
* Since all Q-values are `0.0` except for state `14 → action 3`, the policy defaults to choosing **action 0** (usually "Left") everywhere.
* At state `14`, it correctly chooses action `3` (Right → goal).

So the policy is trivial: "always go left, except when at state 14, go right."


### 🔎 Why does the agent loop between states (0 → 4 → 8 → 4 → 8 …)?

* FrozenLake is **slippery** (`is_slippery=True`), meaning the agent doesn’t always move in the intended direction.
* With a bad policy (always choosing "left"), the agent just bounces around between states (like 0 → 4 → 8 → 12).
* It **never learns a good path** because we only computed **one-step Q-values** from an untrained `V`.


> ### ⭐To actually learn a meaningful policy, we need **policy iteration** or **value iteration** (looping updates to `V` and `Q` until convergence).⭐

## 2.3 Quick Comparison: $V^\pi$ Vs $Q^\pi$

| Function         | Input                 | Output         | Question it answers                                         | Intuition                                                                                              | Example                                                                                         |
| ---------------- | --------------------- | -------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------- |
| **$V^\pi(s)$**   | State $s$             | Number (value) | “How good is this state under policy $\pi$?”                | Expected long-term return starting **from state $s$** and following policy $\pi$.                      | In chess: value of a board position assuming you continue playing according to strategy $\pi$.  |
| **$Q^\pi(s,a)$** | State $s$, Action $a$ | Number (value) | “How good is this action in this state under policy $\pi$?” | Expected long-term return starting **from state $s$**, taking action $a$, then following policy $\pi$. | In chess: value of moving a knight to a specific square (action) given the current board state. |



### 🔑 Key Differences

1. **Scope**

   * $V^\pi(s)$ looks at **how desirable a state is overall**.
   * $Q^\pi(s,a)$ looks at **how desirable a specific action is in that state**.

2. **Decision-making granularity**

   * $V^\pi(s)$ is coarse-grained (state-level).
   * $Q^\pi(s,a)$ is fine-grained (state-action level, more detailed).

3. **Relation**

   * $V^\pi(s) = \sum_a \pi(a|s)\, Q^\pi(s,a)$
     (state value is the expected action value under the policy).

4. **Use cases**

   * $V^\pi(s)$: Used in **policy evaluation** (how good is my strategy overall?).
   * $Q^\pi(s,a)$: Used in **policy improvement** and **control** (which action should I pick?).



### ⭐ Intuitive Summary

* **$V$** = *“Value of being in a place (state).”*
* **$Q$** = *“Quality of making a move (action) in that place.”*


# **✅3. Policy & Value Iteration**

> ### **Initialize policy** $\longrightarrow$ **Evaluate policy** $\longleftrightarrow$ **Improve policy** $\longrightarrow$ **Optimal policy**

## 3.1 Overview

* **Goal**: Find the **optimal policy** (π\*) that maximizes expected return in a Markov Decision Process (MDP).
* Two key algorithms:

  * **Policy Iteration (PI)** → Iterative evaluation + improvement of a policy.
  * **Value Iteration (VI)** → A faster version that combines evaluation & improvement in one step.

## 3.2 Policy Iteration ($PI$)

### 🔹 Steps

1. **Initialize**

   * Start with a random policy π.

2. **Policy Evaluation**

   * Compute the **state-value function V(s)** under π:

     $$
     V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s',r} P(s'|s,a)[r + \gamma V^{\pi}(s')]
     $$
   * Iterate until values converge.

3. **Policy Improvement**

   * For each state, pick the action that maximizes the Q-value:

     $$
     \pi'(s) = \arg\max_a Q(s,a)
     $$
   * Update policy.

4. **Repeat** evaluation → improvement until **policy stabilizes** (no further change).



### 🔹 **Implementation Outline**

* **Functions**:

  * `policy_evaluation(policy)` → Computes V(s) using `compute_state_value()`.
  * `policy_improvement(policy)` → Builds new policy by computing Q(s,a) and picking `max_action`.
  * `policy_iteration()` → Alternates evaluation & improvement until convergence.
* **Termination condition**:
  Policy does not change between iterations ⇒ π\* found.



### 🔹 **Example: Grid World**

* Apply PI to a grid environment.
* Result: Agent finds **shortest path** to the goal (fewer steps than initial random policy).

⭐**Algorithm Overview**

> **Definition:** Policy Iteration is an algorithm that finds the optimal policy by alternating between policy evaluation and policy improvement until convergence.

⭐**Steps:**
1. **Initialize:** Start with an arbitrary policy $\pi_o$
2. **Policy Evaluation:** Compute $V^\pi$ for current policy
3. **Policy Improvement:** Create new policy π' by acting greedily with respect to $V^\pi$
4. **Check Convergence:** If $\pi^{'} = \pi$ , stop. Otherwise, set $\pi = \pi^{'}$ and go to **Step-2**

In [35]:
import numpy as np
import gymnasium as gym

# -------------------------------
# Create GridWorld Environment
# -------------------------------
env = gym.make("FrozenLake-v1", is_slippery=True, render_mode=None)
num_states = env.observation_space.n
num_actions = env.action_space.n

# Access transition probabilities
P = env.unwrapped.P

# Discount factor
gamma = 0.9
theta = 1e-6   # convergence threshold

# -------------------------------
# Compute State Value
# -------------------------------
def compute_state_value(state, policy, V):
    """Computes the value of a state under the given policy."""
    action = policy[state]
    value = 0.0
    for prob, next_state, reward, terminated in P[state][action]:
        value += prob * (reward + gamma * V[next_state])
    return value

# -------------------------------
# Policy Evaluation
# -------------------------------
def policy_evaluation(policy):
    V = np.zeros(num_states)
    while True:
        delta = 0
        for state in range(num_states):
            v = V[state]
            V[state] = compute_state_value(state, policy, V)
            delta = max(delta, abs(v - V[state]))
        if delta < theta:
            break
    return V

# -------------------------------
# Compute Q-value
# -------------------------------
def compute_q_value(state, action, V):
    """Q(s,a) = sum over next states [ P(s'|s,a) * (R + gamma*V(s')) ]"""
    q = 0.0
    for prob, next_state, reward, terminated in P[state][action]:
        q += prob * (reward + gamma * V[next_state])
    return q

# -------------------------------
# Policy Improvement
# -------------------------------
def policy_improvement(V, policy):
    stable = True
    for state in range(num_states):
        old_action = policy[state]
        # Choose best action
        action_values = [compute_q_value(state, a, V) for a in range(num_actions)]
        best_action = np.argmax(action_values)
        policy[state] = best_action
        if old_action != best_action:
            stable = False
    return policy, stable

# -------------------------------
# Policy Iteration
# -------------------------------
def policy_iteration():
    # Initialize random policy
    policy = np.random.choice(num_actions, size=num_states)
    while True:
        V = policy_evaluation(policy)
        policy, stable = policy_improvement(V, policy)
        if stable:
            break
    return policy, V

# -------------------------------
# Run Policy Iteration
# -------------------------------
optimal_policy, optimal_V = policy_iteration()

# Pretty print
print("✅ Optimal Policy (per state):")
print(optimal_policy.reshape((4, 4)))  # 4x4 Grid
print("\n✅ Optimal Value Function:")
print(optimal_V.reshape((4, 4)))

# Action mapping for better visualization
action_map = {0: '←', 1: '↓', 2: '→', 3: '↑'}
print("\n✅ Policy with arrows:")
policy_arrows = np.array([action_map[a] for a in optimal_policy]).reshape((4, 4))
print(policy_arrows)

✅ Optimal Policy (per state):
[[0 3 0 3]
 [0 0 0 0]
 [3 1 0 0]
 [0 2 1 0]]

✅ Optimal Value Function:
[[0.06888673 0.06141154 0.07440786 0.05580526]
 [0.09185135 0.         0.11220737 0.        ]
 [0.14543417 0.24749575 0.29961685 0.        ]
 [0.         0.37993513 0.63901979 0.        ]]

✅ Policy with arrows:
[['←' '↑' '←' '↑']
 ['←' '←' '←' '←']
 ['↑' '↓' '←' '←']
 ['←' '→' '↓' '←']]


**Time Complexity:** $O(|S|²|A|)$ per iteration, where |S| is number of states and |A| is number of actions.

## 3.3 Value Iteration ($VI$)



### 🔹 **Idea**

* Speeds up policy iteration by **combining evaluation & improvement** in a single update.
* Instead of fully evaluating a policy each time, we directly update value estimates.

### 🔹 **Steps**

1. **Initialize**

   * Set $V(s) = 0$ for all states.

2. **Iterative Update**

   * For each state:

     $$
     V(s) \leftarrow \max_a \sum_{s',r} P(s'|s,a)[r + \gamma V(s')]
     $$
   * Policy derived as:

     $$
     \pi(s) = \arg\max_a Q(s,a)
     $$

3. **Convergence**

   * Continue until value updates are below a **threshold (ε)**.



### 🔹 **Implementation Outline**

* **Functions**:

  * `get_max_action_and_value(state, V)` → Returns (max\_action, max\_q\_value).
  * `compute_action_value(state, action, V)` → Computes Q(s,a) using updated V.
* **Loop**:

  * Update new state-values (`new_V`).
  * Derive improved policy simultaneously.
  * Check if `||new_V - V|| < ε`, then stop.



### 🔹 **Key Notes**

* Uses **previous iteration V** directly (instead of `compute_state_value`).
* Same final result as Policy Iteration.
* **Advantage**: Often converges faster.

### ⭐Algorithm Overview

> **Definition:** Value Iteration combines policy evaluation and improvement in a single step. It directly computes the optimal value function and derives the policy from it.

**Key Insight:** Instead of fully evaluating a policy, perform only one sweep of policy evaluation followed by policy improvement.

**Bellman Optimality Equation:**
$$V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^*(s')]$$

### ⭐Algorithm Steps

1. **Initialize:** $V(s)$ = 0 for all states
2. **Value Update:** For each state, compute the maximum expected value over all actions
3. **Policy Extraction:** Choose actions that achieve the maximum value
4. **Convergence Check:** Stop when value changes are below threshold


In [4]:
import gymnasium as gym
import numpy as np

# Environment setup
env = gym.make('FrozenLake-v1', is_slippery=True, render_mode=None)
mdp_env = env.unwrapped  # access the underlying MDP to get .P
num_states = mdp_env.observation_space.n
num_actions = mdp_env.action_space.n
terminal_state = 15  # Goal state in FrozenLake 4x4
gamma = 0.9  # Discount factor

def value_iteration(threshold=0.001):
    """Value iteration algorithm."""
    # Initialize
    V = {state: 0 for state in range(num_states)}
    policy = {state: 0 for state in range(num_states)}

    while True:
        new_V = {state: 0 for state in range(num_states)}

        for state in range(num_states):
            if state == terminal_state:
                new_V[state] = 0
                continue

            # Compute Q-values for all actions
            Q_values = []
            for action in range(num_actions):
                q_val = 0
                for prob, next_state, reward, done in mdp_env.P[state][action]:
                    q_val += prob * (reward + gamma * V[next_state])
                Q_values.append(q_val)

            # Take maximum
            max_q_value = max(Q_values)
            max_action = int(np.argmax(Q_values))

            new_V[state] = max_q_value
            policy[state] = max_action

        # Check convergence
        if all(abs(new_V[s] - V[s]) < threshold for s in range(num_states)):
            break

        V = new_V

    return policy, V

def get_max_action_and_value(state, V):
    """Helper function to get optimal action and value for a state."""
    Q_values = []
    for action in range(num_actions):
        q_val = 0
        for prob, next_state, reward, done in mdp_env.P[state][action]:
            q_val += prob * (reward + gamma * V[next_state])
        Q_values.append(q_val)

    max_action = int(np.argmax(Q_values))
    max_q_value = Q_values[max_action]

    return max_action, max_q_value

# Test the functions
if __name__ == "__main__":
    optimal_policy, optimal_V = value_iteration()
    print("Optimal Policy:", optimal_policy)
    print("Optimal Values:", optimal_V)

    # Test helper function
    test_state = 5
    action, value = get_max_action_and_value(test_state, optimal_V)
    print(f"State {test_state}: Best action = {action}, Value = {value}")

Optimal Policy: {0: 0, 1: 3, 2: 0, 3: 3, 4: 0, 5: 0, 6: 0, 7: 0, 8: 3, 9: 1, 10: 0, 11: 0, 12: 0, 13: 2, 14: 1, 15: 0}
Optimal Values: {0: 0.06162274283994246, 1: 0.05531399137342944, 2: 0.0699622244159502, 3: 0.05101702913301784, 4: 0.08519461431783229, 5: 0.0, 6: 0.10976851693787404, 7: 0.0, 8: 0.13996615432409334, 9: 0.24373109529624215, 10: 0.2969629949287245, 11: 0.0, 12: 0.0, 13: 0.3771539838433998, 14: 0.6375395830635082, 15: 0}
State 5: Best action = 0, Value = 0.0


**Time Complexity:** $O(|S|²|A|)$ per iteration, typically converges faster than Policy Iteration.

## 3.4. Quick comparison: $PI$ Vs $VI$

| Aspect                 | **Policy Iteration**                                                                                                                                                                                              | **Value Iteration**                                                                                                                                                                               |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Approach**           | Two clear steps: (1) **Policy Evaluation** – compute how good the current policy is (using $V^\pi$), (2) **Policy Improvement** – update the policy to be greedy w\.r.t. those values. These two steps alternate. | Blends evaluation and improvement into **one step** by directly updating the value function toward optimality ($V^*$) using the Bellman optimality equation.                                      |
| **Convergence**        | Converges in a **finite number of iterations** (guaranteed to find the optimal policy after some steps).                                                                                                          | Converges **asymptotically**: values get closer to $V^*$ with each update but only *truly* converge after infinite updates. In practice, we stop once changes are very small (below a threshold). |
| **Per Iteration Cost** | **High** – because policy evaluation usually requires solving or approximating a system of equations (can take many sweeps over all states).                                                                      | **Low** – each iteration just does one value update per state (using the Bellman backup). Much cheaper per iteration.                                                                             |
| **Total Iterations**   | **Fewer** – since each iteration makes a “big jump” by fully evaluating the policy.                                                                                                                               | **More** – since each step is small and incremental. Needs many sweeps to reach near-optimal values.                                                                                              |
| **Memory**             | Must store an **explicit policy** (mapping from states to actions) alongside the value function.                                                                                                                  | Policy is **implicitly derived** from the value function (choose the action that maximizes the next-state value).                                                                                 |
| **Practical Use**      | Works best for **small or medium-sized state spaces** where exact policy evaluation is feasible.                                                                                                                  | Preferred for **large or complex state spaces**, since it avoids heavy evaluation and still reaches a good approximation.                                                                         |



### **Intuition**

- **Policy Iteration = “Think hard, act big”** → Each step is expensive but fewer steps are needed.
- **Value Iteration = “Think fast, act small”** → Each step is cheap but you need more of them.

- **Both PI and VI converge to the same optimal policy.**
- Choice depends on efficiency needs:
  - PI → Conceptually simpler, good for learning.
  - VI → Faster, practical for large environments.

# **✅4. Summary**

## 4.1 Value Functions

- **State Value:** $V^π(s) = \mathbb{E}[G_t | S_t = s, π]$
- **Action Value:** $Q^π(s,a) = \mathbb{E}[G_t | S_t = s, A_t = a, π]$
- **Return:** $G_t = \sum_{k=0}^{\infty} γ^k R_{t+k+1}$

## 4.2 Bellman Equation

- **State Value Bellman:** $V^π(s) = \sum_a π(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^π(s')]$
- **Action Value Bellman:** $Q^π(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + γ\sum_{a'} π(a'|s')Q^π(s',a')]$
- **Optimality Bellman:** $V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + γV^*(s')]$

## 4.3 Practical Tips and Best Practices

### Implementation Considerations
- **Convergence Threshold:** Choose appropriate threshold (e.g., 0.001) based on precision needs
- **Discount Factor:** γ close to 1 emphasizes future rewards; closer to 0 emphasizes immediate rewards
- **Terminal States:** Always handle terminal states separately (return 0 or fixed value)

### Common Pitfalls
- **Infinite Loops:** Ensure proper convergence checks in iterative algorithms
- **State Indexing:** Be careful with state numbering and terminal state handling
- **Action Spaces:** Verify action space size matches expected number of actions

### Extensions and Advanced Topics
- **Stochastic Policies:** Extend to probabilistic action selection
- **Function Approximation:** Use neural networks for large state spaces
- **Model-Free Methods:** Q-Learning and SARSA for unknown environments
- **Policy Gradient Methods:** Direct policy optimization without value functions

## 4.4 Code Examples Repository

In [None]:
import gymnasium as gym
import numpy as np

class GridWorldMDP:
    def __init__(self, env_name='FrozenLake-v1'):
        self.env = gym.make(env_name)
        self.num_states = self.env.observation_space.n
        self.num_actions = self.env.action_space.n
        self.gamma = 1.0
        self.terminal_state = self.num_states - 1
    
    def policy_iteration(self, threshold=0.001):
        """Complete policy iteration implementation."""
        policy = {i: 0 for i in range(self.num_states - 1)}
        
        while True:
            V = self.policy_evaluation(policy, threshold)
            new_policy = self.policy_improvement(V)
            
            if new_policy == policy:
                break
            policy = new_policy
        
        return policy, V
    
    def value_iteration(self, threshold=0.001):
        """Complete value iteration implementation."""
        V = {s: 0 for s in range(self.num_states)}
        
        while True:
            new_V = V.copy()
            for state in range(self.num_states - 1):
                if state != self.terminal_state:
                    values = []
                    for action in range(self.num_actions):
                        transitions = self.env.unwrapped.P[state][action]
                        value = sum(prob * (reward + self.gamma * V[next_state])
                                  for prob, next_state, reward, _ in transitions)
                        values.append(value)
                    new_V[state] = max(values)
            
            if all(abs(new_V[s] - V[s]) < threshold for s in V):
                break
            V = new_V
        
        # Extract policy
        policy = {}
        for state in range(self.num_states - 1):
            if state != self.terminal_state:
                values = []
                for action in range(self.num_actions):
                    transitions = self.env.unwrapped.P[state][action]
                    value = sum(prob * (reward + self.gamma * V[next_state])
                              for prob, next_state, reward, _ in transitions)
                    values.append(value)
                policy[state] = np.argmax(values)
        
        return policy, V

# Usage
mdp = GridWorldMDP()
optimal_policy, optimal_values = mdp.value_iteration()
print("Optimal Policy:", optimal_policy)
print("Optimal Values:", optimal_values)