# 🧊 Lab 2 — Optimal Policy for Frozen Lake
In this lab we will continue our exploration of Markov Decision Processes (MDPs) 
using the simplified **FrozenLake** environment. Building on Lab 2, we will 
practice three key ideas:

1. **Iterative Policy Evaluation** We compute the value of a given policy by repeatedly applying the Bellman expectation update until convergence, and compare this with the exact closed-form solution from Lab 2.

2. **Monte Carlo Simulation**  We estimate state values empirically by running many episodes in the FrozenLake    environment under the same policy, and compare these estimates to our analytical results.

3. **Finding the Optimal Policy (Value Iteration)**  We apply value iteration to compute the optimal state values and extract the corresponding optimal policy that maximizes long-term return.

In [14]:
import gymnasium as gym
import numpy as np
np.set_printoptions(precision=2, suppress=True)
import random
from gymnasium.envs.toy_text.frozen_lake import generate_random_map

In [15]:
# Define a smaller 3x3 map
DESC_3x3 = [
    "SFF",
    "FHF",
    "FFG",
]
env = gym.make("FrozenLake-v1",desc=DESC_3x3,is_slippery=True, render_mode="ansi")
obs, info = env.reset(seed=42)
#print(env.render())

In [16]:
P_LEFT = [
    [2, 1, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 1, 0, 0, 3, 1, 0, 1, 0],
    [0, 0, 1, 0, 0, 0, 0, 0, 1],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 1],
    [0, 0, 0, 0, 0, 1, 0, 0, 1],
]

P_DOWN = [
    [1, 1, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 1, 3, 1, 0, 0, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 0, 1, 0, 1, 2],
]

P_RIGHT = [
    [1, 0, 0, 1, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 2, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 1, 0, 0],
    [0, 1, 0, 1, 3, 0, 0, 1, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 1],
    [0, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 1, 1, 0],
    [0, 0, 0, 0, 0, 1, 0, 1, 2],
]

P_UP = [
    [2, 1, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 1, 0, 0, 3, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0, 1, 0, 0, 3],
]
P_all = np.array([P_LEFT, P_DOWN, P_RIGHT, P_UP]).transpose(0, 2, 1)/3
POLICY = [
    1,  # state 0 (top-left)
    2,  # state 1
    1,  # state 2
    1,  # state 3
    1,  # state 4 (center, possibly hole)
    1,  # state 5
    2,  # state 6
    2,  # state 7
    2,  # state 8 (goal state)
]

n_states = len(POLICY)
P_pi = np.zeros((n_states, n_states))
for s in range(n_states):
    a = POLICY[s]             # action chosen at state s
    P_pi[s, :] = P_all[a, s]  # copy the probabilities for that action

Reward = [
0,  # state 0 (top-left)
0,  # state 1
0,  # state 2
0,  # state 3
0,  # state 4 (center, possibly hole)
0,  # state 5
0,  # state 6
0,  # state 7
1,  # state 8 (goal state)
]
Reward = np.array(Reward)
gamma = 0.9

## Part 1: Iterative Policy Evaluation

In Lab 2, we solved for the state-value function analytically using the 
closed-form expression:

$$
V = (I - \gamma P_\pi)^{-1} R
$$

In [19]:
# Solve for the value of the policy I designed 
I = np.eye(9)
A = I - gamma * P_pi
V = np.linalg.solve(A, Reward)
print("State values V:", V)

While exact, this approach requires matrix inversion, which can be expensive 
for large state spaces. An alternative method is **iterative policy evaluation**, 
where we repeatedly apply the Bellman expectation update:

$$
V_{k+1}(s) \;=\; R(s) \;+\; \gamma \sum_{s'} P_\pi(s,s') \, V_k(s')
$$

### Steps
1. Initialize the value function arbitrarily (often $V_0 = R$ or $V_0 = 0$).  
2. Update all state values simultaneously using the Bellman update.  
3. Repeat for a number of iterations, or until values converge.  

In this exercise, we will:
- Start with $V_0 = R$.  
- Run the update for about 50 iterations.  
- Observe how the values converge to the same result as the closed-form 
  solution from Lab 2.  

In [None]:
# Your time to work on it

## Part 2: Monte Carlo Simulation

In Part 1, we evaluated a fixed policy analytically and iteratively.  
Now we take a different approach: **simulation**.

The idea is to estimate the value of each state by running many episodes 
of the FrozenLake environment under the same policy, and then averaging 
the observed discounted returns.

Formally, the value of a state is defined as:

$$
V^\pi(s) = \mathbb{E}_\pi \Big[ \sum_{t=0}^{\infty} 
    \gamma^t R_{t+1} \;\Big|\; S_0 = s \Big]
$$

Monte Carlo methods approximate this expectation by repeated sampling:

1. Start from a given state $s$.  
2. Run an episode by following the policy $\pi$, recording the sequence 
   of rewards.  
3. Compute the discounted return $G = r_1 + \gamma r_2 + \gamma^2 r_3 + \dots$.  
4. Repeat many times and take the **average return** as an estimate of $V^\pi(s)$.  

### What to do
- Run many episodes (e.g., 5,000) under your chosen policy.  
- Collect the empirical state values.  
- Compare them with the results from **Part 1** (iterative evaluation) 
  and **Lab 2** (closed-form solution).  

In [None]:
# Your time to work on it

## Part 3: Finding the Optimal Policy (Value Iteration)

So far, we have focused on **policy evaluation** — estimating the value of a 
given policy. In this part, we move to **control**, where the goal is to find 
the **optimal policy** that maximizes long-term returns.

The key idea is to apply **value iteration**, which combines evaluation and 
improvement into a single update rule:

$$
V_{k+1}(s) \;=\; \max_a \Big[ R(s) + \gamma \sum_{s'} P(s'|s,a) \, V_k(s') \Big].
$$

Once the value function converges, we extract the optimal policy by choosing, 
in each state, the action that achieves the maximum:

$$
\pi^*(s) = \arg\max_a \Big[ R(s) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \Big].
$$

### Steps
1. Initialize the value function $V_0(s)$ arbitrarily (e.g., all zeros).  
2. Repeatedly update each state’s value using the **max over actions** rule.  
3. Stop when values converge (the updates change very little).  
4. Derive the optimal policy $\pi^*$ by picking the greedy action at each state.  

### What to do
- Implement value iteration with your transition model `P_all_prob` and rewards.  
- Print the **optimal state values**.  
- Print the **optimal policy** as action indices (0 = LEFT, 1 = DOWN, 2 = RIGHT, 3 = UP).  


In [None]:
# Your time to work on it