# 🧊 Lab 7: Frozen Lake Revisited and Farewell

In this lab, we **return once more to the Frozen Lake** — now larger and more mysterious than before.  
This lab is designed to **tie everything together** through a focused review and practical reflection.


## 🎯 Learning Goals

- Revisit the **core ideas** of **Value Iteration (VI)** and **Policy Iteration (PI)**.  
- Understand how **Monte Carlo** and **Temporal-Difference (TD)** methods extend the same foundation to unknown environments.  
- Gain intuition for how **model-based** and **model-free** reinforcement learning relate.

In [1]:
import gymnasium as gym
import numpy as np
np.set_printoptions(precision=2, suppress=True)
import random
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
from util_frozen import *

In [2]:
# Define a smaller 3x3 map
DESC_5x5 = [
    "SFFFF",
    "FFHFF",
    "FHHFF",
    "FHHFF",
    "FFFFG",
]
env = gym.make("FrozenLake-v1",desc=DESC_5x5,is_slippery=True, render_mode="ansi")
obs, info = env.reset(seed=42)
print(env.render())


[41mS[0mFFFF
FFHFF
FHHFF
FHHFF
FFFFG



In [3]:
# Get the state transition matrix
P, R, absorbing, shape2d, flatmap = build_frozenlake_transitions(DESC_5x5, is_slippery=True)
T_per_action = [P[:, a, :] for a in range(4)] 
P_all = np.array([T_per_action[0], T_per_action[1], T_per_action[2], T_per_action[3]])

In [4]:
n_states = 25
Reward = np.zeros((n_states,), dtype = int) 
Reward[-1] = 1
gamma = 0.9

## 🗺️ Part 1 – Revisiting the Lake

First we re-implement the **value iteration** approach, which combines evaluation and 
improvement into a single update rule:

$$
V_{k+1}(s) \;=\; \max_a \Big[ R(s) + \gamma \sum_{s'} P(s'|s,a) \, V_k(s') \Big].
$$

Once the value function converges, we extract the optimal policy by choosing, 
in each state, the action that achieves the maximum:

$$
\pi^*(s) = \arg\max_a \Big[ R(s) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \Big].
$$

In [None]:
# Initialize the original value 
V_0 = np.zeros_like(Reward, dtype=float)
V_new = np.zeros_like(V_0, dtype=float)
NUM_ITER = 150
NUM_ACT = P_all.shape[0]
for i_step in np.arange(NUM_ITER):
    for i_state in np.arange(n_states):
        action_values = np.zeros((4,), dtype=float)

        #### Your time to work on it

        ######
        
    V_0 = V_new

    
    if np.mod(i_step, 10) ==0:
        print(f"iter:{i_step}, values: {V_new}")

In [None]:
render_value_grid(V_new, DESC_5x5, title="5x5 FrozenLake – Value Iteration")

Next, we try to re-implement the **Truncated (Modified) Policy Iteration** algorithm, where we alternate between **partial policy evaluation** and **policy improvement**:

$$
V^{(j+1)}(s) \;=\; R(s) \;+\; \gamma \sum_{s'} P(s' \mid s, \pi_k(s))\, V^{(j)}(s'),
\quad j = 0,1,\dots,m-1,
$$

followed by:

$$
\pi_{k+1}(s) \;=\; \arg\max_{a} \Big[ \, R(s) \;+\; \gamma \sum_{s'} P(s' \mid s, a)\, V^{(m)}(s') \, \Big].
$$


In [None]:
# Initialize an original value 
V_0 = np.zeros_like(Reward, dtype=float)
V_new = np.zeros_like(V_0, dtype=float)
# Initialize an original policy 
PI_0 = np.zeros_like(Reward)
PI_new = np.zeros_like(Reward)
NUM_ITER = 40
NUM_ACT = P_all.shape[0]
NUM_PE = 10

for i_step in np.arange(NUM_ITER):
    for i_VI in np.arange(NUM_PE): 
    #### Your time to work on it ######
        
        for i_state in np.arange(n_states):
            
            chosen_action =  # ......
            V_new[i_state] = # ......
           
            
    for i_state in np.arange(n_states):
        action_values = np.zeros((4,), dtype=float)
        for i_action in np.arange(NUM_ACT):
            action_values[i_action] = # .....
        PI_new[i_state] = # ......
        
     #####
        
    V_0 = V_new
    PI_0 = PI_new
    print(f"iter:{i_step}, values: {V_new}, policy: {PI_new}")

In [None]:
render_value_and_policy_grid(V_new, PI_new, DESC_5x5, title="Values + Policy")

# 🍀 Part 2 — Monte Carlo (MC) Control with $ε$-Greedy Policy


In this section we implement **Monte Carlo $ε$-Greedy Control**, a practical variant of **Monte Carlo Exploring Starts** that does **not** require special start states.

---

## 📘 Algorithm: MC ε-Greedy Control

**Goal:** Search for an optimal policy $π^*$ and action-value function $q^*$ through sampling and $ε$-greedy improvement.

### Initialization
- Initialize policy $\pi_0(a | s)$ and action-value function $q(s, a)$ for all $(s, a)$.  
- Initialize `R(s,a) = 0`, `N(s,a) = 0`.  
- Choose an exploration parameter $ε \in (0, 1]$.


In [56]:
def epsilon_greedy_action(Q, s, epsilon):
    """Sample ε-greedy action using action-values Q."""
    n_actions = Q.shape[1]
    if np.random.rand() < epsilon:
        return np.random.randint(n_actions)         # explore
    return np.argmax(Q[s])                          # exploit

In [None]:
import numpy as np

n_states = env.observation_space.n     # total number of states
n_actions = env.action_space.n         # total number of actions
# Action–value function q(s,a)
Q = np.zeros((n_states, n_actions), dtype=float)
Q[-1,:] =10
# Cumulative returns R(s,a)
Returns = np.zeros_like(Q, dtype=float)
# Visit counts N(s,a)
Num = np.zeros_like(Q, dtype=float)

# Exploration parameter
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.95

pi = np.zeros(n_states, dtype=int)

print("Initialization complete:")
print(f"States: {n_states}, Actions: {n_actions}, ε = {epsilon}")


### For each episode
1. **Episode Generation**  
   - Choose a starting state–action pair $(s_0, a_0)$ or simply start from the default start state.  
   - Follow the **current ε-greedy policy** $\pi$ to generate an episode of length $T$:  
     $ (s_0, a_0, r_1, s_1, a_1, \dots, s_{T-1}, a_{T-1}, r_T) $.
2. **Initialization for this episode:** $G = 0$

3. **Backward Return Computation**  
   For each step of the episode, $t = T − 1, T − 2, \ldots, 0$ we do the following:
   $$
   g \leftarrow \gamma g + r_{t+1}
   $$
   $$
   \text{R}(s_t, a_t) \leftarrow \text{R}(s_t, a_t) + g
   $$
   $$
   \text{N}(s_t, a_t) \leftarrow \text{N}(s_t, a_t) + 1
   $$
   $$
   q(s_t, a_t) \leftarrow \frac{\text{R}(s_t, a_t)}{\text{N}(s_t, a_t)}
   $$
4. **Policy Improvement**
   - Let $ a^* = \arg\max_a q(s_t, a) $.  
   - Update π to be $\epsilon$-greedy with respect to q:

     $$
     \pi(a|s_t) =
     \begin{cases}
     1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}(s_t)|}, & a = a^* \\
     \dfrac{\varepsilon}{|\mathcal{A}(s_t)|}, & a \neq a^*
     \end{cases}
     $$

In [None]:
num_episodes = 20000  # Here, one episode is too short, we do multiple episodes

for i_ter in np.arange(100):
    Q        = np.zeros((n_states, n_actions), dtype=float)
    Returns  = np.zeros_like(Q)
    Num      = np.zeros_like(Q)

    # Decay the epsilon coefficient to encourage exploration in the beginning
    epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    for ep in range(num_episodes):
        s, _ = env.reset()
        states, actions, rewards = [s], [], []
        
        done, steps = False, 0
        while steps < max_steps:
            a = epsilon_greedy_action(Q, s, epsilon)
            # a slight modification on the environment to make the result match
            if done:
                s_next = s 
                r = Reward[s]
                actions.append(a); rewards.append(r); states.append(s_next) 
            else: 
                s_next, r, term, trunc, _ = env.step(a)
                done = bool(term or trunc)
                actions.append(a); rewards.append(r); states.append(s_next)    
                s = s_next
            
            steps += 1
    
        G = 0.0
        T = len(actions)
        for t in range(T-1, -1, -1):
            s_t, a_t = states[t], actions[t]
            G = # .........
            Returns[s_t, a_t] = # .....
            Num[s_t, a_t] += 1.0
            Q[s_t, a_t] = # ........

    # Policy improvement
    pi = # ......
    
    print(f"iter: {i_ter}, epsilon = {epsilon} ")
    render_value_and_policy_grid(Q.max(axis=1), pi, DESC_5x5, title="Values + Policy")