# 🧊 Lab 2 — Stochastic FrozenLake & Transition Probabilities

In this lab, we will:
1. **Visualize the slippery surface** to build intuition about stochastic transitions.
2. **Model FrozenLake as an MDP**, explicitly defining $P(s' \mid s,a)$.
3. **Implement and verify** our own transition probability table and compare it to Gym's.

### 🧊 Demonstration: Slippery vs. Deterministic FrozenLake

Let's compare **deterministic** (`is_slippery=False`) and **stochastic** (`is_slippery=True`) FrozenLake.

1. First, create the environment with `is_slippery=False` and run a trajectory.
2. Then recreate the environment with `is_slippery=True` and run the **same function** again.
3. Notice how the agent may slip into unintended directions when the surface is slippery.

In [1]:
import gymnasium as gym
import numpy as np
import random
from gymnasium.envs.toy_text.frozen_lake import generate_random_map

In [2]:
# Deterministic environment (no slipping)
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="ansi")
print("=== Deterministic FrozenLake ===")
obs, info = env.reset(seed=42)
print(env.render())
# Take 4 fixed actions: DOWN, DOWN, RIGHT, RIGHT (1, 1, 2, 2)
for action in [1, 1, 2, 2]:
    obs, reward, terminated, truncated, info = env.step(action)
    print(f"Action: {action}, State: {obs}, Reward: {reward}")
    print(env.render())
    if terminated:
        break

=== Deterministic FrozenLake ===

[41mS[0mFFF
FHFH
FFFH
HFFG

Action: 1, State: 4, Reward: 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

Action: 1, State: 8, Reward: 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

Action: 2, State: 9, Reward: 0.0
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG

Action: 2, State: 10, Reward: 0.0
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG



In [3]:
# Now recreate environment with slippery dynamics
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="ansi")
print("\n=== Slippery FrozenLake ===")
obs, info = env.reset(seed=42)
print(env.render())

# Take the same fixed action sequence again
for action in [1, 1, 2, 2]:
    obs, reward, terminated, truncated, info = env.step(action)
    print(f"Action: {action}, State: {obs}, Reward: {reward}")
    print(env.render())
    if terminated:
        break


=== Slippery FrozenLake ===

[41mS[0mFFF
FHFH
FFFH
HFFG

Action: 1, State: 4, Reward: 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

Action: 1, State: 5, Reward: 0.0
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG



### 📊 How Slippery Works in FrozenLake

When `is_slippery=True`, the environment introduces **stochastic transitions**:

- Each time you choose an action, the environment samples from:
  - **Left turn**: with probability **1/3**
  - **Forward (intended direction)**: with probability **1/3**
  - **Right turn**: with probability **1/3**

This means your chosen direction only succeeds about **33% of the time**.  
If a slip would move you outside the grid, the agent simply stays in place.

## 🧮Group Exercise: Modeling FrozenLake as Four Transition Matrices

In slippery mode, **each action has its own transition probability matrix**:

- **`P_LEFT`** — probabilities for taking action **LEFT**
- **`P_DOWN`** — probabilities for taking action **DOWN**
- **`P_RIGHT`** — probabilities for taking action **RIGHT**
- **`P_UP`** — probabilities for taking action **UP**

Each matrix is **9×9** (one row per current state, one column per next state)  
and satisfies the property that **each row sums to 1** (or 0 if the state is terminal).

In [4]:
# Define a smaller 3x3 map
DESC_3x3 = [
    "SFF",
    "FHF",
    "FFG",
]
env = gym.make("FrozenLake-v1",desc=DESC_3x3,is_slippery=True, render_mode="ansi")
obs, info = env.reset(seed=42)
print(env.render())


[41mS[0mFF
FHF
FFG



In [5]:
# 9x9 placeholder matrices for students to fill manually

P_LEFT = [
    [2, 1, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 1, 0, 0, 3, 1, 0, 1, 0],
    [0, 0, 1, 0, 0, 0, 0, 0, 1],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 1],
    [0, 0, 0, 0, 0, 1, 0, 0, 1],
]

P_DOWN = [
    [1, 1, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 1, 3, 1, 0, 0, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 1, 1, 1],
    [0, 0, 0, 0, 0, 1, 0, 1, 2],
]

P_RIGHT = [
    [1, 0, 0, 1, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 2, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 1, 0, 0],
    [0, 1, 0, 1, 3, 0, 0, 1, 0],
    [0, 0, 1, 0, 0, 1, 0, 0, 1],
    [0, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 1, 1, 0],
    [0, 0, 0, 0, 0, 1, 0, 1, 2],
]

P_UP = [
    [2, 1, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0, 0, 1, 0, 0],
    [0, 1, 0, 0, 3, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0, 0, 2, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0, 1, 0, 0, 3],
]


In [11]:
P_all = np.array([P_LEFT, P_DOWN, P_RIGHT, P_UP]).transpose(0, 2, 1)/3
print("Shape of combined matrix:", P_all.shape)  # (4, 9, 9)

Shape of combined matrix: (4, 9, 9)


### 🎯 Overview: Sampling a Trajectory from `P_all`

We now have `P_all`, a combined transition model with shape `(4, n_states, n_states)`.  
This means we know the probability of reaching every possible next state `s'`  
from any current state `s` for each action `a`.

In this step, we will **simulate a trajectory** given:
- A starting state (e.g., the state of `S`).
- A fixed sequence of actions (e.g. `[0, 1, 3, 2]`).

Instead of calling `env.step()`, we will:
- Use `P_all[a, s, :]` to get the probability distribution of next states.
- Sample the next state according to these probabilities.
- Repeat until all actions are taken, building a list of visited states.

This shows that once we have the full transition model,  
we can generate experience **entirely from our MDP representation** —  
an essential idea behind **model-based reinforcement learning**.


In [12]:
def sample_trajectory(P_all, start_state, actions):
    s = start_state
    trajectory = [s]
    for a in actions:
        probs = P_all[a, :, s]  # row of probabilities for next states
        s_next = random.choices(range(len(probs)), weights=probs, k=1)[0]
        trajectory.append(s_next)
        s = s_next
    return trajectory

In [13]:
# Example usage:
obs = 0  # assume starting from state 0 (top-left S)
actions = [0, 1, 3, 2]  # LEFT, DOWN, UP, RIGHT
trajectory = sample_trajectory(P_all, obs, actions)
print("Sampled trajectory of states:", trajectory)

Sampled trajectory of states: [0, 0, 1, 1, 1]


### 🧭 From Policy to State Transition Matrix

Once we have defined a **policy** — a list of actions, one for each state —

```python
# Example: choose an action for each of the 9 states
POLICY = [1, 1, 2, 0, 0, 2, 1, 1, 0]  # 0=LEFT, 1=DOWN, 2=RIGHT, 3=UP
```

we can use it to create a **policy-specific transition matrix** `P_pi`.

---

### 📝 What We Are Doing

- Start with `P_all`, which contains probabilities for **all actions**:
  - Shape: `(4, 9, 9)` → 4 actions × 9 states × 9 next states.
- For each state `s`:
  - Look up the action chosen by the policy: `a = POLICY[s]`.
  - Copy the probability row `P_all[a, s, :]` into the corresponding row of `P_pi`.

In [14]:
POLICY = [
    1,  # state 0 (top-left)
    2,  # state 1
    1,  # state 2
    1,  # state 3
    1,  # state 4 (center, possibly hole)
    1,  # state 5
    2,  # state 6
    2,  # state 7
    2,  # state 8 (goal state)
]

In [15]:
n_states = len(POLICY)
P_pi = np.zeros((n_states, n_states))

for s in range(n_states):
    a = POLICY[s]             # action chosen at state s
    P_pi[s, :] = P_all[a, s]  # copy the probabilities for that action

# Print the resulting matrix
import numpy as np
np.set_printoptions(precision=2, suppress=True)
print("Policy-specific transition matrix (P_pi):")
print(P_pi)

Policy-specific transition matrix (P_pi):
[[0.33 0.33 0.   0.33 0.   0.   0.   0.   0.  ]
 [0.   0.33 0.33 0.   0.33 0.   0.   0.   0.  ]
 [0.   0.33 0.33 0.   0.   0.33 0.   0.   0.  ]
 [0.   0.   0.   0.33 0.33 0.   0.33 0.   0.  ]
 [0.   0.   0.   0.   1.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.33 0.33 0.   0.   0.33]
 [0.   0.   0.   0.33 0.   0.   0.33 0.33 0.  ]
 [0.   0.   0.   0.   0.33 0.   0.   0.33 0.33]
 [0.   0.   0.   0.   0.   0.33 0.   0.   0.67]]


### 📈 Policy Evaluation: Solving for State Values

Now that we have:

- **`P_pi`** → the policy-specific state transition matrix (9×9)
- **`R`** → the reward vector for each state (or reward per state-action-next_state)

we can compute the **value of each state under this policy**.

### 📝 What We Are Doing

The state value function under a fixed policy satisfies:

$$
V = R + \gamma P_\pi V
$$

where:

- **`V`** is a column vector of state values.
- **`R`** is a column vector of expected immediate rewards per state.
- **`γ`** is the discount factor (e.g. 0.9).
- **`P_pi`** is the transition matrix under the policy.

We can rearrange this to solve for `V` directly:

$$
(I - \gamma P_\pi) V = R
\quad\Rightarrow\quad
V = (I - \gamma P_\pi)^{-1} R
$$

---

### 🎯 Your Task

1. Build a reward vector `R` of length 9 (1 for goal states, 0 otherwise).
2. Choose a discount factor `gamma` (e.g. 0.9).
3. Solve for `V` using NumPy:

In [None]:
# Your time to work on it

In [23]:
Reward = [
    0,  # state 0 (top-left)
    0,  # state 1
    0,  # state 2
    0,  # state 3
    0,  # state 4 (center, possibly hole)
    0,  # state 5
    0,  # state 6
    0,  # state 7
    1,  # state 8 (goal state)
]
Reward = np.array(Reward)
gamma = 0.9

In [25]:
# Solve for the value of the policy I designed 
I = np.eye(9)
A = I - gamma * P_pi
V = np.linalg.solve(A, Reward)

In [28]:
print("State values V:", V)

State values V: [0.3  0.36 0.83 0.36 0.   1.58 0.83 1.58 3.68]
