# Lesson 02: Bellman Equations in Practice

## 🔁 Recap from Lesson 1

We defined value functions and Bellman equations:

- State-value function:
  $$
  V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s \right]
  $$

- Bellman expectation equation:

  $$
  V^\pi(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]
  $$


## 🧠 Value of a Policy

Given a fixed policy \( \pi \), we can compute how good it is using **policy evaluation**.

This means solving:
$$
V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V^\pi(s')
$$

We can solve this using **iterative updates**:

1. Initialize \( V(s) = 0 \) for all \( s \)
2. For each state, apply:
   $$
   V_{k+1}(s) = R(s, \pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V_k(s')
   $$
3. Repeat until values converge


## ✍️ Example MDP

Let’s consider this toy MDP:

- States: $s_1$, $s_2$
- Actions: $a_1$, $a_2$
- Policy: $\pi(s_1) = a_1$, $\pi(s_2) = a_2$
- Transition:
  - $P(s_2 | s_1, a_1) = 1$
  - $P(s_1 | s_2, a_2) = 1$
- Rewards:
  - $R(s_1, a_1) = 2$
  - $R(s_2, a_2) = 1$
- $\gamma = 0.9$

Starting from $V_0(s_1) = V_0(s_2) = 0$, let's compute $V_1$, $V_2$, etc.


In [None]:
# Policy Evaluation (Iterative)
gamma = 0.9
V = {'s1': 0, 's2': 0}
rewards = {'s1': 2, 's2': 1}
transitions = {'s1': 's2', 's2': 's1'}

for k in range(5):  # 5 iterations
    new_V = {}
    for s in V:
        next_s = transitions[s]
        new_V[s] = rewards[s] + gamma * V[next_s]
    V = new_V
    print(f"Iteration {k+1}: V = {V}")


## ✅ Summary

- The Bellman expectation equation allows us to **evaluate** how good a given policy is.
- We use **iterative updates** to approximate the value function.
- Once we know the value of a policy, we can try improving it (covered in Lesson 3).

Up next: **Policy Iteration – Evaluating and Improving Policies**
