**Bellman Equations for Policy Evaluation**

In this exercise, we are going to implement the Bellman equations for policy evaluation in the next MDP:

![alt text](two_state_mdp.png "Title")

Let us start with the imports. We use only numpy in this exercise.

In [16]:
import numpy as np
%matplotlib inline

Now, let us define the parameters of the problem. We have an MDP with two states, and two actions. The rewards, discount factor, and transition probabilities are given in the figure above. We also define the four possible deterministic policies as $\pi_1$, $\pi_2$, $\pi_3$, and $\pi_4$:

In [17]:
gamma = 0.9
R = np.array([[-1, 0.6, 0.5, -0.9]]).T
P = np.array([[0.8, 0.2], [0.2, 0.8], [0.3, 0.7], [0.9, 0.1]])
pi_1 = np.array([[1, 0, 0, 0], [0, 0, 1, 0]])
pi_2 = np.array([[0, 1, 0, 0], [0, 0, 1, 0]])
pi_3 = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])
pi_4 = np.array([[0, 1, 0, 0], [0, 0, 0, 1]])

Now, let us implement the Bellman equations, using the fixed point equations seen in the slides:
* $v^{\pi} = \left( I - \gamma \mathcal{P}^{\pi} \right)^{-1} \mathcal{R}^{\pi}$
* $q^{\pi} = \left( I - \gamma \mathcal{P} \Pi \right)^{-1} \mathcal{R}$

In [18]:
def bellman_equations(R, P, gamma, pi):
    # Code to be filled by the student
    return v_pi, q_pi

for pi in [pi_1, pi_2, pi_3, pi_4]:
    v_pi, q_pi = bellman_equations(R, P, gamma, pi)
    with np.printoptions(precision=2, suppress=True):
        print(f"Policy = {pi.flatten()}")
        print(f"v^pi = {v_pi.flatten()}")
        print(f"q^pi = {q_pi.flatten()}")

Policy = [1 0 0 0 0 0 1 0]
v^pi = [-5.09 -2.36]
q^pi = [-5.09 -2.02 -2.36 -5.24]
Policy = [0 1 0 0 0 0 1 0]
v^pi = [5.34 5.25]
q^pi = [3.79 5.34 5.25 3.9 ]
Policy = [1 0 0 0 0 0 0 1]
v^pi = [-9.83 -9.74]
q^pi = [-9.83 -8.19 -8.29 -9.74]
Policy = [0 1 0 0 0 0 0 1]
v^pi = [-0.63 -1.55]
q^pi = [-1.73 -0.63 -0.64 -1.55]
