**Bellman Equations for Policy Evaluation**

In this exercise, we are going to implement the Bellman equations for policy evaluation in the next MDP, which is a Random Walk:

![alt text](rw.png "Title")

Let us start with the imports. We use only numpy in this exercise.

In [14]:
import numpy as np
%matplotlib inline

Now, let us define the parameters of the problem. We have an MDP with seven states (two of which are terminal states), and two actions. The rewards, discount factor, and transition probabilities are given in the figure above. For this case, we only evaluate one policy: the uniform random policy, which assigns equal probability to all actions in all states.

In [15]:
gamma = 0.9
R = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]]).T
P = np.array([[1, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 1, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 0, 0, 1]])
pi = np.array([[0.5, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.5, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.5]])

Now, let us implement the Bellman equations, using the fixed point equations seen in the slides:
* $v^{\pi} = \left( I - \gamma \mathcal{P}^{\pi} \right)^{-1} \mathcal{R}^{\pi}$
* $q^{\pi} = \left( I - \gamma \mathcal{P} \Pi \right)^{-1} \mathcal{R}$

In [16]:
def bellman_equations(R, P, gamma, pi):
    # Code to be filled by the student
    return v_pi, q_pi

v_pi, q_pi = bellman_equations(R, P, gamma, pi)
with np.printoptions(precision=2, suppress=True):
    print(f"v^pi = {v_pi.flatten()}")
    print(f"q^pi = {q_pi.flatten()}")

v^pi = [-0.    0.07  0.15  0.26  0.43  0.69  0.  ]
q^pi = [0.   0.   0.13 0.   0.23 0.06 0.38 0.13 0.62 0.23 0.   1.38 0.   0.  ]
