**Policy Iteration**

In this exercise, we are going to implement Policy Iteration, an iterative method to obtain the optimal policy, in the next MDP:

![alt text](two_state_mdp.png "Title")

Let us start with the imports. We use only numpy and matplotlib in this exercise.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Now, let us define the parameters of the problem. We have an MDP with two states, and two actions. The rewards, discount factor, and transition probabilities are given in the figure above. We also know the optimal policy beforehand from previous exercises:

In [2]:
gamma = 0.9
R = np.array([[-1, 0.6, 0.5, -0.9]]).T
P = np.array([[0.8, 0.2], [0.2, 0.8], [0.3, 0.7], [0.9, 0.1]])
pi_opt = np.array([[0, 1, 0, 0], [0, 0, 1, 0]])

First, we obtain the optimal value function $v^{\pi^* }$ and the optimal action-value function $q^{\pi^*}$ using the fixed-point Bellman equations, in order to assess the accuracy of our implementation:
* $v^{\pi} = \left( I - \gamma \mathcal{P}^{\pi} \right)^{-1} \mathcal{R}^{\pi}$
* $q^{\pi} = \left( I - \gamma \mathcal{P} \Pi \right)^{-1} \mathcal{R}$

In [3]:
v_opt = (np.linalg.inv(np.eye(pi_opt.shape[0]) - gamma * pi_opt @ P) @ pi_opt @ R).flatten()
q_opt = (np.linalg.inv(np.eye(P.shape[0]) - gamma * P @ pi_opt) @ R).flatten()
with np.printoptions(precision=2, suppress=True):
    print(f"Optimal Policy = {pi_opt.flatten()}")
    print(f"v^* = {v_opt}")
    print(f"q^* = {q_opt}")

Optimal Policy = [0 1 0 0 0 0 1 0]
v^* = [5.34 5.25]
q^* = [3.79 5.34 5.25 3.9 ]


Now, we are going to implement Policy Iteration, an iterative method, for the state value function, following the algorithm seen in the slides:

In [4]:
n_states = 2
n_actions = 2
pi_pe = [np.zeros((n_states, n_states * n_actions))]  # Initial policy: ensure that it is not the optimal!
pi_pe[-1][0,0] = 1
pi_pe[-1][1,3] = 1
v_pi_pe = [np.zeros((n_states, 1))]  # Initialize the value function to 0

threshold = 1e-3  # Variation change for convergence check in PE
theta = False  # Indicator of PI convergence
i = 0  # PI iterations

while not theta:
    # First thing: PE loop

    # To be filled by the student

    # Now, PI loop, to check if the policy is optimal

    # To be filled by the student

print('PI converged after ', i, ' iterations')


with np.printoptions(precision=2, suppress=True):  # Print the values obtained
    print(f"Policy optimal theory = {pi_opt.flatten()}")
    print(f"Policy optimal PI = {pi_pe[-1].flatten()}")
    print(f"v^* theory = {v_opt}")
    print(f"v^* PI = {v_pi_pe[-1].flatten()}")

PI converged after  2  iterations
Policy optimal theory = [0 1 0 0 0 0 1 0]
Policy optimal PI = [0. 1. 0. 0. 0. 0. 1. 0.]
v^* theory = [5.34 5.25]
v^* PI = [5.33 5.24]


And we are going to repeat the procedure, but using Policy Iteration on the state-action value function, following the algorithm seen in the slides:

In [5]:
n_states = 2
n_actions = 2
pi_pe = [np.zeros((n_states, n_states * n_actions))]  # Initial policy: ensure that it is not the optimal!
pi_pe[-1][0,0] = 1
pi_pe[-1][1,3] = 1
q_pi_pe = [np.zeros((n_states * n_actions, 1))]  # Initialize the value to 0

threshold = 1e-3  # Variation change for convergence check in PE
theta = False  # Indicator of PI convergence
i = 0  # PI iterations

while not theta:
    # First thing: PE loop

    # To be filled by the student

    # Now, PI loop, to check if the policy is optimal

    # To be filled by the student

print('PI converged after ', i, ' iterations')

with np.printoptions(precision=2, suppress=True):  # Print the values obtained
    print(f"Policy optimal theory = {pi_opt.flatten()}")
    print(f"Policy optimal PI = {pi_pe[-1].flatten()}")
    print(f"q^* theory = {q_opt}")
    print(f"q^* PI = {q_pi_pe[-1].flatten()}")

PI converged after  2  iterations
Policy optimal theory = [0 1 0 0 0 0 1 0]
Policy optimal PI = [0. 1. 0. 0. 0. 0. 1. 0.]
q^* theory = [3.79 5.34 5.25 3.9 ]
q^* PI = [3.78 5.33 5.24 3.89]
