**Linear approximations: model-based prediction**

In this exercise, we are going to use the following Random Walk to study how linear approximations work for RL. In this part, concretely, we will study how to predict knowing the model. The Random Walk is the following:

![alt text](rw_2.png "Title")

Let us start with the imports. We use numpy.

In [7]:
import numpy as np

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [8]:
rng = np.random.default_rng(1234)

The next thing to do is to define the MDP using the data given in the image as:

In [9]:
n_states = 4
n_actions = 2
gamma = 0.9  # Discount factor

R = np.array([.9, 0.1, .9, 0.1, 0.1, .9, 0.1, .9]).reshape([n_states * n_actions, 1])

P = np.array([[.1, .9, 0, 0],
              [.9, .1, 0, 0],
              [.1, 0, .9, 0],
              [.9, 0, .1, 0],
              [0, .1, 0, .9],
              [0, .9, 0, .1],
              [0, 0, .1, .9],
              [0, 0, .9, .1]])

pi_rp = np.array([[.5, .5, 0, 0, 0, 0, 0, 0],
                  [0, 0, .5, .5, 0, 0, 0, 0],
                  [0, 0, 0, 0, .5, .5, 0, 0],
                  [0, 0, 0, 0, 0, 0, .5, .5]])  # Random policy

pi_opt = np.array([[1, 0, 0, 0, 0, 0, 0, 0],
                   [0, 0, 1, 0, 0, 0, 0, 0],
                   [0, 0, 0, 0, 0, 1, 0, 0],
                   [0, 0, 0, 0, 0, 0, 0, 1]])  # Optimal policy

Next, we are going to need the feature matrix, that you have computed in a previous exercise. Copy and paste here the code you used to compute it.

In [10]:
n_features = 2
phi = np.zeros((n_states * n_actions, n_features * n_actions))

# To be filled by the student

with np.printoptions(precision=4, suppress=True):
    print("The feature matrix is:")
    print(phi)

The feature matrix is:
[[0.0967 0.088  0.     0.    ]
 [0.     0.     0.0967 0.088 ]
 [0.0997 0.0967 0.     0.    ]
 [0.     0.     0.0997 0.0967]
 [0.0967 0.0997 0.     0.    ]
 [0.     0.     0.0967 0.0997]
 [0.088  0.0967 0.     0.    ]
 [0.     0.     0.088  0.0967]]


We now define the other matrices involved in the computation of the BPE. Namely, the visiting probability for each policy.

In [11]:
# Obtain the visiting probability for the random policy
P_rp = pi_rp @ P
w, v = np.linalg.eig(P_rp.T)
eig_unit = np.argmin(np.abs(w - 1))  # Find the unit eigenvector
d_v_rp = v[:, eig_unit] / np.sum(v[:, eig_unit])
assert np.all(d_v_rp >= 0)
d_q_rp = np.zeros([n_states * n_actions])
d_q_rp[::2] = d_v_rp
d_q_rp[1::2] = d_v_rp
d_q_rp = d_q_rp * .5
D_q_rp = np.diag(d_q_rp)  # Diagonal matrix with the visiting probability of the random policy

# Obtain the visiting probability for the optimal policy
P_rp = pi_opt @ P
w, v = np.linalg.eig(P_rp.T)
eig_unit = np.argmin(np.abs(w - 1))  # Find the unit eigenvector
d_v_op = v[:, eig_unit] / np.sum(v[:, eig_unit])
assert np.all(d_v_op >= 0)
d_q_op = np.zeros([n_states * n_actions])
d_q_op[0] = d_v_op[0]
d_q_op[2] = d_v_op[1]
d_q_op[5] = d_v_op[2]
d_q_op[7] = d_v_op[3]
D_q_op = np.diag(d_q_op)  # Diagonal matrix with the visiting probability of the optimal policy

You are now ready to apply the BPE equation, as seen in the slides, in order to obtain the approximated state-action value function and the parameters of the linear approximation.

In [12]:
# To be filled by the student: obtain q_rp_approx, omega_rp, q_op_approx, omega_opt (the Q-value function and the parameters for the random and optimal policy)

with np.printoptions(precision=2, suppress=True):
    print("Results for the random policy")
    print(f"Approximated Q-function: {q_rp_approx.flatten()}")
    print(f"Parameters: {omega_rp.flatten()}")
    print("Results for the optimal policy")
    print(f"Approximated Q-function: {q_op_approx.flatten()}")
    print(f"Parameters: {omega_opt.flatten()}")

Results for the random policy
Approximated Q-function: [5.41 4.19 5.32 4.89 4.89 5.32 4.19 5.41]
Parameters: [ 96.89 -44.9  -44.9   96.89]
Results for the optimal policy
Approximated Q-function: [9.   7.39 9.   8.43 8.43 9.   7.39 9.  ]
Parameters: [137.52 -48.78 -48.78 137.52]
