**Linear approximations: feature matrix**

In this exercise, we are going to use the following Random Walk to study how linear approximations work for RL. In this part, concretely, we will study how to build the feature matrix that will be key for the linear approximations. The Random Walk is the following:

![alt text](rw_2.png "Title")

Let us start with the imports. We use numpy.

In [10]:
import numpy as np

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [11]:
rng = np.random.default_rng(1234)

The next thing to do is to define the MDP using the data given in the image as:

In [12]:
n_states = 4
n_actions = 2
gamma = 0.9  # Discount factor

R = np.array([.9, 0.1, .9, 0.1, 0.1, .9, 0.1, .9]).reshape([n_states * n_actions, 1])

P = np.array([[.1, .9, 0, 0],
              [.9, .1, 0, 0],
              [.1, 0, .9, 0],
              [.9, 0, .1, 0],
              [0, .1, 0, .9],
              [0, .9, 0, .1],
              [0, 0, .1, .9],
              [0, 0, .9, .1]])

pi_rp = np.array([[.5, .5, 0, 0, 0, 0, 0, 0],
                  [0, 0, .5, .5, 0, 0, 0, 0],
                  [0, 0, 0, 0, .5, .5, 0, 0],
                  [0, 0, 0, 0, 0, 0, .5, .5]])  # Random policy

pi_opt = np.array([[1, 0, 0, 0, 0, 0, 0, 0],
                   [0, 0, 1, 0, 0, 0, 0, 0],
                   [0, 0, 0, 0, 0, 1, 0, 0],
                   [0, 0, 0, 0, 0, 0, 0, 1]])  # Optimal policy

Now, we are ready to solve the first part of the exercise: finding the optimal value functions (state and state-action) for both given policies: random and optimal. We will use the Bellman equations to do so.

In [13]:
def bellman_equations(pi):

    # To be filled by the student

    return v_pi, q_pi

v_rp, q_rp = bellman_equations(pi_rp)
v_opt, q_opt = bellman_equations(pi_opt)

with np.printoptions(precision=2, suppress=True):
    print(f"Random policy V: {v_rp.flatten()}")
    print(f"Random policy Q: {q_rp.flatten()}")
    print(f"Optimal policy V: {v_opt.flatten()}")
    print(f"Optimal policy Q: {q_opt.flatten()}")

Random policy V: [5. 5. 5. 5.]
Random policy Q: [5.4 4.6 5.4 4.6 4.6 5.4 4.6 5.4]
Optimal policy V: [9. 9. 9. 9.]
Optimal policy Q: [9.  8.2 9.  8.2 8.2 9.  8.2 9. ]


Now, let us obtain the feature representation for each of the states. We denote the states as $s \in \{1, 2, 3, 4\}$. You have to implement the `get_features` function, which returns the feature representation of a given state.

In [14]:
n_features = 2
def get_features(state, sigma=4):  # Function that returns features for each state

    # To be filled by the student

    return features

for state in range(1, n_states + 1):
    with np.printoptions(precision=4, suppress=True):
        print(f"Features for state {state}: {get_features(state)}")

Features for state 1: [0.0967 0.088 ]
Features for state 2: [0.0997 0.0967]
Features for state 3: [0.0967 0.0997]
Features for state 4: [0.088  0.0967]


Following with the exercise, we now have to obtain each of the feature vectors for the state-action representation.

In [15]:
phi = np.zeros((n_states * n_actions, n_features * n_actions))  # Phi matrix, whose components are going to be filled in the next loop
for si in range(n_states):
    for sa in range(n_actions):
        features = np.zeros(n_features * n_actions)  # The phi vector

        # To be filled by the student: remember to save the features in the phi matrix

        with np.printoptions(precision=4, suppress=True):
            print(f"Features for state {sa + 1} and action {sa}: {features}")

Features for state 1 and action 0: [0.0967 0.088  0.     0.    ]
Features for state 2 and action 1: [0.     0.     0.0967 0.088 ]
Features for state 1 and action 0: [0.0997 0.0967 0.     0.    ]
Features for state 2 and action 1: [0.     0.     0.0997 0.0967]
Features for state 1 and action 0: [0.0967 0.0997 0.     0.    ]
Features for state 2 and action 1: [0.     0.     0.0967 0.0997]
Features for state 1 and action 0: [0.088  0.0967 0.     0.    ]
Features for state 2 and action 1: [0.     0.     0.088  0.0967]


The last part of the exercise is inmediate: if everything has been done right, you just obtained the feature matrix needed!

In [16]:
with np.printoptions(precision=4, suppress=True):
    print("The feature matrix is:")
    print(phi)

The feature matrix is:
[[0.0967 0.088  0.     0.    ]
 [0.     0.     0.0967 0.088 ]
 [0.0997 0.0967 0.     0.    ]
 [0.     0.     0.0997 0.0967]
 [0.0967 0.0997 0.     0.    ]
 [0.     0.     0.0967 0.0997]
 [0.088  0.0967 0.     0.    ]
 [0.     0.     0.088  0.0967]]
