**Off-policy Monte-Carlo via Importance Sampling**

In this exercise, we are going to implement the off-policy Monte-Carlo algorithm using Importance Sampling in the next MDP, which is a Random Walk:

![alt text](rw.png "Title")

Let us start with the imports. We use only numpy in this exercise.

In [21]:
import numpy as np
%matplotlib inline

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [22]:
rng = np.random.default_rng(1234)

Now, let us define the parameters of the problem. We have an MDP with seven states (two of which are terminal states), and two actions. The rewards, discount factor, and transition probabilities are given in the figure above.

In [23]:
gamma = 0.9
R = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]).T
P = np.array([[1, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 1, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 0, 0, 1]])

Our problem now is to estimate the value function of the policy $\pi$, which assigns probability $2/3$ to the right action, and $1/3$ for the left action, when we have samples following the uniform random policy $\mu$. Before anything, we will compute the theorical value function $v^\pi$ with the Bellman equation.

In [24]:
pi = np.array([[2/3, 1/3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 2/3, 1/3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 2/3, 1/3, 0, 0, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 2/3, 1/3, 0, 0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 2/3, 1/3, 0, 0, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2/3, 1/3, 0, 0],
               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2/3, 1/3]])

v_pi = np.linalg.inv(np.eye(pi.shape[0]) - gamma * pi @ P) @ pi @ R

with np.printoptions(precision=4, suppress=True):
    print(f"v^pi = {v_pi.flatten()}")
    print(f"v^pi(4) = {v_pi[3, 0]}")

v^pi = [0.     0.2291 0.3818 0.5217 0.6787 0.8703 0.    ]
v^pi(4) = 0.5217391304347825


Now, let us implement the off-policy Monte-Carlo method with importance sampling . You should obtain a similar value function $v^\pi$ to the one provided.

In [25]:
n_states = 7
n_actions = 2

threshold = 1e-4  # Variation change for convergence check
delta = 1.0  # Initial difference value
v_s4 = []  # List of value functions estimated for state 4
iters = 2000  # Number of iterations

pi_probs = np.array([2/3, 1/3])  # Probabilities of the policy pi

for i in range(iters):

    # To be filled by the student: append to v_s4 the estimation of the value function for state 4 using Importance Sampling

with np.printoptions(precision=4, suppress=True):
    print(f"Value estimated for state 4 = {np.mean(v_s4)}")
    print(f"Theorical value for state 4 = {v_pi[3, 0]}")


Value estimated for state 4 = 0.542847170693135
Theorical value for state 4 = 0.5217391304347825
