**Linear Approximation limits**

In this example, we are going to illustrate the main weakness of Linear Approximations (namely, their dependance with the chosen feature basis), using the next MDP:

![alt text](two_state_mdp.png "Title")

Let us start with the imports. We use numpy and matplotlib in this example, as well as gym and tabulate.

In [8]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
try:
    import gym.spaces as gs
except:
    !pip install gym
    import gym.spaces as gs
from tabulate import tabulate

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [9]:
rng = np.random.default_rng(1234)

We now define the MDP, and use the Bellman equations to obtain the value function for the random policy (remember that this is an exact solution, and we will use it to compare the results obtained with Linear Approximation).

In [10]:
# Define here the matrices given by the problem
n_states = 2
n_actions = 2

P = np.array([[0.8, 0.2], [0.2, 0.8], [0.3, 0.7], [0.9, 0.1]])
R = np.array([[-1], [0.6], [0.5], [-0.9]])
gamma = 0.9

# Now, define the random policy
pi = np.zeros((2, 4))
pi[0, 0:2] = 0.5
pi[1, 2:] = 0.5

# Compute exact values using Bellman matrix equations
q_pi_exact = np.linalg.inv(np.eye(P.shape[0]) - gamma * P @ pi) @ R

In order to solve the MDP, we implement a Gym-like class containing the MDP (this is not strictly necessary, but it is good practice to do so, and it will be useful when we start working with Deep Reinforcement Learning).

In [11]:
class mdp(object):  # Create a gym-like class for the MDP
    def __init__(self, P, R):
        self.P = P
        self.R = R
        self.n_states = self.P.shape[1]
        self.n_actions = int(self.P.shape[0] / self.n_states)
        self.state_space = gs.Discrete(self.n_states)  # Discrete states
        self.action_space = gs.Discrete(self.n_actions)  # Discrete actions
        self.state = None
        self.t = None  # Counter to have a max number of episode steps
        self.max_t = 100  # Max episode steps
    def reset(self):
        self.state = self.state_space.sample()  # Set the initial state randomly
        self.t = 0
        return self.state
    def step(self, action):
        next_state = np.random.choice(np.arange(self.n_states), p=self.P[self.state * self.n_actions + action])
        reward = self.R[self.state * self.n_actions + action]
        self.t += 1
        done = self.t > self.max_t
        self.state = next_state
        return next_state, reward, done, _

Now, define the LSTD algorith, which we use to obtain the value function for the random policy.

In [12]:
def LSTD(features, policy, env, n_episodes, feature_dim, gamma):
  gam = np.zeros((feature_dim, feature_dim))
  lam = np.zeros((feature_dim, feature_dim))
  z = np.zeros((feature_dim, 1))
  for e in range(n_episodes):
    state = env.reset()
    done = False
    action = policy(state)
    while not done:
      next_state, reward, done, _ = env.step(action)
      next_action = policy(next_state)
      feat = features(state, action)
      next_feat = features (next_state, next_action)
      gam += feat @ feat.T
      lam += feat @ next_feat.T
      z += feat * reward
      state = next_state
      action = next_action
  return np.linalg.inv(gam - gamma * lam) @ z  # Linear approximation parameters

Finally, define the random policy and two different set of features to be tested. As the results show, note that the value function obtained with Linear Approximation strongly depends on the chosen feature basis. In this case, One-Hot encoding is a much better choice than using the index of the state-action pair.

In [13]:
def policy(state):
  return np.random.choice(np.arange(n_actions), p=pi[state, state * n_actions : (state + 1) * n_actions])  # Follows given policy

def features_one_hot(state, action):
  feature = np.zeros((n_states * n_actions, 1))
  feature[state * n_actions + action] = 1
  return feature

def features_index(state, action):
  feature = np.array([[state * n_actions + action]])
  return feature

# Now, use LSTD

env = mdp(P, R)
n_episodes = 500

q_lstd = []

for features in [features_one_hot, features_index]:

  feature_dim = features(0, 0).shape[0]

  omega = LSTD(features, policy, env, n_episodes, feature_dim, gamma)

  # Obtain the value function
  feat_mat = np.zeros((n_states * n_actions, feature_dim))
  for s in range(n_states):
    for a in range(n_actions):
      feat_mat[s * n_actions + a] = np.squeeze(features(s, a))

  q_lstd.append(feat_mat @ omega)  # The linear approximation is here!

values = []
values.append(['Exact'])
values[-1].extend(list(q_pi_exact))
values.append(['LSTD One-hot'])
values[-1].extend(list(q_lstd[0]))
values.append(['LSTD Index'])
values[-1].extend(list(q_lstd[1]))
print('State-action value function obtained')
print(tabulate(values, tablefmt="fancy_grid", headers=['Method', 'q(x,u)', 'q(x,m)', 'q(y,u)', 'q(y,m)']))

State-action value function obtained
╒══════════════╤══════════╤═══════════╤═══════════╤═══════════╕
│ Method       │   q(x,u) │    q(x,m) │    q(y,u) │    q(y,m) │
╞══════════════╪══════════╪═══════════╪═══════════╪═══════════╡
│ Exact        │ -2.8     │ -1.2      │ -1.3      │ -2.7      │
├──────────────┼──────────┼───────────┼───────────┼───────────┤
│ LSTD One-hot │ -2.81434 │ -1.20016  │ -1.30004  │ -2.70129  │
├──────────────┼──────────┼───────────┼───────────┼───────────┤
│ LSTD Index   │  0       │ -0.148302 │ -0.296605 │ -0.444907 │
╘══════════════╧══════════╧═══════════╧═══════════╧═══════════╛
