**SARSA and Q-Learning**

In this exercise, we are going to compare two algorithms that work model-free for control: SARSA and Q-learning. We will use the next MDP, which is a Random Walk:

![alt text](rw.png "Title")

Let us start with the imports. We use only numpy and matplotlib in this exercise.

In [33]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [34]:
rng = np.random.default_rng(1234)

Now, let us define the parameters of the problem. We have an MDP with seven states (two of which are terminal states), and two actions. The rewards, discount factor, and transition probabilities are given in the figure above. For this case, we only evaluate one policy: the uniform random policy, which assigns equal probability to all actions in all states.

In [35]:
gamma = 0.9
R = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]]).T
P = np.array([[1, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 1, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 1, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0],
              [0, 0, 0, 1, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 1, 0, 0],
              [0, 0, 0, 0, 0, 0, 1],
              [0, 0, 0, 0, 0, 0, 1]])
n_states = 7
n_actions = 2

Now, reuse the implementations from the previous exercises to define the methods we are going to use for comparison: Policy Iteration, and Value Iteration. Note that we only use the state-action value function.

In [36]:
def Policy_Iteration():

    # To be filled by the student

    return pi_pe  # Return the optimal policy


def Value_Iteration():  # No arguments for Value Iteration

    # To be filled by the student

    return pi_vi  # Return the optimal policy

pi_policy_iteration = Policy_Iteration()
pi_value_iteration = Value_Iteration()

# Obtain the optimal policies for each state: ensure that each policy is a vector of size n_states with the optimal action per state!
pi_policy_iteration = np.argmax(pi_policy_iteration[-1], axis=1) % n_actions
pi_value_iteration = np.argmax(pi_value_iteration, axis=1) % n_actions

Now, we are going to implement SARSA and Q-learning algorithms, as it is in your slides. Note that you have to store the value function for each iteration.

In [38]:
alpha = 0.02  # Update ratio
n_episodes = 500  # Episodes used to learn in SARSA and Q-learning

def epsilon_greedy_policy(q, epsilon=0.1):  # Input: q for the given state
    q = q.flatten()
    if np.random.rand(1) < epsilon:
        return np.random.choice(np.arange(q.size))  # Return an action uniformly
    else:
        return np.argmax(q)  # Action that maximizes q

# SARSA
print('Obtaining SARSA control...')
q_sarsa = np.zeros((n_episodes + 1, n_states, n_actions))
for e in range(n_episodes):
    # To be filled by the student: Obtain pi_sarsa, the optimal policy for the SARSA algorithm, as a vector of dimension n_states


# Q-Learning
print('Obtaining Q-Learning control...')
q_ql = np.zeros((n_episodes + 1, n_states, n_actions))
for e in range(n_episodes):
    # To be filled by the student: Obtain pi_ql, the optimal policy for the SARSA algorithm, as a vector of dimension n_states


with np.printoptions(precision=2, suppress=True):
    print(f"Policy PI = {pi_policy_iteration.flatten()}")
    print(f"Policy VI = {pi_value_iteration.flatten()}")
    print(f"Policy SARSA = {pi_sarsa.astype(int).flatten()}")
    print(f"Policy QL = {pi_ql.astype(int).flatten()}")


Obtaining SARSA control...
Obtaining Q-Learning control...
Policy PI = [0 0 0 0 0 1 0]
Policy VI = [0 0 0 0 0 1 0]
Policy SARSA = [0 0 0 0 0 1 0]
Policy QL = [0 0 0 0 0 1 0]
