**The Cliff problem**

In this exercise, we are going to use the implementations that you should have already done in the previous exercises, in order to solve the Cliff problem:

![alt text](cliff_problem.png "Title")

Let us consider a particular grid-world problem, named the cliff-walking problem, that is appropriate for comparing SARSA and Q-learning. The cliff-walking problem is an episodic task, with one single starting state and one single terminal state, denoted S and G. The action set consists of the usual moving directions for grid-worlds problems: $\A=\left\{ {\rm up},{\rm down},{\rm right},{\rm left}\right\}$ . The reward is $-1$ for all state transitions, except those into the region marked as cliff, such that stepping into this region is penalized with a highly negative reward of $-100$. In addition, transiting to the goal terminal state provides zero reward. When the agent gets to the goal state, the episode ends, and we can start a new episode.

Let us start with the imports. We use numpy, as usual, but we also add new libraries that we will use in this exercise: gym, which is a library that contains many environments for RL, and tabulate, which is a library to print tables in a nice way.

In [12]:
import numpy as np
try:
    import gym.spaces as gs
except:
    !pip install gym
    import gym.spaces as gs
from tabulate import tabulate

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [13]:
rng = np.random.default_rng(1234)

The next thing to do is to create the Cliff class, where we are using the Gym environment interface. There are two main methods needed:
* `step`: this method implements the environment transition, given an action. It returns the next state, the reward, whether the episode has ended, and some additional information (which we do not use here).
* `reset`: this method resets the environment to the initial state. It returns an initial state.

In [14]:
class cliff(object):  # Create the Cliff class

    def __init__(self, dims=(3, 3)):
        self.dims = dims  # Gridworld dims (rows x cols)
        self.observation_space = gs.Box(low=np.array([0, 0]), high=np.array([self.dims[0] - 1, self.dims[1] - 1]), dtype=int)  # The observations are the coordinates of each grid cell
        self.action_space = gs.Discrete(4)  #  There are 4 possible actions: 0=up, 1=down, 2=left, 3=right
        self.init_state = (self.dims[0] - 1, 0)  # Initial state, always the same for simplicity: lower left corner
        self.target_state = (self.dims[0] - 1, self.dims[1] - 1)  # Target state: always the same for simplicity: lower right corner
        self.cliff_location = [(self.dims[0] - 1, i) for i in range(1, self.dims[1] - 1)]  # Lower row (except for target and initial state)
        self.state = None  # This is to store data later on

    def reset(self, randomize=False):  # Call this method to reset the environment
        if randomize:  # To use a random initial state
            self.state = self.observation_space.sample()
        else:
            self.state = self.init_state  # Use a fixed initial state
        return self.state

    def step(self, action):  # This method implements the environment transition

        action = np.squeeze(action).astype(int).item()  # Prepare action (make it an integer just in case)
        if action < 0 or action > 3:  # Check the action bounds
            raise RuntimeError('Action out of bounds')
        # Perform action to get next state (i.e., move the agent in the grid world)
        if action == 0:
            next_state = self.state + np.array([1, 0])  # Move down
        elif action == 1:
            next_state = self.state + np.array([-1, 0])  # Move up
        elif action == 2:
            next_state = self.state + np.array([0, -1])  # Move left
        elif action == 3:
            next_state = self.state + np.array([0, 1])  # Move right
        # Set the action bounds correctly, as we may end up out of the grid
        next_state = np.clip(next_state, np.zeros(2), np.array(self.dims) - 1).astype(int)
        # Check reward and termination condition
        reward, done = self.reward_done(next_state)
        self.state = next_state  # Change the state in the class
        return next_state, reward, done, {}  # Return next state, reward and whether episode has ended

    def from_index(self, index):  # Ancillary method: convert an index state to state coordinates
        return np.unravel_index(index, self.dims)

    def to_index(self, state):  # Ancillary method: convert state coordinates to index
        return np.ravel_multi_index(state, self.dims)

    def reward_done(self, state):
      if tuple(state) in self.cliff_location:  # We fall off the cliff
        reward = -100  # Highly penalizing reward: you die
        done = True  # Episode is terminated
      elif np.sum(np.abs(state - np.array(self.target_state))) < 1e-4:  # Final target found
        reward = 0  # Final reward is 0
        done = True  # Episode is terminated
      else:  # We haven't neither reached the target nor fallen off the cliff
        reward = -1  # Standard reward (takes another step)
        done = False  # Episode not terminated yet
      return reward, done

    def action_to_str(self, action):
      def ac2str(action):
        if action == 0:
          return 'down'
        if action == 1:
          return 'up'
        if action == 2:
          return 'left'
        if action == 3:
          return 'right'
      if isinstance(action, list):
        return [ac2str(a) for a in action]
      else:
        return ac2str(action)

Now, we are ready to create the environment and get ready for the RL algorithms:

In [15]:
env = cliff(dims=(4, 12))  # Environment to work (the cliff)
n_states = np.prod(env.dims)  # Number of states, |S|
n_actions = 4  # Number of actions, |A|
gamma = 0.9

We are now ready to implement SARSA and Q-learning. Note that you may reuse the code that you have already implemented in a previous exercise, although you may need to adapt to the new environment interface.

In [16]:
# Implement SARSA and Q-Learning
alpha = 0.02  # Update ratio
n_episodes = 5000  # Episodes used to update

def epsilon_greedy_policy(q, epsilon=0.1):  # Input: q for the given state
    if np.random.rand(1) < epsilon:
        return np.random.choice(np.arange(q.size))  # Return an action uniformly
    else:
        return np.argmax(q)  # Action that maximizes q


# SARSA
print('Obtaining SARSA control...')
q_sarsa = np.zeros((n_episodes + 1, n_states, n_actions))
for e in range(n_episodes):
    # To be filled by the student

# Q-Learning
print('Obtaining Q-Learning control...')
q_ql = np.zeros((n_episodes + 1, n_states, n_actions))
for e in range(n_episodes):
    # To be filled by the student

# For the next part to work, you must provide as output pi_sarsa and pi_ql, which are the policies obtained by SARSA and Q-learning, respectively, as numpy arrays of dimension (n_states, ).

pi_sarsa_table = [['' for _ in range(env.dims[1])] for _ in range(env.dims[0])]
pi_ql_table = [['' for _ in range(env.dims[1])] for _ in range(env.dims[0])]
for s in range(n_states):
    state = env.from_index(s)
    pi_sarsa_table[state[0]][state[1]] = env.action_to_str(np.argmax(q_sarsa[-1, s, :]))
    pi_ql_table[state[0]][state[1]] = env.action_to_str(np.argmax(q_ql[-1, s, :]))

print('Policy for SARSA')
print(tabulate(pi_sarsa_table, tablefmt="fancy_grid"))
print('Policy for Q-learning')
print(tabulate(pi_ql_table, tablefmt="fancy_grid"))

Obtaining SARSA control...
Obtaining Q-Learning control...
Policy for SARSA
╒═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤═══════╤══════╕
│ left  │ right │ right │ right │ down  │ right │ right │ right │ right │ right │ right │ down │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼──────┤
│ right │ right │ right │ right │ right │ right │ right │ right │ right │ right │ right │ down │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼──────┤
│ up    │ up    │ up    │ up    │ right │ up    │ up    │ up    │ up    │ right │ right │ down │
├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼──────┤
│ up    │ down  │ down  │ down  │ down  │ down  │ down  │ down  │ down  │ down  │ down  │ down │
╘═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧═══════╧══════╛
Policy for Q-learning
╒═══════╤═══════╤═══════╤════

Finally, let us compute the total reward of a trajectory that follows the optimal policy. Note that SARSA reward is lower, as it takes a safer path.

In [17]:
total_rwd = [0, 0]
for i, pi in enumerate([pi_sarsa, pi_ql]):
    state = env.to_index(env.reset())
    done = False
    k = 0
    while not done:
        action = pi[state]
        next_state, reward, done, _ = env.step(action)
        total_rwd[i] += reward * gamma ** k
        k += 1
        state = env.to_index(next_state)

print('Trajectory values: SARSA = ', total_rwd[0], '; Q-learning = ', total_rwd[1])

Trajectory values: SARSA =  -7.7123207545039 ; Q-learning =  -7.175704635190001
