**Gym environment: the Cliff problem**

In this exercise, we are going to explain the Gym interface for RL problems, using the Cliff problem:

![alt text](cliff_problem.png "Title")

The purpose of this example is to show how you can implement a basic MDP (Markov Decision Process) environment using the Gym library, and to illustrate some key concepts that have been explained in class.

First, we start by importing the libraries we are going to use. You should be familiar to all of them, except to Gymnasium (previously known as Gym), which is the *the facto* standard for RL environments in Python. You may check its [webpage](https://github.com/Farama-Foundation/Gymnasium) to get more information.

In [6]:
import numpy as np
try:
    import gymnasium.spaces as gs
except:
    !pip install gymnasium
    import gymnasium.spaces as gs
from tabulate import tabulate

Next, we are going to create the Cliff class. The key methods that we are going to need are:
* **reset**: it restarts the environment, and returns the initial state (S, by default).
* **step**: this method takes an action, and implements all the needed actions to simulate an environment step. It returns the next state, the reward, and whether the episode has ended (i.e., we have either reached the target state G or fallen off the cliff).

While both previous methods are always required to have a Gym environment (de facto standard for Reinforcement Learning in Python), the other methods included are specific for this problem:
* **from_index** and **to_index** change between the state coordinates in the 2-D grid and a scalar index to identify the state.
* **reward_done** computes the reward and whether the episode has ended,
depending on the state.
* **action_to_str** takes as input the integer identifying the action and
returns a string for human readability.

In [7]:
class cliff(object):  # Create the Cliff class

    def __init__(self, dims=(3, 3), seed=1234):
        self.dims = dims  # Cliff dims (rows x cols)
        self.observation_space = gs.Box(low=np.array([0, 0]), high=np.array([self.dims[0] - 1, self.dims[1] - 1]), dtype=int, seed=seed)  # The observations are the coordinates of each grid cell
        self.action_space = gs.Discrete(4, seed=seed)  #  There are 4 possible actions: 0=up, 1=down, 2=left, 3=right
        self.init_state = (self.dims[0] - 1, 0)  # Initial state, always the same for simplicity: lower left corner
        self.target_state = (self.dims[0] - 1, self.dims[1] - 1)  # Target state: always the same for simplicity: lower right corner
        self.cliff_location = [(self.dims[0] - 1, i) for i in range(1, self.dims[1] - 1)]  # Lower row (except for target and initial state)
        self.state = None  # This is to store data later on

    def reset(self, randomize=False):  # Call this method to reset the environment
        if randomize:  # To use a random initial state
            self.state = self.observation_space.sample()
        else:
            self.state = self.init_state  # Use a fixed initial state
        return self.state, None  # The None is provided to keep consistency with Gymnasium, which outputs the state and info

    def step(self, action):  # This method implements the environment transition

        action = np.squeeze(action).astype(int).item()  # Prepare action (make it an integer just in case)
        if action < 0 or action > 3:  # Check the action bounds
            raise RuntimeError('Action out of bounds')
        # Perform action to get next state (i.e., move the agent in the grid world)
        if action == 0:
            next_state = self.state + np.array([1, 0])  # Move down
        elif action == 1:
            next_state = self.state + np.array([-1, 0])  # Move up
        elif action == 2:
            next_state = self.state + np.array([0, -1])  # Move left
        elif action == 3:
            next_state = self.state + np.array([0, 1])  # Move right
        # Set the action bounds correctly, as we may end up out of the grid
        next_state = np.clip(next_state, np.zeros(2), np.array(self.dims) - 1).astype(int)
        # Check reward and termination condition
        reward, done = self.reward_done(next_state)
        self.state = next_state  # Change the state in the class
        terminated = truncated = done  # This is to keep consistency with Gymnasium
        return next_state, reward, terminated, truncated, {}  # Return next state, reward and whether episode has ended

    def from_index(self, index):  # Ancillary method: convert an index state to state coordinates
        return np.unravel_index(index, self.dims)

    def to_index(self, state):  # Ancillary method: convert state coordinates to index
        return np.ravel_multi_index(state, self.dims)

    def reward_done(self, state):
      if tuple(state) in self.cliff_location:  # We fall off the cliff
        reward = -100  # Highly penalizing reward: you die
        done = True  # Episode is terminated
      elif np.sum(np.abs(state - np.array(self.target_state))) < 1e-4:  # Final target found
        reward = 0  # Final reward is 0
        done = True  # Episode is terminated
      else:  # We haven't neither reached the target nor fallen off the cliff
        reward = -1  # Standard reward (takes another step)
        done = False  # Episode not terminated yet
      return reward, done

    def action_to_str(self, action):
      def ac2str(action):
        if action == 0:
          return 'down'
        if action == 1:
          return 'up'
        if action == 2:
          return 'left'
        if action == 3:
          return 'right'
      if isinstance(action, list):
        return [ac2str(a) for a in action]
      else:
        return ac2str(action)

The next thing we do is to seed the random number generator.This is done to ensure that the results are reproducible.At this point, this is not strictly necessary, but it is good practice to do so (and when working with Deep Reinforcement Learning, it is absolutely necessary).

In [8]:
seed = 42
rng = np.random.default_rng(seed)

Now that we have created our Cliff class, we can instantiate it and show an example of the Grid parameters:
* The state index, where each state is represented by an integer number.
* The rewards, which as you know, is $-1$ everywhere except on the Cliff region, where we set a very negative reward ($-100$), and on the target state, whose reward is set to $0$ (this is optional, we could have left the $-1$ reward on the final state too).
* The final states, which comprise the Cliff states (bad reward) and the target state (zero reward).
Take your time comparing the different values, in order to ensure that you understand the problem. Also, you can play around with the dimensions of the Cliff, and see how the representations change.

In [9]:
env = cliff(dims=(4, 12), seed=seed)

# Show reward map
n_states = np.prod(env.dims)  # Obtain the number of states in the grid
reward_grid = np.zeros(env.dims)  # To store the
done_grid = np.zeros(env.dims).astype(bool)  # To store terminal states
state_grid = np.zeros(env.dims)  # To store the index of each state

for s in range(n_states):  # Check all states
  state = env.from_index(s)  # Convert the index to coordinate
  r, d = env.reward_done(state)
  reward_grid[state[0], state[1]] = r
  done_grid[state[0], state[1]] = d
  state_grid[state[0], state[1]] = s

print('State index in the cliff')
print(tabulate(state_grid, tablefmt="fancy_grid"))
print('Rewards in the cliff')
print(tabulate(reward_grid, tablefmt="fancy_grid"))
print('Terminal states in the cliff')
print(tabulate(done_grid, tablefmt="fancy_grid"))

State index in the cliff
╒════╤════╤════╤════╤════╤════╤════╤════╤════╤════╤════╤════╕
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 12 │ 13 │ 14 │ 15 │ 16 │ 17 │ 18 │ 19 │ 20 │ 21 │ 22 │ 23 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 24 │ 25 │ 26 │ 27 │ 28 │ 29 │ 30 │ 31 │ 32 │ 33 │ 34 │ 35 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 36 │ 37 │ 38 │ 39 │ 40 │ 41 │ 42 │ 43 │ 44 │ 45 │ 46 │ 47 │
╘════╧════╧════╧════╧════╧════╧════╧════╧════╧════╧════╧════╛
Rewards in the cliff
╒════╤══════╤══════╤══════╤══════╤══════╤══════╤══════╤══════╤══════╤══════╤════╕
│ -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │ -1 │
├────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼────┤
│ -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │   -1 │ -1 │
├────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────

Finally, we are going to see how to obtain a trajectory by using a random policy. We use a loop that you should become familiar with, as it will be used very frequently in the rest of the course:
* First, we obtain the initial state by calling the *reset* method.
* Then, we enter a loop which lasts until the *done* flag is active, which means that the episode is over. Note that the loop is needed, as we do not know a priori the episode length.
* In each loop iteration, we first select an action, then use that action in the environment by calling the *step* method, and then save all the information for representation.

Run the next cell, and observe the output. Also, note that every time that you run the cell, you will obtain a different trajectory, with a different length. You may check that a very tiny proportion of random trajectories do not end in the Cliff, so we need to put more intelligence on the policy - which is what we will do in the rest of the course.

In [10]:
# Example of trajectory for random policy
states = [env.reset()[0]]  # Obtain initial state with reset() method
actions = []
rewards = []
dones = []
done = False  # This controls when to end iterating
while not done:
  action = env.action_space.sample()  # Randomly sample one of the possible actions
  state, reward, terminated, truncated, _ = env.step(action)
  done = terminated or truncated
  states.append(state)
  actions.append(action)
  rewards.append(reward)
  dones.append(done)

# First, show the states indexes
print('State index in the grid')
print(tabulate(state_grid, tablefmt="fancy_grid"))
# Now, show the trajectory values
states_index = [env.to_index(s) for s in states]
print('States visited in trajectory')
print(states_index)
print('\n')
print('State / Action / Reward / Done / Next state')
pairs = [i for i in zip(states_index, env.action_to_str(actions), rewards, dones, states_index[1:])]
for p in pairs:
  print(p)

State index in the grid
╒════╤════╤════╤════╤════╤════╤════╤════╤════╤════╤════╤════╕
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 12 │ 13 │ 14 │ 15 │ 16 │ 17 │ 18 │ 19 │ 20 │ 21 │ 22 │ 23 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 24 │ 25 │ 26 │ 27 │ 28 │ 29 │ 30 │ 31 │ 32 │ 33 │ 34 │ 35 │
├────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
│ 36 │ 37 │ 38 │ 39 │ 40 │ 41 │ 42 │ 43 │ 44 │ 45 │ 46 │ 47 │
╘════╧════╧════╧════╧════╧════╧════╧════╧════╧════╧════╧════╛
States visited in trajectory
[36, 36, 37]


State / Action / Reward / Done / Next state
(36, 'down', -1, False, 36)
(36, 'right', -100, True, 37)
