# RL-Lab Tutorial: Working Environment

Welcome to the Reinforcement Learning Lab! This is an introductory tutorial for you to familiarize yourself with OpenAI Gym and the first environment for the exercises.

## OpenAI Gym environments

The environment **Dangerous GridWorld** is visible in the following figure.

<img src="images/environment_1.png" width="400">

The agent starts in cell $0$ and has to reach cell $48$, while the cells with the *skull* are the dangerous cells that cause the agent to lose the game. 
The grey cells represent walls that the robot can not cross. 
The robot can move in $4$ directions: *LEFT*, *RIGHT*, *UP*, and *DOWN*. 

However, the robot **doesn't work very well!** It will follow the commands only *90%* of the time. In the other *10%*, it will perform a random action by selecting from the available options, be careful!

To use the environment, we need first to import the packages of OpenAI Gym. 
Notice that, due to the structure of this repository, we need to add the parent directory to the path.

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('../tools'))
if module_path not in sys.path: sys.path.append(module_path)

from DangerousGridWorld import GridWorld

Than we can generate a new enviromnent **Dangerous GridWorld** and render it

In [2]:
env = GridWorld()
env.render()

[S] [ ] [ ] [ ] [ ] [ ] [X] 
[ ] [W] [W] [W] [X] [ ] [X] 
[ ] [ ] [W] [W] [X] [ ] [X] 
[W] [ ] [W] [W] [X] [ ] [X] 
[ ] [ ] [W] [W] [X] [ ] [X] 
[ ] [W] [W] [W] [X] [ ] [X] 
[ ] [ ] [ ] [ ] [ ] [ ] [G] 


The render is a matrix with cells of different type:
* *S* - Start Cell
* *W* - Wall Cells
* *X* - Death Cells

An environment has some useful variables:
* *action_space* - number of possible actions (i.e., $4$)]
* *observation_space* - space of possible observations (states): usually a range of integers  (i.e., $50$)
* *actions* - mapping between action ids and their descriptions
* *startstate* - start state (unique)
* *goalstate* - goal state (unique)

In **Dangerous GridWorld** we have 4 different possible actions numbered from 0 to 4

In [3]:
print( env.action_space )

4


And they are *Left, Right, Up, Down*

In [4]:
print( env.actions )

{0: 'L', 1: 'R', 2: 'U', 3: 'D'}


States are numbered from 0 to 49

In [5]:
print( env.observation_space )

49


There are also some methods:
* *render()* - renders the environment
* *pos_to_state(x, y)* - returns the state id given its position in  and  coordinates
* *state_to_pos(state)* - returns the coordinates  given a state id
* *is_terminal(state)* - returns True if the given *state* is terminal (goal or death)
* *evaluate_policy(policy)* - return the average cumulative reward of 10 runs following the given policy
* *render_policy()* - renders the policy, showing the selected action for each cell 

For example, if we want to know the ids and positions for both the start and goal states.

In [6]:
start = env.start_state
goal = env.goal_state
print("Start id: {}\tGoal id: {}".format(start, goal))
print("Start position: {}\tGoal position: {}".format(env.state_to_pos(start), env.state_to_pos(goal)))
print("Id of state (0, 3): {}".format(env.pos_to_state(3, 0)))
print()
env.render()

Start id: 0	Goal id: 48
Start position: (0, 0)	Goal position: (6, 6)
Id of state (0, 3): 3

[S] [ ] [ ] [ ] [ ] [ ] [X] 
[ ] [W] [W] [W] [X] [ ] [X] 
[ ] [ ] [W] [W] [X] [ ] [X] 
[W] [ ] [W] [W] [X] [ ] [X] 
[ ] [ ] [W] [W] [X] [ ] [X] 
[ ] [W] [W] [W] [X] [ ] [X] 
[ ] [ ] [ ] [ ] [ ] [ ] [G] 


In some cases, it can be necessary to know if a state is **terminal**. In general, the **goal state** and the death states are terminal. Using the *is_terminal(state)* function is a fast method to obtain this information. For example:

In [7]:
is_1_terminal = env.is_terminal(1)
is_6_terminal = env.is_terminal(6)
is_48_terminal = env.is_terminal(48)

print( f"The state 1 ({env.state_to_pos(1)}), is terminal? {is_1_terminal}" )
print( f"The state 6 ({env.state_to_pos(6)}), is terminal? {is_6_terminal}" )
print( f"The state 48 ({env.state_to_pos(48)}), is terminal? {is_48_terminal}" )

The state 1 ((1, 0)), is terminal? False
The state 6 ((6, 0)), is terminal? True
The state 48 ((6, 6)), is terminal? True


## Key Methods: *sample()* vs *transition_prob()*

In **Dangerous GridWorld**, there are two key methods to navigate the environment:
* *sample(state, action)* - returns a new state sampled from the ones that can be reached from *state* by performing *action*, both given as ids
* *transition_prob(state, action, next_state)* - returns the probability of reaching the state *next_state*, starting from *state* and selecting the action *action*

In some cases, we want to analyze only the transition table (e.g., policy/value iteration) so we can use the function **transition_prob** to obtain the probability of reaching a state. In some other cases, we want to actually move the agent in the environment (e.g., MC tree-search or testing phase), and we use the function **sample** to try to execute the action and see what happens *(remember, the robot will follow your instructions only 90% of the time!)*.

Following an example of the method **transition_prob(state, action, new_state)**:

In [8]:
print( f"What's the probability of ending up in state  7 (0, 1) starting from state 0 (0, 0) and selecting D (DOWN)? {env.transition_prob(0, 3, 7)}" )
print( f"What's the probability of ending up in state  7 (0, 1) starting from state 0 (0, 0) and selecting R (RIGHT)? {env.transition_prob(0, 1, 7)}" )
print( f"What's the probability of ending up in state 48 (6, 6) starting from state 0 (0, 0) and selecting R (RIGHT)? {env.transition_prob(0, 1, 48)}\n" )

What's the probability of ending up in state  7 (0, 1) starting from state 0 (0, 0) and selecting D (DOWN)? 0.9
What's the probability of ending up in state  7 (0, 1) starting from state 0 (0, 0) and selecting R (RIGHT)? 0.03
What's the probability of ending up in state 48 (6, 6) starting from state 0 (0, 0) and selecting R (RIGHT)? 0



Following an example of the method **sample**.

In [18]:
start_position = 0
action = 1

for _ in range(10):
    new_state = env.sample(action, start_position) 
    print( f"Starting from {env.state_to_pos(start_position)} and performing action {env.actions[0]}, the reobot ends up in state: {env.state_to_pos(new_state)}" )

Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (0, 1)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (1, 0)
Starting from (0, 0) and performing action L, the reobot ends up in state: (0, 0)


## Key Methods: *sample_episode()*

In **Dangerous GridWorld**, there is a method to sample a full trajectory following a given stochastic policy:
* *sample_episode( policy )*: returns an array of N elements (the number of steps), where each element is an array of 3 values *<state, action, reward>*.

The policy should be an array of N elements, one for each state, where each element is an array of A elements, where A is the number of possible actions from the state. This sequence of values represents the probability distribution over the actions. 

Supposing a uniform policy, where each action has the same probability of being selected:

In [10]:
policy = [[1 / env.action_space for _ in range(env.action_space)] for _ in range(env.observation_space)]

For example, from the state 0, we have the following (*uniform*) distribution over the actions:

In [11]:
print( "Distribution from state 0:", policy[0] ) # p (a | s ) => conditional probability of an action for a given state 

Distribution from state 0: [0.25, 0.25, 0.25, 0.25]


Following a complete episode starting from state **44**:

In [12]:
trajectory = env.sample_episode( policy )
print( trajectory )™

[[26, 2, -0.1], [19, 2, -0.1], [12, 1, -0.1], [19, 2, -0.1], [12, 0, -1]]


Each array element represents a step of the episode in the tuple *<state, action, reward>*. For example, the first element means that starting from state **44** and performing the action **0 (LEFT)**, the robot obtains a reward of **-0.1**.