**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/gym_plannable.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
%matplotlib inline
from gym_plannable.env import MazeEnv

## Gym Environments

The initial notebook in our section on reinforcement learning (RL), is not going to cover any actual reinforcement learning method – it is going to be about how we represent reinforcement learning problems using the OpenAI Gym interface (which has become something of a standard in recent years).

[OpenAI's Gym](https://gym.openai.com/) is a Python package with a number of different RL environments – all provided with the same unified interface. We are going to illustrate the main features of the interface using a simple gridworld environment.

### The Interface

#### Resetting and Rendering

To reset an environment's state and obtain an observation of it, we can call the `reset` method. The method returns two things:

* **obs** : the initial observation;
* **info** : a dictionary that may carry other environment-specific information;


In [None]:
env = MazeEnv()
env.reset()

To get visual renderings of the environment's state, we can pass `render_mode='human'` to it upon construction.

Different environments support different rendering modes – you can query the environment for modes that it supports using the `.metadata` attribute of the environment's class: e.g. `MazeEnv.metadata` here.



In [None]:
env = MazeEnv(render_mode='human')
env.reset();

What we can see in our case, is a simple gridworld environment, where the agent is represented by the blue circle, the start and the goal states are denoted using `S` and `G` respectively and grey squares correspond to walls.

#### The Observation Space and the Action Space

Every environment has an observation space (the `observations_space` attribute) and an action space (the `action_space` attribute); these are the spaces from which observations and actions are drawn.

For instance, in our simple maze, the state is fully described by the agent's position in it (nothing else is subject to change). Given that the agent moves around a $10 \times 10$ grid, any observation returned by `reset` will have to be a pair of integers in the range of $\{0, 1, ..., 9 \}$. This corresponds to the following observation space under OpenAI Gym:



In [None]:
env.observation_space

Our action space consists of four actions:



In [None]:
env.action_space

In our case their meaning is:

* **0:**  up;
* **1:**  down;
* **2:**  left;
* **3:**  right;
but that meaning is, of course, not part of the action space's definition.

To sample a (uniformly) random action from the action space, we can call `sample`:



In [None]:
env.action_space.sample()

#### Taking Actions

Actions are taken using the `step` method; the input argument is the action. What `step` returns is a tuple with the following elements:

* **obs** : the new observation;
* **reward** : the immediate reward associated with the transition;
* **terminated** : a boolean flag indicating whether the environment has terminated (i.e. a terminal state has been reached and the environment needs to be reset);
* **truncated** : a boolean flag indicating (as opposed to terminated) that the environment has been ended prematurely, e.g. because of an error, because a limit on the number of steps has been reached, etc. Again, the environment will need to be reset in order to move on;
* **info** : a dictionary that may carry other environment-specific information;


In [None]:
env = MazeEnv()
env.reset()

obs, reward, terminated, truncated, info = env.step(0)

env.render()
print(f"observation: {obs}\nreward: {reward}\nterminated: {terminated}\ntruncated: {truncated}")

---
#### Task 1: Getting the Agent to the Goal State

**In the cell below, construct a maze environment and feed it a sequence of actions such that the agent traverses a path from the start state to the goal state. Call render at each step so that it is possible to observe the agent's behaviour.**  

---


In [None]:
env = MazeEnv(render_mode='human')
env.reset()


# --- env.step(     )
# --- env.render()


### Plannable Environments

For the gridworld environments that we are going to be using the next few notebooks, we provide an extension to the standard OpenAI Gym interface – plannable environments. In a plannable environment we are essentially given access to a distribution model of the environment: we can query for actions legal in a certain state; we can query for all possible state transitions corresponding to an action and their probabilities; we can simulate the behaviour of the environment for a number of steps without actually changing its state.

As we know, a distribution model of this kind is required by dynamic programming methods, but it can also be useful under a number of other approaches, so let us have a look at what the interface looks like.

#### Retrieve a Plannable State

If an environment is plannable, it is going to provide the `single_plannable_state` method (or the `plannable_state` method, which, however, we would only need if we were working with a multiagent environment). This method returns the current state of the environment as a plannable state:



In [None]:
env = MazeEnv(render_mode='human')
env.reset()

state = env.single_plannable_state()

#### Legal Actions

To inquire about which actions are legal in the current state, one can use the `legal_actions` method:



In [None]:
state.legal_actions()

Here, strictly speaking, only two actions are legal because the agent is in a corner: it cannot go left or down.

#### Simulating Transitions

To iterate over all possible next states and their corresponding probabilities, given action `a` one can use the `all_next(a)` method:



In [None]:
a = state.legal_actions()[0]

for next_state, prob in state.all_next(a):
    print(f"observation {next_state.observation()}; " + 
          f"reward: {next_state.reward()}; probability: {prob}")

To sample just a single transition, one can use `next(a)`:



In [None]:
a = state.legal_actions()[0]
next_state = state.next(a)
print(f"observation {next_state.observation()}; " + 
      f"reward: {next_state.reward()}; probability: {prob}")

#### The Done Flag

To check whether a state is terminal, one can use the `is_done` method:



In [None]:
state.is_done()