## Introduction

This notebook provides a very simple introduction to [OpenAI Gym](https://gym.openai.com/), a toolkit for developing and comparing Reinforcement Learning (RL) algorithms. For instance, you can use OpenAI Gym to train an agent to play [Atari games!\(https://gym.openai.com/envs/#atari)

![](http://gym.openai.com/videos/2019-04-06--My9IiAbqha/SpaceInvaders-v0/poster.jpg)

This notebook **is not** an introcution to RL, and does not explain concepts like Markov Decision Processes, states, rewards, value functions, policies and so on. For a hands on introduction to RL I recommend [Packt: Deep Reinforcement Learning Hands-On](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-Q-networks-ebook/dp/B076H9VQH6/ref=sr_1_1_sspa?keywords=pocket+reinforcement+learning&qid=1555782065&s=gateway&sr=8-1-spons&psc=1). For a solid theoretical treatment of the subject there is nothing better than [Sutton & Barto: Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html), which is available for free online, and can also be [purchased online](https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262193981/ref=sr_1_4?crid=17M2H3J2R3L7Z&keywords=sutton+reinforcement+learning&qid=1555782219&s=gateway&sprefix=sutton+re%2Caps%2C200&sr=8-4).

## Setup

Before we can get our hands dirty there are a few things we need to install. The following cell takes care of all that:

In [1]:
import sys
!{sys.executable} -m pip install gym > /dev/null

import numpy as np
import gym

[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


This is all the setup we need. In the following section we will introduce the foundational Gym concepts, and will execute an actual simulation.

## OpenAI Gym

OpenAI Gym is very flexible and abstracts many details to make experimentation really fast. The key to this are its foundational concepts: environment, observations, actions and spaces.

**Environment**: An environment is a test problem. It models the "world" in which the agent exists, generates observations, defines possible actions and determines the reward the agent gets at different points in time. OpenAI Gym is packed with envirnoments. Environments are instanciated by name:

```python
import gym
env = gym.make('CartPole-v0')
env.reset()
```

**Observations**: Observations allow us to determine the state of the environment, the reward obtained after executing the last actions, wheter or not the simulation has finished, and some extra bits and pieces of information. The ```step``` method of the _environment_ gives us access to this information:

```python
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()
```

**Actions**: The purpose of Reinforcement Learning is to learn what is the optimal action that an agent in a particular environment can take at any point in time. In OpenAI Gym the set of possible actions is defined by the environment, and can be access via its action_space. Spaces are explained next.

```python
# Select a random action
action = env.action_space.sample()
```

**Spaces**: Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

In [2]:
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(4,)


A ```Discrete``` space is basically a set of actions identifed by integers. For dimension $n$ it is basically a set with elements ```range(0, n)```. A ```Box``` is basically an n-dimentional tensor. For instance, a chess board could be represented as a ```Box(8, 8)```.

## Running a simulation

In this section we are going to run but not visualize a simulation. We have not been able to visualize simulations within Jupyter running on EC2.

We are going to use the [CartPole-v0 environment](https://gym.openai.com/envs/CartPole-v0/). CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials. The following code uses random actions, so we are not going to be able to solve the problem.

The observation vector contains the following information: ```[position of cart, velocity of cart, angle of pole, rotation rate of pole]```, and the actions consist of applying a force of $-1$ or $1$ to the cart.

In [3]:
import gym
env = gym.make('CartPole-v0')

episodes = 100
total_reward = 0 # Used to calculate the average reward over all episodes

def select_action(observation, env):
    """Select a random action"""
    return env.action_space.sample()

for episode in range(episodes):
    acc_reward = 0 # Accumulated reward in the current episode
    observation = env.reset()
    t = 0
    while True:
        t += 1
        action = select_action(observation, env) # Select a random action
        observation, reward, done, info = env.step(action)
        acc_reward += reward
        if done:
            print("Episode finished after {} timesteps. Reward {}".format(t + 1, acc_reward))
            break
    total_reward += acc_reward
print("\nAverage reward over {} episodes is {}".format(episodes, total_reward / episodes))
env.close()

Episode finished after 22 timesteps. Reward 21.0
Episode finished after 19 timesteps. Reward 18.0
Episode finished after 16 timesteps. Reward 15.0
Episode finished after 14 timesteps. Reward 13.0
Episode finished after 51 timesteps. Reward 50.0
Episode finished after 43 timesteps. Reward 42.0
Episode finished after 69 timesteps. Reward 68.0
Episode finished after 21 timesteps. Reward 20.0
Episode finished after 16 timesteps. Reward 15.0
Episode finished after 14 timesteps. Reward 13.0
Episode finished after 12 timesteps. Reward 11.0
Episode finished after 22 timesteps. Reward 21.0
Episode finished after 34 timesteps. Reward 33.0
Episode finished after 18 timesteps. Reward 17.0
Episode finished after 17 timesteps. Reward 16.0
Episode finished after 30 timesteps. Reward 29.0
Episode finished after 22 timesteps. Reward 21.0
Episode finished after 39 timesteps. Reward 38.0
Episode finished after 15 timesteps. Reward 14.0
Episode finished after 39 timesteps. Reward 38.0
Episode finished aft

As we can see our performance is far from good, which is expected from an agent that selects actions at random completely disregarding all the information available. We will work on the actual RL algorithm implementations in a different notebook.