# Reinforcement Learning with OpenAI gym

## Requirements

This notebook requires the following packages to be installed:

1. tensorflow
2. tflearn
3. gym
4. numpy

These can be installed via Anaconda (**`conda install <package>`**), or via PIP: (**`pip install <package>`**).

## What we'll be doing:

Teaching a Neural Network to balance a CartPole - which uses 2 inputs (left and right).

![cartpole](https://i.imgur.com/di0IANk.gif)

## History and Rationale

* Game programmers used to use heuristic if-then-else type decisions to make educated guesses. This used to be the norm for a very long time - however, it's major flaw is that game developers can only predict and account for a limited number of scenarios and edge cases.

* Game developers then tried to mimic how humans would play a game, and modeled human intelligence in a game bot.

* DeepMind generalized modeling intelligence to solve any Atari game using deep learning NNs without previously known information about the game such as input, goals, etc. Much of this is not open-source, and is a trade-secret of Google.

* To avoid concentrating the incredible power of AI in the hands of a few, Elon Musk founded OpenAI. It seeks to democratize AI by making it accessible to all.

* OpenAI Gym provides a simple interface for interacting with and managing any arbitrary dynamic environment. When integrated with other packages that add flash functionality (such as `universe`), one can access: Atari games, Minecraft, Grand Theft Auto, etc.

## What is Reinforcement Learning?

This technique observes the game’s previous state and reward (such as the pixels seen on the screen or the game score). It then comes up with an action to perform on the environment.

![reinforce](imgs/reinforcement.jpeg)

The goal is to make its next observation better (in our case — to maximize the game score). This action is chosen and performed by an agent (Game Bot) with the intention of maximizing the score. It’s then applied on the environment. The environment records the resulting state and reward based on whether the action was beneficial or not (did it win the game?).

# Files available:

1. **Basic setup**: `01-cartpole-random-i.py`
2. **Introduction to using OpenAI gym**: `02-cartpole-random-ii.py`
3. **Using reinforcement to achieve a goal (in this case, balancing a pole)**: `03-cartpole-non-random.py`

Since this involves interaction, we will run it from the console.

# Activity: 01-cartpole-random-i.py

Creating and running an environment

# Activity: 02-cartpole-random-ii.py

Interacting with that environment. Note that every environment has two spaces: an `action_space` which determines where and how the actions are performed, and an `observation_space` which tells you about the game "world".

For `CartPole-v0`:

```
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)
```

The Discrete space allows a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1. While the Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers. We can also check the Box's bounds:

```
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])
```

## Environment Observations

If we ever want to do better than take random actions at each step, it'd probably be good to actually know what our actions are doing to the environment.

The environment's step function returns exactly what we need. In fact, step returns four values. These are:

1. **observation (object)**: an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

2. **reward (float)**: amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

3. **done (boolean)**: whether it's time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

4. **info (dict)**: diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment's last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic "agent-environment loop". Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

![loop](imgs/loop.svg)

# Activity: 03-cartpole-non-random.py

# Extra: Try a different game!
To examine which games can be used, run:

```
    from gym import envs
    print(envs.registry.all())
```

**Note that some games do not have 2 inputs, so you will have to change how your inputs are sampled!**

# References

1. https://medium.freecodecamp.org/how-to-build-an-ai-game-bot-using-openai-gym-and-universe-f2eb9bfbb40a
2. https://gym.openai.com/docs/

In [1]:
run "01-cartpole-random-i.py"

[2018-09-24 17:36:24,403] Making new env: CartPole-v0


out of bounds after 12 steps
out of bounds after 16 steps
out of bounds after 19 steps
out of bounds after 12 steps
out of bounds after 28 steps
out of bounds after 11 steps
out of bounds after 19 steps
out of bounds after 16 steps
out of bounds after 12 steps
out of bounds after 18 steps
out of bounds after 42 steps
out of bounds after 15 steps
out of bounds after 16 steps
out of bounds after 16 steps
out of bounds after 30 steps
out of bounds after 13 steps
out of bounds after 46 steps
out of bounds after 16 steps
out of bounds after 25 steps
out of bounds after 13 steps
out of bounds after 15 steps
out of bounds after 13 steps
out of bounds after 18 steps
out of bounds after 23 steps
out of bounds after 29 steps
out of bounds after 15 steps
out of bounds after 38 steps
out of bounds after 24 steps
out of bounds after 12 steps
out of bounds after 25 steps
out of bounds after 21 steps
out of bounds after 23 steps
out of bounds after 42 steps
out of bounds after 19 steps
out of bounds 

In [12]:
run "02-cartpole-random-ii.py"


[2018-09-24 16:50:20,678] Making new env: CartPole-v0


Trial 1 finished after 15 timesteps.
Trial 2 finished after 10 timesteps.
Trial 3 finished after 33 timesteps.
Trial 4 finished after 14 timesteps.
Trial 5 finished after 17 timesteps.
Trial 6 finished after 22 timesteps.
Trial 7 finished after 17 timesteps.
Trial 8 finished after 20 timesteps.
Trial 9 finished after 30 timesteps.
Trial 10 finished after 19 timesteps.
Trial 11 finished after 55 timesteps.
Trial 12 finished after 63 timesteps.
Trial 13 finished after 11 timesteps.
Trial 14 finished after 11 timesteps.
Trial 15 finished after 12 timesteps.
Trial 16 finished after 8 timesteps.
Trial 17 finished after 10 timesteps.
Trial 18 finished after 35 timesteps.
Trial 19 finished after 14 timesteps.
Trial 20 finished after 18 timesteps.
Trial 21 finished after 18 timesteps.
Trial 22 finished after 16 timesteps.
Trial 23 finished after 18 timesteps.
Trial 24 finished after 9 timesteps.
Trial 25 finished after 36 timesteps.
Trial 26 finished after 13 timesteps.
Trial 27 finished after

In [3]:
run "03-cartpole-non-random.py"

Training Step: 1784  | total loss: [1m[32m0.66435[0m[0m | time: 16.533s
| Adam | epoch: 005 | loss: 0.66435 - acc: 0.5938 -- iter: 22784/22827
Training Step: 1785  | total loss: [1m[32m0.66884[0m[0m | time: 16.581s
| Adam | epoch: 005 | loss: 0.66884 - acc: 0.5844 -- iter: 22827/22827
--
Trial 1 finished after 200 timesteps. Score: 200.0
Trial 2 finished after 200 timesteps. Score: 200.0
Trial 3 finished after 200 timesteps. Score: 200.0
Trial 4 finished after 200 timesteps. Score: 200.0
Trial 5 finished after 200 timesteps. Score: 200.0
Trial 6 finished after 200 timesteps. Score: 200.0
Trial 7 finished after 200 timesteps. Score: 200.0
Trial 8 finished after 107 timesteps. Score: 107.0
Trial 9 finished after 200 timesteps. Score: 200.0
Trial 10 finished after 200 timesteps. Score: 200.0
Trial 11 finished after 200 timesteps. Score: 200.0
Trial 12 finished after 200 timesteps. Score: 200.0
Trial 13 finished after 200 timesteps. Score: 200.0
Trial 14 finished after 200 timestep