<a href="https://colab.research.google.com/github/lisaong/diec/blob/rl_path_finding/day4/rl/path_finding_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning Path-Finding Demo

This demonstrates how to:
- Use OpenAI gym to create a custom environment
- Compare different Q-learning algorithms for Reinforcement Learning

Inspired by: http://mnemstudio.org/path-finding-q-learning-tutorial.htm

## Problem Setup

Bender is lost in Fry's house! Help Bender find Fry (who is in Room 5 waiting with a can of beer).

![intro](https://github.com/lisaong/diec/raw/rl_path_finding/day4/rl/path_finding_intro.png)

## OpenAI Gym

[OpenAI Gym](https://gym.openai.com/) is an open-source Python toolkit for developing RL algorithms.

We will use OpenAI gym to re-create Fry's house, then run some reinforcement learning to find the path.

https://github.com/openai/gym/blob/master/docs/creating-environments.md


In [14]:
# gym is already built into Colab
import gym
from gym import spaces
import numpy as np
import random

gym.__version__

'0.15.6'

In [0]:
class FrysHomeEnv(gym.Env):
  """Custom Environment describing Fry's home  
  
  For details on the gym.Env class:
  https://github.com/openai/gym/blob/master/gym/core.py
  """

  # render to the current display or terminal
  metadata = {'render.modes': ['human']}

  def __init__(self):
    super(FrysHomeEnv, self).__init__()

    # Initialise the rewards matrix according to the graph above
    # Where:
    #  state: current room, action: next room
    #  dimensions (row=state, col=actions)
    #  A value of -1 means there is no adjacent path from room_i to room_j
    #  (for example, room_0 to room_0 has, room_0 to room_5)
    self.rewards = np.array([[-1, -1, -1, -1,  0, -1], # action 0
                             [-1, -1, -1,  0, -1, 0],  # action 1
                             [-1, -1, -1,  0, -1, -1], # etc
                             [-1,  0,  0, -1,  0, -1],
                             [ 0, -1, -1,  0, -1,  0],
                             [-1, 100, -1, -1, 100, 100]])
    
    self.num_rooms = self.rewards.shape[0]

    # Action space describes all possible actions that can be taken
    # here, we can select 1 out of 6 rooms
    self.action_space = spaces.Discrete(self.num_rooms)

    # Observation space describes the valid observations
    # since we are moving between rooms, we can be in 1 of 6 rooms
    self.observation_space = spaces.Discrete(self.num_rooms)

    # Rewards range describes the min and max possible rewards
    self.reward_range = (self.rewards.min(), self.rewards.max())

    # Room 5 is our goal
    self.goal = 5

    # Initialise our state
    self.reset()

  def reset(self):
    """Reset the environment to an initial state"""

    # Randomly initialise the state
    self.state = random.randint(0, self.num_rooms)

    # Return the observation (same as the state in our case)
    obs = self.state
    return obs

  def step(self, action):
    """Execute one step within the environment"""

    # take the selected action
    prev_state = self.state
    self.state = action

    # calculate the reward
    reward = self.rewards[prev_state][action]

    # check if we've reached our goal
    done = (self.state == self.goal)

    # get the next observation
    obs = self.state

    return obs, reward, done, {}

  def render(self, mode='human', close=True):
    print(f'Current room: {self.state}')
    print(f'Reached goal: {self.state == self.goal}')

In [0]:
# Unit testing
myenv = FrysHomeEnv()

for i in range(0, 6):
  print(myenv.step(i))
  myenv.render()

(0, -1, False, {})
Current room: 0
Reached goal: False
(1, -1, False, {})
Current room: 1
Reached goal: False
(2, -1, False, {})
Current room: 2
Reached goal: False
(3, 0, False, {})
Current room: 3
Reached goal: False
(4, 0, False, {})
Current room: 4
Reached goal: False
(5, 0, True, {})
Current room: 5
Reached goal: True


## Package custom environment as module

OpenAI gym requires all environments to be packaged as Python modules.

The code above has been packaged here:
https://github.com/lisaong/diec/blob/master/day4/rl/gym-fryshome

The module follows this convention:
https://github.com/openai/gym/blob/master/docs/creating-environments.md

In [7]:
!git clone -b rl_path_finding https://github.com/lisaong/diec.git
%cd diec/day4/rl/gym-fryshome
!git pull
!pip install --verbose -e  .

fatal: destination path 'diec' already exists and is not an empty directory.
/content/diec/day4/rl/gym-fryshome
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 8 (delta 4), reused 8 (delta 4), pack-reused 0[K
Unpacking objects: 100% (8/8), done.
From https://github.com/lisaong/diec
   25cf3b7..51a04be  rl_path_finding -> origin/rl_path_finding
Updating 25cf3b7..51a04be
Fast-forward
 day4/rl/gym-fryshome/gym_fryshome/envs/frys_home_env.py | 3 [32m+[m[31m--[m
 1 file changed, 1 insertion(+), 2 deletions(-)
Created temporary directory: /tmp/pip-ephem-wheel-cache-vm4bt16c
Created temporary directory: /tmp/pip-req-tracker-2eyibxib
Created requirements tracker '/tmp/pip-req-tracker-2eyibxib'
Created temporary directory: /tmp/pip-install-6tiunkpb
Obtaining file:///content/diec/day4/rl/gym-fryshome
  Added file:///content/diec/day4/rl/gym-fryshome to build tracker '/tmp/pip-req-track

## ** Restart the Colab kernel after every pip install**

Runtime -> Restart Runtime

In [0]:
# IMPORTANT: RESTART THE COLAB KERNEL if you've just run !pip install

# Test the installation
import gym_fryshome

gym_fryshome

## Random Walk Agent
 
The simplest way is to have Bender randomly walk around the house.
* We will run 20 episodes, where each episode is a maximum of 100 steps
* Refer to http://gym.openai.com/docs/

This doesn't actually learn anything, but is a good baseline for any RL agents.

In [2]:
import gym

env = gym.make('gym_fryshome:fryshome-v0')

for episode in range(20):
  observation = env.reset()

  for t in range(100):
    env.render()
    
    # take a random action
    action = env.action_space_sample() # random actions

    # step the environment using the selected action
    observation, reward, done, info = env.step(action)

    if done:
      print(f'Episode finished after {t+1} timesteps\n')
      break

env.close()

Current room: 0, Reached goal: False
Current room: 4, Reached goal: False
Current room: 2, Reached goal: False
Current room: 2, Reached goal: False
Current room: 3, Reached goal: False
Current room: 0, Reached goal: False
Current room: 4, Reached goal: False
Current room: 4, Reached goal: False
Current room: 3, Reached goal: False
Current room: 1, Reached goal: False
Current room: 2, Reached goal: False
Current room: 4, Reached goal: False
Current room: 1, Reached goal: False
Current room: 3, Reached goal: False
Current room: 0, Reached goal: False
Current room: 0, Reached goal: False
Current room: 4, Reached goal: False
Current room: 2, Reached goal: False
Current room: 4, Reached goal: False
Episode finished after 19 timesteps

Current room: 0, Reached goal: False
Current room: 2, Reached goal: False
Current room: 2, Reached goal: False
Current room: 2, Reached goal: False
Current room: 3, Reached goal: False
Episode finished after 5 timesteps

Current room: 0, Reached goal: False
Cu

## Q-Learning Agent

Let's implement the basic Q-Learning formula (without Temporal Differencing):

`Q(state, action) = R(state, action) + gamma * max[Q(next_state, all actions)]`

Our Q-Learning agent will store the Q-values as we go along. This can be thought of as Benders "brain."

## More Advanced Agents

You can install baselines, which contain implementation of more sophisticated agents.

https://github.com/openai/baselines