<a href="https://colab.research.google.com/github/lisaong/diec/blob/rl_path_finding/path_finding_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning Path-Finding Demo

This demonstrates how to:
- Use OpenAI gym to create a custom environment
- Compare different Q-learning algorithms for Reinforcement Learning

Inspired by: http://mnemstudio.org/path-finding-q-learning-tutorial.htm

## Problem Setup

Bender is lost in Fry's house! Help Bender find Fry (who is in Room 5 waiting with a can of beer).

![intro](https://github.com/lisaong/diec/raw/rl_path_finding/day4/rl/path_finding_intro.png)

## OpenAI Gym

[OpenAI Gym](https://gym.openai.com/) is an open-source Python toolkit for developing RL algorithms.

We will use OpenAI gym to re-create Fry's house, then run some reinforcement learning to find the path.

Fry's house will be a custom environment, similar to: https://towardsdatascience.com/creating-a-custom-openai-gym-environment-for-stock-trading-be532be3910e



In [11]:
# gym is already built into Colab
import gym
from gym import spaces
import numpy as np
import random

gym.__version__

'0.15.6'

In [0]:
class FrysHomeEnv(gym.Env):
  """Custom Environment describing Fry's home  
  
  For details on the gym.Env class:
  https://github.com/openai/gym/blob/master/gym/core.py
  """

  # render to the current display or terminal
  metadata = {'render.modes': ['human']}

  def __init__(self):
    super(FrysHomeEnv, self).__init__()

    # Initialise the rewards matrix according to the graph above
    # Where:
    #  state: current room, action: next room
    #  dimensions (row=state, col=actions)
    #  A value of -1 means there is no adjacent path from room_i to room_j
    #  (for example, room_0 to room_0 has, room_0 to room_5)
    self.rewards = np.array([[-1, -1, -1, -1,  0, -1], # action 0
                             [-1, -1, -1,  0, -1, 0],  # action 1
                             [-1, -1, -1,  0, -1, -1], # etc
                             [-1,  0,  0, -1,  0, -1],
                             [ 0, -1, -1,  0, -1,  0],
                             [-1, 100, -1, -1, 100, 100]])
    
    self.num_rooms = self.rewards.shape[0]

    # Action space describes all possible actions that can be taken
    # here, we can select 1 out of 6 rooms
    self.action_space = spaces.Discrete(self.num_rooms)

    # Observation space describes the valid observations
    # since we are moving between rooms, we can be in 1 of 6 rooms
    self.observation_space = spaces.Discrete(self.num_rooms)

    # Rewards range describes the min and max possible rewards
    self.reward_range = (self.rewards.min(), self.rewards.max())

    # Room 5 is our goal
    self.goal = 5

  def reset(self):
    """Reset the environment to an initial state"""

    # Randomly initialise the state
    self.state = random.randint(0, self.num_rooms)

    # Return the observation (same as the state in our case)
    obs = self.state
    return obs

  def step(self, action):
    """Execute one step within the environment"""

    # take the selected action
    prev_state = self.state
    self.state = action

    # calculate the reward
    reward = self.rewards[prev_state][action]

    # check if we've reached our goal
    done = (self.state == self.goal)

    # get the next observation
    obs = self.state

    return obs, reward, done, {}

  def render(self, mode='human', close=True):
    print(f'Current room: {self.state}')
    print(f'Reached goal: {self.state == self.goal}')