<a href="https://colab.research.google.com/github/leonistor/ml-manning/blob/master/06-data-mining-machine-learning-techniques/ReinforcementGym.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q- / Reinforcement Learning

In [0]:
# ! pip install gym

In [7]:
import gym
import random

random.seed(1234)

streets = gym.make("Taxi-v3").env
streets.render()

+---------+
|[34;1mR[0m: | : :G|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+



### Q-Learning

- R, G, B, Y - pickup and dropoff locations
- Blue letter: pick passenger from
- Magenta letter: drop passenger to

State of the world:
- location (5 x 5 grid -> 25 locations)
- currennt destination (4 locations)
- where theh passenger is (in taxi or 4 locations)
- => 25 x 4 x 5 = 500 possible states

For each state, 6 possible actions:
- move S E W N
- pickup
- drop off

Rewards and penalties:


|              |     |
|------------------------------------|----:|
| successful drop-off                | +20 |
| step taken while having passenger  |  -1 |
| pickup or drop at illegal location | -10 |



In [9]:
# initial state
# taxi location x: 2, y: 3, passenger at pickup location 2, destination location 0
initial_state = streets.encode(2, 3, 2, 0)

streets.s = initial_state
streets.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [11]:
# initial reward table
# [probability, next state id, reward, dropoff?]
streets.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

In [0]:
# Q-Learning
# train 10_000 taxi runs, with 10% chance of exploratory (not use Q values)
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

learning_rate = 0.1
discount_factor = 0.6
exploration = 0.1
epochs = 10_000

for taxi_run in range(epochs):
  state = streets.reset()
  done = False
  while not done:
    random_value = random.uniform(0, 1)
    if random_value < exploration:
      # explore a random action
      action = streets.action_space.sample()
    else:
      # use the action with the highest Q-value
      action = np.argmax(q_table[state])
    
    next_state, reward, done, info = streets.step(action)

    prev_q = q_table[state, action]
    next_max_q = np.max(q_table[next_state])
    new_q = (1 - learning_rate) * prev_q + \
        learning_rate * (reward + discount_factor * next_max_q)
    q_table[state, action] = new_q

    state = next_state

In [19]:
# see results
q_table[initial_state]

array([-2.42558047, -2.40696774, -2.41324747, -2.3639511 , -9.13287701,
       -5.80430814])

In [21]:
from IPython.display import clear_output
from time import sleep

for tripnum in range(1, 11):
    state = streets.reset()

    done = False

    while not done:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum))
        print(streets.render(mode='ansi'))
        sleep(.5)
        state = next_state
    sleep(1)

Trip number 10
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)

