## 1. Import libraries and sample dataset

- Episode starts with taxi at a random square and passenger at random location, and ends with the passenger being dropped off at a specified destination.
- 4 destinations: R(ed), G(reen), Y(ellow), and B(lue)

There are `500` discrete states as `25` (taxi positions) × `5` (possible passenger locations) × `4` (destination).

| Location index | Description |
| -- | --- |
| `0` | R(ed) |
| `1` | G(reen) |
| `2` | Y(ellow) |
| `3` | B(lue) |
| `4` | In taxi |

There are `6` discrete deterministic actions:

| Action index | Description |
| -- | -- |
| `0` | move south |
| `1` | move north |
| `2` | move east |
| `3` | move west |
| `4` | pickup passenger |
| `5` | drop off passenger |

The reward functions acts like this:

| Reward value | Description |
| -- | -- |
| `-1` | Per step reward |
| `+20` | Delivering passenger |
| `-10` | Executing "pickup" or "drop-off" actions illegally |

### Rendering

These are the color indications,

| Color | Description |
| -- | -- |
| Blue | Passenger |
| Magenta | Destination |
| Yellow | Empty taxi |
| Green | Full taxi |

These are the letter indications,

| Letter | Description |
| -- | -- |
| R | R(ed) destination |
| G | G(reen) destination |
| Y | Y(ellow) destination |
| B | B(lue) destination |

The block represent the taxi.

In [1]:
import gym
import numpy as np
import pandas as pd

streets = gym.make("Taxi-v3").env
streets.render()

+---------+
|R: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+



## 2. Find optimized policy and state-action function from Q-Learning

Credits to [angps95@kaggle](https://www.kaggle.com/angps95/intro-to-reinforcement-learning-with-openai-gym/).

This uses the ε-greedy policy for choosing the action.

In [5]:
import random

learning_rate = 0.2
discount_factor = 0.95
epsilon = 0.1

no_of_states = streets.observation_space.n
no_of_actions = streets.action_space.n
no_of_episodes = 100000

Q = np.zeros([no_of_states, no_of_actions])
policy = np.ones([no_of_states, no_of_actions]) / no_of_actions


def next_action(state):
    # Exploration-vs-exploitation using ε-greedy algorithm.
    tau = random.uniform(0, 1)
    
    if tau < epsilon:
        return streets.action_space.sample()
    
    return np.argmax(Q[state])


# Run through all episodes to tabulate the state-action matrix.
for i in range(1, no_of_episodes + 1):
    state = streets.reset()
    is_terminal_state = False

    while not is_terminal_state:
        action = next_action(state)
        next_state, reward, is_terminal_state, _ = streets.step(action)

        Q[state, action] = (1 - learning_rate) * Q[state, action] + learning_rate * (reward + discount_factor * np.max(Q[next_state]))
        state = next_state


# Set the policy accordingly to be favourable to maximized reward action.
for state in range(no_of_states):
    policy[state] = np.eye(no_of_actions)[np.argmax(Q[state])]

print(f'Completed with {no_of_episodes} episodes')

Completed with 100000 episodes
[[1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0.]
 ...
 [0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]]


## 3. Find the amount of steps taken upon using model algorithm

Reset the environment after having pre-learnt it and find the amount of steps taken to reach goal.

Display the min, max and avg steps after it is done.

In [None]:
def episode_steps():
    current_state = streets.reset()
    reward = 0
    no_of_steps = 0

    while reward != 20:
        state, reward, _, _ = streets.step(np.argmax(policy[current_state]))  
        current_state = state
        no_of_steps += 1
    
    return no_of_steps

episode_dist = np.array([episode_steps() for i in range(10000)])

print(f'Min steps={np.min(episode_dist)}, Avg steps={np.round(np.average(episode_dist), 1)}, Max steps={np.max(episode_dist)}')
streets.render()

## 4. Display episode steps distribution

In [None]:
from matplotlib import pyplot as plt

fig, ax = plt.subplots(1,1)
bins = [i + 1 for i in range(np.min(episode_dist), np.max(episode_dist))]

ax.hist(episode_dist, bins=bins)
ax.set_title("Episode Steps Distribution")
ax.set_xticks(bins)
ax.set_xlabel('Steps')
ax.set_ylabel('No. of episodes')
plt.show()