# Reinforcement Q-Learning with OpenAI GymTaxi Environment
The following code is solving the reinforcement learning problem of self-driving cab in a simplified environment of the OpenAI gym environment [Taxi-v3](https://gym.openai.com/envs/Taxi-v2/). For this, the Q-Learning algorithms was used to create a Q-table of size (state_space_size, action_space_size) which presents the action selection process. This is a tutorial to demonstrate a Q-learning algorithm to deal with discrete state space and discrete action space. 

In [2]:
# Import packages
import gym
import numpy as np
import random 
import time

## The Environment
The objective of this environment is for the taxi
* drive to the customer 
* pick up the customer, 
* drive to the destination 
* drop off the customer

The environment has 500 discrete possible states defining the location of the taxi, the customer and destination, and 6 actions spaces (up, down, left, right, pickup, drop off). The objective of the Q-table is to learn the Q-values for each action at each state and allow the algorithm to pick the best action in order to obtain the highest reward. 

In [3]:
# Create the taxi environment 
env = gym.make("Taxi-v3").env

In [13]:
# Display the number of action and states in the environment 
print(env.action_space)
print(env.observation_space)

# Run a random action at a state
env.reset()
env.render()
state, reward, done, info = env.step(1)
env.render()

# print information
print("The next state: ", state)
print("Reward: ", reward)
print("Terminal state", done)
print("Probability of happening: ",info)

Discrete(6)
Discrete(500)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

+---------+
|R: | : :[35mG[0m|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (North)
The next state:  169
Reward:  -1
Terminal state False
Probability of happening:  {'prob': 1.0}


* The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
* The pipe ("|") represents a wall which the taxi cannot cross.
* R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

In [16]:
# Return the state number for a particular environemnt setting
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

# set the environment to this state
env.s = state

# render to show the environment in this state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+
  (North)


In [17]:
env.P[state]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

The above is obtained information at a particular state. This means that at a state, there are 6 actions to take (0, 1, 2, 3, 4, 5) and for each action, it tells us that 
* There is only 1 outcome with a probability of 1. 
* The next state number of this action. 
* The reward received with this action
* And whether this next state is a terminal state. 

## Q-Learning 

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state. The values store in the Q-table are called a Q-values, and they map to a (state, action) combination. The Q-table is initialised as zeros(state_size, action_size) and is updated using the Q-learning algorithm after each state action pair and with the reward received. 

<img src="img/q_matrix.png" style="width:400px;height400px"/>

The q-value iteration is derived from bellman's equation to iteratively update the q-value of a state and action pair based on a step of state, action, reward. The learned value is a combination of the reward for taking the current action in the current state, and the discounted maximum reward from the next state we will be in once we take the current action.

\begin{align}
Q(state, actions) \leftarrow (1-\alpha)Q(state, actions) + \alpha(reward + \gamma \max_{a}(Q(next\_state, all\_actions))
\end{align}

In [19]:
# create an empty table of zeros 
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

In [27]:
# Loop for N number of episodes of training
N = 10000
for i in range(0, N):
    
    # reset environment 
    state = env.reset()
    done = False
    
    # loop until the agent exits the environment (terminal point/drop off passenger)
    while not done:
        
        # epsilon greedy explore and exploit
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])
        
        # take the action and get q-value of next state
        next_state, reward, done, info = env.step(action)
        max_q_next = np.max(q_table[next_state])
        
        # update the q-value for current state
        q_table[state, action] = (1-alpha)*q_table[state,action] + alpha*(reward + gamma*max_q_next)
        
        # set the current state as next state
        state = next_state


In [32]:
# simulate a result
state = env.reset()
done = False 
reward_sum = 0
step = 0 
# until done 
while not done:
    
    # render the current environment 
    env.render()
    
    # take an action in the max q_table
    action = np.argmax(q_table[state])
    next_state, reward, done, info = env.step(action)
    
    # set the next state
    state = next_state
    
    # Cumulate rewards and record step
    reward_sum += reward
    step+= 1
    
    # pause for next action
    time.sleep(1)

print("The final reward: ", reward_sum)
print("Number of steps taken: ", step)

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[34;1m[43mB[0m[0m: |
+---------+
  (South)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[42mB[0m: |
+---------+
  (Pickup)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[42m_[0m: |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : :[42m_[0m: |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : :[42m_[0m|
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :[35mG[0m|
| : | : :[42m_[0m|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | : :[35m[42mG[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
The final reward:  13
Number of steps taken:  8
