**Reinforment Learning**
*Using Q-Learning.* Lets take an example of self-driving taxi that can pick up passengers at one set of fixed location and drop them off at another and get there in the quickest amount of time avoiding obstacles. Using AI Gym lets create this environment

In [4]:
import gym
import random

random.seed(123)

streets = gym.make("Taxi-v3").env #contains the rules
streets.render()

+---------+
|[43mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+



Taxi-v3 environment description: 

R,G,B,and Y are the passenger pick up and drop off location
Letter colored blue indicates the pick up location.
Letter colored magenta indicates the drop location.
Solid lines represent walls.
filled in rectangle in yellow color is the taxi.


This is a 5x5 grid. The state of the grid can be defined by:

Where the taxi is. (5x5 = 25 locations)
Where the current destination is. (4 possibilities)
Where the passenger is. (5 Possibilities including inside taxi)

So there are a total of 25x5x4 = 500 possible states.

For each State there are six possible actions.

Move in any direction(North, East, West, and South)
Pick up the passenger.
Drop the passenger.

Rewards and Penalties: 

A successful drop = +20 points
Everytime step taken with the passenger = +1 point.
Picking up and dropping at wrong locations = -10 points.

Moving across the line isnt allowed.


In [5]:
initial_state = streets.encode(2,3,2,0)
streets.s = initial_state
streets.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [6]:
streets.P[initial_state] #reward table for the inital state

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

Designing our Q-Learning learning algorithm.

In [0]:
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n]) #2D array with all the actions and observations initialised to zero

learning_rate = 0.1 #how quickly an algorithm learns
discount_factor = 0.6
exploration = 0.1 
epochs = 10000 #number of times it goes through the taxi rins.

for taxi_run in range(epochs):
  state = streets.reset() #reset all the states
  done = False
  
  while not done:
    random_value = random.uniform(0,1) #take a random no between 0 and 1
    if(random_value < exploration):
      action = streets.action_space.sample() #explore a random action
    else:
      action = np.argmax(q_table[state]) #usse an action with highest q-value

    next_state, reward, done, info = streets.step(action) #apply and get the details from the action

    prev_q = q_table[state, action] #current q values
    next_max_q = np.max(q_table[next_state]) #next state max q value
    new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q) # from the q learning equation compute the new q value
    q_table[state, action] = new_q #Assign the new state table to the new q value
         
    state = next_state  # this continues till the passenger is dropped


In [9]:
q_table[initial_state]

array([-2.42860034, -2.40766244, -2.40029153, -2.3639511 , -9.43185334,
       -8.08032874])

Now lets simulate the trained taxi with a 10 trips which are randomly started and display the output

In [10]:
from IPython.display import clear_output
from time import sleep

for tripnum in range(1, 11):
    state = streets.reset()
   
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render(mode='ansi'))
        sleep(.5)
        state = next_state
        trip_length += 1
        
    sleep(2)

Trip number 10 Step 13
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

