# --------------------------- Smart Cab (Self driving cab) ---------------------------

**Problem Statement**: A self-driving cab company Smartcab wanted to design a simulation of a self-driving cab.

**Goal**: Demonstrate the use of RL techniques to develop an efficient and safe approach for tackling the issue.

**Project Description**:
<br>
The Smartcab's job is to pick up the passenger at one location and drop them off in another. Here is a list of things that they would love our Smartcab to take care of:
- Drop off the passenger to the right location.
- Save passenger's time by taking minimum time possible to drop off
- Take care of passenger's safety and traffic rules

There are different aspects that need to be considered here while modeling an RL solution to this problem: rewards, states, and actions.

Company also wanted to allow users to book a cab by sending a free text SMS containing source, destination and time of travel. Since SMS is a free text, different users can send same message in different ways, e.g.,
1. I want to book a cab from cyber city to sector 48 at 5 pm.
2. Please book my cab with pick up from cyber city, destination sector 48, and time of pick up 5 pm.
3. Book a cab for me from sector 48, my pick-up is from cyber city at 5 pm.

So, the challenge for the company is to parse this free text and fetch three entities:
- Pick up location
- Destination
- Time to pick up

**Task to be solved:**

Train:
1. Train a model using Q learning algorithm on tax v2 environment.

Evaluation:
    1. Take text from "sms.txt" and fetch pickup and drop from it.
    2. Generate the random state from an environment and change the pick-up and drop as the fetched one from sms.txt
    3. Evaluate you model performance on all the texts given in sms.txt. We have generated text for four locations given in city.csv
    4. Have a check if the fetched pickup, drop is not matching with original pickup, drop given in “orig_df.csv”
    5. If fetched pickup or/and drop does not match with the original, add penalty and reward -10 
    6. Calculate the Total reward, penalties, Wrong pickup/drop predicted and Average time steps per episode.

### IMPORTING MODULES

In [1]:
import random
import gym
import numpy as np
import pandas as pd
from IPython.display import clear_output

### READING DATASETS

In [2]:
org_df = pd.read_csv('org_df.csv')
sms_df = pd.read_csv('sms.txt', header=None)

### CREATING NEW COLUMN AND MERGING THE ORG AND SMS DATASET

In [3]:
sms_df.columns = ['SMS']
df = pd.concat([sms_df, org_df], axis=1)

### MAPPING THE ORIGIN AND DESTINATION

In [4]:
df['origin'] = df['origin'].map(
    {'airport': 3, 'hauz khaas': 2, 'dwarka sector 21': 1, 'dwarka sector 23': 0})

df['dest'] = df['dest'].map(
    {'airport': 3, 'hauz khaas': 2, 'dwarka sector 21': 1, 'dwarka sector 23': 0})

In [5]:
df.head()

Unnamed: 0,SMS,origin,dest,time
0,Please book a cab from airport to hauz khaas a...,3,2,3 PM
1,airport to hauz khaas at 6 PM,3,2,6 PM
2,Kindly book a cab for me at 1 PM from hauz kha...,2,0,1 PM
3,airport to hauz khaas at 1 AM,3,2,1 AM
4,I want to go to dwarka sector 21 from airport ...,3,1,10 PM


## ------------------------------ REINFORCEMENT LEARNING ------------------------------

### RENDERING THE ENVIRONMENT USING TAXI v3 FROM GYM PACKAGE

In [6]:
env = gym.make("Taxi-v3").env
env.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | :[43m [0m|
|[34;1mY[0m| : |B: |
+---------+



#### Note
- The filled square represents the cab, which is yellow without a passenger and green with a passenger.
- The pipe ("|") represents a wall which the cab cannot cross.
- R, G, Y, B are the possible pickup and destination locations. 
- The blue letter represents the current passenger pick-up location
- The purple letter is the current destination.

### RESETTING ENVIRONMENT TO NEW RANDOM STATE

In [7]:
env.reset()
env.render()

+---------+
|[34;1mR[0m:[43m [0m| : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+



In [8]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


**Note** - We have an Action Space of size 6 and a State Space of size 500

#### We have the cab at row 3, column 1, our passenger is at location 2, and our destination is location 0

In [9]:
# (taxi row, taxi column, passenger index, destination index)
state = env.encode(3, 1, 2, 0)
print("State:", state)

env.s = state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



### THE REWARD TABLE
When the Taxi environment is created, there is an initial Reward table that's also created, called `P`.

In [10]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

**Note** - This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.
1. The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state in the illustration.
2. In this env, `probability` is always 1.0.
3. The `nextstate` is the state we would be in if we take the action at this index of the dict
4. All the movement actions have a -1 reward and the pickup/dropoff actions have -10 reward in this particular state. If we are in a state where the taxi has a passenger and is on top of the right destination, we would see a reward of 20 at the dropoff action (5)
5. `done` is used to tell us when we have successfully dropped off a passenger in the right location. Each successfull dropoff is the end of an episode

**If our agent chose to explore action two (2) in this state it would be going East into a wall. Its impossible to actually move the cab across a wall, so if the cab chooses that action, it will just keep accruing -1 penalties, which affects the long-term reward.**

### Q TABLE

In [11]:
q_table = np.zeros([env.observation_space.n, env.action_space.n])

### TRAINING THE AGENT

#### Hyperparameters

In [12]:
# Learning Rate
alpha = 0.1

# Discount Factor
gamma = 0.6

# Trade-off Between Exploration and Exploitation
epsilon = 0.1

#### For Plotting Metrics

In [13]:
all_epochs = []
all_penalties = []

In [14]:
# We will see 1000 episodes
for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            # Explore action space
            action = env.action_space.sample()
        else:
            # Exploit learned values
            action = np.argmax(q_table[state])

        next_state, reward, done, info = env.step(action)

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * \
            (reward + gamma * next_max)
        
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training Finished.\n")

Episode: 100000
Training Finished.



#### Q-values at our State

In [15]:
q_table[328]

array([ -2.40413359,  -2.27325184,  -2.40658982,  -2.35992994,
       -10.81291785, -10.8658985 ])

In [16]:
def create_loc_dict(city_df):
    loc_dict = {}
    loc_dict['dwarka sector 23'] = 0
    loc_dict['dwarka sector 21'] = 1
    loc_dict['hauz khaas'] = 2
    loc_dict['airport'] = 3
    return loc_dict

#### Evaluate Agent's performance after Q-learning

In [17]:
total_epochs, total_penalties = 0, 0
total_rewards = 0
episodes = 100

f = open('sms.txt', 'r')
num_of_lines = df.shape[0]
episodes = num_of_lines
city = pd.read_csv('city.csv')
loc_dict = create_loc_dict(city)
line_num = 0


for line in f:
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_rewards += reward
    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")
print(f"Total Rewards: {total_rewards}")

Results after 1000 episodes:
Average timesteps per episode: 13.007
Average penalties per episode: 0.0
Total Rewards: 20000


### RENDERING THE ENVIRONMENT FOR THE DROPOFF

In [18]:
env.render()

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)
