# Predicting ETA of Taxi Cabs - Nayana Anil

#### Problem
Riders feel that the ETA we show them when they open the app is inaccurate. When does the ETA forecast tend to work well, and when does it break down? What is the impact of ETA inaccuracy on rider cancellations? Improve the accuracy of the existing ETA estimate so that the estimated_time_to_arrival (rider_trips) is a better prediction of the actual_time_to_arrival (driver_trips). With the accuracy improvement from your model, design an experiment to estimate the impact on the marketplace.
<br>

Note: For the purpose of this study, only trips with ETA over a 2 minute tolerance level have been considered. That is, trips where the actual time of arrival was -  later than the ETA by 2 or more minutes -  earlier than the ETA by 2 or more minutes.<br>
The value of delta ETA above 2 min ( _abs(Actual time of Arrival - Estimated Time of Arrival) > 2_ ) is commonly used. For readability, we will call this **DE2**.
<br>
## When does the ETA forecast tend to work well, and when does it break down?
We look at the relationship of ETA with other trip data to find statistical patterns.

### Spatial Stats
Observations: 
 -  Number of trips with DE2 is unevenly distributed, with Chelsea Court having most number of DE2 trips. However, this is in proportion to the total number of trips per geo.
 -  Mean DE2 is negative for all locations, 
 
Inference:
 -  Majority of misestimated trips arrived earlier than predicted.

![title](StatsETA.png)

### Temporal Stats
Observations:
 -  Average DE2 peaks on Thursdays, across all locations.
 -  Average DE2 is positive on Thursdays, and negative on all other days. 
 
Inference:
  -  On average, cabs arrive early on all days except Thursdays.
  
![title](Temporal Stats.png)

### Temporal and Spatial Stats
Observations:
 -  The number of trips above tolerance are highest on Saturdays, for all locations. But they have lowest standard deviation, and negative average DE2.
 
Inference:
 -  Even though a large number of trips were outside tolerance level on Saturdays, they majorly arrived early than late.

![title](T&SStats2.png)

![title](T&SStats1.png)

### Driver Lifetime Completed trips
Observations:
 -  The distribution is normal around the 0 delta ETA line. Also, the distribution is symmetric along the 0 delta ETA line. Interestingly, the same holds good for all the locations.
 -  As driver's lifetime fares increase, the DE2 decreases gradually, indicating a negative correlation. 
 
Inference:
 -  The more trips the driver completes, the more likely of adhering to ETA.
 
 
![title](LifetimeCompletedTrips.png)

### DriverLifetime Fares

Observations:
 -  The distribution is normal around the 0 delta ETA line. Also, the distribution is symmetric along the 0 delta ETA line. Interestingly, the same holds good for all the locations.
 -  As driver's lifetime fares increase, the DE2 decreases drastically, indicating a negative correlation. 
 
Inference:
 -  The number of trips that are late and number of trips early are almost the same given a driver's lifetime fare. Also, the more a driver earns, more likely of adhering to ETA.
![title](LifetimeFares.png)

### Driver Lifetime Rating

Observations:
 -  Majority of DE2 deviations are clustered between ratings of 4.4 and 5.0. 
 -  Below a rating of 4.4, deviation of DE2 is small. Between ratings 4.4 and 4.8, DE2 deviates drastically, more so in positive direction than negative. For ratings above 4.8, we see a drastic reduction in DE2.
 -  Deviations in DE2 are almost identical in positive and negative direction 
 
Inference:
 -  It seems as though, drivers with a low rating adhere to the ETA to get their ratings up. Drivers with ratings between 4.4 and 4.8 seem to not adhere to ETA as well. However, drivers with a high rating, have very few ETA deviations, which probably explains the high rating
![title](LifetimeRating.png)

### Surge Multiplier

Observations:
 -  Lower the surge multiplier, more the deviation in DE2. Higher surge multipliers have lesser deviation. 
 
![title](SurgeMultiplier.png)


Observations:
 -  Surge multiplier 1.0 sees the maximum number of trips with DE2. Following that is a drastic fall, with surge multipliers > 1.0 seeing a gradual decrese in total trips with DE2, ultimately stagnating to one or fewer trips at 3.4 surge.
Inference:
 -  Low surge multipliers probably attract larger number of trips and hence more chance of deviating from ETA.
![title](SurgeMultiCount.png)

### Trip price pre discount

Observations:
 -  As trip price increases, DE2 decreases. Trips costing less than $10 see the most deviation in ETA.
![title](TripPricePreDiscount.png)

## What is the impact of ETA inaccuracy on rider cancellations?

Below is a graph that shows trip status vs the number of trips of each status.
![title](Cancelled ETA2.png)

From the graph, we see that 
 -  Total trips = 57577,  Total cancelled trips = 5598
The percentage is calculated below.


In [17]:
#Total Trips
tot_trips = 57577
tot_cancelled = 5598
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips: %.2f %%"
      % tot_cancel_percent)

Percentage of total cancelled trips: 9.72 %


 Now,
 -  Total trips within tolerance of 2 min = 44834, Total cancelled trips of these =  1422
 -  Total trips with DE2 = 6509, Total cancelled trips of these = 389


In [22]:
#Trips within tolerance of 2 min
tot_trips = 44834
tot_cancelled = 1422
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips within tolerance of 2 min: %.2f %%"
      % tot_cancel_percent)

#Trips above tolerance of 2 min
tot_trips = 6509
tot_cancelled = 389
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips above tolerance of 2 min: %.2f %%"
      % tot_cancel_percent)

Percentage of total cancelled trips within tolerance of 2 min: 3.17 %
Percentage of total cancelled trips above tolerance of 2 min: 5.98 %


Wrongly estimated ETA increased the number of cancellations by more than 2% ! However, these numbers correspond to rides that are above a 2 minute tolerance. Below is a graph showing the trip status vs the number of trips of each status, now with tolerance increased to 5 min.

![title](Cancelled ETA5.png)

Now,
 -  Total trips within tolerance of 5 min = 50692, Total cancelled trips of these = 1735
 -  Total trips with DE5 = 651, Total cancelled trips of these = 76

In [21]:
#Trips within tolerance of 5 min
tot_trips = 50692
tot_cancelled = 1735
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips within tolerance of 5 min: %.2f %%"
      % tot_cancel_percent)

#Trips above tolerance of 5 min
tot_trips = 651
tot_cancelled = 76
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips above tolerance of 5 min: %.2f %%"
      % tot_cancel_percent)

Percentage of total cancelled trips within tolerance of 5 min: 3.42 %
Percentage of total cancelled trips above tolerance of 5 min: 11.67 %


Cancellations increase from 3.5% to 11.5%, that's an increase of 8% ! That's a pretty major impact that ETA has on cancellations.

Having explored the data, let's try to use the information to predict ETA accurately.
Data from different tables was joined using Tableau. As seen in graphs above, the 'DeltaETA' column has been generated for use. 
DeltaETA = Actual Time To Arrival - Estimated Time To Arrival

In [73]:
#Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor

#Get the dataset
df_temp = pd.read_excel('21Nov.xlsx')
print(df_temp.columns)

Index(['DeltaETA', 'Number of Records', 'Actual Time To Arrival',
       'Driver Payout', 'Driver Id (Driver!Data)', 'Driver Id',
       'End Geo (Rider!Trips)', 'End Geo', 'Estimated Time To Arrival',
       'First Completed Trip (Rider!Data)', 'First Completed Trip',
       'First Trip City Id', 'Lifetime Completed Trips', 'Lifetime Fares',
       'Lifetime Payments', 'Lifetime Rating', 'Lifetime Trips',
       'Request Time (Rider!Trips)', 'Request Time', 'Rider Id (Rider!Data)',
       'Rider Id', 'Rider Payment', 'Start Geo (Rider!Trips)', 'Start Geo',
       'Surge Multiplier (Rider!Trips)', 'Surge Multiplier',
       'Trip Id (Rider!Trips)', 'Trip Id', 'Trip Price Pre Discount',
       'Trip Status (Rider!Trips)', 'Trip Status'],
      dtype='object')


In [74]:
# To allow better prediction, we're removing any outliers that may skew prediction.
# Here outliers are DeltaETA above 15 min, and DeltaETA below -10 min.
# Also, the assumption is that trips with DeltaETA above 2 minutes of tolerance (in both +ve and -ve) are the real targets
# and only those have been considered.
df_temp_late = df_temp[(df_temp['DeltaETA'] > 2) & (df_temp['DeltaETA'] < 15)]
df_temp_early = df_temp[(df_temp['DeltaETA'] < (-2)) & (df_temp['DeltaETA'] > (-10))]
df = df_temp_late.append(df_temp_early)
print(df.head())

    DeltaETA  Number of Records  Actual Time To Arrival  Driver Payout  \
44  2.833333                  1                4.483333         6.4960   
55  2.983333                  1                9.083333         3.0208   
57  2.733333                  1                6.266667         5.2032   
65  2.166667                  1                6.716667         4.6144   
78  2.366667                  1                6.650000         2.6176   

   Driver Id (Driver!Data)  Driver Id End Geo (Rider!Trips)        End Geo  \
44               435c-49a9  435c-49a9         Chelsea Court  Chelsea Court   
55               41a9-7ef0  41a9-7ef0           Daisy Drive    Daisy Drive   
57               4e39-92e4  4e39-92e4            Allen Abby     Allen Abby   
65               439a-e8ea  439a-e8ea         Chelsea Court  Chelsea Court   
78               4ccf-2716  4ccf-2716         Chelsea Court  Chelsea Court   

    Estimated Time To Arrival First Completed Trip (Rider!Data)     ...      \
44     

In [75]:
# The model cannot be built on data that is completely unrelated to the target, or data that sees into the future. To 
# avoid this we only select feature columns which have relevance to our end goal.
# Discarded columns: DeltaETA(future-seeing), Driver Payout(future-seeing), DriverID(irrelevant), End Geo(irrelevant),
# Estimated Time to arrival(what needs to be predicted), First Completed Trip(irrelevant), Rider Payment(future-seeing),
# Trip Id(irrelevant), Trip status(future-seeing)

# Note: Columns 'Trip Status' and 'Estimated Time To Arrival' are needed for later use. I retain them here,
# to avoid redundant processing. However they will NOT be used to train the model.
df = df[['Lifetime Completed Trips', 'Lifetime Fares', 'Lifetime Rating', 'Request Time', 'Start Geo', 'Surge Multiplier', 'Trip Price Pre Discount','Trip Status', 'Estimated Time To Arrival','Actual Time To Arrival']]

#remove values with NaN
df.dropna(inplace = True)
df = df.reset_index(drop=True)

#shuffle data to ensure no over/underfitting
from sklearn.utils import shuffle
df = shuffle(df)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6449 entries, 1251 to 2431
Data columns (total 10 columns):
Lifetime Completed Trips     6449 non-null int64
Lifetime Fares               6449 non-null float64
Lifetime Rating              6449 non-null float64
Request Time                 6449 non-null datetime64[ns]
Start Geo                    6449 non-null object
Surge Multiplier             6449 non-null float64
Trip Price Pre Discount      6449 non-null float64
Trip Status                  6449 non-null object
Estimated Time To Arrival    6449 non-null float64
Actual Time To Arrival       6449 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(1), object(2)
memory usage: 554.2+ KB
None


In [76]:
# The data contains 'Request Time' feature which is a datetime object, and needs to be parsed into relevant features. We extract
# hour and day of the week, and drop the original column
df['hour'] = [df['Request Time'][i].hour for i in range(len(df['Request Time']))] 
df['weekday'] = df[['Request Time']].apply(lambda x: dt.datetime.strftime(x['Request Time'], '%A'), axis=1)
df.drop(['Request Time'], axis=1, inplace=True)
print(df.head(10))

      Lifetime Completed Trips  Lifetime Fares  Lifetime Rating  \
1251                      4415       45292.824         4.697238   
5656                      4818       53643.984         4.691003   
3606                      5544       64040.784         4.741105   
4846                      5544       64040.784         4.741105   
2278                      9215       99047.352         4.780362   
2619                      2794       95554.104         4.829098   
5692                      5478       52608.344         4.792280   
2797                        34         333.408         5.000000   
1466                      2653       27699.000         4.772064   
3434                       705        6253.632         4.918635   

          Start Geo  Surge Multiplier  Trip Price Pre Discount Trip Status  \
1251  Chelsea Court               1.0                    5.296   completed   
5656     Allen Abby               1.0                    8.608   completed   
3606  Chelsea Court         

In [77]:
# The data now contains two categorical variables, 'Start Geo' and 'weekday'. They need to be converted into equivalent numeric
# features to enable accurate predictions. We do this using one hot encoding.
one_hot = pd.get_dummies(df[['Start Geo']])
df = df.drop('Start Geo', axis=1)
df = df.join(one_hot)
df.drop('Start Geo_Daisy Drive', axis=1, inplace=True)

one_hot = pd.get_dummies(df[['weekday']])
df = df.drop('weekday', axis=1)
df = df.join(one_hot)
df.drop('weekday_Wednesday', axis=1, inplace=True)

print(df.head())

      Lifetime Completed Trips  Lifetime Fares  Lifetime Rating  \
1251                      4415       45292.824         4.697238   
5656                      4818       53643.984         4.691003   
3606                      5544       64040.784         4.741105   
4846                      5544       64040.784         4.741105   
2278                      9215       99047.352         4.780362   

      Surge Multiplier  Trip Price Pre Discount Trip Status  \
1251               1.0                    5.296   completed   
5656               1.0                    8.608   completed   
3606               1.0                   10.144   completed   
4846               2.3                   13.536   completed   
2278               1.0                    5.808   completed   

      Estimated Time To Arrival  Actual Time To Arrival  hour  \
1251                   1.350000                5.116667    16   
5656                   7.266667                5.016667    19   
3606                   

In [78]:
# In order to compare performance of model built here, with the given ETA predictions, we will keep aside the values 
# of the 'Estimated Time To Arrival' and 'Trip Status' columns. 

# The data is split into a training set and a testing set, with a 80/20 split ratio

uber_eta = df['Estimated Time To Arrival'].values
dwnld_df = df['Trip Status'].values
y = df['Actual Time To Arrival'].values

X = df.drop(['Estimated Time To Arrival', 'Trip Status', 'Actual Time To Arrival'], axis=1).values

trainset = round(0.8*len(X))
print(trainset)

X_train = X[:trainset, :]
y_train = y[:trainset]

X_test = X[trainset: , :]
y_test = y[trainset:]

uber_eta_test = uber_eta[trainset:]
dwnld_df_test = dwnld_df[trainset: ]

[  4.41500000e+03   4.52928240e+04   4.69723832e+00   1.00000000e+00
   5.29600000e+00   1.60000000e+01   0.00000000e+00   0.00000000e+00
   1.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00]
(6449, 15)
(6449,)
(6449,)
5159


In [96]:
# Applying RandomForest regression model
# Justification: The dataset has complex underlying relationships that need to be addressed. Hardly any linearity
# is present. Very robust on small datasets.
regressor = RandomForestRegressor(n_estimators=10, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

print("Mean squared error uber to actual: %.2f"
      % mean_squared_error(y_test, uber_eta_test))


Mean squared error: 12.27
Mean squared error uber to actual: 13.05


We see the model built here has a mean square error of 12.27 percent; performs better than the given ETA weighing in at a higher MSE of 13.05. Accuracy has been improved.

To find the impact of this increase in accuracy, we download the results to a CSV file

In [89]:
# In order to determine the impact of the new model, we use the Trip status values of predicted trips
dowfle = pd.DataFrame({'Trip Status':dwnld_df_test, 'Estimated Time of Arrival':y_pred, 'Actual Time to Arrival':y_test})
print(dowfle.head())
dowfle.to_csv('dowfle')


   Actual Time to Arrival  Estimated Time of Arrival     Trip Status
0                0.783333                   4.606667       completed
1                1.866667                   4.416667       completed
2                2.566667                   1.535000       completed
3                6.000000                   5.218333       completed
4                3.983333                   2.938333  rider_canceled


Plotting the Trip Status vs Number of trips with ETA above tolerance graph as we did in our statistical analysis, we find that there is an improvement in the number of cancelled trips, shown calculated below.

![title](NewCancelled.png)

In [94]:
#Trips above tolerance of 2 min, with new predicted ETA
tot_trips = 1290
tot_cancelled = 76
tot_cancel_percent = tot_cancelled/tot_trips * 100
print("Percentage of total cancelled trips above tolerance of 2 min, with new ETAs: %.2f %%"
      % tot_cancel_percent)


Percentage of total cancelled trips above tolerance of 2 min, with new ETAs: 5.88 %


Previously, the percentage was 5.98%. This graph seems to suggest an improvement of 0.1%.

This means,
With the previosuly predicted ETA, 598 people for every 10,000 people would cancel their rides if ETA was above tolerance level.
With the new ETAs, 588 people for every 10,000 people will cancel their rides, under same conditions.

The new ETA avoids 10 people for every 10,000 people from cancelling their rides.

To test this in the field, we conduct an experiment. Our hypothesis is
**The new ETA prediction will decrease rider cancellation by 0.1%**

1. The null hypothesis is the new ETA prediction will not decrease rider cancellation by 0.1%.
2. We use the Pearson's correlation coefficient to find correlation between ETA and rider cancellation, and a significance level of 0.05. 
3. Collect the data and calculate p-value. 
4. If p-value is less than 0.05, the test is significant and the null hypothesis can be rejected.
