
## Problem Statement:
At some point or the other almost each one of us has used an Ola or Uber for taking a ride. 

Ride hailing services are services that use online-enabled platforms to connect between passengers and local drivers using their personal vehicles. In most cases they are a comfortable method for door-to-door transport. Usually they are cheaper than using licensed taxicabs. Examples of ride hailing services include Uber and Lyft.


To improve the efficiency of taxi dispatching systems for such services, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request.

In this competition, we are challenged to build a model that predicts the total ride duration of taxi trips in New York City.

## Data Dictionary

- id - a unique identifier for each trip
- vendor_id - a code indicating the provider associated with the trip record
- pickup_datetime - date and time when the meter was engaged
- dropoff_datetime - date and time when the meter was disengaged
- passenger_count - the number of passengers in the vehicle (driver entered value)
- pickup_longitude - the longitude where the meter was engaged
- pickup_latitude - the latitude where the meter was engaged
- dropoff_longitude - the longitude where the meter was disengaged
- dropoff_latitude - the latitude where the meter was disengaged
- store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)
- trip_duration - duration of the trip in seconds


In [0]:
# Import libraries

import pandas as pd
import numpy as np

In [0]:
# Load Data
df = pd.read_csv("/content/nyc_taxi_trip_duration.csv")

In [16]:
# check shape
df.shape

(462119, 11)

In [17]:
# explore few rows
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id1080784,2,2016-02-29 16:40:21,2016-02-29 16:47:01,1,-73.953918,40.778873,-73.963875,40.771164,N,400.0
1,id0889885,1,2016-03-11 23:35:37,2016-03-11 23:53:57,2,-73.988312,40.731743,-73.994751,40.694931,N,1100.0
2,id0857912,2,2016-02-21 17:59:33,2016-02-21 18:26:48,2,-73.997314,40.721458,-73.948029,40.774918,N,1635.0
3,id3744273,2,2016-01-05 09:44:31,2016-01-05 10:03:32,6,-73.96167,40.75972,-73.956779,40.780628,N,1141.0
4,id0232939,1,2016-02-17 06:42:23,2016-02-17 06:56:31,1,-74.01712,40.708469,-73.988182,40.740631,N,848.0


In [18]:
# Lets check if there is any missing data
df.isnull().sum()




id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    1
trip_duration         1
dtype: int64

Lets drop all rows with missing data .

In [0]:
df = df.dropna()

In [20]:
df.isnull().sum()

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462118 entries, 0 to 462117
Data columns (total 11 columns):
id                    462118 non-null object
vendor_id             462118 non-null int64
pickup_datetime       462118 non-null object
dropoff_datetime      462118 non-null object
passenger_count       462118 non-null int64
pickup_longitude      462118 non-null float64
pickup_latitude       462118 non-null float64
dropoff_longitude     462118 non-null float64
dropoff_latitude      462118 non-null float64
store_and_fwd_flag    462118 non-null object
trip_duration         462118 non-null float64
dtypes: float64(5), int64(2), object(4)
memory usage: 42.3+ MB


In [0]:
# Convering to datetime

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'] )
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'] )



In [0]:
# Adding some date time features


import datetime as dt
df['weekday'] = df['pickup_datetime'].dt.weekday
df['hour'] = df['pickup_datetime'].dt.hour
df['minute'] = df['pickup_datetime'].dt.minute

In [0]:
# Lets calculate distance between 2 points & create as a feature


y= df['pickup_longitude'] - df['dropoff_longitude']
x= df['pickup_latitude'] - df['dropoff_latitude']


dist_sq= (y ** 2) + (x ** 2)

#distance
df['dist'] = np.sqrt(dist_sq)


In [0]:
# error metric - root mean squared error

from sklearn.metrics import mean_squared_error


In [27]:
df.columns

Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration', 'weekday', 'hour', 'minute', 'dist'],
      dtype='object')

In [0]:

# During EDA , we oserved that store_and_fed_flag is not imp feature so lets take
# vendor_id , passenger_count , weekday,hour,minute,dist as features.


X = df[['vendor_id','passenger_count',  'weekday', 'hour', 'minute', 'dist']]
y = df['trip_duration']



In [0]:
# lets create function to do k fold cross validation
from sklearn.model_selection import KFold
def cv_score(ml_model):
    i = 1
    cv_scores = []
    df1 = X.copy()
    
    
    kf = KFold(n_splits=5,shuffle=True)
    for train_index,test_index in kf.split(df1,y):
        print(' kfold number',i)
        xtr,xvl = df1.loc[train_index],df1.loc[test_index]
        ytr,yvl = y[train_index],y[test_index]

        model = ml_model
        model.fit(xtr, ytr)
        train_val = model.predict(xtr)
        pred_val = model.predict(xvl)
        
        rmse_score = np.sqrt(mean_squared_error(yvl, pred_val))
        
        print('rmse_score',rmse_score)
        # Save scores
        cv_scores.append(rmse_score)
        i+=1
    return cv_scores

### Linear Regression

In [0]:
# Linear Regression 
from sklearn.linear_model import LinearRegression 
lr = LinearRegression()

In [49]:
cv_score(lr)

 kfold number 1
rmse_score 3047.5301711275088
 kfold number 2
rmse_score 2831.407711358925
 kfold number 3
rmse_score 3146.425204830452
 kfold number 4
rmse_score 3097.9449117214635
 kfold number 5
rmse_score 7108.351553949935


[3047.5301711275088,
 2831.407711358925,
 3146.425204830452,
 3097.9449117214635,
 7108.351553949935]

We observed that best score from linear regression was 2831.

### Random Forest

In [0]:
# Random Forest with 10 trees
from sklearn.ensemble import RandomForestRegressor
rf =  RandomForestRegressor()


In [51]:
cv_score(rf)

 kfold number 1




rmse_score 3231.005097591883
 kfold number 2
rmse_score 3388.8780908880253
 kfold number 3
rmse_score 7201.734636954985
 kfold number 4
rmse_score 3442.122101817168
 kfold number 5
rmse_score 3931.0519919734406


[3231.005097591883,
 3388.8780908880253,
 7201.734636954985,
 3442.122101817168,
 3931.0519919734406]

In [52]:
# Lets run Random Forest with 100 trees
rf = RandomForestRegressor(n_estimators=100)
cv_score(rf)

 kfold number 1
rmse_score 3109.170548847724
 kfold number 2
rmse_score 4911.306634787569
 kfold number 3
rmse_score 3262.2667218611705
 kfold number 4
rmse_score 7159.770647867461
 kfold number 5
rmse_score 3742.018890311907


[3109.170548847724,
 4911.306634787569,
 3262.2667218611705,
 7159.770647867461,
 3742.018890311907]