<a href="https://colab.research.google.com/github/issatyajit/NYC-taxi-time-prediction/blob/main/NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

In [1]:
import numpy as np
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df=pd.read_csv('/content/drive/MyDrive/Almabetter/NYC Taxi Data.csv')

In [4]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


In [5]:
df.shape

(1458644, 11)

In [6]:
df.isnull().sum(axis=0)

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

In [7]:
for col in df.columns:
  print(f'Number of unique values in {col} is {len(df[col].unique())}')

Number of unique values in id is 1458644
Number of unique values in vendor_id is 2
Number of unique values in pickup_datetime is 1380222
Number of unique values in dropoff_datetime is 1380377
Number of unique values in passenger_count is 10
Number of unique values in pickup_longitude is 23047
Number of unique values in pickup_latitude is 45245
Number of unique values in dropoff_longitude is 33821
Number of unique values in dropoff_latitude is 62519
Number of unique values in store_and_fwd_flag is 2
Number of unique values in trip_duration is 7417


In [8]:
df['pickup_datetime']=pd.to_datetime(df['pickup_datetime'])

In [9]:
df['pickup_datetime'][0].date

<function Timestamp.date>

In [10]:
unique_years=set()
for i in df['pickup_datetime']:
  unique_years.add(i.year)

In [11]:
unique_years

{2016}

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   id                  1458644 non-null  object        
 1   vendor_id           1458644 non-null  int64         
 2   pickup_datetime     1458644 non-null  datetime64[ns]
 3   dropoff_datetime    1458644 non-null  object        
 4   passenger_count     1458644 non-null  int64         
 5   pickup_longitude    1458644 non-null  float64       
 6   pickup_latitude     1458644 non-null  float64       
 7   dropoff_longitude   1458644 non-null  float64       
 8   dropoff_latitude    1458644 non-null  float64       
 9   store_and_fwd_flag  1458644 non-null  object        
 10  trip_duration       1458644 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(3)
memory usage: 122.4+ MB


In [13]:
pip install holidays

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
from datetime import date
import holidays
holiday=[]
us_holidays=holidays.USA()
for ptr in holidays.USA(years = 2016).items():
    holiday.append(ptr[0])

In [15]:
holiday

[datetime.date(2016, 1, 1),
 datetime.date(2016, 1, 18),
 datetime.date(2016, 2, 15),
 datetime.date(2016, 5, 30),
 datetime.date(2016, 7, 4),
 datetime.date(2016, 9, 5),
 datetime.date(2016, 10, 10),
 datetime.date(2016, 11, 11),
 datetime.date(2016, 11, 24),
 datetime.date(2016, 12, 25),
 datetime.date(2016, 12, 26)]

In [16]:
df['pickup_datetime'][0].date() in holiday

False

In [17]:
def get_holiday(X):
  if X in holiday:
    A=1
  else:
    A=0
  return A

In [18]:
df1=df

In [19]:
A=np.array([])
for i in range(df.shape[0]):
  A=np.append(A,get_holiday(df['pickup_datetime'][i].date()))

In [33]:
df1=pd.DataFrame(A,columns=['is_holiday']).join(df)
# next we make a column date number

In [59]:
def get_date(X):
  '''returns date number'''
  return X.day
def get_month_num(X):
  return X.month
def get_week(X):
  return X.week
def get_hour(X):
  return X.hour

In [60]:
df1['date_num']=df['pickup_datetime'].apply(get_date)
df1['month_num']=df['pickup_datetime'].apply(get_month_num)
df1['week_num']=df['pickup_datetime'].apply(get_week)
df1['hour_of_day']=df['pickup_datetime'].apply(get_hour)

In [61]:
df1.head()

Unnamed: 0,is_holiday,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,date_num,month_num,week_num,hour_of_day
0,0.0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,14,3,11,17
1,0.0,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,12,6,23,0
2,0.0,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,19,1,3,11
3,0.0,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,6,4,14,19
4,0.0,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,26,3,12,13


In [63]:
#Now we get the column representing distance
!pip install haversine


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting haversine
  Downloading haversine-2.5.1-py2.py3-none-any.whl (6.1 kB)
Installing collected packages: haversine
Successfully installed haversine-2.5.1


In [67]:
import haversine as hs
def get_distance(start_lat,start_long,end_lat,end_long):
  loc1=(start_lat,start_long)
  loc2=(end_lat,end_long)
  return hs.haversine(loc1,loc2)

In [69]:
df1['distance_in_km']=df1.apply(lambda X: get_distance(X['pickup_latitude'],X['pickup_longitude'],X['dropoff_latitude'],X['dropoff_longitude']),axis=1)

In [70]:
df1.head()

Unnamed: 0,is_holiday,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,date_num,month_num,week_num,hour_of_day,distance_in_km
0,0.0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,14,3,11,17,1.498523
1,0.0,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,12,6,23,0,1.80551
2,0.0,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,19,1,3,11,6.385107
3,0.0,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,6,4,14,19,1.4855
4,0.0,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,26,3,12,13,1.18859


In [75]:
df1['average_speed_kmph']=df1['distance_in_km']*3600/df1['trip_duration']

In [76]:
df1.head()

Unnamed: 0,is_holiday,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,date_num,month_num,week_num,hour_of_day,distance_in_km,average_speed,average_speed_kmph
0,0.0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455,14,3,11,17,1.498523,11.856445,11.856445
1,0.0,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663,12,6,23,0,1.80551,9.803672,9.803672
2,0.0,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,19,1,3,11,6.385107,10.822216,10.822216
3,0.0,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429,6,4,14,19,1.4855,12.465738,12.465738
4,0.0,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435,26,3,12,13,1.18859,9.836608,9.836608


In [77]:
len(df['id'].unique())

1458644

In [78]:
#Now we finalize the dataset
df1=df1[['vendor_id','passenger_count','is_holiday','store_and_fwd_flag','date_num',
         'month_num',	'week_num',	'hour_of_day',	'distance_in_km',	'average_speed_kmph']]