<h1>Taxi Demand Prediction</h1>

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml 

<h3> DATA DICTIONARY </h3>
<table>
<th>Field Name <th>Description
<tr> <td>VendorID <td>A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
<tr>
<td>tpep_pickup_datetime <td> The date and time when the meter was engaged.
<tr>
<td>tpep_dropoff_datetime <td>The date and time when the meter was disengaged.
<tr>
<td>Passenger_count <td>The number of passengers in the vehicle.
This is a driver-entered value.
<tr>
<td>Trip_distance <td>The elapsed trip distance in miles reported by the taximeter.
<tr>
<td>PULocationID <td>TLC Taxi Zone in which the taximeter was engaged
<tr><td>DOLocationID <td>TLC Taxi Zone in which the taximeter was disengaged
<tr><td>RateCodeID <td>The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
<tr>
<td>Store_and_fwd_flag <td>This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
<tr>
<td>Payment_type <td>A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
<tr><td>
Fare_amount <td>The time-and-distance fare calculated by the meter.
<tr><td>
Extra <td>Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
<tr><td>
MTA_tax <td>$0.50 MTA tax that is automatically triggered based on the metered
rate in use.
<tr><td>
Improvement_surcharge <td>$0.30 improvement surcharge assessed trips at the flag drop. The
improvement surcharge began being levied in 2015.
<tr><td>
Tip_amount <td>Tip amount – This field is automatically populated for credit card
tips. Cash tips are not included.
<tr><td>
Tolls_amount <td>Total amount of all tolls paid in trip.
<tr><td>
Total_amount <td>The total amount charged to passengers. Does not include cash tips.
<tr><td>
Congestion_Surcharge <td>Total amount collected in trip for NYS congestion surcharge.
<tr><td>
Airport_fee <td>$1.25 for pick up only at LaGuardia and John F. Kennedy Airports
</table>

In [1]:
import pandas as pd

In [2]:
%pip install pyarrow
%pip install fastparquet


Note: you may need to restart the kernel to use updated packages.


In [3]:
df = pd.read_parquet("yellow_tripdata_2022-01.parquet")

In [4]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

**Columns with Null Values** : 
    <ol>
    
        passenger_count

        RatecodeID

        store_and_fwd_flag

        congestion_surcharge

        airport_fee
    

In [5]:
df.loc[df['VendorID'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


<h3>Handling Null Values of Airport_fee and Congestion Surcharge </h3>

Airport Fee : All NaN values replaced by zero

Congestion Surcharge : All NaN values replaced by
$$
Total amount -(fare amount+extra+mta tax+tip amount+tolls amount+improvement surcharge)
$$

In [6]:
import numpy as np
from numpy.core.umath import ceil

In [7]:
def null_value_handled(df):
    df['airport_fee']= df['airport_fee'].fillna(0)
    df.loc[df['airport_fee'].isnull()]
    df['congestion_surcharge']= df['congestion_surcharge'].fillna(df['total_amount']-(df['fare_amount']+df['extra']+df['mta_tax']+df['tip_amount']+df['tolls_amount']+df['improvement_surcharge']))
    df=df.drop('store_and_fwd_flag',axis=1)
    average=df['total_amount']/df['trip_distance']
    mean=np.mean(average[np.isfinite(average)])
    df['passenger_count']= df['passenger_count'].fillna(0)
    df['RatecodeID']= df['RatecodeID'].fillna(0)
    return(df)


In [8]:
df = null_value_handled(df)

In [9]:
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.80,1.0,142,236,1,14.50,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.10,1.0,236,42,1,8.00,0.5,0.5,4.00,0.0,0.3,13.30,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,166,166,1,7.50,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,114,68,2,8.00,0.5,0.5,0.00,0.0,0.3,11.80,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.30,1.0,68,163,1,23.50,0.5,0.5,3.00,0.0,0.3,30.30,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2463926,2,2022-01-31 23:36:53,2022-01-31 23:42:51,0.0,1.32,0.0,90,170,0,8.00,0.0,0.5,2.39,0.0,0.3,13.69,2.5,0.0
2463927,2,2022-01-31 23:44:22,2022-01-31 23:55:01,0.0,4.19,0.0,107,75,0,16.80,0.0,0.5,4.35,0.0,0.3,24.45,2.5,0.0
2463928,2,2022-01-31 23:39:00,2022-01-31 23:50:00,0.0,2.10,0.0,113,246,0,11.22,0.0,0.5,2.00,0.0,0.3,16.52,2.5,0.0
2463929,2,2022-01-31 23:36:42,2022-01-31 23:48:45,0.0,2.92,0.0,148,164,0,12.40,0.0,0.5,0.00,0.0,0.3,15.70,2.5,0.0


In [10]:
df.dtypes


VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object

In [11]:
df.loc[df['RatecodeID'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


In [12]:
import datetime
import time

In [13]:
def convert_to_unix(s):
    # return time.mktime(datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple())
    return (s- np.datetime64('1970-01-01T00:00:00Z'))/np.timedelta64(1, 's')


In [14]:
def calc_trip_times(df):

    duration = df[['tpep_pickup_datetime', 'tpep_dropoff_datetime']]
# pickups and dropoffs to unix time
    duration_pickup = [convert_to_unix(x) for x in duration['tpep_pickup_datetime'].values]
    duration_drop = [convert_to_unix(x) for x in duration['tpep_dropoff_datetime'].values]
# calculate duration of trips
    durations = (np.array(duration_drop) - np.array(duration_pickup))/float(60)
    # append durations of trips and speed in miles/hr to a new dataframe
    new_frame = df[['passenger_count', 'trip_distance', 'PULocationID','DOLocationID','total_amount']].copy()
    new_frame['trip_time'] = durations
    new_frame['pickup_times']= duration_pickup
    new_frame['Speed'] = 60 *(new_frame['trip_distance']/new_frame['trip_time'])
    
    return new_frame



In [15]:
df_with_durations = calc_trip_times(df)

  return (s- np.datetime64('1970-01-01T00:00:00Z'))/np.timedelta64(1, 's')


In [16]:
df_with_durations

Unnamed: 0,passenger_count,trip_distance,PULocationID,DOLocationID,total_amount,trip_time,pickup_times,Speed
0,2.0,3.80,142,236,21.95,17.816667,1.640997e+09,12.797007
1,1.0,2.10,236,42,13.30,8.400000,1.640997e+09,15.000000
2,1.0,0.97,166,166,10.56,8.966667,1.640998e+09,6.490706
3,1.0,1.09,114,68,11.80,10.033333,1.640997e+09,6.518272
4,1.0,4.30,68,163,30.30,37.533333,1.640997e+09,6.873890
...,...,...,...,...,...,...,...,...
2463926,0.0,1.32,90,170,13.69,5.966667,1.643672e+09,13.273743
2463927,0.0,4.19,107,75,24.45,10.650000,1.643673e+09,23.605634
2463928,0.0,2.10,113,246,16.52,11.000000,1.643672e+09,11.454545
2463929,0.0,2.92,148,164,15.70,12.050000,1.643672e+09,14.539419


In [17]:
from scipy.stats import zscore

In [18]:
def z_score(df_with_durations):
    df_with_durations['total_amount']= zscore(df_with_durations['total_amount'])
    return df_with_durations

df_with_durations = z_score(df_with_durations)

In [19]:
print(max(df_with_durations['PULocationID']))

265


In [20]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.cluster import KMeans
from numpy.ma import count
from numpy.core.umath import log

In [21]:
oldcol  = df['PULocationID'].values.reshape(-1,1)

In [22]:
def binning (df) : 
    bins  = ceil(1+3.322*log(count(df_with_durations['PULocationID'])))
    oldcol  = df['PULocationID'].values.reshape(-1,1)
    kmeans = KMeans(n_clusters=int(bins),  random_state=0).fit(oldcol)
    newcol = kmeans.predict(oldcol)
    df['PUCluster'] = newcol
    return df

In [23]:
bins  = ceil(1+3.322*log(count(df_with_durations['PULocationID'])))
kmeans = KMeans(n_clusters=int(bins),  random_state=0).fit(oldcol)
df_with_durations['PUCluster'] = kmeans.predict(oldcol)
cluster_centers = kmeans.cluster_centers_
cluster_len = len(cluster_centers)

In [24]:
print(cluster_centers)

[[193.89482584]
 [ 99.91232911]
 [245.84929029]
 [142.        ]
 [ 48.0027514 ]
 [164.42091529]
 [ 74.74728682]
 [228.99579152]
 [264.227877  ]
 [237.        ]
 [ 24.10200219]
 [131.98222566]
 [113.58723744]
 [210.74174618]
 [ 90.02265474]
 [ 67.84306951]
 [106.999583  ]
 [186.01118406]
 [  4.91630478]
 [169.99458794]
 [151.07820527]
 [160.9996035 ]
 [ 43.12135187]
 [137.61115718]
 [ 79.02517043]
 [233.67970595]
 [262.47235985]
 [239.00924351]
 [124.98948962]
 [224.23865127]
 [249.00153396]
 [ 12.94122131]
 [147.86499289]
 [139.99879322]
 [ 87.31348912]
 [ 34.13888889]
 [157.9826454 ]
 [143.43057456]
 [231.08923809]
 [162.        ]
 [255.52069858]
 [238.        ]
 [235.99904903]
 [141.        ]
 [179.87195903]
 [230.        ]
 [ 70.02316033]
 [ 50.14148655]
 [163.        ]
 [ 41.16703241]]


Time Binning


In [25]:
def add_pickup_bins(frame,month):
    unix_pickup_times=[i for i in frame['pickup_times'].values]
    unix_times = [1640975400,1643653800,1646073000,1648751400]
    
    start_pickup_unix=unix_times[month-1]

    frame['pickup_bins'] = np.array(start_pickup_unix)
    return frame
df_with_durations = add_pickup_bins(df_with_durations,1)

In [26]:
jan_2021_groupby = df_with_durations[['PUCluster','pickup_bins','trip_distance']].groupby(['PUCluster','pickup_bins']).count()

In [27]:
def dataprep(df,month):
    df = null_value_handled(df)
    df = calc_trip_times(df)
    df = z_score(df)
    df = binning(df)
    df = add_pickup_bins(df,month)
    df_groupby = df_with_durations[['PUCluster','pickup_bins','trip_distance']].groupby(['PUCluster','pickup_bins']).count()
    return df , df_groupby

In [28]:
df_feb_2022 = pd.read_parquet('yellow_tripdata_2022-02.parquet')
df_march_2022 = pd.read_parquet('yellow_tripdata_2022-03.parquet')
df_april_2022 = pd.read_parquet('yellow_tripdata_2022-04.parquet')


In [29]:
df_feb_2022.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-02-01 00:06:58,2022-02-01 00:19:24,1.0,5.4,1.0,N,138,252,1,17.0,1.75,0.5,3.9,0.0,0.3,23.45,0.0,1.25
1,1,2022-02-01 00:38:22,2022-02-01 00:55:55,1.0,6.4,1.0,N,138,41,2,21.0,1.75,0.5,0.0,6.55,0.3,30.1,0.0,1.25
2,1,2022-02-01 00:03:20,2022-02-01 00:26:59,1.0,12.5,1.0,N,138,200,2,35.5,1.75,0.5,0.0,6.55,0.3,44.6,0.0,1.25
3,2,2022-02-01 00:08:00,2022-02-01 00:28:05,1.0,9.88,1.0,N,239,200,2,28.0,0.5,0.5,0.0,3.0,0.3,34.8,2.5,0.0
4,2,2022-02-01 00:06:48,2022-02-01 00:33:07,1.0,12.16,1.0,N,138,125,1,35.5,0.5,0.5,8.11,0.0,0.3,48.66,2.5,1.25


In [30]:
df_feb_2022, groupby_feb_2022= dataprep(df_feb_2022,2)
df_march_2022, groupby_march_2022 = dataprep(df_march_2022,3)
df_april_2022 , groupby_april_2022 = dataprep(df_april_2022,4)


  return (s- np.datetime64('1970-01-01T00:00:00Z'))/np.timedelta64(1, 's')
  return (s- np.datetime64('1970-01-01T00:00:00Z'))/np.timedelta64(1, 's')
  return (s- np.datetime64('1970-01-01T00:00:00Z'))/np.timedelta64(1, 's')


In [31]:
#Preparing the Dataframe only with x(i) values as jan-2015 data and y(i) values as jan-2016
ratios = pd.DataFrame()
ratios['Given']= jan_2021_groupby
ratios['Prediction']= groupby_feb_2022
ratios['Ratios']=ratios['Prediction']*1.0/ratios['Given']*1.0

## SMOOTHING

In [32]:
# Gets the unique bins where pickup values are present for each each reigion

# for each cluster region we will collect all the indices of 10min intravels in which the pickups are happened
# we got an observation that there are some pickpbins that doesnt have any pickups
def return_unq_pickup_bins(frame):
    values = []
    for i in range(0, 40):
        new = frame[frame['PUCluster'] == i]
        list_unq = list(set(new['pickup_bins']))
        list_unq.sort()
        values.append(list_unq)
    return values

In [33]:
jan_2021_unique = return_unq_pickup_bins(df_with_durations)
feb_2022_unique = return_unq_pickup_bins(df_feb_2022)

# march
march_2022_unique = return_unq_pickup_bins(df_march_2022)

# april
april_2022_unique = return_unq_pickup_bins(df_april_2022)

In [34]:
# for each cluster number of 10min intravels with 0 pickups
for i in range(40):
    print("for the ", i, "th cluster number of 10min intavels with zero pickups: ",
          4464 - len(set(jan_2021_unique[i])))
    print('-'*60)

for the  0 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  1 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  2 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  3 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  4 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  5 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  6 th cluster number of 10min intavels with zero pickups:  4463
------------------------------------------------------------
for the  7 th cluster number of 10min intavels with zero pickups:  44

In [35]:
# Fills a value of zero for every bin where no pickup data is present
# the count_values: number pickps that are happened in each region for each 10min intravel
# there wont be any value if there are no picksups.
# values: number of unique bins

# for every 10min intravel(pickup_bin) we will check it is there in our unique bin,
# if it is there we will add the count_values[index] to smoothed data
# if not we add 0 to the smoothed data
# we finally return smoothed data
def fill_missing(count_values, values):
    smoothed_regions = []
    ind = 0
    for r in range(0, 40):
        smoothed_bins = []
        for i in range(4464):
            if i in values[r]:
                smoothed_bins.append(count_values[ind])
                ind += 1
            else:
                smoothed_bins.append(0)
        smoothed_regions.extend(smoothed_bins)
    return smoothed_regions

In [36]:
# Fills a value of zero for every bin where no pickup data is present
# the count_values: number pickps that are happened in each region for each 10min intravel
# there wont be any value if there are no picksups.
# values: number of unique bins

# for every 10min intravel(pickup_bin) we will check it is there in our unique bin,
# if it is there we will add the count_values[index] to smoothed data
# if not we add smoothed data (which is calculated based on the methods that are discussed in the above markdown cell)
# we finally return smoothed data
import math


def smoothing(count_values, values):
    smoothed_regions = []  # stores list of final smoothed values of each reigion
    ind = 0
    repeat = 0
    smoothed_value = 0
    for r in range(0, 40):
        smoothed_bins = []  # stores the final smoothed values
        repeat = 0
        for i in range(4464):
            if repeat != 0:  # prevents iteration for a value which is already visited/resolved
                repeat -= 1
                continue
            if i in values[r]:  # checks if the pickup-bin exists
                # appends the value of the pickup bin if it exists
                smoothed_bins.append(count_values[ind])
            else:
                if i != 0:
                    right_hand_limit = 0
                    for j in range(i, 4464):
                        # searches for the left-limit or the pickup-bin value which has a pickup value
                        if j not in values[r]:
                            continue
                        else:
                            right_hand_limit = j
                            break
                    if right_hand_limit == 0:
                        # Case 1: When we have the last/last few values are found to be missing,hence we have no right-limit here
                        smoothed_value = count_values[ind-1] * \
                            1.0/((4463-i)+2)*1.0
                        for j in range(i, 4464):
                            smoothed_bins.append(math.ceil(smoothed_value))
                        smoothed_bins[i-1] = math.ceil(smoothed_value)
                        repeat = (4463-i)
                        ind -= 1
                    else:
                        # Case 2: When we have the missing values between two known values
                        smoothed_value = (
                            count_values[ind-1]+count_values[ind])*1.0/((right_hand_limit-i)+2)*1.0
                        for j in range(i, right_hand_limit+1):
                            smoothed_bins.append(math.ceil(smoothed_value))
                        smoothed_bins[i-1] = math.ceil(smoothed_value)
                        repeat = (right_hand_limit-i)
                else:
                    # Case 3: When we have the first/first few values are found to be missing,hence we have no left-limit here
                    right_hand_limit = 0
                    for j in range(i, 4464):
                        if j not in values[r]:
                            continue
                        else:
                            right_hand_limit = j
                            break
                    smoothed_value = count_values[ind] * \
                        1.0/((right_hand_limit-i)+1)*1.0
                    for j in range(i, right_hand_limit+1):
                        smoothed_bins.append(math.ceil(smoothed_value))
                    repeat = (right_hand_limit-i)
            ind += 1
        smoothed_regions.extend(smoothed_bins)
    return smoothed_regions


In [37]:
# Filling Missing values of Jan-2015 with 0
# here in jan_2015_groupby dataframe the trip_distance represents the number of pickups that are happened
jan_2021_fill = fill_missing(
    jan_2021_groupby['trip_distance'].values, jan_2021_unique)

# Smoothing Missing values of Jan-2015
jan_2021_smooth = smoothing(
    jan_2021_groupby['trip_distance'].values, jan_2021_unique)


In [38]:
# Jan-2015 data is smoothed, Jan,Feb & March 2016 data missing values are filled with zero
jan_2021_smooth = smoothing(
    jan_2021_groupby['trip_distance'].values, jan_2021_unique)
feb_2022_smooth = fill_missing(
    groupby_feb_2022['trip_distance'].values, feb_2022_unique)
march_2022_smooth = fill_missing(
    groupby_march_2022['trip_distance'].values, march_2022_unique)
april_2022_smooth = fill_missing(
    groupby_april_2022['trip_distance'].values, april_2022_unique)

# Making list of all the values of pickup data in every bin for a period of 3 months and storing them region-wise
regions_cum = []

# a =[1,2,3]
# b = [2,3,4]
# a+b = [1, 2, 3, 2, 3, 4]

# number of 10min indices for jan 2015= 24*31*60/10 = 4464
# number of 10min indices for jan 2016 = 24*31*60/10 = 4464
# number of 10min indices for feb 2016 = 24*29*60/10 = 4176
# number of 10min indices for march 2016 = 24*31*60/10 = 4464
# regions_cum: it will contain 40 lists, each list will contain 4464+4176+4464 values which represents the number of pickups
# that are happened for three months in 2016 data

for i in range(0, 40):
    regions_cum.append(feb_2022_smooth[4464*i:4464*(
        i+1)]+march_2022_smooth[4176*i:4176*(i+1)]+april_2022_smooth[4464*i:4464*(i+1)])

# print(len(regions_cum))
# 40
# print(len(regions_cum[0]))
# 13104