<h1>Taxi Demand Prediction</h1>

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml 

<h3> DATA DICTIONARY </h3>
<table>
<th>Field Name <th>Description
<tr> <td>VendorID <td>A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
<tr>
<td>tpep_pickup_datetime <td> The date and time when the meter was engaged.
<tr>
<td>tpep_dropoff_datetime <td>The date and time when the meter was disengaged.
<tr>
<td>Passenger_count <td>The number of passengers in the vehicle.
This is a driver-entered value.
<tr>
<td>Trip_distance <td>The elapsed trip distance in miles reported by the taximeter.
<tr>
<td>PULocationID <td>TLC Taxi Zone in which the taximeter was engaged
<tr><td>DOLocationID <td>TLC Taxi Zone in which the taximeter was disengaged
<tr><td>RateCodeID <td>The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
<tr>
<td>Store_and_fwd_flag <td>This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
<tr>
<td>Payment_type <td>A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
<tr><td>
Fare_amount <td>The time-and-distance fare calculated by the meter.
<tr><td>
Extra <td>Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
<tr><td>
MTA_tax <td>$0.50 MTA tax that is automatically triggered based on the metered
rate in use.
<tr><td>
Improvement_surcharge <td>$0.30 improvement surcharge assessed trips at the flag drop. The
improvement surcharge began being levied in 2015.
<tr><td>
Tip_amount <td>Tip amount – This field is automatically populated for credit card
tips. Cash tips are not included.
<tr><td>
Tolls_amount <td>Total amount of all tolls paid in trip.
<tr><td>
Total_amount <td>The total amount charged to passengers. Does not include cash tips.
<tr><td>
Congestion_Surcharge <td>Total amount collected in trip for NYS congestion surcharge.
<tr><td>
Airport_fee <td>$1.25 for pick up only at LaGuardia and John F. Kennedy Airports
</table>

In [107]:
import pandas as pd

In [108]:
%pip install pyarrow
%pip install fastparquet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [109]:
df = pd.read_parquet("C:/Users/GarimaJi/Downloads/yellow_tripdata_2022-01.parquet")

In [110]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

**Columns with Null Values** : 
    <ol>
    
        passenger_count

        RatecodeID

        store_and_fwd_flag

        congestion_surcharge

        airport_fee
    

In [111]:
df.loc[df['VendorID'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


<h3>Handling Null Values of Airport_fee and Congestion Surcharge </h3>

Airport Fee : All NaN values replaced by zero

Congestion Surcharge : All NaN values replaced by
$$
Total amount -(fare amount+extra+mta tax+tip amount+tolls amount+improvement surcharge)
$$

In [112]:
df['airport_fee']= df['airport_fee'].fillna(0)
df.loc[df['airport_fee'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee


In [113]:
df['congestion_surcharge']= df['congestion_surcharge'].fillna(df['total_amount']-(df['fare_amount']+df['extra']+df['mta_tax']+df['tip_amount']+df['tolls_amount']+df['improvement_surcharge']))
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.80,1.0,N,142,236,1,14.50,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.10,1.0,N,236,42,1,8.00,0.5,0.5,4.00,0.0,0.3,13.30,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.50,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.00,0.5,0.5,0.00,0.0,0.3,11.80,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.30,1.0,N,68,163,1,23.50,0.5,0.5,3.00,0.0,0.3,30.30,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2463926,2,2022-01-31 23:36:53,2022-01-31 23:42:51,,1.32,,,90,170,0,8.00,0.0,0.5,2.39,0.0,0.3,13.69,2.5,0.0
2463927,2,2022-01-31 23:44:22,2022-01-31 23:55:01,,4.19,,,107,75,0,16.80,0.0,0.5,4.35,0.0,0.3,24.45,2.5,0.0
2463928,2,2022-01-31 23:39:00,2022-01-31 23:50:00,,2.10,,,113,246,0,11.22,0.0,0.5,2.00,0.0,0.3,16.52,2.5,0.0
2463929,2,2022-01-31 23:36:42,2022-01-31 23:48:45,,2.92,,,148,164,0,12.40,0.0,0.5,0.00,0.0,0.3,15.70,2.5,0.0


In [114]:
df=df.drop('store_and_fwd_flag',axis=1)

In [115]:
df.loc[df['passenger_count'].isnull()]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
2392428,2,2022-01-01 00:50:00,2022-01-01 00:54:00,,1.00,,68,246,0,13.20,0.0,0.5,1.75,0.0,0.3,18.25,2.5,0.0
2392429,2,2022-01-01 00:49:24,2022-01-01 01:27:36,,13.31,,257,223,0,44.87,0.0,0.5,10.05,0.0,0.3,55.72,0.0,0.0
2392430,2,2022-01-01 00:42:00,2022-01-01 00:56:00,,2.87,,143,236,0,13.23,0.0,0.5,3.51,0.0,0.3,20.04,2.5,0.0
2392431,2,2022-01-01 00:40:00,2022-01-01 00:55:00,,3.24,,143,262,0,14.19,0.0,0.5,3.72,0.0,0.3,21.21,2.5,0.0
2392432,2,2022-01-01 00:40:00,2022-01-01 00:52:00,,2.19,,239,166,0,13.20,0.0,0.5,5.25,0.0,0.3,21.75,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2463926,2,2022-01-31 23:36:53,2022-01-31 23:42:51,,1.32,,90,170,0,8.00,0.0,0.5,2.39,0.0,0.3,13.69,2.5,0.0
2463927,2,2022-01-31 23:44:22,2022-01-31 23:55:01,,4.19,,107,75,0,16.80,0.0,0.5,4.35,0.0,0.3,24.45,2.5,0.0
2463928,2,2022-01-31 23:39:00,2022-01-31 23:50:00,,2.10,,113,246,0,11.22,0.0,0.5,2.00,0.0,0.3,16.52,2.5,0.0
2463929,2,2022-01-31 23:36:42,2022-01-31 23:48:45,,2.92,,148,164,0,12.40,0.0,0.5,0.00,0.0,0.3,15.70,2.5,0.0


In [116]:
import numpy as np
average=df['total_amount']/df['trip_distance']
nulcol  = df[average.isnull()]
print(nulcol)


         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
10065           1  2022-01-01 02:46:39   2022-01-01 03:31:48              1.0   
10171           1  2022-01-01 02:34:04   2022-01-01 02:34:16              2.0   
30840           2  2022-01-01 13:29:52   2022-01-01 13:30:08              1.0   
34114           2  2022-01-01 13:43:34   2022-01-01 13:43:47              1.0   
53758           1  2022-01-01 20:13:17   2022-01-01 20:13:17              1.0   
...           ...                  ...                   ...              ...   
2349501         2  2022-01-31 14:37:27   2022-01-31 14:37:53              1.0   
2349502         2  2022-01-31 14:35:17   2022-01-31 14:35:58              1.0   
2379438         2  2022-01-31 19:13:50   2022-01-31 19:14:06              1.0   
2383489         2  2022-01-31 20:06:25   2022-01-31 20:06:50              1.0   
2402287         1  2022-01-07 07:38:02   2022-01-07 07:40:45              NaN   

         trip_distance  Rat

In [117]:
mean=np.mean(average[np.isfinite(average)])
mean

11.198534962613284

In [118]:
from numpy.core.umath import ceil


df. loc[df['passenger_count'].isnull(),'passenger_count']=ceil(df['total_amount']/(mean*df['trip_distance']))


In [119]:
df.loc[df['passenger_count'].isnull()]
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.80,1.0,142,236,1,14.50,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.10,1.0,236,42,1,8.00,0.5,0.5,4.00,0.0,0.3,13.30,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,166,166,1,7.50,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,114,68,2,8.00,0.5,0.5,0.00,0.0,0.3,11.80,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.30,1.0,68,163,1,23.50,0.5,0.5,3.00,0.0,0.3,30.30,2.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2463926,2,2022-01-31 23:36:53,2022-01-31 23:42:51,1.0,1.32,,90,170,0,8.00,0.0,0.5,2.39,0.0,0.3,13.69,2.5,0.0
2463927,2,2022-01-31 23:44:22,2022-01-31 23:55:01,1.0,4.19,,107,75,0,16.80,0.0,0.5,4.35,0.0,0.3,24.45,2.5,0.0
2463928,2,2022-01-31 23:39:00,2022-01-31 23:50:00,1.0,2.10,,113,246,0,11.22,0.0,0.5,2.00,0.0,0.3,16.52,2.5,0.0
2463929,2,2022-01-31 23:36:42,2022-01-31 23:48:45,1.0,2.92,,148,164,0,12.40,0.0,0.5,0.00,0.0,0.3,15.70,2.5,0.0
