<a href="https://colab.research.google.com/github/hargurjeet/bt/blob/main/NY_taxi_fare_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Intitial Observations

- It is a supervised machine learning problem
- The dataset contians overall 50,000 records
- Following are the feature of the dataset
  - key - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. 
  - fare_amount (target)- float dollar amount of the cost of the taxi ride. 
  - pickup_datetime - timestamp value indicating when the taxi ride started.
  - pickup_longitude - float for longitude coordinate of where the taxi ride started.
  - pickup_latitude - float for latitude coordinate of where the taxi ride started.
  - dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
  - dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
  - passenger_count - integer indicating the number of passengers in the taxi ride.

In [19]:
## Perfoming the required imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
required_cols = ['fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count']

dtypes = {
    'fare_amount' : 'float32', 
    'pickup_datetime': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'dropoff_latitude': 'float32',
    'passenger_count': 'uint8'
}
file_path = 'https://raw.githubusercontent.com/hargurjeet/bt/main/ny_taxi_fare_data.csv'
df = pd.read_csv(file_path, 
                 usecols = required_cols, 
                 parse_dates=['pickup_datetime'],
                 dtype = dtypes)
df.head()

Unnamed: 0,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,4.5,2009-06-15 17:26:21+00:00,-73.844315,40.721317,-73.841614,40.712276,1
1,16.9,2010-01-05 16:52:16+00:00,-74.016045,40.711304,-73.979271,40.782005,1
2,5.7,2011-08-18 00:35:00+00:00,-73.982735,40.761269,-73.991241,40.750561,2
3,7.7,2012-04-21 04:30:42+00:00,-73.987129,40.733143,-73.99157,40.758091,1
4,5.3,2010-03-09 07:51:00+00:00,-73.968094,40.768009,-73.956657,40.783764,1


In [3]:
df.shape

(50000, 7)

# Data Exploration

- Basics info regarding dataset
- EDA and visualizaitons
- Key insights

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   fare_amount        50000 non-null  float32            
 1   pickup_datetime    50000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   50000 non-null  float32            
 3   pickup_latitude    50000 non-null  float32            
 4   dropoff_longitude  50000 non-null  float32            
 5   dropoff_latitude   50000 non-null  float32            
 6   passenger_count    50000 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 1.4 MB


There seems to be no null values within the dataset

In [5]:
df.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,11.364215,-72.521416,39.931904,-72.517723,39.924244,1.66784
std,9.685438,10.392804,6.224685,10.406597,6.014816,1.289195
min,-5.0,-75.423851,-74.006889,-84.654243,-74.006378,0.0
25%,6.0,-73.992065,40.734879,-73.99115,40.734371,1.0
50%,8.5,-73.981842,40.752678,-73.98008,40.753372,1.0
75%,12.5,-73.967148,40.767361,-73.963585,40.768166,2.0
max,200.0,40.78347,401.083344,40.851028,43.415192,6.0


In [6]:
def min_max_date(df, date_col):
  return df[date_col].max(), df[date_col].min()

min_max_date(df, 'pickup_datetime')

(Timestamp('2015-06-30 22:42:39+0000', tz='UTC'),
 Timestamp('2009-01-01 01:31:49+0000', tz='UTC'))

Obseravations - 
- min fare amount is -5 dollars  and max is 200 dollars
- 50% values are under 8.5$, 75% of rides cost less than 12.5$. This gives the basic understanding how good our models need to be.
- While making prediction, I would expect my prediciton to be under +_ 3$ range otherwise I am off by a lot.
- max passege count is 6 which is highly unlikely. Hence we might requrie some data cleanup.
- observing the min and max date its about 6 years of worth of data

# 3. EDA and Visualization

#4. Preparing the dataset

We will set aside 20% of our data as validation set for evaluation of our model

In [10]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

In [9]:
len(train_df), len(val_df)

(40000, 10000)

In [11]:
## extract input and targets

train_df.columns

Index(['fare_amount', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

In [12]:
input_cols = ['pickup_longitude', 'pickup_latitude','dropoff_longitude', 'dropoff_latitude', 'passenger_count']
target_cols = 'fare_amount'

In [17]:
train_inputs = train_df[input_cols]
train_targets = train_df[target_cols]
val_inputs = val_df[input_cols]
val_targets = val_df[target_cols]

# 5. Train baseline models

In [29]:
class regressor():
  def fit(self, inputs, targets):
    self.mean = targets.mean()

  def predicts(self, inputs):
    return np.full(inputs.shape[0], self.mean)

In [26]:
np.full([40000], 40000)

array([40000, 40000, 40000, ..., 40000, 40000, 40000])

In [31]:
mean_model = regressor()

In [34]:
mean_model.fit(train_inputs, train_targets)
print(f'Average fare for the taxi id {mean_model.mean}')

Average fare for the taxi id 11.376873970031738
