### Jupyter Notebook Description: Taxi Trip Duration Prediction

---

#### Dataset Description

This Jupyter notebook analyzes a dataset containing information about taxi trips, aiming to predict the duration of each trip. The dataset includes the following features:

- **id**: A unique identifier for each trip.
- **vendor_id**: A code indicating the provider associated with the trip record.
- **pickup_datetime**: Date and time when the meter was engaged.
- **dropoff_datetime**: Date and time when the meter was disengaged.
- **passenger_count**: Number of passengers in the vehicle (driver entered value).
- **pickup_longitude**: Longitude where the meter was engaged.
- **pickup_latitude**: Latitude where the meter was engaged.
- **dropoff_longitude**: Longitude where the meter was disengaged.
- **dropoff_latitude**: Latitude where the meter was disengaged.
- **store_and_fwd_flag**: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server.
  - Y = store and forward
  - N = not a store and forward trip
- **trip_duration**: Duration of the trip in seconds.

#### Objective

The goal of this notebook is to build a predictive model for estimating the trip duration based on the provided features. The evaluation metric for this competition is Root Mean Squared Logarithmic Error (RMSLE).

#### Contents

1. **Data Loading and Exploration**
   - Loading the dataset
   - Exploring the structure of the dataset
   - Checking for missing values and data types
   
2. **Data Preprocessing and Feature Engineering**
   - Handling datetime features (pickup_datetime, dropoff_datetime)
   - Calculating distance between pickup and dropoff points
   - Encoding categorical variables (vendor_id, store_and_fwd_flag)
   - Visualizing distributions and correlations
   
3. **Model Building**
   - Splitting data into training and validation sets
   - Selecting appropriate models for regression
   - Training models and evaluating performance using RMSLE
   
4. **Model Tuning and Optimization**
   - Fine-tuning model parameters using cross-validation
   - Addressing overfitting and underfitting
   
5. **Prediction and Submission**
   - Generating predictions on test dataset
   - Preparing submission file for competition
   
6. **Conclusion**
   - Summary of findings and potential improvements

#### Tools and Libraries

- Python
- Pandas, NumPy for data manipulation
- Matplotlib, Seaborn for data visualization
- Scikit-learn for model building and evaluation
- CatboostRegressor for model building
- Optuna for tuning hyperparameters

This notebook serves as a comprehensive guide to understanding the process of predicting taxi trip durations using machine learning techniques, with a focus on achieving optimal performance according to the RMSLE metric.

In [27]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
import optuna
from catboost import CatBoostRegressor

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("../data/New York City Taxi Trip Duration/train.csv", parse_dates=["dropoff_datetime", "pickup_datetime"])

In [3]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


### Let me fix the data

In [4]:
df.isna().mean()

id                    0.0
vendor_id             0.0
pickup_datetime       0.0
dropoff_datetime      0.0
passenger_count       0.0
pickup_longitude      0.0
pickup_latitude       0.0
dropoff_longitude     0.0
dropoff_latitude      0.0
store_and_fwd_flag    0.0
trip_duration         0.0
dtype: float64

No missing values

In [5]:
df.dtypes

id                            object
vendor_id                      int64
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
pickup_longitude             float64
pickup_latitude              float64
dropoff_longitude            float64
dropoff_latitude             float64
store_and_fwd_flag            object
trip_duration                  int64
dtype: object

In [6]:
def fix_times(df):
    df["DayOfPickup"] = df["pickup_datetime"].dt.day
    df["HourOfPickup"] = df["pickup_datetime"].dt.hour
    df["MinuteOfPickup"] = df["pickup_datetime"].dt.minute
    df["SecondOfPickup"] = df["pickup_datetime"].dt.second
    df["DayOfDrop"] = df["dropoff_datetime"].dt.day
    df["HourOfDrop"] = df["dropoff_datetime"].dt.hour
    df["MinuteOfDrop"] = df["dropoff_datetime"].dt.minute
    df["SecondOfDrop"] = df["dropoff_datetime"].dt.second

    df.drop(["pickup_datetime", "dropoff_datetime"], axis=1, inplace=True)

    return df

In [7]:
df = fix_times(df)

In [8]:
df.head()

Unnamed: 0,id,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,DayOfPickup,HourOfPickup,MinuteOfPickup,SecondOfPickup,DayOfDrop,HourOfDrop,MinuteOfDrop,SecondOfDrop
0,id2875421,2,1,-73.982155,40.767937,-73.96463,40.765602,N,455,14,17,24,55,14,17,32,30
1,id2377394,1,1,-73.980415,40.738564,-73.999481,40.731152,N,663,12,0,43,35,12,0,54,38
2,id3858529,2,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,19,11,35,24,19,12,10,48
3,id3504673,2,1,-74.01004,40.719971,-74.012268,40.706718,N,429,6,19,32,31,6,19,39,40
4,id2181028,2,1,-73.973053,40.793209,-73.972923,40.78252,N,435,26,13,30,55,26,13,38,10


In [9]:
import geopy.distance

In [10]:
df["pickup_cords"] = list(zip(df["pickup_latitude"], df["pickup_longitude"]))
df["dropoff_cords"] = list(zip(df["dropoff_latitude"], df["dropoff_longitude"]))

df.drop(["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"], axis=1, inplace=True)

In [11]:
distance = geopy.distance.geodesic(df["pickup_cords"][0], df["dropoff_cords"][0]).km

In [12]:
distance = []
for x in range(len(df.pickup_cords)):
    distance.append(geopy.distance.geodesic(df["pickup_cords"][x], df["dropoff_cords"][x]).km)

In [13]:
df.drop(["pickup_cords", "dropoff_cords"], axis=1, inplace=True)

In [14]:
df["distance"] = pd.Series(distance)

In [15]:
df.head()

Unnamed: 0,id,vendor_id,passenger_count,store_and_fwd_flag,trip_duration,DayOfPickup,HourOfPickup,MinuteOfPickup,SecondOfPickup,DayOfDrop,HourOfDrop,MinuteOfDrop,SecondOfDrop,distance
0,id2875421,2,1,N,455,14,17,24,55,14,17,32,30,1.502172
1,id2377394,1,1,N,663,12,0,43,35,12,0,54,38,1.80866
2,id3858529,2,1,N,2124,19,11,35,24,19,12,10,48,6.379687
3,id3504673,2,1,N,429,6,19,32,31,6,19,39,40,1.483632
4,id2181028,2,1,N,435,26,13,30,55,26,13,38,10,1.187038


In [16]:
df.drop("id", axis=1, inplace=True)

In [17]:
df.dtypes

vendor_id               int64
passenger_count         int64
store_and_fwd_flag     object
trip_duration           int64
DayOfPickup             int32
HourOfPickup            int32
MinuteOfPickup          int32
SecondOfPickup          int32
DayOfDrop               int32
HourOfDrop              int32
MinuteOfDrop            int32
SecondOfDrop            int32
distance              float64
dtype: object

In [18]:
df.store_and_fwd_flag.unique()

array(['N', 'Y'], dtype=object)

In [19]:
df["store_and_fwd_flag"].replace({"Y": 1, "N": 0}, inplace=True)

In [20]:
df.dtypes

vendor_id               int64
passenger_count         int64
store_and_fwd_flag      int64
trip_duration           int64
DayOfPickup             int32
HourOfPickup            int32
MinuteOfPickup          int32
SecondOfPickup          int32
DayOfDrop               int32
HourOfDrop              int32
MinuteOfDrop            int32
SecondOfDrop            int32
distance              float64
dtype: object

In [21]:
X = df.drop("trip_duration", axis=1)
y = df["trip_duration"]

In [22]:
len(X), len(y)

(1458644, 1458644)

In [23]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.4)

In [24]:
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((875186, 12), (583458, 12), (875186,), (583458,))

In [25]:
regr = CatBoostRegressor()

In [26]:
regr.fit(X_train, y_train, verbose=False, plot=True);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [28]:
y_preds = regr.predict(X_valid)

In [30]:
mean_squared_log_error(y_valid, y_preds, squared=True)

ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.