### Jupyter Notebook Description: Taxi Trip Duration Prediction

---

#### Dataset Description

This Jupyter notebook analyzes a dataset containing information about taxi trips, aiming to predict the duration of each trip. The dataset includes the following features:

- **id**: A unique identifier for each trip.
- **vendor_id**: A code indicating the provider associated with the trip record.
- **pickup_datetime**: Date and time when the meter was engaged.
- **dropoff_datetime**: Date and time when the meter was disengaged.
- **passenger_count**: Number of passengers in the vehicle (driver entered value).
- **pickup_longitude**: Longitude where the meter was engaged.
- **pickup_latitude**: Latitude where the meter was engaged.
- **dropoff_longitude**: Longitude where the meter was disengaged.
- **dropoff_latitude**: Latitude where the meter was disengaged.
- **store_and_fwd_flag**: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server.
  - Y = store and forward
  - N = not a store and forward trip
- **trip_duration**: Duration of the trip in seconds.

#### Objective

The goal of this notebook is to build a predictive model for estimating the trip duration based on the provided features. The evaluation metric for this competition is Root Mean Squared Logarithmic Error (RMSLE).

#### Contents

1. **Data Loading and Exploration**
   - Loading the dataset
   - Exploring the structure of the dataset
   - Checking for missing values and data types
   
2. **Data Preprocessing and Feature Engineering**
   - Handling datetime features (pickup_datetime, dropoff_datetime)
   - Calculating distance between pickup and dropoff points
   - Encoding categorical variables (vendor_id, store_and_fwd_flag)
   - Visualizing distributions and correlations
   
3. **Model Building**
   - Splitting data into training and validation sets
   - Selecting appropriate models for regression
   - Training models and evaluating performance using RMSLE
   
4. **Model Tuning and Optimization**
   - Fine-tuning model parameters using cross-validation
   - Addressing overfitting and underfitting
   
5. **Prediction and Submission**
   - Generating predictions on test dataset
   - Preparing submission file for competition
   
6. **Conclusion**
   - Summary of findings and potential improvements

#### Tools and Libraries

- Python
- Pandas, NumPy for data manipulation
- Matplotlib, Seaborn for data visualization
- Scikit-learn for model building and evaluation
- CatboostRegressor for model building
- Optuna for tuning hyperparameters

This notebook serves as a comprehensive guide to understanding the process of predicting taxi trip durations using machine learning techniques, with a focus on achieving optimal performance according to the RMSLE metric.

In [21]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
df = pd.read_csv("../data/New York City Taxi Trip Duration/train.csv", parse_dates=["dropoff_datetime", "pickup_datetime"])

In [23]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


### Let me fix the data

In [24]:
df.isna().mean()

id                    0.0
vendor_id             0.0
pickup_datetime       0.0
dropoff_datetime      0.0
passenger_count       0.0
pickup_longitude      0.0
pickup_latitude       0.0
dropoff_longitude     0.0
dropoff_latitude      0.0
store_and_fwd_flag    0.0
trip_duration         0.0
dtype: float64

No missing values

In [25]:
df.dtypes

id                            object
vendor_id                      int64
pickup_datetime       datetime64[ns]
dropoff_datetime      datetime64[ns]
passenger_count                int64
pickup_longitude             float64
pickup_latitude              float64
dropoff_longitude            float64
dropoff_latitude             float64
store_and_fwd_flag            object
trip_duration                  int64
dtype: object

In [26]:
def fix_times(df):
    df["DayOfPickup"] = df["pickup_datetime"].dt.day
    df["HourOfPickup"] = df["pickup_datetime"].dt.hour
    df["MinuteOfPickup"] = df["pickup_datetime"].dt.minute
    df["SecondOfPickup"] = df["pickup_datetime"].dt.second
    df["DayOfDrop"] = df["dropoff_datetime"].dt.day
    df["HourOfDrop"] = df["dropoff_datetime"].dt.hour
    df["MinuteOfDrop"] = df["dropoff_datetime"].dt.minute
    df["SecondOfDrop"] = df["dropoff_datetime"].dt.second

    df.drop(["pickup_datetime", "dropoff_datetime"], axis=1, inplace=True)

    return df

In [27]:
df = fix_times(df)

In [28]:
df.head()

Unnamed: 0,id,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,DayOfPickup,HourOfPickup,MinuteOfPickup,SecondOfPickup,DayOfDrop,HourOfDrop,MinuteOfDrop,SecondOfDrop
0,id2875421,2,1,-73.982155,40.767937,-73.96463,40.765602,N,455,14,17,24,55,14,17,32,30
1,id2377394,1,1,-73.980415,40.738564,-73.999481,40.731152,N,663,12,0,43,35,12,0,54,38
2,id3858529,2,1,-73.979027,40.763939,-74.005333,40.710087,N,2124,19,11,35,24,19,12,10,48
3,id3504673,2,1,-74.01004,40.719971,-74.012268,40.706718,N,429,6,19,32,31,6,19,39,40
4,id2181028,2,1,-73.973053,40.793209,-73.972923,40.78252,N,435,26,13,30,55,26,13,38,10


In [29]:
df.drop("id", axis=1, inplace=True)

Let me calculate the distance travelled

In [11]:
!pip install geopy

Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
   ---------------------------------------- 0.0/125.4 kB ? eta -:--:--
   ---------------------------------------- 0.0/125.4 kB ? eta -:--:--
   ---------------------------------------- 0.0/125.4 kB ? eta -:--:--
   --------- ------------------------------ 30.7/125.4 kB ? eta -:--:--
   ------------ -------------------------- 41.0/125.4 kB 653.6 kB/s eta 0:00:01
   ------------------- ------------------- 61.4/125.4 kB 469.7 kB/s eta 0:00:01
   ------------------------- ------------- 81.9/125.4 kB 512.0 kB/s eta 0:00:01
   ---------------------------- ---------- 92.2/125.4 kB 438.1 kB/s eta 0:00:01
   ---------------------------------- --- 112.6/125.4 kB 467.6 kB/s eta 0:00:01
   ---------------------------------- --- 112.6/125.4 kB 467.6 kB/s eta 0