# Model Comparison Across Stages

First Workflow (for training data because this has lots of data)
1. **checking inaccuracy/error data**
2. removing duplicates
3. converting datatypes
4. creating new feature distance_osrm
5. creating new feature NYC_center

Second Workflow (for retraining or testing, lesser data)
1. creating new feature distance_osrm
2. creating new feature NYC_center
3. **checking inaccuracy/error data**
4. removing duplicates
5. converting datatypes
6. date time breakdown
7. missing value imputation + speed derivation
8. outliers handling 1
9. outliers handling 2
10. pickdrop cluster
11. skewness transformation
12. correlation matrix
13. vif
14. log_transform
15. feature encoding
16. feature scaling
17. feature selection
18. estimator

### Stage 1: Baseline Model

In [1]:
from playground import *

In [2]:
# baseline model
df = pd.read_csv("dataset/train.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


In [4]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [6]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.09 s
Wall time: 5.72 s


In [7]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 6.69 s
Wall time: 7.43 s


In [8]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 6.09 s
Wall time: 6.48 s


In [9]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [10]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1166915 entries, 1053743 to 121958
Data columns (total 15 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1166915 non-null  int64  
 1   pickup_longitude        1166915 non-null  float64
 2   pickup_latitude         1166915 non-null  float64
 3   dropoff_longitude       1166915 non-null  float64
 4   dropoff_latitude        1166915 non-null  float64
 5   vendor_id_2             1166915 non-null  uint8  
 6   store_and_fwd_flag_Y    1166915 non-null  uint8  
 7   pickup_month_sin        1166915 non-null  float64
 8   pickup_month_cos        1166915 non-null  float64
 9   pickup_day_sin          1166915 non-null  float64
 10  pickup_day_cos          1166915 non-null  float64
 11  pickup_day_of_week_sin  1166915 non-null  float64
 12  pickup_day_of_week_cos  1166915 non-null  float64
 13  pickup_hour_sin         1166915 non-null  float64
 1

In [11]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291729 entries, 67250 to 589044
Data columns (total 15 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291729 non-null  int64  
 1   pickup_longitude        291729 non-null  float64
 2   pickup_latitude         291729 non-null  float64
 3   dropoff_longitude       291729 non-null  float64
 4   dropoff_latitude        291729 non-null  float64
 5   vendor_id_2             291729 non-null  uint8  
 6   store_and_fwd_flag_Y    291729 non-null  uint8  
 7   pickup_month_sin        291729 non-null  float64
 8   pickup_month_cos        291729 non-null  float64
 9   pickup_day_sin          291729 non-null  float64
 10  pickup_day_cos          291729 non-null  float64
 11  pickup_day_of_week_sin  291729 non-null  float64
 12  pickup_day_of_week_cos  291729 non-null  float64
 13  pickup_hour_sin         291729 non-null  float64
 14  pickup_hour_cos 

In [12]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [13]:
stage_1 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.8646, Fit time: 1.6810 seconds
DTR - RMSLE: 0.6242, Fit time: 52.5158 seconds
XGBR - RMSLE: 0.7075, Fit time: 10.8571 seconds


In [14]:
print(stage_1)

LR - RMSLE: 0.8646, Fit time: 1.6810 seconds
DTR - RMSLE: 0.6242, Fit time: 52.5158 seconds
XGBR - RMSLE: 0.7075, Fit time: 10.8571 seconds


### Stage 2: Adding distance_osrm Feature

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
# specific condition: because some models cannot process missing values
df.dropna(subset=['distance_osrm'], inplace=True)

In [3]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [5]:
cols_to_drop = ['pickup_dist_NYC_center', 'dropoff_dist_NYC_center']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)

In [6]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 4.94 s
Wall time: 5.13 s


In [7]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 6.06 s
Wall time: 6.59 s


In [8]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 6.39 s
Wall time: 8.03 s


In [9]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [10]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1166901 entries, 865804 to 121962
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1166901 non-null  int64  
 1   pickup_longitude        1166901 non-null  float64
 2   pickup_latitude         1166901 non-null  float64
 3   dropoff_longitude       1166901 non-null  float64
 4   dropoff_latitude        1166901 non-null  float64
 5   distance_osrm           1166901 non-null  float64
 6   vendor_id_2             1166901 non-null  uint8  
 7   store_and_fwd_flag_Y    1166901 non-null  uint8  
 8   pickup_month_sin        1166901 non-null  float64
 9   pickup_month_cos        1166901 non-null  float64
 10  pickup_day_sin          1166901 non-null  float64
 11  pickup_day_cos          1166901 non-null  float64
 12  pickup_day_of_week_sin  1166901 non-null  float64
 13  pickup_day_of_week_cos  1166901 non-null  float64
 14

In [11]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291726 entries, 64778 to 1119119
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291726 non-null  int64  
 1   pickup_longitude        291726 non-null  float64
 2   pickup_latitude         291726 non-null  float64
 3   dropoff_longitude       291726 non-null  float64
 4   dropoff_latitude        291726 non-null  float64
 5   distance_osrm           291726 non-null  float64
 6   vendor_id_2             291726 non-null  uint8  
 7   store_and_fwd_flag_Y    291726 non-null  uint8  
 8   pickup_month_sin        291726 non-null  float64
 9   pickup_month_cos        291726 non-null  float64
 10  pickup_day_sin          291726 non-null  float64
 11  pickup_day_cos          291726 non-null  float64
 12  pickup_day_of_week_sin  291726 non-null  float64
 13  pickup_day_of_week_cos  291726 non-null  float64
 14  pickup_hour_sin

In [12]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [13]:
stage_2 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6594, Fit time: 1.6992 seconds
DTR - RMSLE: 0.5957, Fit time: 51.1731 seconds
XGBR - RMSLE: 0.6944, Fit time: 11.4793 seconds


### Stage 3: Missing Value Imputation

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
cols_to_drop = ['pickup_dist_NYC_center', 'dropoff_dist_NYC_center']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)

In [5]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 4.64 s
Wall time: 4.69 s


In [6]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 6.5 s
Wall time: 7.52 s


In [7]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.61 s
Wall time: 5.88 s


In [8]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [9]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [10]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1166899 entries, 1053743 to 121958
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1166899 non-null  int64  
 1   pickup_longitude        1166899 non-null  float64
 2   pickup_latitude         1166899 non-null  float64
 3   dropoff_longitude       1166899 non-null  float64
 4   dropoff_latitude        1166899 non-null  float64
 5   distance_osrm           1166899 non-null  float64
 6   vendor_id_2             1166899 non-null  uint8  
 7   store_and_fwd_flag_Y    1166899 non-null  uint8  
 8   pickup_month_sin        1166899 non-null  float64
 9   pickup_month_cos        1166899 non-null  float64
 10  pickup_day_sin          1166899 non-null  float64
 11  pickup_day_cos          1166899 non-null  float64
 12  pickup_day_of_week_sin  1166899 non-null  float64
 13  pickup_day_of_week_cos  1166899 non-null  float64
 1

In [11]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   pickup_longitude        291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   dropoff_longitude       291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   distance_osrm           291728 non-null  float64
 6   vendor_id_2             291728 non-null  uint8  
 7   store_and_fwd_flag_Y    291728 non-null  uint8  
 8   pickup_month_sin        291728 non-null  float64
 9   pickup_month_cos        291728 non-null  float64
 10  pickup_day_sin          291728 non-null  float64
 11  pickup_day_cos          291728 non-null  float64
 12  pickup_day_of_week_sin  291728 non-null  float64
 13  pickup_day_of_week_cos  291728 non-null  float64
 14  pickup_hour_sin 

In [12]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [13]:
stage_3 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6560, Fit time: 1.3997 seconds
DTR - RMSLE: 0.5937, Fit time: 46.4586 seconds
XGBR - RMSLE: 0.7084, Fit time: 12.4147 seconds


### Stage 4: Adding speed_osrm Feature

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
cols_to_drop = ['pickup_dist_NYC_center', 'dropoff_dist_NYC_center']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)

In [5]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 4.75 s
Wall time: 4.93 s


In [6]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.27 s
Wall time: 5.44 s


In [7]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 6.31 s
Wall time: 6.74 s


In [8]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [9]:
%time
speed_deriv = SpeedDeriv()
X_train, y_train = speed_deriv.transform(X_train, y_train)
X_test, y_test = speed_deriv.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting speed derivation...
Feature 'speed_osrm' has been created.
>>>> Starting speed derivation...
Feature 'speed_osrm' has been created.


In [10]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1166899 entries, 1053743 to 121958
Data columns (total 17 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1166899 non-null  int64  
 1   pickup_longitude        1166899 non-null  float64
 2   pickup_latitude         1166899 non-null  float64
 3   dropoff_longitude       1166899 non-null  float64
 4   dropoff_latitude        1166899 non-null  float64
 5   distance_osrm           1166899 non-null  float64
 6   speed_osrm              1166899 non-null  float64
 7   vendor_id_2             1166899 non-null  uint8  
 8   store_and_fwd_flag_Y    1166899 non-null  uint8  
 9   pickup_month_sin        1166899 non-null  float64
 10  pickup_month_cos        1166899 non-null  float64
 11  pickup_day_sin          1166899 non-null  float64
 12  pickup_day_cos          1166899 non-null  float64
 13  pickup_day_of_week_sin  1166899 non-null  float64
 1

In [12]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 17 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   pickup_longitude        291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   dropoff_longitude       291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   distance_osrm           291728 non-null  float64
 6   speed_osrm              291728 non-null  float64
 7   vendor_id_2             291728 non-null  uint8  
 8   store_and_fwd_flag_Y    291728 non-null  uint8  
 9   pickup_month_sin        291728 non-null  float64
 10  pickup_month_cos        291728 non-null  float64
 11  pickup_day_sin          291728 non-null  float64
 12  pickup_day_cos          291728 non-null  float64
 13  pickup_day_of_week_sin  291728 non-null  float64
 14  pickup_day_of_we

In [13]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [14]:
stage_4 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6235, Fit time: 1.8488 seconds
DTR - RMSLE: 0.1430, Fit time: 48.4794 seconds
XGBR - RMSLE: 0.2661, Fit time: 14.2972 seconds


### Stage 5: Applying sensible restrictions to the Features

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
cols_to_drop = ['pickup_dist_NYC_center', 'dropoff_dist_NYC_center']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)

In [5]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.38 s
Wall time: 5.74 s


In [6]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.5 s
Wall time: 5.75 s


In [7]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 6.08 s
Wall time: 6.43 s


In [8]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [9]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 2.02 s
Wall time: 2.26 s


In [10]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155267 entries, 1053743 to 121958
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1155267 non-null  int64  
 1   pickup_longitude        1155267 non-null  float64
 2   pickup_latitude         1155267 non-null  float64
 3   dropoff_longitude       1155267 non-null  float64
 4   dropoff_latitude        1155267 non-null  float64
 5   distance_osrm           1155267 non-null  float64
 6   vendor_id_2             1155267 non-null  uint8  
 7   store_and_fwd_flag_Y    1155267 non-null  uint8  
 8   pickup_month_sin        1155267 non-null  float64
 9   pickup_month_cos        1155267 non-null  float64
 10  pickup_day_sin          1155267 non-null  float64
 11  pickup_day_cos          1155267 non-null  float64
 12  pickup_day_of_week_sin  1155267 non-null  float64
 13  pickup_day_of_week_cos  1155267 non-null  float64
 1

In [12]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   pickup_longitude        291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   dropoff_longitude       291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   distance_osrm           291728 non-null  float64
 6   vendor_id_2             291728 non-null  uint8  
 7   store_and_fwd_flag_Y    291728 non-null  uint8  
 8   pickup_month_sin        291728 non-null  float64
 9   pickup_month_cos        291728 non-null  float64
 10  pickup_day_sin          291728 non-null  float64
 11  pickup_day_cos          291728 non-null  float64
 12  pickup_day_of_week_sin  291728 non-null  float64
 13  pickup_day_of_week_cos  291728 non-null  float64
 14  pickup_hour_sin 

In [13]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [14]:
stage_5 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6394, Fit time: 1.5391 seconds
DTR - RMSLE: 0.6078, Fit time: 47.7388 seconds
XGBR - RMSLE: 0.5942, Fit time: 11.0469 seconds


### Stage 6: Restricting the boundary to the area around NYC

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.52 s
Wall time: 5.77 s


In [5]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.2 s
Wall time: 5.34 s


In [6]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.44 s
Wall time: 5.61 s


In [7]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [8]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 2.27 s
Wall time: 3.15 s


In [9]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.8 s
Wall time: 2.84 s


In [10]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155196 entries, 1053743 to 121958
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1155196 non-null  int64  
 1   pickup_longitude        1155196 non-null  float64
 2   pickup_latitude         1155196 non-null  float64
 3   dropoff_longitude       1155196 non-null  float64
 4   dropoff_latitude        1155196 non-null  float64
 5   distance_osrm           1155196 non-null  float64
 6   vendor_id_2             1155196 non-null  uint8  
 7   store_and_fwd_flag_Y    1155196 non-null  uint8  
 8   pickup_month_sin        1155196 non-null  float64
 9   pickup_month_cos        1155196 non-null  float64
 10  pickup_day_sin          1155196 non-null  float64
 11  pickup_day_cos          1155196 non-null  float64
 12  pickup_day_of_week_sin  1155196 non-null  float64
 13  pickup_day_of_week_cos  1155196 non-null  float64
 1

In [12]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   pickup_longitude        291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   dropoff_longitude       291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   distance_osrm           291728 non-null  float64
 6   vendor_id_2             291728 non-null  uint8  
 7   store_and_fwd_flag_Y    291728 non-null  uint8  
 8   pickup_month_sin        291728 non-null  float64
 9   pickup_month_cos        291728 non-null  float64
 10  pickup_day_sin          291728 non-null  float64
 11  pickup_day_cos          291728 non-null  float64
 12  pickup_day_of_week_sin  291728 non-null  float64
 13  pickup_day_of_week_cos  291728 non-null  float64
 14  pickup_hour_sin 

In [13]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [14]:
stage_6 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6386, Fit time: 1.4434 seconds
DTR - RMSLE: 0.6115, Fit time: 50.3154 seconds
XGBR - RMSLE: 0.5948, Fit time: 14.8784 seconds


### Stage 7: Replacing pickup_longitude, pickup_latitude with pickup_cluster and dropoff_longitude, dropoff_latitude with dropoff_cluster

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.44 s
Wall time: 5.58 s


In [5]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.36 s
Wall time: 5.41 s


In [6]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.39 s
Wall time: 5.39 s


In [7]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [8]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 1.88 s
Wall time: 1.89 s


In [9]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.44 s
Wall time: 2.44 s


In [10]:
%%time
pickdrop_cluster = PickDropCluster(optimal_k_pickup=4, optimal_k_dropoff=4)
X_train, y_train = pickdrop_cluster.fit_transform(X_train, y_train)
X_test, y_test = pickdrop_cluster.transform(X_test, y_test)

>>>> Starting to create cluster features ...
New features 'pickup_cluster' and 'dropoff_cluster' have been created.
Dropped the columns 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', and 'dropoff_latitude'
>>>> Starting to create cluster features ...
New features 'pickup_cluster' and 'dropoff_cluster' have been created.
Dropped the columns 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', and 'dropoff_latitude'
CPU times: total: 50.9 s
Wall time: 23.1 s


In [11]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing dummy encoding on 'pickup_cluster' column...
Performing dummy encoding on 'dropoff_cluster' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd

In [12]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155196 entries, 1053743 to 121958
Data columns (total 18 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1155196 non-null  int64  
 1   distance_osrm           1155196 non-null  float64
 2   vendor_id_2             1155196 non-null  uint8  
 3   store_and_fwd_flag_Y    1155196 non-null  uint8  
 4   pickup_cluster_1        1155196 non-null  uint8  
 5   pickup_cluster_2        1155196 non-null  uint8  
 6   pickup_cluster_3        1155196 non-null  uint8  
 7   dropoff_cluster_1       1155196 non-null  uint8  
 8   dropoff_cluster_2       1155196 non-null  uint8  
 9   dropoff_cluster_3       1155196 non-null  uint8  
 10  pickup_month_sin        1155196 non-null  float64
 11  pickup_month_cos        1155196 non-null  float64
 12  pickup_day_sin          1155196 non-null  float64
 13  pickup_day_cos          1155196 non-null  float64
 1

In [13]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 18 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   distance_osrm           291728 non-null  float64
 2   vendor_id_2             291728 non-null  uint8  
 3   store_and_fwd_flag_Y    291728 non-null  uint8  
 4   pickup_cluster_1        291728 non-null  uint8  
 5   pickup_cluster_2        291728 non-null  uint8  
 6   pickup_cluster_3        291728 non-null  uint8  
 7   dropoff_cluster_1       291728 non-null  uint8  
 8   dropoff_cluster_2       291728 non-null  uint8  
 9   dropoff_cluster_3       291728 non-null  uint8  
 10  pickup_month_sin        291728 non-null  float64
 11  pickup_month_cos        291728 non-null  float64
 12  pickup_day_sin          291728 non-null  float64
 13  pickup_day_cos          291728 non-null  float64
 14  pickup_day_of_we

In [14]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [15]:
stage_7 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6351, Fit time: 1.5940 seconds
DTR - RMSLE: 0.6372, Fit time: 22.4325 seconds
XGBR - RMSLE: 0.6131, Fit time: 11.2697 seconds


### Stage 8: Skewness Transformation

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.69 s
Wall time: 5.79 s


In [5]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.59 s
Wall time: 5.78 s


In [6]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.59 s
Wall time: 5.61 s


In [7]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [8]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 1.86 s
Wall time: 1.95 s


In [9]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.45 s
Wall time: 2.5 s


In [10]:
%%time
skewness_transform = SkewnessTransform()
X_train, y_train = skewness_transform.transform(X_train, y_train)

>>>> Starting to transform the skew features ...

Skewness of X columns before transformation:
passenger_count      2.127703
pickup_longitude     3.299983
pickup_latitude     -1.107068
dropoff_longitude    2.452137
dropoff_latitude    -0.379705
distance_osrm        3.010464
dtype: float64
Skewness of y before transformation: 25.401763381908253

Column 'passenger_count' has skewness 2.127702902910666 which is outside the range (-2, 2)
Min value of column 'passenger_count': 1
Applied custom log transformation formula to 'passenger_count'.

Column 'pickup_longitude' has skewness 3.2999832729155765 which is outside the range (-2, 2)
Min value of column 'pickup_longitude': -74.3177490234375
Applied custom log transformation formula to 'pickup_longitude'.

Column 'dropoff_longitude' has skewness 2.4521373274651324 which is outside the range (-2, 2)
Min value of column 'dropoff_longitude': -74.58055877685547
Applied custom log transformation formula to 'dropoff_longitude'.

Column 'distance_o

In [11]:
# Apply the same transformation to test set
X_test['passenger_count'] = np.log1p(np.maximum(X_test['passenger_count'] + np.abs(X_test['passenger_count']), 0))
X_test['pickup_longitude'] = np.log1p(np.maximum(X_test['pickup_longitude'] + np.abs(X_test['pickup_longitude']), 0))
X_test['dropoff_longitude'] = np.log1p(np.maximum(X_test['dropoff_longitude'] + np.abs(X_test['dropoff_longitude']), 0))
X_test['distance_osrm'] = np.log1p(np.maximum(X_test['distance_osrm'] + np.abs(X_test['distance_osrm']), 0))
y_test = np.log1p(np.maximum(y_test + np.abs(y_test), 0))

X_test.rename(columns={'passenger_count': 'log_passenger_count'}, inplace=True)
X_test.rename(columns={'pickup_longitude': 'log_pickup_longitude'}, inplace=True)
X_test.rename(columns={'dropoff_longitude': 'log_dropoff_longitude'}, inplace=True)
X_test.rename(columns={'distance_osrm': 'log_distance_osrm'}, inplace=True)

In [12]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [13]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155196 entries, 1053743 to 121958
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   log_passenger_count     1155196 non-null  float64
 1   log_pickup_longitude    1155196 non-null  float64
 2   pickup_latitude         1155196 non-null  float64
 3   log_dropoff_longitude   1155196 non-null  float64
 4   dropoff_latitude        1155196 non-null  float64
 5   log_distance_osrm       1155196 non-null  float64
 6   vendor_id_2             1155196 non-null  uint8  
 7   store_and_fwd_flag_Y    1155196 non-null  uint8  
 8   pickup_month_sin        1155196 non-null  float64
 9   pickup_month_cos        1155196 non-null  float64
 10  pickup_day_sin          1155196 non-null  float64
 11  pickup_day_cos          1155196 non-null  float64
 12  pickup_day_of_week_sin  1155196 non-null  float64
 13  pickup_day_of_week_cos  1155196 non-null  float64
 1

In [14]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   log_passenger_count     291728 non-null  float64
 1   log_pickup_longitude    291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   log_dropoff_longitude   291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   log_distance_osrm       291728 non-null  float64
 6   vendor_id_2             291728 non-null  uint8  
 7   store_and_fwd_flag_Y    291728 non-null  uint8  
 8   pickup_month_sin        291728 non-null  float64
 9   pickup_month_cos        291728 non-null  float64
 10  pickup_day_sin          291728 non-null  float64
 11  pickup_day_cos          291728 non-null  float64
 12  pickup_day_of_week_sin  291728 non-null  float64
 13  pickup_day_of_week_cos  291728 non-null  float64
 14  pickup_hour_sin 

In [15]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [16]:
stage_8 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.0646, Fit time: 1.6233 seconds
DTR - RMSLE: 0.0807, Fit time: 33.0034 seconds
XGBR - RMSLE: 0.0591, Fit time: 13.9955 seconds


### Stage 9: Encoding Features

In [23]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [24]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [26]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.06 s
Wall time: 5.23 s


In [27]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.2 s
Wall time: 5.59 s


In [28]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.36 s
Wall time: 5.55 s


In [29]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [30]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 1.92 s
Wall time: 1.99 s


In [31]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.48 s
Wall time: 2.51 s


In [32]:
X_train = X_train.drop(columns=['id'])
X_test = X_test.drop(columns=['id'])

col= ['vendor_id', 'store_and_fwd_flag', 'pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']

X_train[col] = X_train[col].astype(str)
X_test[col] = X_test[col].astype(str)

X_train = pd.get_dummies(X_train, columns=col, drop_first=True)
X_test = pd.get_dummies(X_test, columns=col, drop_first=True)

# X_train = X_train.drop(columns=columns_to_drop)
# X_test = X_test.drop(columns=columns_to_drop)

In [12]:
# %%time
# feature_encoding = FeatureEncoding()
# X_train = feature_encoding.transform(X_train)
# X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [34]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155196 entries, 1053743 to 121958
Data columns (total 72 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   passenger_count       1155196 non-null  int64  
 1   pickup_longitude      1155196 non-null  float64
 2   pickup_latitude       1155196 non-null  float64
 3   dropoff_longitude     1155196 non-null  float64
 4   dropoff_latitude      1155196 non-null  float64
 5   distance_osrm         1155196 non-null  float64
 6   vendor_id_2           1155196 non-null  uint8  
 7   store_and_fwd_flag_Y  1155196 non-null  uint8  
 8   pickup_month_2        1155196 non-null  uint8  
 9   pickup_month_3        1155196 non-null  uint8  
 10  pickup_month_4        1155196 non-null  uint8  
 11  pickup_month_5        1155196 non-null  uint8  
 12  pickup_month_6        1155196 non-null  uint8  
 13  pickup_day_10         1155196 non-null  uint8  
 14  pickup_day_11         1155196

In [35]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 72 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   passenger_count       291728 non-null  int64  
 1   pickup_longitude      291728 non-null  float64
 2   pickup_latitude       291728 non-null  float64
 3   dropoff_longitude     291728 non-null  float64
 4   dropoff_latitude      291728 non-null  float64
 5   distance_osrm         291728 non-null  float64
 6   vendor_id_2           291728 non-null  uint8  
 7   store_and_fwd_flag_Y  291728 non-null  uint8  
 8   pickup_month_2        291728 non-null  uint8  
 9   pickup_month_3        291728 non-null  uint8  
 10  pickup_month_4        291728 non-null  uint8  
 11  pickup_month_5        291728 non-null  uint8  
 12  pickup_month_6        291728 non-null  uint8  
 13  pickup_day_10         291728 non-null  uint8  
 14  pickup_day_11         291728 non-null  uint8  
 

In [36]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [37]:
stage_9 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6432, Fit time: 12.7416 seconds
DTR - RMSLE: 0.6035, Fit time: 90.4740 seconds
XGBR - RMSLE: 0.6035, Fit time: 23.4781 seconds


### Stage 10: Feature Scaling - Standardization (Z-score normalization)

In [34]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [35]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [37]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.45 s
Wall time: 5.49 s


In [38]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.53 s
Wall time: 5.59 s


In [39]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.69 s
Wall time: 5.88 s


In [40]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [41]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 1.95 s
Wall time: 2.02 s


In [42]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.39 s
Wall time: 2.52 s


In [43]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [44]:
# standard_scaler = StandardScaler()
# X_train = standard_scaler.fit_transform(X_train)
# X_test = standard_scaler.transform(X_test)

In [45]:
# minmax_scaler = MinMaxScaler()
# X_train = minmax_scaler.fit_transform(X_train)
# X_test = minmax_scaler.transform(X_test)

In [46]:
robust_scaler = RobustScaler()
X_train = robust_scaler.fit_transform(X_train)
X_test = robust_scaler.transform(X_test)

In [47]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [48]:
stage_10 = evaluate_models(models, X_train, y_train, X_test, y_test)

LR - RMSLE: 0.6386, Fit time: 1.3335 seconds
DTR - RMSLE: 0.6119, Fit time: 48.9959 seconds
XGBR - RMSLE: 0.5986, Fit time: 12.1261 seconds


### Stage 11: Feature Selection - Numerical-Numerical Feature Selection

In [1]:
from playground import *
df = pd.read_csv("csv_ml/eda_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   id                       1458644 non-null  object 
 1   vendor_id                1458644 non-null  int64  
 2   pickup_datetime          1458644 non-null  object 
 3   dropoff_datetime         1458644 non-null  object 
 4   passenger_count          1458644 non-null  int64  
 5   pickup_longitude         1458644 non-null  float64
 6   pickup_latitude          1458644 non-null  float64
 7   dropoff_longitude        1458644 non-null  float64
 8   dropoff_latitude         1458644 non-null  float64
 9   store_and_fwd_flag       1458644 non-null  object 
 10  trip_duration            1458644 non-null  int64  
 11  distance_osrm            1458627 non-null  float64
 12  pickup_dist_NYC_center   1458644 non-null  float64
 13  dropoff_dist_NYC_center  1458644 non-null 

In [2]:
X = df.drop(columns="trip_duration", axis=1)
y = df["trip_duration"]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [4]:
%%time
rm_duplicates = RemoveDuplicates()
X_train, y_train = rm_duplicates.transform(X_train, y_train)

>>>> Starting the process of removing duplicates ...
No duplicates found.
CPU times: total: 5.48 s
Wall time: 6.18 s


In [5]:
%%time
to_dtypes = ToDataTypes()
X_train, y_train = to_dtypes.transform(X_train, y_train)
X_test, y_test = to_dtypes.transform(X_test, y_test)

>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
>>>> Starting data type conversion process...
Column 'vendor_id' changed from int64 to object
Column 'pickup_datetime' changed from object to datetime64[ns]
Column 'dropoff_datetime' changed from object to datetime64[ns]
CPU times: total: 5.33 s
Wall time: 5.68 s


In [6]:
%%time
datetime_break = DateTimeBreak()
X_train = datetime_break.transform(X_train)
X_test = datetime_break.transform(X_test)

>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
>>>> Starting datetime feature extraction...
Extracted features: ['pickup_month', 'pickup_day', 'pickup_day_of_week', 'pickup_hour']
Dropped columns: ['pickup_datetime', 'dropoff_datetime']
CPU times: total: 5.52 s
Wall time: 5.56 s


In [7]:
%time
miss_val_input = MissValInput()
X_train, y_train = miss_val_input.fit_transform(X_train, y_train)
X_test, y_test = miss_val_input.transform(X_test, y_test)

CPU times: total: 0 ns
Wall time: 0 ns
>>>> Starting missing value imputation...
Dropped 16 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 1166915
Removed data: 16 (0.00%)
Final data length: 1166899
>>>> Starting missing value imputation...
Dropped 1 entries (0.00%) from column 'distance_osrm' due to missing values.
Initial data length: 291729
Removed data: 1 (0.00%)
Final data length: 291728


In [8]:
%%time
feature_restriction = FeatureRestriction()
X_train, y_train = feature_restriction.transform(X_train, y_train)

>>>> Starting features restriction ...
The dataset size: 1166899 rows
trip_duration (old) -> [min, max]: [1, 3526282]
trip_duration (new) -> [min, max]: [60, 86392]
distance_osrm (old) -> [min, max]: [0.0, 765.6445]
distance_osrm (new) -> [min, max]: [0.1001, 97.7243]
speed_osrm column not found, skipping restriction on 'speed_osrm'.
passenger_count (old) -> [min, max]: [0, 8]
passenger_count (new) -> [min, max]: [1, 8]
Total removed data: 11632 (1.00%)
CPU times: total: 1.97 s
Wall time: 2.06 s


In [9]:
%%time
outlier_mapper = OutlierMapper(map_title="outliers_map_ml_baseline", csv_dir="csv_ml_baseline", html_dir="html_ml_baseline")
X_train, y_train = outlier_mapper.transform(X_train, y_train)
X_test.drop(columns=['pickup_dist_NYC_center', 'dropoff_dist_NYC_center'], inplace=True)

>>>> Starting New York City map restriction ...
Outliers saved to 'csv_ml_baseline\outliers_map_ml_baseline.csv'
Map saved as 'html_ml_baseline/outliers_map_ml_baseline.html'
Removed 71 (0.01%) records outside NYC boundaries.
CPU times: total: 2.39 s
Wall time: 2.43 s


In [10]:
%%time
feature_encoding = FeatureEncoding()
X_train = feature_encoding.transform(X_train)
X_test = feature_encoding.transform(X_test)

>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' completed.
Performing cyclical encoding on 'pickup_day' column...
Cyclical encoding for 'pickup_day' completed.
Performing cyclical encoding on 'pickup_day_of_week' column...
Cyclical encoding for 'pickup_day_of_week' completed.
Performing cyclical encoding on 'pickup_hour' column...
Cyclical encoding for 'pickup_hour' completed.
Dropping original columns after encoding...
Transformation completed.
>>>> Starting to encode the features ...
Starting transformations...
Dropping 'id' column...
Performing dummy encoding on 'vendor_id' column...
Performing dummy encoding on 'store_and_fwd_flag' column...
Performing cyclical encoding on 'pickup_month' column...
Cyclical encoding for 'pickup_month' co

In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1155196 entries, 1053743 to 121958
Data columns (total 16 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   passenger_count         1155196 non-null  int64  
 1   pickup_longitude        1155196 non-null  float64
 2   pickup_latitude         1155196 non-null  float64
 3   dropoff_longitude       1155196 non-null  float64
 4   dropoff_latitude        1155196 non-null  float64
 5   distance_osrm           1155196 non-null  float64
 6   vendor_id_2             1155196 non-null  uint8  
 7   store_and_fwd_flag_Y    1155196 non-null  uint8  
 8   pickup_month_sin        1155196 non-null  float64
 9   pickup_month_cos        1155196 non-null  float64
 10  pickup_day_sin          1155196 non-null  float64
 11  pickup_day_cos          1155196 non-null  float64
 12  pickup_day_of_week_sin  1155196 non-null  float64
 13  pickup_day_of_week_cos  1155196 non-null  float64
 1

In [12]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291728 entries, 67250 to 589044
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   passenger_count         291728 non-null  int64  
 1   pickup_longitude        291728 non-null  float64
 2   pickup_latitude         291728 non-null  float64
 3   dropoff_longitude       291728 non-null  float64
 4   dropoff_latitude        291728 non-null  float64
 5   distance_osrm           291728 non-null  float64
 6   vendor_id_2             291728 non-null  uint8  
 7   store_and_fwd_flag_Y    291728 non-null  uint8  
 8   pickup_month_sin        291728 non-null  float64
 9   pickup_month_cos        291728 non-null  float64
 10  pickup_day_sin          291728 non-null  float64
 11  pickup_day_cos          291728 non-null  float64
 12  pickup_day_of_week_sin  291728 non-null  float64
 13  pickup_day_of_week_cos  291728 non-null  float64
 14  pickup_hour_sin 

In [13]:
# Define models
models = [
    ('LR', LinearRegression()),
    ('DTR', DecisionTreeRegressor()),
    ('XGBR', XGBRegressor())
]

In [14]:
k_range = range(5, X_train.shape[1] + 1)  # From 1 to number of features
results = {}

In [15]:
for k in k_range:
    print(f"Evaluating for k={k}...")
    
    # Feature selection
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    # Evaluate models
    stage_results = evaluate_models(models, X_train_selected, y_train, X_test_selected, y_test)
    
    # Store results
    results[k] = stage_results
    

Evaluating for k=5...
LR - RMSLE: 0.6498, Fit time: 0.3493 seconds
DTR - RMSLE: 0.6378, Fit time: 34.6457 seconds
XGBR - RMSLE: 0.5815, Fit time: 7.5156 seconds
Evaluating for k=6...
LR - RMSLE: 0.6549, Fit time: 0.4063 seconds
DTR - RMSLE: 0.6362, Fit time: 30.9367 seconds
XGBR - RMSLE: 0.5906, Fit time: 7.2813 seconds
Evaluating for k=7...
LR - RMSLE: 0.6508, Fit time: 0.4375 seconds
DTR - RMSLE: 0.6186, Fit time: 32.2033 seconds
XGBR - RMSLE: 0.5889, Fit time: 7.4844 seconds
Evaluating for k=8...
LR - RMSLE: 0.6507, Fit time: 0.4844 seconds
DTR - RMSLE: 0.6256, Fit time: 35.3680 seconds
XGBR - RMSLE: 0.5908, Fit time: 7.6093 seconds
Evaluating for k=9...
LR - RMSLE: 0.6474, Fit time: 0.5440 seconds
DTR - RMSLE: 0.6254, Fit time: 34.2815 seconds
XGBR - RMSLE: 0.5898, Fit time: 8.2344 seconds
Evaluating for k=10...
LR - RMSLE: 0.6386, Fit time: 0.6719 seconds
DTR - RMSLE: 0.6092, Fit time: 34.5224 seconds
XGBR - RMSLE: 0.5814, Fit time: 8.6251 seconds
Evaluating for k=11...
LR - RMSLE

In [23]:
# Parsing the string results into dictionaries
parsed_results = {}

for k, metrics_str in results.items():
    print(f"Results for k={k}:")
    metrics = {}
    # Split the string by newline and then by " - "
    for line in metrics_str.split("\n"):
        model_name, metric_str = line.split(" - ")
        rmsle_value = float(metric_str.split(": ")[1].split(",")[0])  # Extract RMSLE value
        metrics[model_name] = rmsle_value
        print(f"{model_name}: RMSLE = {rmsle_value}")
    parsed_results[k] = metrics

Results for k=5:
LR: RMSLE = 0.6498
DTR: RMSLE = 0.6378
XGBR: RMSLE = 0.5815
Results for k=6:
LR: RMSLE = 0.6549
DTR: RMSLE = 0.6362
XGBR: RMSLE = 0.5906
Results for k=7:
LR: RMSLE = 0.6508
DTR: RMSLE = 0.6186
XGBR: RMSLE = 0.5889
Results for k=8:
LR: RMSLE = 0.6507
DTR: RMSLE = 0.6256
XGBR: RMSLE = 0.5908
Results for k=9:
LR: RMSLE = 0.6474
DTR: RMSLE = 0.6254
XGBR: RMSLE = 0.5898
Results for k=10:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6092
XGBR: RMSLE = 0.5814
Results for k=11:
LR: RMSLE = 0.6387
DTR: RMSLE = 0.6105
XGBR: RMSLE = 0.5838
Results for k=12:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6083
XGBR: RMSLE = 0.5894
Results for k=13:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.615
XGBR: RMSLE = 0.5935
Results for k=14:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6146
XGBR: RMSLE = 0.5952
Results for k=15:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6037
XGBR: RMSLE = 0.5973
Results for k=16:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6072
XGBR: RMSLE = 0.5948


## Result

#### Stage 1: Baseline Model

In [15]:
print(stage_1)

LR - RMSLE: 0.8646, Fit time: 1.6810 seconds
DTR - RMSLE: 0.6242, Fit time: 52.5158 seconds
XGBR - RMSLE: 0.7075, Fit time: 10.8571 seconds


#### Stage 2: Adding distance_osrm Feature

I drop the missing values in distance_osrm before performing the train-test split. This result is ignored

In [14]:
print(stage_2)

LR - RMSLE: 0.6594, Fit time: 1.6992 seconds
DTR - RMSLE: 0.5957, Fit time: 51.1731 seconds
XGBR - RMSLE: 0.6944, Fit time: 11.4793 seconds


#### Stage 3: Missing Value Imputation

In [14]:
print(stage_3)

LR - RMSLE: 0.6560, Fit time: 1.3997 seconds
DTR - RMSLE: 0.5937, Fit time: 46.4586 seconds
XGBR - RMSLE: 0.7084, Fit time: 12.4147 seconds


#### Stage 4: Adding speed_osrm Feature

In [15]:
print(stage_4)

LR - RMSLE: 0.6235, Fit time: 1.8488 seconds
DTR - RMSLE: 0.1430, Fit time: 48.4794 seconds
XGBR - RMSLE: 0.2661, Fit time: 14.2972 seconds


#### Stage 5: Applying sensible restrictions to the Features

In [15]:
print(stage_5)

LR - RMSLE: 0.6394, Fit time: 1.5391 seconds
DTR - RMSLE: 0.6078, Fit time: 47.7388 seconds
XGBR - RMSLE: 0.5942, Fit time: 11.0469 seconds


#### Stage 6: Restricting the boundary to the area around NYC

In [15]:
print(stage_6)

LR - RMSLE: 0.6386, Fit time: 1.4434 seconds
DTR - RMSLE: 0.6115, Fit time: 50.3154 seconds
XGBR - RMSLE: 0.5948, Fit time: 14.8784 seconds


#### Stage 7: Replacing pickup_longitude, pickup_latitude with pickup_cluster and dropoff_longitude, dropoff_latitude with dropoff_cluster

In [16]:
# optimal_k_pickup=4, optimal_k_dropoff=3
print(stage_7)

LR - RMSLE: 0.6385, Fit time: 1.6137 seconds
DTR - RMSLE: 0.6494, Fit time: 26.2606 seconds
XGBR - RMSLE: 0.6287, Fit time: 11.3091 seconds


In [16]:
# optimal_k_pickup=3, optimal_k_dropoff=3
print(stage_7)

LR - RMSLE: 0.6398, Fit time: 1.9373 seconds
DTR - RMSLE: 0.6502, Fit time: 28.3572 seconds
XGBR - RMSLE: 0.6178, Fit time: 15.6123 seconds


In [16]:
# optimal_k_pickup=3, optimal_k_dropoff=4
print(stage_7)

LR - RMSLE: 0.6370, Fit time: 1.4430 seconds
DTR - RMSLE: 0.6333, Fit time: 23.0397 seconds
XGBR - RMSLE: 0.6199, Fit time: 11.0230 seconds


In [16]:
# optimal_k_pickup=4, optimal_k_dropoff=4
print(stage_7)

LR - RMSLE: 0.6351, Fit time: 1.5940 seconds
DTR - RMSLE: 0.6372, Fit time: 22.4325 seconds
XGBR - RMSLE: 0.6131, Fit time: 11.2697 seconds


#### Stage 8: Skewness Transformation

In [17]:
print(stage_8)

LR - RMSLE: 0.0646, Fit time: 1.6233 seconds
DTR - RMSLE: 0.0807, Fit time: 33.0034 seconds
XGBR - RMSLE: 0.0591, Fit time: 13.9955 seconds


#### Stage 9: Encoding Features

cylical encoding

In [15]:
print(stage_6)

LR - RMSLE: 0.6386, Fit time: 1.4434 seconds
DTR - RMSLE: 0.6115, Fit time: 50.3154 seconds
XGBR - RMSLE: 0.5948, Fit time: 14.8784 seconds


dummy encoding

In [38]:
print(stage_9)

LR - RMSLE: 0.6432, Fit time: 12.7416 seconds
DTR - RMSLE: 0.6035, Fit time: 90.4740 seconds
XGBR - RMSLE: 0.6035, Fit time: 23.4781 seconds


#### Stage 10: Feature Scaling - Standardization (Z-score normalization)

standard scaler

In [17]:
print(stage_10)

LR - RMSLE: 0.6386, Fit time: 1.1339 seconds
DTR - RMSLE: 0.6070, Fit time: 47.6009 seconds
XGBR - RMSLE: 0.5979, Fit time: 11.1721 seconds


min max scaler

In [33]:
print(stage_10)

LR - RMSLE: 0.6386, Fit time: 1.1525 seconds
DTR - RMSLE: 0.6016, Fit time: 46.3015 seconds
XGBR - RMSLE: 0.5969, Fit time: 11.8969 seconds


robust scaler

In [49]:
print(stage_10)

LR - RMSLE: 0.6386, Fit time: 1.3335 seconds
DTR - RMSLE: 0.6119, Fit time: 48.9959 seconds
XGBR - RMSLE: 0.5986, Fit time: 12.1261 seconds


#### Stage 11: Feature Selection - Numerical-Numerical Feature Selection

In [24]:
# Parsing the string results into dictionaries
parsed_results = {}

for k, metrics_str in results.items():
    print(f"Results for k={k}:")
    metrics = {}
    # Split the string by newline and then by " - "
    for line in metrics_str.split("\n"):
        model_name, metric_str = line.split(" - ")
        rmsle_value = float(metric_str.split(": ")[1].split(",")[0])  # Extract RMSLE value
        metrics[model_name] = rmsle_value
        print(f"{model_name}: RMSLE = {rmsle_value}")
    parsed_results[k] = metrics

Results for k=5:
LR: RMSLE = 0.6498
DTR: RMSLE = 0.6378
XGBR: RMSLE = 0.5815
Results for k=6:
LR: RMSLE = 0.6549
DTR: RMSLE = 0.6362
XGBR: RMSLE = 0.5906
Results for k=7:
LR: RMSLE = 0.6508
DTR: RMSLE = 0.6186
XGBR: RMSLE = 0.5889
Results for k=8:
LR: RMSLE = 0.6507
DTR: RMSLE = 0.6256
XGBR: RMSLE = 0.5908
Results for k=9:
LR: RMSLE = 0.6474
DTR: RMSLE = 0.6254
XGBR: RMSLE = 0.5898
Results for k=10:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6092
XGBR: RMSLE = 0.5814
Results for k=11:
LR: RMSLE = 0.6387
DTR: RMSLE = 0.6105
XGBR: RMSLE = 0.5838
Results for k=12:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6083
XGBR: RMSLE = 0.5894
Results for k=13:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.615
XGBR: RMSLE = 0.5935
Results for k=14:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6146
XGBR: RMSLE = 0.5952
Results for k=15:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6037
XGBR: RMSLE = 0.5973
Results for k=16:
LR: RMSLE = 0.6386
DTR: RMSLE = 0.6072
XGBR: RMSLE = 0.5948
