# Summary of the Feature Selection and Modeling Process

This work aims to predict the fare amount of yellow taxi trips in New York City using a machine learning approach based on systematic feature selection and model comparison.
After cleaning the dataset and removing variables that directly reconstruct the fare (e.g., total_amount, tip_amount), several feature-selection techniques were applied:

ANOVA F-test (linear dependency)

Mutual Information Regression (non-linear dependency)

Random Forest Feature Importances

Gradient Boosting Feature Importances

All four methods consistently identified the same core explanatory variables, with trip_distance being by far the most influential predictor, followed by RateCodeID, geographic coordinates (pickup/dropoff), and payment_type.
Tree-based models provided the highest predictive accuracy (R² ≈ 0.93–0.94), and the Random Forest model showed excellent robustness and interpretability.

Based on the consistency of results across methods and the strong performance of ensemble models, our group selected Random Forest as the final predictive model.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random

import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [11]:
# import dataset
df = pd.read_csv(r'df_sample.csv')
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-01-01 16:07:13,2015-01-01 16:35:55,1,18.26,-73.979401,40.76038,3,N,-74.176826,40.694511,2,66.0,0.0,0.0,0.0,16.35,0.3,82.65
1,2,2015-01-01 15:59:47,2015-01-01 16:02:06,1,0.66,-73.980637,40.730099,1,N,-73.983124,40.722569,2,4.0,0.0,0.5,0.0,0.0,0.3,4.8
2,2,2015-01-01 04:05:37,2015-01-01 04:08:25,5,0.33,-73.971985,40.749748,1,N,-73.971886,40.746304,2,4.0,0.5,0.5,0.0,0.0,0.3,5.3
3,1,2015-01-01 10:10:53,2015-01-01 10:27:01,3,7.3,-74.010773,40.71405,1,N,-73.950073,40.77972,1,22.5,0.0,0.5,5.0,0.0,0.0,28.3
4,1,2015-01-01 03:30:14,2015-01-01 03:46:46,1,2.6,-73.982277,40.742973,1,N,-73.990211,40.712654,2,12.5,0.5,0.5,0.0,0.0,0.0,13.8


## Preprocessing Setup: Numerical and Categorical Pipeline

Before applying any feature-selection method, we define a preprocessing pipeline that standardizes numerical features and one-hot encodes categorical features.
This ensures that all downstream feature-selection techniques and machine-learning models operate on transformed, comparable data.
The ColumnTransformer automatically applies the right transformation to each column type.

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# === 1. Colonne cible ===
target = "fare_amount"   # ou celle que tu veux

# === 2. Séparation X / y ===
cols_to_remove = [
    'total_amount',
    'tip_amount',
    'tolls_amount',
    'extra',
    'mta_tax',
    'improvement_surcharge'
]
X = df.drop(columns=[target] + cols_to_remove)
y = df[target]

# === 3. Train/test ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === 4. Sélection automatique des types ===
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

# === 5. Préprocesseur : Imputation + Scaling + Encoding ===
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# === 6. Pipeline complet ===
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# === 7. Entraînement ===
pipeline.fit(X_train, y_train)

# === 8. Prédictions ===
y_pred = pipeline.predict(X_test)

# === 9. Métriques ===
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("=== Performances ===")
print(f"MAE : {mae:.2f}")
print(f"RMSE : {rmse:.2f}")
print(f"R² : {r2:.4f}")

=== Performances ===
MAE : 1.77
RMSE : 4.03
R² : 0.8424


## Feature Selection using ANOVA F-test

In this section, we apply a linear feature-selection technique (ANOVA F-score) to evaluate how strongly each input variable is linearly associated with the target (fare_amount).
The SelectKBest method ranks all features by F-score and allows us to print the sorted ranking.
This provides an interpretable, variance-based baseline for feature relevance.

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
import numpy as np
import pandas as pd

# === Colonnes déjà présentes, plus aucune datetime ===
target = "fare_amount"

cols_to_remove = [
    'total_amount','tip_amount','tolls_amount',
    'extra','mta_tax','improvement_surcharge'
]

df = df.drop(columns=cols_to_remove)

# === Reconstruction X, y ===
X = df.drop(columns=[target])
y = df[target]

# === Séparer num / cat ===
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# === Préprocesseur ===
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

# === ANOVA ===
K = 10

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("selector", SelectKBest(score_func=f_regression, k=K)),
    ("regressor", LinearRegression())
])

# === Split ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === Fit ===
pipeline.fit(X_train, y_train)

# === Extraction scores ===
selector = pipeline.named_steps['selector']
scores = selector.scores_

encoded_features = pipeline.named_steps['preprocessor'].get_feature_names_out()

scores_df = pd.DataFrame({'Feature': encoded_features, 'F-score': scores})
scores_df = scores_df.sort_values('F-score', ascending=False)
print(scores_df)

selected_features = scores_df.head(K)['Feature'].tolist()
print("\nFeatures sélectionnées :", selected_features)

# === Prédiction ===
y_pred = pipeline.predict(X_test)

# === Métriques ===
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²:   {r2:.4f}")


                      Feature        F-score
2          num__trip_distance  353996.057984
5             num__RateCodeID    6988.961014
8           num__payment_type     593.938775
3       num__pickup_longitude      42.392855
6      num__dropoff_longitude      41.727704
7       num__dropoff_latitude      41.265916
4        num__pickup_latitude      41.257718
0               num__VendorID      13.568280
1        num__passenger_count       7.533220
10  cat__store_and_fwd_flag_Y       2.568141
9   cat__store_and_fwd_flag_N       2.568141

Features sélectionnées : ['num__trip_distance', 'num__RateCodeID', 'num__payment_type', 'num__pickup_longitude', 'num__dropoff_longitude', 'num__dropoff_latitude', 'num__pickup_latitude', 'num__VendorID', 'num__passenger_count', 'cat__store_and_fwd_flag_Y']

MAE: 1.7477
RMSE: 3.9604
R²:   0.8481


 ## Feature Selection using Mutual Information

Mutual Information captures non-linear dependencies between each feature and the target variable.
This method does not rely on linear assumptions and can identify complex relationships that ANOVA may miss.
We compute MI scores, sort all features by importance, and extract the top contributors.
Comparing MI with ANOVA helps validate the stability of feature relevance across methods.

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import mutual_info_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd


X = df.drop(columns=['fare_amount'])
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features)
    ])

X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

feature_names_num = numeric_features
feature_names_cat = []
if len(categorical_features) > 0:
    feature_names_cat = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
feature_names = feature_names_num + feature_names_cat

mi_scores = mutual_info_regression(X_train_transformed, y_train, random_state=42)

mi_df = pd.DataFrame({'Feature': feature_names, 'MI_Score': mi_scores})
mi_df = mi_df.sort_values('MI_Score', ascending=False)

print("Mutual Information Scores triés par ordre décroissant:")
print(mi_df)

K = 10
top_k_features = mi_df.head(K)['Feature'].tolist()
top_k_indices = [feature_names.index(f) for f in top_k_features]

X_train_selected = X_train_transformed[:, top_k_indices]
X_test_selected = X_test_transformed[:, top_k_indices]

print(f"\nTop {K} features sélectionnées:")
print(top_k_features)

model = LinearRegression()
model.fit(X_train_selected, y_train)

y_pred = model.predict(X_test_selected)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

Mutual Information Scores triés par ordre décroissant:
                Feature  MI_Score
2         trip_distance  1.396470
7      dropoff_latitude  0.118202
6     dropoff_longitude  0.114360
3      pickup_longitude  0.114099
5            RateCodeID  0.107425
4       pickup_latitude  0.092537
8          payment_type  0.013859
0              VendorID  0.003326
9  store_and_fwd_flag_Y  0.000417
1       passenger_count  0.000000

Top 10 features sélectionnées:
['trip_distance', 'dropoff_latitude', 'dropoff_longitude', 'pickup_longitude', 'RateCodeID', 'pickup_latitude', 'payment_type', 'VendorID', 'store_and_fwd_flag_Y', 'passenger_count']

MAE: 1.7477
RMSE: 3.9604
R²: 0.8481


## Random Forest Regressor and Feature Importances

Here we train a Random Forest model using the preprocessing pipeline.
Tree-based models naturally compute feature importances based on how frequently each variable is used to improve decision splits.
This method is robust to noise and non-linearity, and provides a realistic estimate of which variables matter most for prediction.
We evaluate model performance (MAE, RMSE, R²) and display the ranked feature importances.

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd


X = df.drop(columns=['fare_amount'])
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)

feature_names_num = numeric_features
feature_names_cat = []
if len(categorical_features) > 0:
    feature_names_cat = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
feature_names = feature_names_num + feature_names_cat

importances = pipeline.named_steps['regressor'].feature_importances_

importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importances_df = importances_df.sort_values('Importance', ascending=False)

print("Feature Importances triées par ordre décroissant:")
print(importances_df)

top_10_features = importances_df.head(10)['Feature'].tolist()
print(f"\nTop 10 features les plus importantes:")
print(top_10_features)

y_pred = pipeline.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

Feature Importances triées par ordre décroissant:
                Feature  Importance
2         trip_distance    0.866813
5            RateCodeID    0.069846
6     dropoff_longitude    0.014733
3      pickup_longitude    0.014373
4       pickup_latitude    0.013035
7      dropoff_latitude    0.012604
8          payment_type    0.004378
1       passenger_count    0.002479
0              VendorID    0.001465
9  store_and_fwd_flag_Y    0.000274

Top 10 features les plus importantes:
['trip_distance', 'RateCodeID', 'dropoff_longitude', 'pickup_longitude', 'pickup_latitude', 'dropoff_latitude', 'payment_type', 'passenger_count', 'VendorID', 'store_and_fwd_flag_Y']

MAE: 1.2348
RMSE: 2.5978
R²: 0.9346


## Gradient Boosting Regressor and Feature Importances

Gradient Boosting is another ensemble method that builds sequential decision trees to reduce prediction error.
It provides smoother importance scores and handles complex relationships efficiently.
We train a Gradient Boosting model with the same preprocessing pipeline and display the feature importances and evaluation metrics.
This allows us to check the consistency of results across different ensemble models.

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd



X = df.drop(columns=['fare_amount'])
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)

feature_names_num = numeric_features
feature_names_cat = []
if len(categorical_features) > 0:
    feature_names_cat = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
feature_names = feature_names_num + feature_names_cat

importances = pipeline.named_steps['regressor'].feature_importances_

importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importances_df = importances_df.sort_values('Importance', ascending=False)

print("Feature Importances triées par ordre décroissant:")
print(importances_df)

top_10_features = importances_df.head(10)['Feature'].tolist()
print(f"\nTop 10 features les plus importantes:")
print(top_10_features)

y_pred = pipeline.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"\nMAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

Feature Importances triées par ordre décroissant:
                Feature  Importance
2         trip_distance    0.903787
5            RateCodeID    0.074444
6     dropoff_longitude    0.008543
7      dropoff_latitude    0.005017
3      pickup_longitude    0.002760
4       pickup_latitude    0.002671
8          payment_type    0.002337
0              VendorID    0.000222
9  store_and_fwd_flag_Y    0.000124
1       passenger_count    0.000095

Top 10 features les plus importantes:
['trip_distance', 'RateCodeID', 'dropoff_longitude', 'dropoff_latitude', 'pickup_longitude', 'pickup_latitude', 'payment_type', 'VendorID', 'store_and_fwd_flag_Y', 'passenger_count']

MAE: 1.2956
RMSE: 2.6426
R²: 0.9324


## Cross-Method Comparison and Final Feature Assessment
After evaluating ANOVA, Mutual Information, Random Forest, and Gradient Boosting, we compare the rankings obtained from each method.
We confirm that the same core set of features consistently appears as the most influential, especially trip_distance, RateCodeID, and the geographical coordinates.
This validation ensures strong confidence in the selected variables.
The team ultimately chose Random Forest as the final model due to its excellent predictive performance, robustness, and interpretability.