## Final scoring Model Methods

In this notebook I have finalized how/what features will be used in our scoring model and get the finalized weights and what the output of the model will do. With group discussions and from previous testing of the models we removed the hotspot score since it was not listed as an important feature and we brought back sin/cos hour.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import os
from tkinter import Tk
from tkinter.filedialog import askopenfilename
from tkinter.filedialog import asksaveasfilename

ModuleNotFoundError: No module named 'xgboost'

## XGBoost with Boroughs sin/cos hour and no hotspot score

In [None]:
# Load the cleaned and merged file
Tk().withdraw()
file_path = askopenfilename(title="Select Cleaned Merged CSV with Features")
df = pd.read_csv(file_path)
print("Loaded:", file_path)
print("Shape:", df.shape)

# Log-transform target for regression stability
df['log_fare_per_minute'] = np.log1p(df['fare_per_minute'])

# Also log-transform fare_per_mile if not already done
#df['log_fare_per_mile'] = np.log1p(df['fare_per_mile'])

# Final features (after removing boroughs and time_of_day)
categorical_cols = ['is_airport_trip', 'pickup_borough', 'dropoff_borough']
numeric_cols = [
    #'fare_per_mile',
    'dropoff_zone_hotness',
    'trip_duration_variability',
    #'hotspot_score',  
    'sin_hour',
    'cos_hour'
]
feature_cols = categorical_cols + numeric_cols

# Preprocess datetime and filter rows
df['pickup_date'] = pd.to_datetime(df['pickup_date'])
df = df.dropna(subset=['fare_per_minute'])  # Drop rows with missing original target

# Encode sin/cos hour
df['sin_hour'] = np.sin(2 * np.pi * df['pickup_hour'] / 24)
df['cos_hour'] = np.cos(2 * np.pi * df['pickup_hour'] / 24)
df.drop(columns=['pickup_hour', 'time_of_day'], inplace=True, errors='ignore')

# Time-based split (Jan → train, Feb → test)
train_df = df[df['pickup_date'].dt.month == 1].copy()
test_df = df[df['pickup_date'].dt.month == 2].copy()

# Preprocess categorical features
X_train_cat = pd.get_dummies(train_df[categorical_cols], drop_first=True)
X_test_cat = pd.get_dummies(test_df[categorical_cols], drop_first=True)
X_test_cat = X_test_cat.reindex(columns=X_train_cat.columns, fill_value=0)

# Final feature matrices
X_train = pd.concat([train_df[numeric_cols].reset_index(drop=True), X_train_cat.reset_index(drop=True)], axis=1)
X_test = pd.concat([test_df[numeric_cols].reset_index(drop=True), X_test_cat.reset_index(drop=True)], axis=1)

# Targets
y_train = train_df['log_fare_per_minute']  # log-transformed for training
y_test = test_df['fare_per_minute']        # original scale for evaluation

# Train model
model = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

# Predict and invert log transformation
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log)  # convert back to original scale

# Evaluate
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
feature_importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Output results
print(f"\nTime-Based Evaluation XGBoost (Train: Jan, Test: Feb):")
print(f"R² Score: {r2:.4f}")
print(f"MAE: {mae:.4f}")
print("\nTop Features:")
print(feature_importance.head(10))

Loaded: C:/diksha/Summer Sem/DataAnalysis/Data/clean with features for scoring/Cleaned_Jan_Feb_with_Hotness_and_trip_duration.csv
Shape: (5609910, 30)

Time-Based Evaluation XGBoost (Train: Jan, Test: Feb):
R² Score: 0.4335
MAE: 0.1715

Top Features:
dropoff_borough_EWR          0.309659
fare_per_mile                0.222582
is_airport_trip              0.091446
trip_duration_variability    0.085614
dropoff_borough_Brooklyn     0.059010
cos_hour                     0.050330
pickup_borough_Queens        0.038808
dropoff_borough_Queens       0.037615
sin_hour                     0.027236
pickup_borough_Brooklyn      0.025678
dtype: float32


## LightGBM with Boroughs sin/cos hour and no hotspot score

In [None]:
# Load the cleaned and merged file
Tk().withdraw()
file_path = askopenfilename(title="Select Cleaned Merged CSV with Features ")
df = pd.read_csv(file_path)
print("Loaded:", file_path)
print("Shape:", df.shape)

# Log-transform target for regression stability
df['log_fare_per_minute'] = np.log1p(df['fare_per_minute'])

# Log-transform fare_per_mile for modeling
#df['log_fare_per_mile'] = np.log1p(df['fare_per_mile'])

# Feature lists
categorical_cols = ['is_airport_trip', 'pickup_borough', 'dropoff_borough']
numeric_cols = [
    #'fare_per_mile',
    'dropoff_zone_hotness',
    'trip_duration_variability',
    #'hotspot_score',
    'sin_hour',
    'cos_hour'
]
feature_cols = categorical_cols + numeric_cols

# Date/time handling
df['pickup_date'] = pd.to_datetime(df['pickup_date'])
df = df.dropna(subset=['fare_per_minute'])

# Add sin/cos hour encodings
df['sin_hour'] = np.sin(2 * np.pi * df['pickup_hour'] / 24)
df['cos_hour'] = np.cos(2 * np.pi * df['pickup_hour'] / 24)
df.drop(columns=['pickup_hour', 'time_of_day'], inplace=True, errors='ignore')

# Train/test split
train_df = df[df['pickup_date'].dt.month == 1].copy()
test_df = df[df['pickup_date'].dt.month == 2].copy()

# One-hot encoding for categorical features
X_train_cat = pd.get_dummies(train_df[categorical_cols], drop_first=True)
X_test_cat = pd.get_dummies(test_df[categorical_cols], drop_first=True)
X_test_cat = X_test_cat.reindex(columns=X_train_cat.columns, fill_value=0)

# Final input matrices
X_train = pd.concat([train_df[numeric_cols].reset_index(drop=True), X_train_cat.reset_index(drop=True)], axis=1)
X_test = pd.concat([test_df[numeric_cols].reset_index(drop=True), X_test_cat.reset_index(drop=True)], axis=1)

# Targets
y_train = train_df['log_fare_per_minute']
y_test = test_df['fare_per_minute']  # Original scale for evaluation

# Train LightGBM model
lgb_model = LGBMRegressor(n_estimators=100, random_state=42, n_jobs=-1)
lgb_model.fit(X_train, y_train)

# Predict and invert log transform
y_pred_log = lgb_model.predict(X_test)
y_pred = np.expm1(y_pred_log)

# Evaluate performance
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
feature_importance = pd.Series(lgb_model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Output
print(f"\nTime-Based Evaluation LightGBM (Train: Jan, Test: Feb):")
print(f"R² Score: {r2:.4f}")
print(f"MAE: {mae:.4f}")
print("\nTop Features:")
print(feature_importance.head(10))

Loaded: C:/diksha/Summer Sem/DataAnalysis/Data/clean with features for scoring/Cleaned_Jan_Feb_with_Hotness_and_trip_duration.csv
Shape: (5609910, 30)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.076811 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 824
[LightGBM] [Info] Number of data points in the train set: 2871008, number of used features: 15
[LightGBM] [Info] Start training from score 0.806685

Time-Based Evaluation LightGBM (Train: Jan, Test: Feb):
R² Score: 0.4167
MAE: 0.1742

Top Features:
fare_per_mile                1211
trip_duration_variability     506
dropoff_zone_hotness          370
cos_hour                      241
sin_hour                      226
is_airport_trip               104
pickup_borough_Queens          74
dropoff_borough_EWR            61
dropoff_borough_Manhattan      61
dropoff_borough_Brooklyn       48
d

## To  Get Feature Importance Values

since we have both lightgbm and xgboost the best way to derive the weights is to normalize and avg the features

We can normalize each models importance (scale them to sum of 1)  and the avg them across the models (this was taken from chatgpt) once we get this we can use combined weights that gives the final weight (%) to apply to each feature in the scoring equation 

In [None]:
# XGBoost importances from the model (already normalized)
xgb_importances = {
    'dropoff_borough_EWR': 0.309659,
    #'fare_per_mile': 0.222582,
    'is_airport_trip': 0.091446,
    'trip_duration_variability': 0.085614,
    'dropoff_borough_Brooklyn': 0.059010,
    'cos_hour': 0.050330,
    'pickup_borough_Queens': 0.038808,
    'dropoff_borough_Queens': 0.037615,
    'sin_hour': 0.027236,
    'pickup_borough_Brooklyn': 0.025678,
    'dropoff_zone_hotness': 0.0
}

# LightGBM raw split count importances from the model
lgb_importances = {
    #'fare_per_mile': 1211,
    'trip_duration_variability': 506,
    'dropoff_zone_hotness': 370,
    'cos_hour': 241,
    'sin_hour': 226,
    'is_airport_trip': 104,
    'pickup_borough_Queens': 74,
    'dropoff_borough_EWR': 61,
    'dropoff_borough_Manhattan': 61,
    'dropoff_borough_Brooklyn': 48
}

# Convert to Series
xgb_series = pd.Series(xgb_importances)
lgb_series = pd.Series(lgb_importances)

# Normalize LightGBM to sum to 1
lgb_normalized = lgb_series / lgb_series.sum()

# Combine and average (only overlapping features will be averaged)
combined_df = pd.concat([xgb_series, lgb_normalized], axis=1, keys=['xgb', 'lgb_norm']).fillna(0)
combined_df['avg_importance'] = combined_df.mean(axis=1)

# Sort by average importance
final_weights = combined_df['avg_importance'].sort_values(ascending=False)

# Display result
print("\n Final Combined Feature Importances:\n")
print(final_weights)



 Final Combined Feature Importances:

fare_per_mile                0.319940
dropoff_borough_EWR          0.165339
trip_duration_variability    0.129988
cos_hour                     0.066688
dropoff_zone_hotness         0.063749
is_airport_trip              0.063642
sin_hour                     0.052557
dropoff_borough_Brooklyn     0.037775
pickup_borough_Queens        0.032154
dropoff_borough_Queens       0.018808
pickup_borough_Brooklyn      0.012839
dropoff_borough_Manhattan    0.010510
Name: avg_importance, dtype: float64


These columns like pickup_borough_Queens or dropoff_borough_Brooklyn are how the model understands where the trip starts or ends.

They’re created from the pickup_borough and dropoff_borough values we had in the data.

We need to keep them in the scoring model because the model learned to associate certain boroughs with higher or lower profitability — and each of these one-hot columns has a weight based on how much it contributed during training.

## Function for the Scoring Function

This is the the final feature to now output the score

In [None]:
# Final combined weights (replace with values from your final feature importance)
combined_weights = {
    #'fare_per_mile': 0.319940,
    'dropoff_borough_EWR': 0.165339,
    'trip_duration_variability': 0.129988,
    'cos_hour': 0.066688,
    'dropoff_zone_hotness': 0.063749,
    'is_airport_trip': 0.063642,
    'sin_hour': 0.052557,
    'dropoff_borough_Brooklyn': 0.037775,
    'pickup_borough_Queens': 0.032154,
    'dropoff_borough_Queens': 0.018808,
    'pickup_borough_Brooklyn': 0.012839,
    'dropoff_borough_Manhattan': 0.010510
}

#  Create a copy of the working DataFrame
df_scoring = df.copy()

#  Make sure one-hot encoded features exist (e.g., for boroughs)
for feature in combined_weights:
    if feature not in df_scoring.columns:
        df_scoring[feature] = 0  # Fill missing one-hot columns with 0s

#  Define scoring function
def calculate_score(row, weights):
    score = 0
    for feature, weight in weights.items():
        score += row.get(feature, 0) * weight
    return score

#  Apply scoring function
df_scoring['predicted_score'] = df_scoring.apply(lambda row: calculate_score(row, combined_weights), axis=1)

#  Normalize predicted score to 0–1
scaler = MinMaxScaler()
df_scoring['final_score'] = scaler.fit_transform(df_scoring[['predicted_score']])

#  Preview scores
print("\nPreview of predicted scores:")
print(df_scoring[['predicted_score', 'final_score']].head())

#  show best/worst trips
print("\nTop 5 Most Profitable Trips:")
print(df_scoring.sort_values('final_score', ascending=False).head())

print("\nBottom 5 Least Profitable Trips:")
print(df_scoring.sort_values('final_score').head())


Preview of predicted scores:
   predicted_score  final_score
0        38.117309     0.107897
1        17.270233     0.048792
2        33.225518     0.094028
3         1.928905     0.005297
4        38.053078     0.107715

Top 5 Most Profitable Trips:
              tpep_pickup_datetime      tpep_dropoff_datetime  trip_distance  \
1799836  2023-02-19 01:39:07-05:00  2023-02-19 01:40:37-05:00           0.04   
415841   2023-02-05 01:29:56-05:00  2023-02-05 01:31:41-05:00           0.10   
2748127  2023-01-01 01:47:04-05:00  2023-01-01 01:48:55-05:00           0.10   
5364675  2023-01-29 01:44:21-05:00  2023-01-29 01:45:53-05:00           0.10   
416377   2023-02-05 01:39:05-05:00  2023-02-05 01:41:33-05:00           0.10   

         fare_amount  trip_duration_min pickup_date  pickup_day_of_week  \
1799836          3.7           1.500000  2023-02-19                   6   
415841           3.7           1.750000  2023-02-05                   6   
2748127          3.7           1.850000  2

# Display final results (top 5 for good and bad)

In [None]:
from IPython.display import display

# Columns to show for presentation
cols = [
    'tpep_pickup_datetime', 'tpep_dropoff_datetime',
    'trip_distance', 'fare_amount', 'trip_duration_min',
    'pickup_borough', 'pickup_zone',
    'dropoff_borough', 'dropoff_zone',
    'predicted_score', 'final_score'
]

# Sort and select top/bottom
top5 = df_scoring.sort_values('final_score', ascending=False).head(5)[cols]
bottom5 = df_scoring.sort_values('final_score').head(5)[cols]

# Display with styled headers
print("\n Top 5 Most Profitable Trips")
display(top5.style.set_caption("Top 5 Trips").format({
    'trip_distance': '{:.2f}',
    'fare_amount': '${:.2f}',
    'trip_duration_min': '{:.1f} min',
    'predicted_score': '{:.2f}',
    'final_score': '{:.2f}'
}))

print("\n Bottom 5 Least Profitable Trips")
display(bottom5.style.set_caption("Bottom 5 Trips").format({
    'trip_distance': '{:.2f}',
    'fare_amount': '${:.2f}',
    'trip_duration_min': '{:.1f} min',
    'predicted_score': '{:.2f}',
    'final_score': '{:.2f}'
}))



🟢 Top 5 Most Profitable Trips


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,trip_duration_min,pickup_borough,pickup_zone,dropoff_borough,dropoff_zone,predicted_score,final_score
1799836,2023-02-19 01:39:07-05:00,2023-02-19 01:40:37-05:00,0.04,$3.70,1.5 min,Manhattan,East Village,Manhattan,East Village,352.78,1.0
415841,2023-02-05 01:29:56-05:00,2023-02-05 01:31:41-05:00,0.1,$3.70,1.8 min,Manhattan,East Village,Manhattan,East Village,335.02,0.95
2748127,2023-01-01 01:47:04-05:00,2023-01-01 01:48:55-05:00,0.1,$3.70,1.9 min,Manhattan,East Village,Manhattan,East Village,335.02,0.95
5364675,2023-01-29 01:44:21-05:00,2023-01-29 01:45:53-05:00,0.1,$3.70,1.5 min,Manhattan,East Village,Manhattan,East Village,335.02,0.95
416377,2023-02-05 01:39:05-05:00,2023-02-05 01:41:33-05:00,0.1,$3.70,2.5 min,Manhattan,East Village,Manhattan,East Village,335.02,0.95



🔴 Bottom 5 Least Profitable Trips


Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,fare_amount,trip_duration_min,pickup_borough,pickup_zone,dropoff_borough,dropoff_zone,predicted_score,final_score
2439746,2023-02-25 18:36:51-05:00,2023-02-25 19:18:46-05:00,19.4,$3.00,41.9 min,Manhattan,Upper East Side North,EWR,Newark Airport,0.06,0.0
2189341,2023-02-23 14:02:31-05:00,2023-02-23 14:41:19-05:00,4.2,$3.00,38.8 min,Manhattan,Greenwich Village South,Brooklyn,Stuyvesant Heights,0.14,0.0
2104189,2023-02-22 16:42:25-05:00,2023-02-22 19:20:19-05:00,17.0,$21.50,157.9 min,Brooklyn,East Flatbush/Farragut,Brooklyn,Ocean Parkway South,0.33,0.0
3433171,2023-01-09 13:10:09-05:00,2023-01-09 14:57:41-05:00,14.7,$15.20,107.5 min,Queens,Jamaica,Queens,College Point,0.38,0.0
2844813,2023-01-02 17:01:51-05:00,2023-01-02 17:39:43-05:00,11.5,$18.20,37.9 min,Queens,Forest Hills,Queens,Rosedale,0.44,0.0


| Column            | Meaning                                                                |
| ----------------- | ---------------------------------------------------------------------- |
| `predicted_score` | The raw, weighted sum based on the chosen features and their weights. |
| `final_score`     | The **normalized version**, scaled between 0 and 1 for comparison.     |


How to Use These Scores:
predicted_score is useful if we want to explain how the ride’s components added up.

final_score is the actionable version we can rank or display in the app (e.g., “Choose this ride — it scores 0.82!”).