# Ridefare Dynamics and Predictive Pricing

**Name(s)**: Ken Fukutomi

**Website Link**: (your website link)

In [1]:
%%capture
!pip install walkscore_api
!pip install folium
!pip install statistics
!pip install tabulate

In [2]:
import pandas as pd
import numpy as np
import re
import datetime as dt
import geopandas as gpd
from shapely.geometry import Point
import statistics

# Misc:
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
from lec_utils import *

In [3]:
# nything that might be useful.
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

## Step 1: Introduction

In [4]:
# my urban analytics related components
import os
from dotenv import load_dotenv, find_dotenv
from walkscore import WalkScoreAPI
import folium
from folium.plugins import FastMarkerCluster

## For reading in my environment key
load_dotenv(find_dotenv())
assert(os.getenv("WALKSCORE_API"))
key = (os.getenv('WALKSCORE_API'))

In [None]:
# Reading in my dataset, from Kaggle.
df = pd.read_csv("riders.csv")
col_list = df.columns.unique().tolist()
assert(len(col_list) == df.shape[1])

print(df.shape[0], ',', df.shape[1])

As I begin to explore this dataset, several questions come to mind that would be useful in a real‑world context:

- During peak hours, should I choose Uber or Lyft to minimize my fare? 
- How does surge pricing vary by time of day and by service?
- What is the relationship between trip distance and price for each cab type?
- Do weather conditions (e.g., temperature, uv-index) affect average fares in BOS? 
- Which neighborhoods tend to incur higher or lower ride costs?
- What are some of the trip hotspots in Boston? Do they originate from areas w/ a younger audience? Working people?

Specifically, for the predictive portion of this project, I will first establish a baseline model and then refine it into a final, improved model to better understand and predict trip fares based on various factors in the dataset.

**Dataset overview:**  
- **Number of rows:** 693,071 
- **Relevant columns:**  
  - `timestamp` / `hour` (ride timing)  
  - `cab_type` (Uber vs. Lyft)  
  - `price` (fare cost)  
  - `surge_multiplier` (demand pricing)  
  - `distance` (trip length)  
  - `source` / `destination` (pickup -> trip destination)
  - `product_id` Type of Ridership Service (Shared, XL, Comfort, etc.)

## Step 2: Data Cleaning and Exploratory Data Analysis

Cleaning:
1. Probabilistic Imputation / Distance-Based Imputation for NaN
2. Standardize Uber/Lyft Type, ReGex to Convert to ASCII
3. Combine Metric(s) into One, Simplify
4. Standardize Datetime Column
5. Drop duplicate and unnecessary columns

In [None]:
# Let's see what some variables we're missing, NaN
## Cleaning 1

# 1. Summary of Missing Values.
missing_counts = df.isna().sum()
missing_counts = (
    missing_counts[missing_counts > 0]
    .sort_values(ascending=False)
)

nan_summary = pd.DataFrame({
    'Missing Count': missing_counts.astype(int),
    '% NaN': (missing_counts / len(df) * 100).round(2)
})
print(nan_summary, '\n')
nulls = df[df['price'].isna()]
target = ('distance') #let's save for now.

## Imputing for Prices.
# 2. Conditionally impute:
def impute_prob(df: pd.Series) -> pd.Series:
    local = df.copy()
    num_missing = local.isna().sum()
    sample = np.random.choice(local.dropna(), num_missing)
    local.loc[local.isna()] = sample
    return local

prob_imputed = impute_prob(df['price'])
##print(f"Pre-Imputation Mean: {df['price'].mean()}")
print(f"Computed Probabilistic Mean: ${prob_imputed.mean():.2f}")

# 3. Let's try to impute based on distance:
df['dist_bin'] = pd.qcut(df['distance'], q=5, duplicates='drop')
#print(f"{df['dist_bin']}")
median_by_qbin = (
    df
    .groupby('dist_bin', observed=False)
    ['price']
    .transform('median')
)

distance_imputed = (df['price'].fillna(median_by_qbin))
mean_distance = distance_imputed.mean()
print(f'Distance-Based Imputed Mean: ${mean_distance:.2f}')
diff = abs(mean_distance - prob_imputed.mean())
print(f"Absolute difference: ${diff:.2f}")


# imputation apply,
df['price'] = prob_imputed

In [None]:
df_plot = pd.DataFrame({
    'Original': df['price'].dropna(),
    'Probabilistic Impute': prob_imputed,
    'Distance-Based Impute': distance_imputed
})

df_melt = (
    df_plot
    .melt(var_name='Method', value_name='price')
)
fig = px.histogram(
    df_melt,
    x='price',
    color='Method',
    histnorm='density',
    nbins=40,
    barmode='overlay',
    opacity=0.5,
    title='Fare Price Distribution: Pre vs Post-Imputation'
)
fig.show()
fig.write_html("assets/fare-imputation-comparison.html", include_plotlyjs="cdn")

Given that the two imputation methods yield nearly identical distributions and the overall percentage of missing values is low, we can safely conclude that the choice of imputation strategy will not materially affect our analysis. Therefore, we adopt a probabilistic imputation approach, sampling each missing value from the empirical distribution of observed values:

$$
x_i^{(\mathrm{imputed})} = x_j,\quad
j \sim \mathrm{Uniform}\bigl(\{k \mid x_k \neq \mathrm{NaN}\}\bigr).
$$

Equivalently, each observed value has **equal probability**:

$$
P\bigl(x_i^{(\mathrm{imputed})} = x_j\bigr)
= \frac{1}{n_{\mathrm{observed}}},
$$

Regardless, my selection for which method doesn't affect the overall distribution. Hence, I could've just kept variabales as NaN, but I preferred to have it imputed, in order to have a full column of information (pricing, as it becomes relevant later).

In [None]:
## Cleaning 2, Expanding/Closing 'data'
# Cleaning Product ID Column

def normalize_product_ids(df, col='product_id'):

    # UUID regex pattern
    uuid_re = re.compile(
        r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$',
        re.IGNORECASE
    )

    # UUID --> lyft_* mapping
    uuid_to_name = {
        '6f72dfc5-27f1-42e8-84db-ccc7a75f6969': 'lyft_premier',
        '9a0e7b09-b92b-4c41-9779-2ad22b4d779d': 'lyft',
        '6d318bcc-22a3-4af6-bddd-b409bfce1546': 'lyft_luxsuv',
        '6c84fd89-3f11-4782-9b50-97c468b19529': 'lyft_plus',
        '8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a': 'lyft_lux',
        '55c66225-fbe7-4fd5-9072-eab1ece5e23e': 'lyft_line',
        '997acbb5-e102-41e1-b155-9df7de0a73f2': 'lyft_shared'
    }

    df['product_id_clean'] = df[col].astype(str).str.strip().str.lower()
    mask = (~df['product_id_clean'].str.match(uuid_re))
    problematic = (
        df
        .loc[mask, 'product_id_clean']
        .unique()
        .tolist()
    )
    res_init = (df[col].unique().tolist())
    print("Before replacement:\n", res_init, '\n')

    # Output diffs:
    df['product_id_clean'] = df['product_id_clean'].replace(uuid_to_name)

    res = df['product_id_clean'].unique().tolist()
    print("After cleaning:\n", res, '\n')
    return df

df = normalize_product_ids(df)
df.drop(columns='product_id', inplace=True)
df.rename(columns={'product_id_clean' : 'product_id'}, inplace=True)

In [None]:
## Clean DF
# Aggregate total price and ride count per product and name
agg2 = (
    df.groupby(['product_id', 'name'])['price']
    .agg(total_price='sum', ride_count='count')
    .reset_index()
)

display(agg2)

# name, product_id are the same.
df = df.drop(['product_id'],axis=1)
clean_r=df['name'].unique().tolist()
print(f'Final Cleaning of product_id: {clean_r}')

In [10]:
## Cleaning 2, Expanding/Closing 'data'
# Continued, Feature Engineering

# repeat column, convert to datetime format.
df.drop(columns=['visibility.1'], inplace=True)
df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')
assert isinstance(df['datetime'].iloc[0], dt.datetime)

In [11]:
## Create a common departure time column based on source
df['hour'] = df['datetime'].dt.hour

most_common_hour = (
    df.groupby('source')['hour']
      .agg(lambda x: x.mode().iloc[0])
      .to_dict()
)

df['most_common'] = df['source'].map(most_common_hour)
#df['most_common'].unique().tolist()

## Applying more, let's compute a duration col, given the distance.
assert df['distance'].shape[0] == df.shape[0]

# estimated_duration = (distance_miles / avg_speed) * 60
# above creates multicolinearity w/ distance

# remove cols

In [12]:
df = df.drop(columns=[
    'id', 'timezone', 'short_summary', 'long_summary',
    'windGustTime', 'temperatureHighTime', 'temperatureLowTime',
    'temperatureHigh', 'temperatureLow', 'humidity',
    'dewPoint', 'uvIndex', 'ozone', 'moonPhase',
    'temperatureMinTime', 'apparentTemperatureMax',
    'apparentTemperatureMaxTime', 'apparentTemperatureMin',
    'apparentTemperatureMinTime'
])

# above are all low-correlation figures.

In [None]:
relevant_columns = ['hour', 'distance', 'price', 'cab_type', 'name', 'destination']
print(df[relevant_columns].head().to_markdown(index=False))

Average Boston Driving Speed: https://www.cbsnews.com/boston/news/boston-traffic-study-2024/#:~:text=INRIX%20found%20that%20the%20average,second%2Dslowest%20in%20the%20country.


## Univariate Visualizations

In [None]:
## Univariate Mapping/Visualization
# VISUALIZATION 1: Trip Distance Distribution

fig = px.histogram(
    df,
    x='distance',
    nbins=50,
    title='Trip Distance Distribution',
    labels={'distance':'Distance (in miles)'},
    #log_y=True,
    color_discrete_sequence=['indianred'],
)

fig.update_layout(
    xaxis_title='Distance (in miles)',
    yaxis_title='Frequency',
    bargap=0.05
)
fig.show()
fig.write_html("assets/distance.html", include_plotlyjs="cdn")

In [None]:
def log_transform(df, col:str):
    # plot w/ logarithmic transformation
    key=(col+'_log')
    df[key] = np.log1p(df[col])
    fig = px.histogram(
        df,
        x=key,
        nbins=50,
        title='Log-Transformed Trip Distance Distribution',
        labels={key: f'Log({col} + 1)'}
    )
    fig.update_layout(
        xaxis_title=f'Log({col} + 1)',
        yaxis_title='Frequency',
        bargap=0.05
    )
    return fig

fig2 = log_transform(df, 'distance')
fig2.show()
fig2.write_html("assets/log.html", include_plotlyjs="cdn")

In [None]:
## Univariate Mapping/Visualization
# VISUALIZATION 2: Count of Trip per Hour

trip_cn = (
    df['hour']
    .value_counts()
    .sort_index()
    .reset_index()
) #trip_cn

trip_cn.columns = ['hour', 'trip_count']

fig = px.bar(
    trip_cn,
    x='hour',
    y='trip_count',
    title='Trip Counts per Hour of Day',
    labels={
        'hour': 'Hour of the Day',
        'trip_count': 'Frequency of Trip(s)'
    },
    color_discrete_sequence=['purple'],
    log_y=False
)

fig.update_layout(
    xaxis=dict(dtick=1),
    title_font_size=15,
)

fig.show()

Overall, trips do tend to be all over the place, but it's evident that past Midnight until 8, we see a declining of total routed trips across Boston. It'd be more clever to look at the distribution of departed trip for every POI (place(s) of interest), accounting for other factors such as connectivity to public transit access, or even looking at the walkability of the given area to another destination. Often times, people might be less incentivized to take shorter trips, leading them to take alternative approaches. Hence why I also self computed for a **duration** column given the distance(s) of trips.

In [17]:
## Univariate Mapping/Visualization
# VISUALIZATION 3: Trip Frequency by Random Origin

x = df['source'].unique().tolist()
loc = np.random.choice(x)
subset = df[df['source'] == loc]

# build my base map
center_lat = subset['latitude'].mean()
center_lon = subset['longitude'].mean()
trip_map = folium.Map(
    location=[center_lat, center_lon], 
    zoom_start=14,
    tiles='CartoDB Dark_Matter',
)

# cluster markers
coors = (
    subset[[
        'latitude',
        'longitude'
    ]]
    .dropna()
    .values.tolist()
)
FastMarkerCluster(coors).add_to(trip_map)

trip_map
trip_map.save("assets/trip-frequency-map.html")

## Bivariate Analysis / Interesting Aggregates

In [None]:
## Bivariate Mapping/Visualization
# VISUALIZATION 1: Fare Prices by Cab Type (Uber, Lyft)

fig = px.box(
    df,
    y='cab_type',
    x='price',
    color='cab_type',
    color_discrete_sequence=['blue', 'orange'],
    title='Fare Price by Cab Type (Uber | Lyft)',
    labels={'cab_type': 'Cab Type', 'price': 'Price/Fare ($ USD)'},
    log_x=False
)

fig.update_layout(showlegend=True)
fig.show()

In [None]:
## Bivariate Mapping/Visualization
# VISUALIZATION 2: Heatmap correlation of features

numeric_cols = [
    'hour', 'day', 'month', 'price', 'distance', 'surge_multiplier',
    'latitude', 'longitude', 'temperature', 'apparentTemperature',
    'precipIntensity', 'precipProbability', 'windSpeed', 'windGust',
    'visibility', 'apparentTemperatureHigh', 'apparentTemperatureLow',
    'pressure', 'windBearing', 'cloudCover', 'precipIntensityMax',
    'temperatureMin', 'temperatureMax', 'distance_log'
]

corr_matrix = df[numeric_cols].corr()

fig = px.imshow(
    corr_matrix,
    text_auto=True,
    aspect="auto",
    title="Feature Correlation Heatmap",
    color_continuous_scale="Viridis"
)
fig.show()
fig.write_html("assets/corr.html", include_plotlyjs="cdn")

In [None]:
## Interesting Aggregations:
agg_split = (
    df.groupby(['destination', 'cab_type'])[['price', 'distance']]
      .agg(['mean', 'median'])
      .round(2)
      .sort_values(('price', 'median'), ascending=False)
)

agg_split.columns = ['_'.join(col) for col in agg_split.columns]
agg_split = agg_split.reset_index()

uber = agg_split[agg_split['cab_type'] == 'Uber']
lyft = agg_split[agg_split['cab_type'] == 'Lyft']
uber.sort_values('price_median', ascending=False)
lyft.sort_values('price_median', ascending=False)
display(lyft)
display(uber)

## Bivariate Mapping #3
fig = px.bar(
    agg_split,
    x='destination',
    y='price_median',
    color='cab_type',
    barmode='group',
    text='price_median',
    facet_col='cab_type',
    title='Median Trip Price by Source and Cab Type',
    labels={'price_median': 'Median Price ($)','destination': 'Pickup Area'},
)
fig.update_layout(showlegend=False, bargap=0)
fig.show()

In [None]:
def compute_walkscore(df: pd.DataFrame, df2) -> list:
    import reverse_geocode
    # out of interest:
    ls1 = []
    ls2 = []

    def __init__(df, ls1, ls2):
        ls1 = (
            df
            .sort_values('price_mean',ascending=False)
            .head(3)['destination']
            .tolist()
        )
        ls2 = (
            df
            .sort_values('price_mean',ascending=True)
            .head(3)['destination']
            .tolist()
        )
        return ls1, ls2
    
    tups=(__init__(df, ls1, ls2))
    # tups[0], top
    # tups[1], bottom

    def compute_scores(tups, df2, key):

        walkscore_api = WalkScoreAPI(api_key=key)
        scores = {}

        for i in range(2): # 0 = top 3, 1 = bottom 3
            bike_scores = []
            walk_scores = []

            for location in tups[i]:
                match = df2[df2['destination'] == location][['latitude', 'longitude']].head(1)
                if not match.empty:
                    lat, lon = match.iloc[0]
                    result = walkscore_api.get_score(latitude=lat, longitude=lon)

                    bike_scores.append(int(result.bike_score))
                    walk_scores.append(int(result.walk_score))

            scores[i] = {
                'bikescore': bike_scores,
                'walkscore': walk_scores
            }
        return scores
    
    return(compute_scores(tups, df2, key))
        
computed_tuple = compute_walkscore(uber, df)
print('Connectivity Metric Median for Top-3 Destination')
print('Bike',statistics.mean(computed_tuple[0]['bikescore']))
print('Walk',statistics.mean(computed_tuple[0]['walkscore']), '\n')
print('Connectivity Metric Median for Bottom-3 Destination')
print('Bike',statistics.mean(computed_tuple[1]['bikescore']))
print('Walk',statistics.mean(computed_tuple[1]['walkscore']))

Noteworthy that transit access of an area might be a factor. However, it's also note-worthy that a lot of the top locations are nearby a young crowd (universities, downtown, areas with many clubs, bars); which might be a factor for people to take alternative shared-use mobility systems such as these Uber/Lyft trips. A higher income bracket would definitely correlate with this behavior as well.

## Step 3: Framing a Prediction Problem

I'm tackling a regression task: predicting the ride fare price (`price`) for each trip in Boston.  

- Response variable: `price`—this is the value riders see and platforms optimize for, and it must be known at booking time.  
- Features used: only data available when the ride is requested—trip distance, pickup hour/day/month, surge multiplier, cab type & service tier, origin & destination neighborhoods, and forecasted weather metrics. No in-ride or post-trip information is included to avoid leakage.  
- Evaluation metric: RMSE as our primary score (to heavily penalize large dollar‐value mistakes) and mean absolute error for a straightforward average-error interpretation in dollars.  

By restricting inputs and using RMSE/MAE, I can ensure that my model mimics real-world fare estimation and prioritizes minimizing costly prediction outliers.

## Step 4: Baseline Model

In [None]:
# draw a 10% random sample for quicker iteration
df_sampled = df.sample(frac=0.15, random_state=42)
print(df_sampled.shape[0])

# Base Line Model;
## TECHNIQUE: Multiple Linear Regression with Scaling & One-Hot Encoding
class BaselineModel:
    def __init__(self, df, test_size=0.2, random_state=42):
        self.df = df
        self.FEATURES = ['distance', 'hour', 'cab_type']
        self.TARGET = 'price'
        X = df[self.FEATURES]
        y = df[self.TARGET]
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state
        )
        num_feats = ['distance', 'hour']
        cat_feats = ['cab_type']
        num_pipe = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler',   StandardScaler())
        ])
        cat_pipe = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('ohe', OneHotEncoder(handle_unknown='ignore'))
        ])
        preprocessor = ColumnTransformer([
            ('num', num_pipe, num_feats),
            ('cat', cat_pipe, cat_feats)
        ], sparse_threshold=0)
        self.pipeline = Pipeline([
            ('preproc', preprocessor),
            ('model', LinearRegression())
        ])

    def fit(self):
        self.pipeline.fit(self.X_train, self.y_train)

    def evaluate(self):
        preds = self.pipeline.predict(self.X_test)
        print(f"Baseline RMSE:{mean_squared_error(self.y_test,preds,squared=False):.2f}")
        print(f"Baseline MAE :{mean_absolute_error(self.y_test,preds):.2f}")
        print(f"Baseline R^2 :{r2_score(self.y_test,preds):.3f}")

# usage, end
# utilize our local sampled_df size
# performance comparable to full-size df of 60k rows
model = BaselineModel(df_sampled)
model.fit()
model.evaluate()

## Step 5: Final Model

In [23]:
# Feature Engineering 1
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['is_weekend'] = (df['timestamp'].dt.weekday >= 5).astype(int)

df_sampled['timestamp'] = pd.to_datetime(df_sampled['timestamp'])
df_sampled['is_weekend'] = (df_sampled['timestamp'].dt.weekday >= 5).astype(int)

In [24]:
# Feature Engineering 2
# LOOK AT TRIP ORIGIN
#df['source'].unique().tolist()

df = pd.concat([df, pd.get_dummies(df['source'], prefix='source', dtype=int)], axis=1)
df_sampled = pd.concat([df_sampled, pd.get_dummies(df_sampled['source'], prefix='source', dtype=int)], axis=1)

# Leads to multicolinearity (?) --> Q to self.

In [25]:
# MORE feature engineering!
# understanding peak hours:
peak_hours = [7, 8, 9, 16, 17, 18]
df['is_peak_hour'] = df['hour'].isin(peak_hours).astype(int)
df_sampled['is_peak_hour'] = df_sampled['hour'].isin(peak_hours).astype(int)

In [None]:
df.columns

In [30]:
# drop pre expanded one hots e.g., source)*
cols = ['timestamp','datetime','icon','sunriseTime','sunsetTime',
        'apparentTemperatureHigh','apparentTemperatureHighTime','apparentTemperatureLow',
        'apparentTemperatureLowTime','precipIntensityMax','uvIndexTime','temperatureMaxTime',
        'dist_bin','most_common'
]

df.drop(cols, axis=1, errors="ignore", inplace=True)
# columns after pruning are above in df

In [31]:
# set up w dask distributed
# new imports 
import dask.dataframe as dd 
from xgboost import dask as dxgb 
from dask.distributed import Client
from xgboost import XGBRegressor, callback

# start a local dask cluster 
#client = Client()

In [32]:
feats = [
    'distance', 'hour', 'day', 'month',
    'cab_type', 'source', 'destination',
    'is_weekend', 'is_peak_hour'
]

df2 = df[feats].copy()

In [None]:
print(df2.head())
print(df2.dtypes)
print(df2.columns)

# there are categorical object variables

Omit certain columns with less importance, for the general observation made so far, as well as for the analysis.

In [None]:
# Apply importance of geospatial information for temporal pattern
# cols of location : source, destination, one hot encoded source_*

# --> the information in regards to exact start/end address is limited
# factors that impact ride(s) -> precipIntensity, cab_type, distance
df.columns

In [None]:
df.to_csv('res.csv')

In [None]:
# external py --? run evals.py w/ provided data.