# Ride Price Estimation System

This notebook presents an end-to-end machine learning workflow for estimating ride prices
based on trip and contextual factors. The project covers problem framing, dataset design,
data exploration, preprocessing, modeling, evaluation, and ethical reflection.

## 1. ML Mindset & Problem Framing

Estimating ride prices is a supervised learning problem where the goal is to predict a
continuous target variable (`price`) based on trip and contextual features.

This problem is better suited for machine learning than fixed rule-based systems because:
- Ride pricing depends on multiple interacting factors
- Relationships are non-linear and change over time
- Rule-based systems are static and require manual updates

The model is expected to learn how features such as distance, duration, traffic, weather,
time of day, and demand influence ride price from historical data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Data Loading & Exploration

In [None]:
df = pd.read_csv("../data/ride_sharing_learning_data.csv")

df.head()

In [None]:
df.info()

In [None]:
df.duplicated().sum()

In [None]:
df.describe()

In [None]:
plt.hist(df['price'], bins=30)
plt.title("Ride Price Distribution")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

The histogram shows a right-skewed distribution with extreme price outliers, confirming
the need for data cleaning.

## 3. Data Cleaning & Feature Engineering

In [None]:
df = df.drop_duplicates()

In [None]:
df['traffic'] = df['traffic'].str.lower().str.strip().replace({'low_': 'low'})
df['weather'] = df['weather'].str.lower().str.strip()

In [None]:
# Remove negative duration values
df = df[df['duration'] > 0]

# Cap extreme values instead of deleting
df['distance'] = df['distance'].clip(upper=100)
df['price'] = df['price'].clip(upper=300)

In [None]:
df_encoded = pd.get_dummies(
    df,
    columns=['traffic', 'weather', 'time_of_day', 'demand'],
    drop_first=True
)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_encoded[['distance', 'duration']] = scaler.fit_transform(
    df_encoded[['distance', 'duration']]
)

## 4. Regression Model – Price Prediction

In [None]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('price', axis=1)
y = df_encoded['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_pred = lr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

mae, rmse

In [None]:
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Ride Prices")
plt.show()

## 5. Classification – High-Cost vs Low-Cost Ride

In [None]:
df_encoded['high_cost'] = (df_encoded['price'] > df_encoded['price'].median()).astype(int)

X_cls = df_encoded.drop(['price', 'high_cost'], axis=1)
y_cls = df_encoded['high_cost']

X_train, X_test, y_train, y_test = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42
)

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred_cls = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_cls)
cm = confusion_matrix(y_test, y_pred_cls)

accuracy, cm

Logistic regression predicts the probability that a ride belongs to the high-cost class.
A threshold of 0.5 is used to convert probabilities into class labels.

## Model Comparison

- Linear regression predicts exact prices but is sensitive to outliers.
- Classification simplifies decision-making but loses numerical precision.
- Data quality strongly influenced both models' performance.

Distance and duration were the most influential features across models.

## Ethical & Practical Reflection

One potential unfair pricing behavior is demand-based pricing disproportionately affecting
users in certain locations or time periods.

A real-world risk is over-reliance on imperfect data, which could result in unfair or
incorrect prices.

A limitation of this dataset is that it is synthetic and may not fully capture real-world
human behavior.