# Ride Price Estimation System

This notebook walks through dataset creation, cleaning, feature engineering, and modeling (linear regression + logistic regression) for estimating ride prices.

## 1. ML Mindset & Problem Framing
- **Learning problem:** Predict continuous ride prices using trip context features.
- **Why ML instead of fixed rules?** Real-world pricing depends on interacting factors (traffic, weather, demand) that change over time. ML can learn these patterns from data.
- **What the model should learn:** The relationship between trip/context features and observed ride prices so it can generalize to new rides.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Data Exploration & Understanding

In [None]:
df = pd.read_csv('../data/rides.csv')
df.head()

In [None]:
df.info()
df.isna().sum()

In [None]:
df['ride_price'].describe()

### Raw data visualization

In [None]:
plt.figure(figsize=(6, 4))
plt.scatter(df['distance_km'], df['ride_price'], alpha=0.7)
plt.title('Ride price vs distance')
plt.xlabel('Distance (km)')
plt.ylabel('Ride price')
plt.show()

## 3. Data Cleaning & Feature Engineering
**Missing values:** fill numeric features with median and categorical features with mode.
**Outliers:** cap distance and duration using the IQR rule to reduce extreme influence.
**Encoding & scaling:** One-hot encode categoricals and standardize numeric features.

Poor data quality (missing values, mislabeled categories, or extreme outliers) can bias the model and reduce generalization.

In [None]:
df_clean = df.copy()

# handle missing values
for col in ['distance_km', 'duration_min']:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

for col in ['time_of_day', 'traffic_level', 'weather', 'demand_level', 'day_of_week']:
    df_clean[col] = df_clean[col].replace('', np.nan)
    df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])

# outlier treatment using IQR
for col in ['distance_km', 'duration_min']:
    q1 = df_clean[col].quantile(0.25)
    q3 = df_clean[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    df_clean[col] = df_clean[col].clip(lower, upper)

df_clean.isna().sum()

## Dataset Justification
- **distance_km:** Longer rides cost more due to fuel/time.
- **duration_min:** Captures slow traffic or longer routes affecting price.
- **time_of_day:** Peak hours tend to be pricier.
- **traffic_level:** High congestion increases duration and cost.
- **weather:** Bad weather can increase demand and risk.
- **demand_level:** Surge pricing occurs when demand exceeds supply.
- **day_of_week:** Weekends often have different demand patterns.

**Feature not included:** Driver rating was considered, but pricing should not depend on individual driver behavior in a fair pricing model.

## 4. Regression Model: Price Prediction (Linear Regression)

In [None]:
# one-hot encode categoricals
X = pd.get_dummies(df_clean.drop(columns=['ride_price']), drop_first=True).astype(float)
y = df_clean['ride_price'].values.reshape(-1, 1)

# standardize numeric columns
numeric_cols = ['distance_km', 'duration_min']
X[numeric_cols] = (X[numeric_cols] - X[numeric_cols].mean()) / X[numeric_cols].std()

# train-test split
rng = np.random.default_rng(42)
indices = np.arange(len(X))
rng.shuffle(indices)
split = int(len(X) * 0.8)
train_idx, test_idx = indices[:split], indices[split:]

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

# add bias term
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train.values]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test.values]

# closed-form solution (normal equation)
theta = np.linalg.pinv(X_train_bias.T @ X_train_bias) @ X_train_bias.T @ y_train
y_pred = X_test_bias @ theta

mae = np.mean(np.abs(y_test - y_pred))
ss_res = np.sum((y_test - y_pred) ** 2)
ss_tot = np.sum((y_test - np.mean(y_test)) ** 2)
r2 = 1 - ss_res / ss_tot
mae, r2

In [None]:
plt.figure(figsize=(5, 5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual Ride Prices')
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--')
plt.show()

## 5. Classification Model: High-Cost vs Low-Cost Ride (Logistic Regression)

In [None]:
threshold = df_clean['ride_price'].median()
y_cls = (df_clean['ride_price'] >= threshold).astype(int).values.reshape(-1, 1)

# reuse encoded features from regression
X_cls = X
X_train, X_test = X_cls.iloc[train_idx], X_cls.iloc[test_idx]
y_train, y_test = y_cls[train_idx], y_cls[test_idx]

X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train.values]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test.values]

# logistic regression via gradient descent
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

theta = np.zeros((X_train_bias.shape[1], 1))
lr = 0.1
epochs = 2000

for _ in range(epochs):
    preds = sigmoid(X_train_bias @ theta)
    gradient = (X_train_bias.T @ (preds - y_train)) / len(y_train)
    theta -= lr * gradient

probs = sigmoid(X_test_bias @ theta)
y_pred = (probs >= 0.5).astype(int)

accuracy = np.mean(y_pred == y_test)
confusion = pd.crosstab(y_test.flatten(), y_pred.flatten(), rownames=['Actual'], colnames=['Predicted'])
accuracy, confusion

**Probability explanation:** Logistic regression outputs probabilities (0-1). A threshold (usually 0.5) converts probabilities into high/low cost labels.

## 6. Model Evaluation & Comparison
- Regression evaluates numeric price accuracy (MAE/RÂ²).
- Classification evaluates correct high/low cost predictions (accuracy + confusion matrix).
- Data quality issues like missing values or extreme outliers can harm both models by skewing the learned relationships.
- The most influential feature can be approximated by inspecting linear coefficients (distance and duration are typically strongest).

## 7. Ethical & Practical Reflection
- **Potential unfair pricing behavior:** If demand is consistently higher in certain neighborhoods, surge pricing could disproportionately impact residents there.
- **Real-world risk:** Over-reliance on model outputs could lead to poor pricing during unusual events (storms, outages).
- **Dataset limitation:** Synthetic data may not capture all real-world variability and bias.