# Ride Price Estimation System
This notebook presents a complete ML workflow for ride price estimation, including problem framing, data quality checks, preprocessing, regression, classification, model comparison, and practical reflection.

## Notebook Roadmap
1. Problem framing and assumptions  
2. Data loading and raw exploration  
3. Data cleaning and feature preparation  
4. Linear regression (price prediction)  
5. Logistic regression (high-cost classification)  
6. Model comparison and feature influence  
7. Ethical and practical reflection

## 1) Problem Framing

**Business goal:** Estimate ride prices before trip confirmation to support transparent rider quotes.

**ML framing:**
- Regression task: predict continuous `price`
- Classification task: predict `high_cost` (1/0) for decision support

**Why ML over fixed rules:**
- Price depends on interacting factors (distance, duration, demand, traffic, weather)
- Relationships shift with context and are hard to maintain with static rules
- Data-driven models adapt better as new examples are collected

**Model expectation:** Learn how contextual and trip variables combine to influence price.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, confusion_matrix

## 2) Data Loading and Exploration

In [None]:
df = pd.read_csv('../data/ride_sharing_learning_data.csv')
df.head()

In [None]:
print('Rows:', len(df))
print('Columns:', df.shape[1])
df.info()

In [None]:
quality_report = pd.DataFrame({
    'missing_values': df.isna().sum(),
    'n_unique': df.nunique()
})
quality_report

In [None]:
print('Duplicate rows:', df.duplicated().sum())
df.describe(include='all').T

In [None]:
plt.figure(figsize=(7, 4))
plt.hist(df['price'], bins=30)
plt.title('Ride Price Distribution (Raw)')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

## 3) Data Cleaning and Feature Engineering

Cleaning plan:
1. Remove duplicates
2. Standardize inconsistent category labels
3. Remove impossible negative trip durations
4. Cap extreme outliers for `distance` and `price`
5. One-hot encode categorical variables
6. Scale numeric variables for stable optimization


In [None]:
df_clean = df.copy()
df_clean = df_clean.drop_duplicates()

df_clean['traffic'] = df_clean['traffic'].str.lower().str.strip().replace({'low_': 'low'})
df_clean['weather'] = df_clean['weather'].str.lower().str.strip()

df_clean = df_clean[df_clean['duration'] > 0]

df_clean['distance'] = df_clean['distance'].clip(upper=100)
df_clean['price'] = df_clean['price'].clip(upper=300)

print('Rows after cleaning:', len(df_clean))

In [None]:
df_encoded = pd.get_dummies(
    df_clean,
    columns=['traffic', 'weather', 'time_of_day', 'demand'],
    drop_first=True
)

scaler = StandardScaler()
df_encoded[['distance', 'duration']] = scaler.fit_transform(df_encoded[['distance', 'duration']])

df_encoded.head()

### Feature justification
- **distance**: primary driver of operating cost/time
- **duration**: captures congestion and route inefficiency
- **traffic**: contextual congestion intensity
- **weather**: impacts road risk and travel speed
- **time_of_day**: captures rush hours and night behavior
- **demand**: reflects surge-like demand pressure

**Excluded feature example:** rider income level (excluded for fairness and privacy concerns).

## 4) Regression: Linear Regression for Price

In [None]:
X = df_encoded.drop('price', axis=1)
y = df_encoded['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

reg_pred = lin_reg.predict(X_test)

mae = mean_absolute_error(y_test, reg_pred)
rmse = np.sqrt(mean_squared_error(y_test, reg_pred))
r2 = r2_score(y_test, reg_pred)

pd.DataFrame({'metric': ['MAE', 'RMSE', 'R2'], 'value': [mae, rmse, r2]})

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(y_test, reg_pred, alpha=0.7)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Linear Regression: Actual vs Predicted')

line_min = min(y_test.min(), reg_pred.min())
line_max = max(y_test.max(), reg_pred.max())
plt.plot([line_min, line_max], [line_min, line_max], 'r--')
plt.show()

## 5) Classification: Logistic Regression for High-Cost Ride

In [None]:
df_cls = df_encoded.copy()
df_cls['high_cost'] = (df_cls['price'] > df_cls['price'].median()).astype(int)

X_cls = df_cls.drop(['price', 'high_cost'], axis=1)
y_cls = df_cls['high_cost']

X_train, X_test, y_train, y_test = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42
)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

cls_pred = log_reg.predict(X_test)
cls_prob = log_reg.predict_proba(X_test)[:, 1]

acc = accuracy_score(y_test, cls_pred)
cm = confusion_matrix(y_test, cls_pred)

print('Accuracy:', round(acc, 4))
print('Confusion Matrix:
', cm)

In [None]:
plt.figure(figsize=(7, 4))
plt.hist(cls_prob, bins=20)
plt.title('Predicted Probability Distribution for High-Cost Class')
plt.xlabel('Predicted probability of high_cost = 1')
plt.ylabel('Frequency')
plt.show()

**How probabilities are used:** Logistic regression estimates a probability from 0 to 1. A decision threshold (default 0.5) converts this probability into class labels (high-cost vs low-cost).

## 6) Model Evaluation and Comparison

- Regression provides numeric pricing accuracy (`MAE`, `RMSE`, `R2`)
- Classification provides decision quality (`Accuracy`, confusion matrix)
- Data cleaning improved reliability by removing impossible durations and standardizing labels

### Most influential features (linear model coefficients by absolute value)

In [None]:
coef_df = pd.DataFrame({
    'feature': X.columns,
    'coefficient': lin_reg.coef_
})
coef_df['abs_coefficient'] = coef_df['coefficient'].abs()
coef_df.sort_values('abs_coefficient', ascending=False).head(10)

## 7) Ethical and Practical Reflection

- **Potential unfair pricing behavior:** Demand-heavy zones could systematically get higher prices, disproportionately affecting specific communities.
- **Deployment risk:** Model drift during unusual events (festivals, severe storms, strikes) can produce unstable prices.
- **Dataset limitation:** Synthetic data may not capture full real-world behavior, route constraints, and socio-economic variation.

## Final Note
This notebook was built as an end-to-end practical workflow with emphasis on traceability, reproducibility, and clear modeling decisions.