In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)

n = 250

distance = np.random.uniform(1, 25, n)

duration = distance * np.random.uniform(2, 4, n)

base_fare = np.random.uniform(2, 5, n)

demand = np.random.randint(1, 6, n)

traffic_levels = np.random.choice(['low', 'medium', 'high'], n)
weather_conditions = np.random.choice(['clear', 'rainy', 'stormy'], n)
time_of_day = np.random.choice(['morning', 'afternoon', 'evening', 'night'], n)

price = (
    base_fare +
    1.5 * distance +
    0.4 * duration +
    demand * 1.2
)


price += np.where(traffic_levels == 'high', 3, 0)
price += np.where(traffic_levels == 'medium', 1.5, 0)


price += np.where(weather_conditions == 'stormy', 4, 0)
price += np.where(weather_conditions == 'rainy', 2, 0)

price += np.where(time_of_day == 'night', 2, 0)
price += np.where(time_of_day == 'evening', 1, 0)

price += np.random.normal(0, 3, n)

df = pd.DataFrame({
    'distance_km': distance,
    'duration_min': duration,
    'base_fare': base_fare,
    'demand_level': demand,
    'traffic_level': traffic_levels,
    'weather_condition': weather_conditions,
    'time_of_day': time_of_day,
    'ride_price': price
})

for col in ['traffic_level', 'weather_condition']:
    df.loc[df.sample(frac=0.05).index, col] = np.nan

df.loc[df.sample(frac=0.03).index, 'traffic_level'] = 'High'

outlier_indices = df.sample(5).index
df.loc[outlier_indices, 'ride_price'] *= 2

df.head()
df.to_csv("rides.csv", index=False)





**Synthetic Data Generation and Realism**

The dataset is synthetically generated to mimic real-world ride pricing patterns. Random variation, missing values, outliers, and categorical diversity make it resemble real-life data.

In practice, actual ride data may differ due to factors like driver behavior, dynamic traffic, promotions, or real-time demand.

The code is designed to automatically create realistic-looking data in larger quantities, making it reusable for training and testing machine learning models.

**Dataset Features and Relevance**

**distance_km**: Longer trips cost more, directly increasing ride price.

**duration_min**: Captures time spent, affected by traffic, also increases price.

**base_fare**: Minimum starting cost for every ride.

**demand_level**: Higher demand increases price (surge pricing).

**traffic_level**: Congestion raises travel time and cost.

**weather_condition**: Rain or storms increase demand and operational difficulty.

**time_of_day**: Evening and night rides often have higher prices.

**ride_price**: Target variable representing the final cost of the ride.

**Additional Notes**:

Some entries have missing values to mimic real-world imperfect data.

A few outliers were added to reflect unusually expensive rides.

These choices make the dataset realistic for training ML models.

**Feature considered but excluded**

I thought about including **driver rating** but Driver rating does not directly affect ride price in a predictable way.

Including it could introduce bias, since ratings are subjective and vary by passenger perception.

Itâ€™s not a clear numeric factor that the model can use to learn pricing patterns.