Real-World Business Scenario: Performance Benchmarking

Requirement: The operations team needs to identify "Extreme Weather Rides." We must:

1. Filter: Only analyze high-volume days (days with > 5,000 rides) to ensure the baseline is statistically significant.

2. Transform: For every individual ride, calculate how much the temperature deviated from the average temperature of that specific day.

3. Aggregate (Collapse): Produce a daily summary report showing the total ride count and the single highest temperature deviation recorded.

In [None]:
import pandas as pd

# Load dataset from the MDA Repository Knowledge Source
def generate_weather_anomaly_report(path: str) -> pd.DataFrame:
    return (
        pd.read_csv(path)
        # 0. INGESTION & TYPE OPTIMIZATION (The Logic Standard)
        .assign(
            starttime=lambda x: pd.to_datetime(x['starttime']),
            date=lambda x: x['starttime'].dt.date
        )
        .loc[:, ['date', 'events', 'temperature', 'wind_speed']]
        
        # 1. FILTER: Data Quality Pruning (The "Bouncer")
        # Rule: Drop any day that doesn't have at least 50 rides.
        # This prevents outliers on low-volume days from skewing our logic.
        .groupby('date')
        .filter(lambda x: len(x) > 50)
        
        # 2. TRANSFORM: Feature Engineering (The "Broadcaster")
        # Rule: Calculate the daily average temp and map it back to every ride.
        # This allows row-level comparison without losing the ride details.
        .assign(
            daily_avg_temp=lambda x: x.groupby('date')['temperature'].transform('mean'),
            temp_anomaly=lambda x: (x['temperature'] - x['daily_avg_temp']).abs()
        )
        
        # 3. AGGREGATE: Final Reporting (The "Collapse")
        # Rule: Collapse the thousands of rides into one summary row per day.
        .groupby('date')
        .agg(
            total_rides=('temperature', 'size'),
            max_temp_deviation=('temp_anomaly', 'max'),
            primary_weather=('events', lambda x: x.value_counts().index[0])
        )
        .reset_index()
        # Sort by the most "anomalous" days
        .sort_values('max_temp_deviation', ascending=False)
    )

# Execution
report = generate_weather_anomaly_report('../data/bikes.csv')

print("--- Production Business Report: Weather Anomalies ---")
report

--- Production Business Report: Weather Anomalies ---


Unnamed: 0,date,total_rides,max_temp_deviation,primary_weather
119,2016-06-30,60,9904.795000,mostlycloudy
95,2016-05-23,52,23.823077,partlycloudy
202,2017-05-15,66,22.693939,cloudy
91,2016-04-18,58,21.827586,partlycloudy
105,2016-06-10,87,21.601149,partlycloudy
...,...,...,...,...
139,2016-07-30,51,3.925490,mostlycloudy
75,2015-10-01,52,3.138462,mostlycloudy
179,2016-09-29,53,2.828302,cloudy
175,2016-09-23,69,2.724638,cloudy
