# 02 - Efficiency Index Development

**Author:** Hector Carbajal  
**Version:** 1.0  
**Last Updated:** 2026-02

---

## Overview
This notebook develops the efficiency scoring algorithm for NEMT rides. We'll:
- Engineer features from cleaned trip data
- Design scoring components (on-time, route, capacity, idle)
- Weight and combine into a composite efficiency index (35/25/20/20)
- Validate the scoring system

## Inputs
- `data/processed/trips_cleaned.csv` - Cleaned trip data from notebook 01
- `data/processed/drivers.csv` - Driver reference data

## Outputs
- `data/processed/trips_with_efficiency.csv` - Trips with efficiency scores

## Table of Contents
1. [Setup & Data Loading](#setup)
2. [Feature Engineering](#features)
3. [Scoring Components](#components)
4. [Efficiency Index](#index)
5. [Validation & Analysis](#validation)
6. [Driver & Region Scores](#aggregates)
7. [Key Findings](#findings)

<a id="setup"></a>
## 1. Setup & Data Loading

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
sns.set_style('whitegrid')

# Project imports
import sys
sys.path.insert(0, str(Path.cwd().parent))
from src.config import RAW_DIR, INTERIM_DIR, PROCESSED_DIR, EFFICIENCY_WEIGHTS

print("✅ Setup complete")
print(f"Efficiency Weights: {EFFICIENCY_WEIGHTS}")

In [None]:
# Load data from notebook 01
from src.config import PROCESSED_DIR

trips_file = PROCESSED_DIR / "trips_cleaned.csv"
drivers_file = PROCESSED_DIR / "drivers.csv"

if not trips_file.exists():
    print("⚠️ Run Notebook 01 first to generate the data!")
    raise FileNotFoundError(f"Missing: {trips_file}")

# Load trips
trips_df = pd.read_csv(
    trips_file,
    parse_dates=["requested_pickup_time", "scheduled_pickup_time", 
                 "actual_pickup_time", "actual_dropoff_time"]
)
print(f"✅ Loaded {len(trips_df):,} trips")

# Load drivers
drivers_df = pd.read_csv(drivers_file)
print(f"✅ Loaded {len(drivers_df):,} drivers")

# Derive is_cancelled flag from cancellation_reason
trips_df['is_cancelled'] = trips_df['cancellation_reason'].notna()

# Filter to completed trips for scoring
completed = trips_df[~trips_df['is_cancelled']].copy()
print(f"📊 Completed trips for scoring: {len(completed):,}")

<a id="features"></a>
## 2. Feature Engineering

Creating derived metrics needed for efficiency scoring.

In [None]:
# Calculate pickup delay (already in cleaned data, but let's verify)
print("Pickup Delay Statistics (minutes):")
print(completed['pickup_delay_minutes'].describe())

# Visualize delay distribution
fig = px.histogram(
    completed,
    x='pickup_delay_minutes',
    nbins=50,
    title='Pickup Delay Distribution',
    labels={'pickup_delay_minutes': 'Delay (minutes)'},
    color_discrete_sequence=['steelblue']
)
fig.add_vline(x=0, line_dash="dash", line_color="green", annotation_text="On Time")
fig.add_vline(x=10, line_dash="dash", line_color="red", annotation_text="Late Threshold")
fig.show()

In [None]:
# Expected vs Actual trip duration
# Expected: distance / 25 mph (urban average) * 60 minutes
completed['expected_duration'] = (completed['distance_miles'] / 25) * 60
completed['duration_deviation'] = completed['trip_duration_minutes'] - completed['expected_duration']
completed['duration_ratio'] = completed['trip_duration_minutes'] / completed['expected_duration'].replace(0, 1)

print("Duration Ratio Statistics:")
print(completed['duration_ratio'].describe())

# Visualize
fig = px.scatter(
    completed.sample(min(1000, len(completed))),
    x='expected_duration',
    y='trip_duration_minutes',
    color='duration_ratio',
    title='Expected vs Actual Trip Duration',
    labels={
        'expected_duration': 'Expected Duration (min)',
        'trip_duration_minutes': 'Actual Duration (min)'
    },
    color_continuous_scale='RdYlGn_r'
)
fig.add_trace(go.Scatter(x=[0, 60], y=[0, 60], mode='lines', name='Perfect', line=dict(dash='dash', color='gray')))
fig.show()

In [None]:
# Capacity utilization (if not already calculated)
if 'capacity_utilization' not in completed.columns:
    completed['capacity_utilization'] = completed['num_passengers'] / completed['vehicle_capacity']

print("Capacity Utilization Statistics:")
print(completed['capacity_utilization'].describe())

fig = px.histogram(
    completed,
    x='capacity_utilization',
    nbins=20,
    title='Vehicle Capacity Utilization Distribution',
    labels={'capacity_utilization': 'Utilization (%)'},
    color_discrete_sequence=['teal']
)
fig.show()

In [None]:
# Driver daily productivity (for idle time calculation)
driver_daily = completed.groupby(['driver_id', 'scheduled_date']).agg({
    'trip_id': 'count',
    'trip_duration_minutes': 'sum',
    'distance_miles': 'sum'
}).reset_index()
driver_daily.columns = ['driver_id', 'scheduled_date', 'daily_trips', 'daily_minutes', 'daily_miles']

# Merge back to trips
completed = completed.merge(driver_daily, on=['driver_id', 'scheduled_date'], how='left')

print("Driver Daily Stats:")
print(driver_daily.describe())

<a id="components"></a>
## 3. Scoring Components

Each component is scored 0-100, where higher is better.

In [None]:
# Component 1: On-Time Score
# 0 delay = 100, 30+ min delay = 0
def calculate_on_time_score(delay_minutes):
    """Convert delay to 0-100 score. Early/on-time = high score."""
    delay = np.clip(delay_minutes.fillna(0), -10, 30)
    score = 100 - ((delay + 10) / 40 * 100)
    return np.clip(score, 0, 100)

completed['score_on_time'] = calculate_on_time_score(completed['pickup_delay_minutes'])

print("On-Time Score Distribution:")
print(completed['score_on_time'].describe())

fig = px.histogram(completed, x='score_on_time', nbins=30, 
                   title='On-Time Score Distribution (0-100)',
                   color_discrete_sequence=['green'])
fig.show()

In [None]:
# Component 2: Route Efficiency Score
# Compares actual vs expected duration with NEMT-realistic expectations

# NEMT trips include loading/unloading, wait times, and urban traffic
# Industry benchmark: 1.5x-2.0x expected duration is typical
# We recalibrate so median performance scores ~50 (not 0)

NEMT_DURATION_MULTIPLIER = 1.8  # Realistic NEMT baseline (actual / expected)
OPTIMAL_MULTIPLIER = 1.2  # Best-case scenario (minimal delays)

def calculate_route_score(actual, expected):
    """
    Score based on route efficiency relative to NEMT industry norms.
    
    Scoring logic:
    - Ratio at NEMT baseline (1.8x): score = 50 (average)
    - Ratio at optimal (1.2x): score = 100 (excellent)
    - Ratios worse than 2.4x: score approaches 0
    
    This acknowledges that NEMT trips naturally take longer than
    pure distance-based calculations due to operational realities.
    """
    ratio = actual / expected.replace(0, 1)
    
    # Linear interpolation: optimal -> 100, baseline -> 50
    score = 100 - (ratio - OPTIMAL_MULTIPLIER) / (NEMT_DURATION_MULTIPLIER - OPTIMAL_MULTIPLIER) * 50
    return np.clip(score, 0, 100)

completed['score_route'] = calculate_route_score(
    completed['trip_duration_minutes'],
    completed['expected_duration']
)

print(f"📊 Route Efficiency Baseline:")
print(f"   NEMT typical ratio: {NEMT_DURATION_MULTIPLIER}x (scores ~50)")
print(f"   Optimal ratio: {OPTIMAL_MULTIPLIER}x (scores 100)")
print(f"   Actual median ratio: {completed['duration_ratio'].median():.2f}x")

print(f"\nRoute Efficiency Score Distribution:")
print(completed['score_route'].describe())

fig = px.histogram(completed, x='score_route', nbins=30,
                   title='Route Efficiency Score Distribution (0-100, NEMT-Calibrated)',
                   color_discrete_sequence=['blue'])
fig.add_vline(x=50, line_dash="dash", line_color="gray", annotation_text="Industry Baseline")
fig.show()

In [None]:
# Component 3: Capacity Score
# Direct utilization percentage
completed['score_capacity'] = (completed['capacity_utilization'] * 100).clip(0, 100)

print("Capacity Score Distribution:")
print(completed['score_capacity'].describe())

fig = px.histogram(completed, x='score_capacity', nbins=20,
                   title='Capacity Utilization Score Distribution (0-100)',
                   color_discrete_sequence=['purple'])
fig.show()

In [None]:
# Component 4: Idle Time Score (Productivity)
# Recalibrated to use data-driven baseline rather than arbitrary 8-hour shift

# Calculate median daily productive time as the baseline for "average" performance
# This makes scoring relative to actual fleet performance, not theoretical capacity
BASELINE_DAILY_MINUTES = completed['daily_minutes'].median()  # Data-driven baseline
TARGET_DAILY_MINUTES = completed['daily_minutes'].quantile(0.75)  # Top 25% = excellent

print(f"📊 Daily Productivity Baseline:")
print(f"   Median daily trip time: {BASELINE_DAILY_MINUTES:.1f} min")
print(f"   Target (75th pctl): {TARGET_DAILY_MINUTES:.1f} min")

def calculate_idle_score(daily_minutes, baseline=BASELINE_DAILY_MINUTES, target=TARGET_DAILY_MINUTES):
    """
    Score based on productive time relative to fleet performance.
    
    Scoring logic:
    - At median productivity: score = 50 (average performer)
    - At 75th percentile: score = 100 (top performer)
    - Below median: proportionally lower scores
    
    This data-driven approach avoids arbitrary shift assumptions
    and provides meaningful relative rankings within the fleet.
    """
    productive = daily_minutes.fillna(0)
    
    # Linear interpolation: median -> 50 pts, target -> 100 pts
    score = 50 + (productive - baseline) / (target - baseline + 1e-6) * 50
    return np.clip(score, 0, 100)

completed['score_idle'] = calculate_idle_score(completed['daily_minutes'])

print(f"\nIdle Score Distribution:")
print(completed['score_idle'].describe())

fig = px.histogram(completed, x='score_idle', nbins=30,
                   title='Productivity Score Distribution (0-100, Data-Driven)',
                   color_discrete_sequence=['orange'])
fig.add_vline(x=50, line_dash="dash", line_color="gray", annotation_text="Fleet Median")
fig.show()

<a id="index"></a>
## 4. Efficiency Index

Combining components with configurable weights.

In [None]:
# Define scoring weights
WEIGHTS = {
    'on_time': 0.35,
    'route_efficiency': 0.25,
    'capacity': 0.20,
    'idle_time': 0.20
}

print("Efficiency Index Weights:")
for component, weight in WEIGHTS.items():
    print(f"  {component}: {weight:.0%}")

In [None]:
# Calculate composite efficiency index
def calculate_efficiency_index(row, weights=WEIGHTS):
    """Calculate weighted efficiency score for a trip."""
    # Handle cancelled trips
    if row.get('is_cancelled', False):
        return 0.0
    
    score = (
        weights['on_time'] * row.get('score_on_time', 0) +
        weights['route_efficiency'] * row.get('score_route', 0) +
        weights['capacity'] * row.get('score_capacity', 0) +
        weights['idle_time'] * row.get('score_idle', 0)
    )
    return round(score, 4)

# Apply to dataset
completed['efficiency_index'] = completed.apply(calculate_efficiency_index, axis=1)

# Summary statistics
print("Efficiency Index Statistics:")
print(completed['efficiency_index'].describe())
print(f"\nMedian: {completed['efficiency_index'].median():.4f}")

In [None]:
# Efficiency index distribution
fig = px.histogram(
    completed, 
    x='efficiency_index',
    nbins=50,
    title='Efficiency Index Distribution',
    labels={'efficiency_index': 'Efficiency Index'}
)
fig.add_vline(x=completed['efficiency_index'].median(), line_dash="dash", 
              annotation_text=f"Median: {completed['efficiency_index'].median():.2f}")
fig.update_layout(showlegend=False)
fig.show()

<a id="validation"></a>
## 5. Score Validation

Correlation analysis and component contribution assessment.

In [None]:
# Component correlation matrix
score_cols = ['score_on_time', 'score_route', 'score_capacity', 'score_idle', 'efficiency_index']
correlation_matrix = completed[score_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Scoring Component Correlations')
plt.tight_layout()
plt.show()

print("\nCorrelation with Efficiency Index:")
print(correlation_matrix['efficiency_index'].drop('efficiency_index').sort_values(ascending=False))

In [None]:
# Component contribution analysis
component_means = completed[['score_on_time', 'score_route', 
                             'score_capacity', 'score_idle']].mean()

weighted_contributions = pd.Series({
    'On-Time (35%)': component_means['score_on_time'] * 0.35,
    'Route Efficiency (25%)': component_means['score_route'] * 0.25,
    'Capacity (20%)': component_means['score_capacity'] * 0.20,
    'Idle Time (20%)': component_means['score_idle'] * 0.20
})

fig = px.bar(
    x=weighted_contributions.index,
    y=weighted_contributions.values,
    title='Average Weighted Contribution to Efficiency Index',
    labels={'x': 'Component', 'y': 'Weighted Score'}
)
fig.update_traces(marker_color=['#2E86AB', '#A23B72', '#F18F01', '#C73E1D'])
fig.show()

print(f"Sum of weighted contributions: {weighted_contributions.sum():.4f}")
print(f"Average efficiency index: {completed['efficiency_index'].mean():.4f}")

<a id="aggregation"></a>
## 6. Driver & Region Aggregation

Rolling up efficiency scores for operational insights.

In [None]:
# Driver-level efficiency aggregation
driver_efficiency = completed.groupby('driver_id').agg({
    'efficiency_index': ['mean', 'std', 'count'],
    'score_on_time': 'mean',
    'score_route': 'mean',
    'score_capacity': 'mean',
    'score_idle': 'mean'
}).round(4)

driver_efficiency.columns = ['avg_efficiency', 'std_efficiency', 'trip_count',
                              'avg_on_time', 'avg_route', 'avg_capacity', 'avg_idle']
driver_efficiency = driver_efficiency.reset_index()

# Top and bottom performers
print("Top 10 Drivers by Efficiency:")
print(driver_efficiency.nlargest(10, 'avg_efficiency')[['driver_id', 'avg_efficiency', 'trip_count']])

print("\nBottom 10 Drivers by Efficiency:")
print(driver_efficiency.nsmallest(10, 'avg_efficiency')[['driver_id', 'avg_efficiency', 'trip_count']])

In [None]:
# Driver efficiency distribution
fig = px.histogram(
    driver_efficiency,
    x='avg_efficiency',
    nbins=30,
    title='Driver Average Efficiency Distribution',
    labels={'avg_efficiency': 'Average Efficiency Index'}
)
fig.add_vline(x=driver_efficiency['avg_efficiency'].median(), line_dash="dash",
              annotation_text=f"Median: {driver_efficiency['avg_efficiency'].median():.2f}")
fig.show()

In [None]:
# Region-level efficiency
region_efficiency = completed.groupby('region').agg({
    'efficiency_index': ['mean', 'std', 'count'],
    'score_on_time': 'mean',
    'score_route': 'mean'
}).round(4)

region_efficiency.columns = ['avg_efficiency', 'std_efficiency', 'trip_count', 
                              'avg_on_time', 'avg_route']
region_efficiency = region_efficiency.reset_index().sort_values('avg_efficiency', ascending=False)

fig = px.bar(
    region_efficiency,
    x='region',
    y='avg_efficiency',
    color='trip_count',
    title='Regional Efficiency Comparison',
    labels={'avg_efficiency': 'Average Efficiency', 'trip_count': 'Trip Count'}
)
fig.show()

print(region_efficiency)

<a id="findings"></a>
## 7. Key Findings & Next Steps

### Summary

1. **Efficiency Index Range**: Trip-level scores span from 0 (cancelled) to 100 (optimal)
2. **Component Balance**: On-time performance (35% weight) has the largest impact on overall scores
3. **Driver Variability**: Significant performance differences exist across drivers
4. **Regional Patterns**: Some regions show consistently higher/lower efficiency

### Recommendations

- **Target low-performing drivers** for training or route reassignment
- **Investigate high-efficiency drivers** to identify best practices
- **Monitor regional trends** for operational adjustments

### Next Steps

- **Notebook 03**: Test routing simulation strategies
- **Notebook 04**: Evaluate simulation results
- **Dashboard**: Interactive exploration in Streamlit app

In [None]:
# Export processed data for downstream notebooks
from src.config import PROCESSED_DIR

# Merge efficiency scores back to full dataframe
trips_with_scores = trips_df.merge(
    completed[['trip_id', 'efficiency_index', 'score_on_time', 'score_route', 
               'score_capacity', 'score_idle']],
    on='trip_id',
    how='left'
)

# Fill cancelled trips with 0 scores
score_cols = ['efficiency_index', 'score_on_time', 'score_route', 
              'score_capacity', 'score_idle']
trips_with_scores[score_cols] = trips_with_scores[score_cols].fillna(0)

# Save to processed
output_path = PROCESSED_DIR / 'trips_with_efficiency.csv'
trips_with_scores.to_csv(output_path, index=False)
print(f"✅ Saved {len(trips_with_scores)} trips with efficiency scores to {output_path}")

# Also copy drivers to ensure it's available
drivers_df.to_csv(PROCESSED_DIR / 'drivers.csv', index=False)
print(f"✅ Saved drivers to {PROCESSED_DIR / 'drivers.csv'}")

print(f"\n✅ Data ready for notebook 03!")