# 01 - Exploratory Data Analysis: NEMT Rides

## Overview
This notebook explores the synthetic NEMT (Non-Emergency Medical Transport) ride data to understand:
- Data distributions and quality
- Temporal patterns in ride requests
- Geographic coverage
- Key operational metrics baseline

## Table of Contents
1. [Setup & Data Loading](#setup)
2. [Data Overview](#overview)
3. [Temporal Analysis](#temporal)
4. [Geographic Analysis](#geographic)
5. [Trip Type Analysis](#trip-types)
6. [Cancellation Analysis](#cancellations)
7. [Key Findings](#findings)

<a id="setup"></a>
## 1. Setup & Data Loading

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

# Configure display
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Project imports
import sys
sys.path.insert(0, str(Path.cwd().parent))
from src.config import RAW_DIR, PROCESSED_DIR

print("âœ… Setup complete")

âœ… Setup complete


In [2]:
# Load or generate data
from src.data_generation import generate_trips, generate_drivers, save_raw_data
from src.config import RAW_DIR, PROCESSED_DIR

# Ensure directories exist
RAW_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Generate trips if not exists
trips_file = RAW_DIR / "trips.csv"
if not trips_file.exists():
    print("Generating synthetic trip data...")
    trips_df = generate_trips()
    save_raw_data(trips_df, "trips.csv")
else:
    trips_df = pd.read_csv(trips_file, parse_dates=[
        "requested_pickup_time", "scheduled_pickup_time", 
        "actual_pickup_time", "actual_dropoff_time"
    ])
    print(f"âœ… Loaded {len(trips_df):,} trips")

# Generate drivers if not exists
drivers_file = RAW_DIR / "drivers.csv"
if not drivers_file.exists():
    print("Generating synthetic driver data...")
    drivers_df = generate_drivers()
    save_raw_data(drivers_df, "drivers.csv")
else:
    drivers_df = pd.read_csv(drivers_file)
    print(f"âœ… Loaded {len(drivers_df):,} drivers")

print(f"\nData Summary:")
print(f"  Trips: {len(trips_df):,}")
print(f"  Drivers: {len(drivers_df):,}")
print(f"  Regions: {trips_df['region'].nunique()}")

âœ… Loaded 5,000 trips
âœ… Loaded 150 drivers

Data Summary:
  Trips: 5,000
  Drivers: 150
  Regions: 5


<a id="overview"></a>
## 2. Data Overview

Let's examine the structure, data types, and basic statistics of our dataset.

In [3]:
# Dataset shape and structure
print(f"Dataset Shape: {trips_df.shape[0]:,} rows Ã— {trips_df.shape[1]} columns\n")
print("Column Types:")
print(trips_df.dtypes)

Dataset Shape: 5,000 rows Ã— 19 columns

Column Types:
trip_id                          object
member_id                        object
driver_id                        object
pickup_lat                      float64
pickup_lng                      float64
dropoff_lat                     float64
dropoff_lng                     float64
requested_pickup_time    datetime64[ns]
scheduled_pickup_time    datetime64[ns]
actual_pickup_time       datetime64[ns]
actual_dropoff_time      datetime64[ns]
distance_miles                  float64
trip_type                        object
vehicle_capacity                  int64
num_passengers                    int64
late_pickup_flag                 object
late_dropoff_flag                object
cancellation_reason              object
region                           object
dtype: object


In [None]:
# First few rows
trips_df.head()

In [None]:
# Missing values analysis
missing = trips_df.isnull().sum()
missing_pct = (missing / len(trips_df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).query('`Missing Count` > 0')

if len(missing_df) > 0:
    print("Columns with Missing Values:")
    display(missing_df)
else:
    print("âœ… No missing values in the dataset")

In [None]:
# Descriptive statistics for numeric columns
trips_df.describe()

In [None]:
# Unique value counts for categorical columns
categorical_cols = ['trip_type', 'region', 'cancellation_reason']
for col in categorical_cols:
    print(f"\n{col.upper()} Distribution:")
    print(trips_df[col].value_counts())

<a id="temporal"></a>
## 3. Temporal Analysis

Understanding when rides occur helps identify peak demand periods and potential capacity constraints.

In [None]:
# Extract temporal features
trips_df['scheduled_date'] = trips_df['scheduled_pickup_time'].dt.date
trips_df['scheduled_hour'] = trips_df['scheduled_pickup_time'].dt.hour
trips_df['day_of_week'] = trips_df['scheduled_pickup_time'].dt.day_name()

# Trips by hour of day
hourly_counts = trips_df.groupby('scheduled_hour').size()

fig = px.bar(
    x=hourly_counts.index,
    y=hourly_counts.values,
    labels={'x': 'Hour of Day', 'y': 'Number of Trips'},
    title='Trip Volume by Hour of Day'
)
fig.update_layout(xaxis=dict(tickmode='linear'))
fig.show()

In [None]:
# Trips by day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_counts = trips_df['day_of_week'].value_counts().reindex(day_order)

fig = px.bar(
    x=daily_counts.index,
    y=daily_counts.values,
    labels={'x': 'Day of Week', 'y': 'Number of Trips'},
    title='Trip Volume by Day of Week',
    color=daily_counts.values,
    color_continuous_scale='Blues'
)
fig.show()

In [None]:
# Heatmap: Hour vs Day of Week
pivot = trips_df.pivot_table(
    index='day_of_week',
    columns='scheduled_hour',
    values='trip_id',
    aggfunc='count'
).reindex(day_order)

fig = px.imshow(
    pivot,
    labels=dict(x="Hour of Day", y="Day of Week", color="Trip Count"),
    title="Trip Volume Heatmap: Day of Week Ã— Hour",
    color_continuous_scale='YlOrRd',
    aspect='auto'
)
fig.show()

<a id="geographic"></a>
## 4. Geographic Analysis

Analyzing the spatial distribution of pickups and dropoffs.

In [None]:
# Pickup locations scatter plot
fig = px.scatter(
    trips_df.sample(min(1000, len(trips_df))),  # Sample for performance
    x='pickup_lng',
    y='pickup_lat',
    color='region',
    title='Pickup Location Distribution by Region',
    labels={'pickup_lng': 'Longitude', 'pickup_lat': 'Latitude'},
    opacity=0.6
)
fig.update_layout(height=500)
fig.show()

In [None]:
# Distance distribution
fig = px.histogram(
    trips_df,
    x='distance_miles',
    nbins=50,
    title='Trip Distance Distribution',
    labels={'distance_miles': 'Distance (miles)', 'count': 'Number of Trips'},
    color_discrete_sequence=['steelblue']
)
fig.add_vline(x=trips_df['distance_miles'].median(), line_dash="dash", line_color="red",
              annotation_text=f"Median: {trips_df['distance_miles'].median():.1f} mi")
fig.show()

print(f"Distance Statistics:")
print(f"  Mean: {trips_df['distance_miles'].mean():.2f} miles")
print(f"  Median: {trips_df['distance_miles'].median():.2f} miles")
print(f"  Max: {trips_df['distance_miles'].max():.2f} miles")

In [None]:
# Trips by region
region_stats = trips_df.groupby('region').agg({
    'trip_id': 'count',
    'distance_miles': 'mean',
    'late_pickup_flag': 'mean'
}).round(3)
region_stats.columns = ['Trip Count', 'Avg Distance (mi)', 'Late Pickup Rate']
region_stats = region_stats.sort_values('Trip Count', ascending=False)

display(region_stats)

fig = px.pie(
    values=region_stats['Trip Count'],
    names=region_stats.index,
    title='Trip Distribution by Region'
)
fig.show()

<a id="trip-types"></a>
## 5. Trip Type Analysis

Medical transport trips vary by appointment type. Let's analyze patterns by trip type.

In [None]:
# Trip type distribution
trip_type_counts = trips_df['trip_type'].value_counts()

fig = px.bar(
    x=trip_type_counts.index,
    y=trip_type_counts.values,
    title='Trip Volume by Appointment Type',
    labels={'x': 'Trip Type', 'y': 'Number of Trips'},
    color=trip_type_counts.values,
    color_continuous_scale='Viridis'
)
fig.show()

In [None]:
# Late pickup rates by trip type
# Filter to completed trips only
completed = trips_df[trips_df['cancellation_reason'].isna()]

late_by_type = completed.groupby('trip_type')['late_pickup_flag'].mean().sort_values(ascending=False) * 100

fig = px.bar(
    x=late_by_type.index,
    y=late_by_type.values,
    title='Late Pickup Rate by Trip Type',
    labels={'x': 'Trip Type', 'y': 'Late Pickup Rate (%)'},
    color=late_by_type.values,
    color_continuous_scale='RdYlGn_r'
)
fig.add_hline(y=late_by_type.mean(), line_dash="dash", line_color="gray",
              annotation_text=f"Avg: {late_by_type.mean():.1f}%")
fig.show()

In [None]:
# Vehicle capacity utilization
fig = px.box(
    trips_df,
    x='trip_type',
    y='num_passengers',
    title='Passenger Count by Trip Type',
    labels={'trip_type': 'Trip Type', 'num_passengers': 'Number of Passengers'}
)
fig.show()

# Average capacity utilization
trips_df['capacity_utilization'] = trips_df['num_passengers'] / trips_df['vehicle_capacity']
print(f"\nAverage Capacity Utilization: {trips_df['capacity_utilization'].mean()*100:.1f}%")

<a id="cancellations"></a>
## 6. Cancellation Analysis

Understanding cancellation patterns to identify potential process improvements.

In [None]:
# Cancellation rate
cancelled = trips_df['cancellation_reason'].notna()
cancellation_rate = cancelled.mean() * 100

print(f"Overall Cancellation Rate: {cancellation_rate:.2f}%")
print(f"Completed Trips: {(~cancelled).sum():,}")
print(f"Cancelled Trips: {cancelled.sum():,}")

In [None]:
# Cancellation reasons breakdown
if cancelled.sum() > 0:
    reason_counts = trips_df[cancelled]['cancellation_reason'].value_counts()
    
    fig = px.pie(
        values=reason_counts.values,
        names=reason_counts.index,
        title='Cancellation Reasons',
        hole=0.4
    )
    fig.show()
else:
    print("No cancellations in the dataset")

<a id="findings"></a>
## 7. Key Findings & Next Steps

### Summary Statistics

In [None]:
# Summary dashboard
completed_trips = trips_df[trips_df['cancellation_reason'].isna()]

summary = {
    'Total Trips': len(trips_df),
    'Completed Trips': len(completed_trips),
    'Cancellation Rate': f"{(1 - len(completed_trips)/len(trips_df))*100:.1f}%",
    'Unique Drivers': trips_df['driver_id'].nunique(),
    'Unique Members': trips_df['member_id'].nunique(),
    'Regions Covered': trips_df['region'].nunique(),
    'Date Range': f"{trips_df['scheduled_pickup_time'].min().date()} to {trips_df['scheduled_pickup_time'].max().date()}",
    'Avg Distance': f"{trips_df['distance_miles'].mean():.1f} miles",
    'On-Time Rate': f"{(1 - completed_trips['late_pickup_flag'].mean())*100:.1f}%",
    'Avg Capacity Utilization': f"{completed_trips['capacity_utilization'].mean()*100:.1f}%"
}

print("=" * 50)
print("ðŸ“Š NEMT RIDES - EDA SUMMARY")
print("=" * 50)
for key, value in summary.items():
    print(f"{key:.<30} {value}")

### Key Observations

1. **Temporal Patterns**: Trips peak during mid-morning hours (dialysis appointments) and early afternoon
2. **Geographic Distribution**: Coverage is relatively uniform across regions
3. **Trip Types**: Dialysis is the dominant trip type (~35%), followed by physical therapy
4. **Cancellation**: Low cancellation rate, primarily member no-shows
5. **Capacity**: Average vehicle utilization suggests room for route optimization

### Next Steps

- **Notebook 02**: Feature engineering for efficiency metrics
- **Notebook 03**: Build and validate efficiency scoring algorithm
- **Notebook 04**: Routing simulation and strategy comparison
- **Notebook 05**: Dashboard development and final analysis

In [5]:
# Save cleaned data to processed directory for downstream notebooks
from src.config import PROCESSED_DIR
from src.data_cleaning import clean_trips

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Apply proper cleaning (adds computed columns like pickup_delay_minutes, trip_duration_minutes, etc.)
trips_cleaned = clean_trips(trips_df)
print(f"ðŸ“Š Cleaned {len(trips_cleaned):,} trips (added computed columns)")

# Save trips (with full cleaning applied)
trips_cleaned.to_csv(PROCESSED_DIR / 'trips_cleaned.csv', index=False)
print(f"âœ… Saved trips to {PROCESSED_DIR / 'trips_cleaned.csv'}")

# Save drivers
drivers_df.to_csv(PROCESSED_DIR / 'drivers.csv', index=False)
print(f"âœ… Saved drivers to {PROCESSED_DIR / 'drivers.csv'}")

print(f"\nâœ… Data ready for notebook 02!")

ðŸ“Š Cleaned 5,000 trips (added computed columns)
âœ… Saved trips to /Users/hc/Documents/projects/modivcare-rides-efficiency/data/processed/trips_cleaned.csv
âœ… Saved drivers to /Users/hc/Documents/projects/modivcare-rides-efficiency/data/processed/drivers.csv

âœ… Data ready for notebook 02!
