# 06 - Manhattan Subway Ridership Feature Engineering  
Data-Driven Feature Creation Based on Validated Analysis Insights

---

## Notebook Overview

**Objective:**  
Create an optimized 24-feature dataset for ridership prediction using validated temporal and weather-driven patterns.

**Input Data:**  
- `weather_ridership_integrated_2024.parquet`  
- Analysis insights from:
  - `temporal_patterns.json`
  - `weather_correlation_results.json`

**Output:**  
A production-ready feature matrix with 24 engineered columns categorized by:

- Temporal features (12)
- Weather features (9)
- Location-based features (3)

**Goal:**  
Enable high-performance modeling by aligning features with known behavioral and environmental effects.

---


In [13]:
# =============================================
# Setup and Configuration
# =============================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Reproducibility
np.random.seed(42)

# Plotting configuration
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.3f}'.format)

# Header
print("\n" + "=" * 60)
print("NOTEBOOK 06: MANHATTAN SUBWAY FEATURE ENGINEERING")
print("=" * 60)
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Objective: Create production-ready feature set based on validated insights")
print("=" * 60)



NOTEBOOK 06: MANHATTAN SUBWAY FEATURE ENGINEERING
Analysis Date: 2025-07-28 19:54:27
Objective: Create production-ready feature set based on validated insights


In [14]:
# =============================================
# 2. Directory Setup and Data Load
# =============================================

from pathlib import Path

# NOTE: Assumes notebook is run from the /notebooks folder
data_dir = Path("../data/processed")
integration_dir = data_dir / "integration"
analysis_dir = data_dir / "analysis"
modeling_dir = data_dir / "modeling"
modeling_dir.mkdir(parents=True, exist_ok=True)

# Log structure
print("\nDirectory structure:")
print(f"  Integration: {integration_dir}")
print(f"  Analysis:    {analysis_dir}")
print(f"  Output:      {modeling_dir}")

# Load integrated weather-ridership dataset
integrated_file = integration_dir / "weather_ridership_integrated_2024.parquet"
if not integrated_file.exists():
    raise FileNotFoundError(f"Integrated dataset not found: {integrated_file.resolve()}")

print("\nLoading integrated dataset...")
df = pd.read_parquet(integrated_file)
df['transit_timestamp'] = pd.to_datetime(df['transit_timestamp'])

# Summary
print("Dataset loaded successfully:")
print(f"  File:         {integrated_file.name}")
print(f"  Shape:        {df.shape}")
print(f"  Date range:   {df['transit_timestamp'].min()} to {df['transit_timestamp'].max()}")
print(f"  Stations:     {df['station_complex_id'].nunique()}")
print(f"  Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")



Directory structure:
  Integration: ..\data\processed\integration
  Analysis:    ..\data\processed\analysis
  Output:      ..\data\processed\modeling

Loading integrated dataset...
Dataset loaded successfully:
  File:         weather_ridership_integrated_2024.parquet
  Shape:        (1052709, 23)
  Date range:   2024-01-01 00:00:00 to 2024-12-31 23:00:00
  Stations:     121
  Memory usage: 389.0 MB


In [15]:
# =============================================
# 3. Load Analysis Insights (Temporal + Weather)
# =============================================

# --- Temporal Insights ---
temporal_file = analysis_dir / "temporal_patterns.json"
temporal_insights = {}

if temporal_file.exists():
    with open(temporal_file, 'r') as f:
        temporal_insights = json.load(f)
    print(f"\nTemporal insights loaded from: {temporal_file.name}")

    key_patterns = temporal_insights.get('key_patterns', {})
    RUSH_HOURS = key_patterns.get('rush_hours', [8, 17])
    WEEKEND_FACTOR = key_patterns.get('weekend_factor', 0.63)
    HOLIDAY_FACTOR = key_patterns.get('holiday_factor', 0.66)
    CBD_ADVANTAGE = key_patterns.get('cbd_advantage', 2.52)

    print(f"  Validated rush hours:     {RUSH_HOURS}")
    print(f"  Weekend factor:           {WEEKEND_FACTOR:.2f}x")
    print(f"  Holiday factor:           {HOLIDAY_FACTOR:.2f}x")
    print(f"  CBD advantage:            {CBD_ADVANTAGE:.2f}x")
else:
    print("\nWarning: Temporal insights not found — using default values")
    RUSH_HOURS = [8, 17]
    WEEKEND_FACTOR = 0.63
    HOLIDAY_FACTOR = 0.66
    CBD_ADVANTAGE = 2.52

# --- Weather Insights ---
weather_files = [
    analysis_dir / "weather_correlation_results.json",
    analysis_dir.parent / "results" / "weather_correlation_analysis" / "weather_correlation_results.json"
]

weather_insights = {}
weather_file_found = None

for weather_file in weather_files:
    if weather_file.exists():
        with open(weather_file, 'r') as f:
            weather_insights = json.load(f)
        weather_file_found = weather_file
        break

if weather_file_found:
    print(f"\nWeather insights loaded from: {weather_file_found.name}")

    precip_effects = weather_insights.get('precipitation_effects', {})
    SNOW_DETERRENT = precip_effects.get('snow', {}).get('deterrent_pct', 31.8)
    RAIN_DETERRENT = precip_effects.get('rain', {}).get('deterrent_pct', 1.2)

    temp_effects = weather_insights.get('temperature_effects', {})
    TEMP_COMFORT_FACTOR = temp_effects.get('comfort_factor', 2.41)

    print(f"  Snow deterrent:            {SNOW_DETERRENT:.1f}%")
    print(f"  Rain deterrent:            {RAIN_DETERRENT:.1f}%")
    print(f"  Temperature comfort:       {TEMP_COMFORT_FACTOR:.2f}x")
else:
    print("\nWeather insights not found — using default values")
    SNOW_DETERRENT = 31.8
    RAIN_DETERRENT = 1.2
    TEMP_COMFORT_FACTOR = 2.41

print("\nAnalysis insights loading complete.")



Temporal insights loaded from: temporal_patterns.json
  Validated rush hours:     [8, 17]
  Weekend factor:           0.63x
  Holiday factor:           0.66x
  CBD advantage:            2.52x

Weather insights not found — using default values

Analysis insights loading complete.


In [16]:
# =============================================
# 4. Target Feature Set Declaration
# =============================================

import json
from itertools import chain

# Feature set defined based on validated insights
TARGET_FEATURES = {
    'temporal_raw': [
        'hour', 'day_of_week', 'month'
    ],
    'temporal_cyclical': [
        'hour_sin', 'hour_cos',
        'dow_sin', 'dow_cos',
        'month_sin', 'month_cos'
    ],
    'temporal_derived': [
        'is_rush_hour', 'is_weekend', 'is_holiday'
    ],
    'weather': [
        'temp', 'has_snow', 'has_rain',
        'temp_category', 'is_freezing', 'is_hot',
        'humidity', 'wind_speed', 'feels_like'
    ],
    'location': [
        'is_cbd', 'latitude', 'longitude'
    ]
}

# Flatten the nested dictionary into a list
ALL_TARGET_FEATURES = list(chain.from_iterable(TARGET_FEATURES.values()))
TOTAL_FEATURES = len(ALL_TARGET_FEATURES)

# Assert all are strings
assert all(isinstance(f, str) for f in ALL_TARGET_FEATURES), "Non-string feature name found"

# Display
print("\nUpdated target feature set defined:")
print(f"  Total features: {TOTAL_FEATURES}\n")
for category, features in TARGET_FEATURES.items():
    print(f"  {category:<20}: {len(features)} features")

print(f"\nAll target features:\n{ALL_TARGET_FEATURES}")

# Save to disk for use in API / model deployment
required_path = modeling_dir / "required_features.json"
with open(required_path, 'w') as f:
    json.dump(ALL_TARGET_FEATURES, f, indent=2)

print(f"\nSaved required features to: {required_path}")



Updated target feature set defined:
  Total features: 24

  temporal_raw        : 3 features
  temporal_cyclical   : 6 features
  temporal_derived    : 3 features
  weather             : 9 features
  location            : 3 features

All target features:
['hour', 'day_of_week', 'month', 'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos', 'is_rush_hour', 'is_weekend', 'is_holiday', 'temp', 'has_snow', 'has_rain', 'temp_category', 'is_freezing', 'is_hot', 'humidity', 'wind_speed', 'feels_like', 'is_cbd', 'latitude', 'longitude']

Saved required features to: ..\data\processed\modeling\required_features.json


In [17]:
# =============================================
# 5. Assess Current Feature Availability
# =============================================

print("\nFeature Availability Assessment")
print("=" * 40)

available_features = {}
missing_features = {}
object_type_issues = []

for category, features in TARGET_FEATURES.items():
    available = [f for f in features if f in df.columns]
    missing = [f for f in features if f not in df.columns]
    
    available_features[category] = available
    missing_features[category] = missing

    print(f"\n{category.upper():<22}")
    print(f"  Available: {len(available)}/{len(features)} → {available if available else 'None'}")
    if missing:
        print(f"  Missing:   {missing}")
    else:
        print(f"  All features available ✓")

# Check for object-type columns
for col in ALL_TARGET_FEATURES:
    if col in df.columns:
        if df[col].dtype == "object":
            object_type_issues.append(col)

# Overall summary
total_available = sum(len(v) for v in available_features.values())
total_missing = sum(len(v) for v in missing_features.values())
completion_pct = (total_available / TOTAL_FEATURES) * 100

print("\nOVERALL STATUS")
print("-" * 40)
print(f"  Available: {total_available}/{TOTAL_FEATURES}")
print(f"  Missing:   {total_missing}/{TOTAL_FEATURES}")
print(f"  Completion: {completion_pct:.1f}%")

if total_missing > 0:
    print(f"\nFeature engineering required for {total_missing} features.")

if object_type_issues:
    print("\nThe following features are present but have dtype 'object' — check encoding:")
    for col in object_type_issues:
        print(f"  - {col}")



Feature Availability Assessment

TEMPORAL_RAW          
  Available: 3/3 → ['hour', 'day_of_week', 'month']
  All features available ✓

TEMPORAL_CYCLICAL     
  Available: 0/6 → None
  Missing:   ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos']

TEMPORAL_DERIVED      
  Available: 0/3 → None
  Missing:   ['is_rush_hour', 'is_weekend', 'is_holiday']

WEATHER               
  Available: 4/9 → ['temp', 'humidity', 'wind_speed', 'feels_like']
  Missing:   ['has_snow', 'has_rain', 'temp_category', 'is_freezing', 'is_hot']

LOCATION              
  Available: 3/3 → ['is_cbd', 'latitude', 'longitude']
  All features available ✓

OVERALL STATUS
----------------------------------------
  Available: 10/24
  Missing:   14/24
  Completion: 41.7%

Feature engineering required for 14 features.


In [18]:
# =============================================
# 6. Temporal Feature Engineering
# =============================================

print("\nTemporal Feature Engineering")
print("=" * 40)

temporal_features_created = 0

# Raw time components
if 'hour' not in df.columns:
    df['hour'] = df['transit_timestamp'].dt.hour
    print("Created: hour")
    temporal_features_created += 1

if 'day_of_week' not in df.columns:
    df['day_of_week'] = df['transit_timestamp'].dt.dayofweek
    print("Created: day_of_week")
    temporal_features_created += 1

if 'month' not in df.columns:
    df['month'] = df['transit_timestamp'].dt.month
    print("Created: month")
    temporal_features_created += 1

# Cyclical encoding helper
def create_cyclical_features(series, max_val):
    sin = np.sin(2 * np.pi * series / max_val)
    cos = np.cos(2 * np.pi * series / max_val)
    return sin, cos

# Hour encoding
if 'hour_sin' not in df.columns or 'hour_cos' not in df.columns:
    df['hour_sin'], df['hour_cos'] = create_cyclical_features(df['hour'], 24)
    print("Created: hour_sin, hour_cos")
    temporal_features_created += 2

# Day-of-week encoding
if 'dow_sin' not in df.columns or 'dow_cos' not in df.columns:
    df['dow_sin'], df['dow_cos'] = create_cyclical_features(df['day_of_week'], 7)
    print("Created: dow_sin, dow_cos")
    temporal_features_created += 2

# Month encoding
if 'month_sin' not in df.columns or 'month_cos' not in df.columns:
    df['month_sin'], df['month_cos'] = create_cyclical_features(df['month'], 12)
    print("Created: month_sin, month_cos")
    temporal_features_created += 2

# Derived flags
if 'is_rush_hour' not in df.columns:
    df['is_rush_hour'] = df['hour'].isin(RUSH_HOURS).astype(int)
    print(f"Created: is_rush_hour (rush hours = {RUSH_HOURS})")
    temporal_features_created += 1

if 'is_weekend' not in df.columns:
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    print("Created: is_weekend")
    temporal_features_created += 1

# Holiday flag
if 'is_holiday' not in df.columns:
    try:
        import holidays
        us_holidays = holidays.US(years=2024)
        holiday_dates = {d.strftime('%Y-%m-%d') for d in us_holidays.keys()}
        print(f"Using holidays package: {len(holiday_dates)} US federal holidays")
    except ImportError:
        print("Using hardcoded holiday fallback (holidays package not found)")
        holiday_dates = {
            '2024-01-01', '2024-01-15', '2024-02-19', '2024-05-27', '2024-06-19',
            '2024-07-04', '2024-09-02', '2024-10-14', '2024-11-11', '2024-11-28', '2024-12-25'
        }

    df['date_str'] = df['transit_timestamp'].dt.strftime('%Y-%m-%d')
    df['is_holiday'] = df['date_str'].isin(set(holiday_dates)).astype(int)
    df.drop(columns='date_str', inplace=True)
    print(f"Created: is_holiday ({len(holiday_dates)} dates)")
    temporal_features_created += 1

print(f"\nTotal temporal features created: {temporal_features_created}")



Temporal Feature Engineering
Created: hour_sin, hour_cos
Created: dow_sin, dow_cos
Created: month_sin, month_cos
Created: is_rush_hour (rush hours = [8, 17])
Created: is_weekend
Using holidays package: 11 US federal holidays
Created: is_holiday (11 dates)

Total temporal features created: 9


In [19]:
# =============================================
# 7. Weather Feature Engineering
# =============================================

print("\nWeather Feature Engineering")
print("=" * 40)

weather_features_created = 0

# Ensure raw temperature exists
if 'temp' in df.columns:
    print("Raw temperature feature available: temp")
else:
    print("Warning: Raw temperature feature missing — critical for real-time prediction")

# Precipitation indicators
if 'has_rain' not in df.columns and 'rain_1h' in df.columns:
    df['has_rain'] = (df['rain_1h'] > 0).astype(int)
    print("Created: has_rain")
    weather_features_created += 1

if 'has_snow' not in df.columns and 'snow_1h' in df.columns:
    df['has_snow'] = (df['snow_1h'] > 0).astype(int)
    print("Created: has_snow")
    weather_features_created += 1

# Temperature category (ordinal for ML)
if 'temp_category' in df.columns:
    if df['temp_category'].dtype == 'object':
        temp_mapping = {'freezing': 0, 'cold': 1, 'cool': 2, 'warm': 3, 'hot': 4}
        df['temp_category'] = df['temp_category'].map(temp_mapping).astype('int8')
        print("Re-encoded: temp_category (string to ordinal)")
        weather_features_created += 1
elif 'temp' in df.columns:
    temp_bins = [-np.inf, 0, 10, 20, 30, np.inf]
    temp_labels = ['freezing', 'cold', 'cool', 'warm', 'hot']
    temp_cat = pd.cut(df['temp'], bins=temp_bins, labels=temp_labels)
    df['temp_category'] = temp_cat.cat.codes.astype('int8')
    print("Created: temp_category (ordinal 0=freezing to 4=hot)")
    weather_features_created += 1

# Extreme temperature flags
if 'is_freezing' not in df.columns and 'temp' in df.columns:
    df['is_freezing'] = (df['temp'] < 0).astype(int)
    print("Created: is_freezing (temp < 0°C)")
    weather_features_created += 1

if 'is_hot' not in df.columns and 'temp' in df.columns:
    df['is_hot'] = (df['temp'] > 30).astype(int)
    print("Created: is_hot (temp > 30°C)")
    weather_features_created += 1

# Boolean columns type enforcement
for col in ['has_snow', 'has_rain', 'is_freezing', 'is_hot']:
    if col in df.columns:
        df[col] = df[col].astype('int8')

# Validate core weather variables
core_weather = ['humidity', 'wind_speed', 'feels_like']
available_weather = [col for col in core_weather if col in df.columns]
missing_weather = [col for col in core_weather if col not in df.columns]

print("\nCore weather features:")
print(f"  Available: {available_weather}")
if missing_weather:
    print(f"  Missing:   {missing_weather}")
    print("  Warning: Missing features may reduce model accuracy")

print(f"\nWeather features created or encoded: {weather_features_created}")



Weather Feature Engineering
Raw temperature feature available: temp
Created: has_rain
Created: has_snow
Created: temp_category (ordinal 0=freezing to 4=hot)
Created: is_freezing (temp < 0°C)
Created: is_hot (temp > 30°C)

Core weather features:
  Available: ['humidity', 'wind_speed', 'feels_like']

Weather features created or encoded: 5


In [20]:
# =============================================
# 8. Location Feature Validation
# =============================================

print("\nLocation Feature Validation")
print("=" * 40)

location_features_available = 0

# CBD classification check
if 'is_cbd' in df.columns:
    if df['is_cbd'].dtype != 'int':
        df['is_cbd'] = df['is_cbd'].astype('int8')
    total_stations = df['station_complex_id'].nunique()
    cbd_stations = df[df['is_cbd'] == 1]['station_complex_id'].nunique()
    cbd_percentage = (cbd_stations / total_stations * 100) if total_stations > 0 else 0

    print("CBD classification available:")
    print(f"  {cbd_stations} CBD stations out of {total_stations} total ({cbd_percentage:.1f}%)")
    location_features_available += 1
else:
    print("Warning: CBD feature missing — known to improve model accuracy")
    print(f"  Expected advantage: {CBD_ADVANTAGE:.2f}x (from analysis)")

# Latitude and longitude check
has_lat = 'latitude' in df.columns
has_lon = 'longitude' in df.columns

if has_lat and has_lon:
    if df['latitude'].isnull().all() or df['longitude'].isnull().all():
        print("Warning: lat/lon columns exist but contain only nulls")
    else:
        lat_range = f"{df['latitude'].min():.3f} to {df['latitude'].max():.3f}"
        lon_range = f"{df['longitude'].min():.3f} to {df['longitude'].max():.3f}"
        print("\nGeographic coordinates available:")
        print(f"  Latitude range:  {lat_range}")
        print(f"  Longitude range: {lon_range}")
        location_features_available += 2
else:
    print("\nWarning: Geographic coordinates missing")
    print("  lat/lon required for spatial effects, zone mapping, or weather proximity interpolation")

# Final summary
print(f"\nLocation features available: {location_features_available}/3")



Location Feature Validation
CBD classification available:
  65 CBD stations out of 121 total (53.7%)

Geographic coordinates available:
  Latitude range:  40.703 to 40.875
  Longitude range: -74.014 to -73.910

Location features available: 3/3


In [21]:
# =============================================
# 9. Feature Validation and Modeling Dataset Assembly
# =============================================

print("\nFeature Quality Validation")
print("=" * 40)

# Assess final feature set presence
final_available = [f for f in ALL_TARGET_FEATURES if f in df.columns]
final_missing = [f for f in ALL_TARGET_FEATURES if f not in df.columns]

print("Final Feature Availability:")
print(f"  Available: {len(final_available)}/{TOTAL_FEATURES}")
if final_missing:
    print(f"  Missing:   {final_missing}")
    print(f"  Warning:   {len(final_missing)} features still missing")

# Identifier columns
base_identifiers = ['station_complex_id', 'station_complex', 'transit_timestamp']
geo_identifiers = ['latitude', 'longitude']
target_column = ['ridership']

# Deduplicate geo from features
final_available_no_geo = [f for f in final_available if f not in geo_identifiers]

# Final modeling column list
modeling_columns = base_identifiers + geo_identifiers + target_column + final_available_no_geo

# Handle missing identifier columns
missing_identifiers = [col for col in base_identifiers + geo_identifiers if col not in df.columns]
if missing_identifiers:
    print(f"\nWarning: Missing identifier columns: {missing_identifiers}")
    available_identifiers = [col for col in base_identifiers + geo_identifiers if col in df.columns]
    modeling_columns = available_identifiers + target_column + final_available_no_geo

# Create modeling-ready DataFrame
df_modeling = df[modeling_columns].copy()

# Summary
print(f"\nModeling Columns Summary:")
print(f"  Total columns:       {len(modeling_columns)}")
print(f"  Base identifiers:    {len(base_identifiers)}")
print(f"  Geo identifiers:     {len(geo_identifiers)}")
print(f"  Other features:      {len(final_available_no_geo)}")
print(f"  Target column:       1")
print(f"  Final shape:         {df_modeling.shape}")

# Validate cyclical ranges
print("\nCyclical Feature Validation:")
cyclical_features = [f for f in final_available if f.endswith('_sin') or f.endswith('_cos')]
for feature in cyclical_features:
    if feature in df_modeling.columns:
        min_val = df_modeling[feature].min()
        max_val = df_modeling[feature].max()
        if 'dow' in feature:
            in_range = abs(min_val) >= 0.95 and max_val >= 0.95
        else:
            in_range = -1.01 <= min_val <= -0.98 and 0.98 <= max_val <= 1.01
        status = "Valid" if in_range else "Warning"
        print(f"  {feature:<12}: [{min_val:.3f}, {max_val:.3f}] - {status}")

# Validate binary features
print("\nBinary Feature Validation:")
binary_features = [f for f in final_available if f.startswith('is_') or f.startswith('has_')]
for feature in binary_features:
    if feature in df_modeling.columns:
        unique_vals = sorted(df_modeling[feature].dropna().unique())
        status = "Valid" if unique_vals == [0, 1] else "Warning"
        print(f"  {feature:<15}: {unique_vals} - {status}")

# Check for missing values
print("\nMissing Value Check:")
missing_counts = df_modeling[final_available].isnull().sum()
total_missing = missing_counts.sum()

if total_missing == 0:
    print("  No missing values detected — Valid")
else:
    print(f"  Warning: {int(total_missing)} missing values found:")
    for feature, count in missing_counts[missing_counts > 0].items():
        print(f"    {feature:<15}: {count}")



Feature Quality Validation
Final Feature Availability:
  Available: 24/24

Modeling Columns Summary:
  Total columns:       28
  Base identifiers:    3
  Geo identifiers:     2
  Other features:      22
  Target column:       1
  Final shape:         (1052709, 28)

Cyclical Feature Validation:
  hour_sin    : [-1.000, 1.000] - Valid
  hour_cos    : [-1.000, 1.000] - Valid
  dow_sin     : [-0.975, 0.975] - Valid
  month_sin   : [-1.000, 1.000] - Valid
  month_cos   : [-1.000, 1.000] - Valid

Binary Feature Validation:
  is_rush_hour   : [np.int64(0), np.int64(1)] - Valid
  is_weekend     : [np.int64(0), np.int64(1)] - Valid
  is_holiday     : [np.int64(0), np.int64(1)] - Valid
  has_snow       : [np.int8(0), np.int8(1)] - Valid
  has_rain       : [np.int8(0), np.int8(1)] - Valid
  is_freezing    : [np.int8(0), np.int8(1)] - Valid
  is_hot         : [np.int8(0), np.int8(1)] - Valid
  is_cbd         : [np.int64(0), np.int64(1)] - Valid

Missing Value Check:
  No missing values detected —

In [22]:
# =============================================
# 10. Pattern Validation Against Analysis Insights
# =============================================

print("\nPattern Validation Against Analysis Insights")
print("=" * 40)

validation_results = {}

# Rush hour pattern
if 'is_rush_hour' in df_modeling.columns:
    rush_avg = df_modeling[df_modeling['is_rush_hour'] == 1]['ridership'].mean()
    non_rush_avg = df_modeling[df_modeling['is_rush_hour'] == 0]['ridership'].mean()
    observed_rush_factor = round(rush_avg / non_rush_avg, 2)
    rush_valid = 2.0 <= observed_rush_factor <= 3.0

    validation_results['rush_hour'] = {
        'expected': 2.36,
        'observed': observed_rush_factor,
        'valid': rush_valid
    }

    print("Rush Hour Validation:")
    print(f"  Expected: ~2.36x")
    print(f"  Observed: {observed_rush_factor:.2f}x")
    print(f"  Status:   {'Valid' if rush_valid else 'Warning'}")

# Weekend vs. weekday
if 'is_weekend' in df_modeling.columns:
    weekend_avg = df_modeling[df_modeling['is_weekend'] == 1]['ridership'].mean()
    weekday_avg = df_modeling[df_modeling['is_weekend'] == 0]['ridership'].mean()
    observed_weekend_factor = round(weekend_avg / weekday_avg, 2)
    weekend_valid = 0.55 <= observed_weekend_factor <= 0.75

    validation_results['weekend'] = {
        'expected': round(WEEKEND_FACTOR, 2),
        'observed': observed_weekend_factor,
        'valid': weekend_valid
    }

    print("\nWeekend Pattern Validation:")
    print(f"  Expected: ~{WEEKEND_FACTOR:.2f}x")
    print(f"  Observed: {observed_weekend_factor:.2f}x")
    print(f"  Status:   {'Valid' if weekend_valid else 'Warning'}")

# Snow deterrent
if 'has_snow' in df_modeling.columns:
    snow_records = df_modeling['has_snow'].sum()
    if snow_records > 100:
        snow_avg = df_modeling[df_modeling['has_snow'] == 1]['ridership'].mean()
        no_snow_avg = df_modeling[df_modeling['has_snow'] == 0]['ridership'].mean()
        observed_snow_factor = round(snow_avg / no_snow_avg, 2)
        expected_snow_factor = round(1 - (SNOW_DETERRENT / 100), 2)
        snow_valid = 0.6 <= observed_snow_factor <= 0.8

        validation_results['snow'] = {
            'expected': expected_snow_factor,
            'observed': observed_snow_factor,
            'valid': snow_valid
        }

        print("\nSnow Deterrent Validation:")
        print(f"  Expected: ~{expected_snow_factor:.2f}x")
        print(f"  Observed: {observed_snow_factor:.2f}x")
        print(f"  Status:   {'Valid' if snow_valid else 'Warning'}")
    else:
        print(f"\nSnow validation skipped: Insufficient snow records ({snow_records})")
        validation_results['snow'] = {'status': 'insufficient_data'}

# Temperature extremes
if 'is_freezing' in df_modeling.columns and 'is_hot' in df_modeling.columns:
    freezing_records = df_modeling['is_freezing'].sum()
    hot_records = df_modeling['is_hot'].sum()

    if freezing_records > 50 and hot_records > 50:
        freezing_avg = df_modeling[df_modeling['is_freezing'] == 1]['ridership'].mean()
        hot_avg = df_modeling[df_modeling['is_hot'] == 1]['ridership'].mean()
        temp_comfort_observed = round(hot_avg / freezing_avg, 2)
        temp_valid = 2.0 <= temp_comfort_observed <= 3.0

        validation_results['temperature'] = {
            'expected': round(TEMP_COMFORT_FACTOR, 2),
            'observed': temp_comfort_observed,
            'valid': temp_valid
        }

        print("\nTemperature Comfort Validation:")
        print(f"  Expected: ~{TEMP_COMFORT_FACTOR:.2f}x")
        print(f"  Observed: {temp_comfort_observed:.2f}x (hot vs. freezing)")
        print(f"  Status:   {'Valid' if temp_valid else 'Warning'}")
    else:
        print("\nTemperature validation skipped: Insufficient extreme weather samples")
        validation_results['temperature'] = {'status': 'insufficient_data'}

# CBD location premium
if 'is_cbd' in df_modeling.columns:
    cbd_avg = df_modeling[df_modeling['is_cbd'] == 1]['ridership'].mean()
    non_cbd_avg = df_modeling[df_modeling['is_cbd'] == 0]['ridership'].mean()
    observed_cbd_factor = round(cbd_avg / non_cbd_avg, 2)
    cbd_valid = 2.0 <= observed_cbd_factor <= 3.5

    validation_results['cbd'] = {
        'expected': round(CBD_ADVANTAGE, 2),
        'observed': observed_cbd_factor,
        'valid': cbd_valid
    }

    print("\nCBD Advantage Validation:")
    print(f"  Expected: ~{CBD_ADVANTAGE:.2f}x")
    print(f"  Observed: {observed_cbd_factor:.2f}x")
    print(f"  Status:   {'Valid' if cbd_valid else 'Warning'}")

# Summary
valid_patterns = sum(1 for r in validation_results.values() if r.get('valid') is not None and bool(r.get('valid')))
total_patterns = sum(1 for r in validation_results.values() if 'valid' in r)
skipped_patterns = sum(1 for r in validation_results.values() if r.get('status') == 'insufficient_data')

print(f"\nValidation Summary: {valid_patterns}/{total_patterns} patterns validated successfully")
if skipped_patterns:
    print(f"Skipped validations due to insufficient data: {skipped_patterns}")

# Safe serialization with helper
def convert_to_json_serializable(obj):
    if isinstance(obj, (np.integer, np.int64)): return int(obj)
    if isinstance(obj, (np.floating, np.float64)): return float(obj)
    if isinstance(obj, (np.bool_)): return bool(obj)
    if isinstance(obj, pd.Timestamp): return obj.isoformat()
    if isinstance(obj, np.ndarray): return obj.tolist()
    if isinstance(obj, dict): return {k: convert_to_json_serializable(v) for k, v in obj.items()}
    if isinstance(obj, list): return [convert_to_json_serializable(i) for i in obj]
    return obj

# Save validation results
validation_path = modeling_dir / "feature_pattern_validation.json"
with open(validation_path, 'w') as f:
    json.dump(convert_to_json_serializable(validation_results), f, indent=2)

print(f"Saved pattern validation results to: {validation_path}")



Pattern Validation Against Analysis Insights
Rush Hour Validation:
  Expected: ~2.36x
  Observed: 2.35x
  Status:   Valid

Weekend Pattern Validation:
  Expected: ~0.63x
  Observed: 0.63x
  Status:   Valid

Snow Deterrent Validation:
  Expected: ~0.68x
  Observed: 0.68x
  Status:   Valid

Temperature Comfort Validation:
  Expected: ~2.41x
  Observed: 2.40x (hot vs. freezing)
  Status:   Valid

CBD Advantage Validation:
  Expected: ~2.52x
  Observed: 2.52x
  Status:   Valid

Validation Summary: 5/5 patterns validated successfully
Saved pattern validation results to: ..\data\processed\modeling\feature_pattern_validation.json


In [23]:
# =============================================
# 11. Save Final Modeling Dataset and Metadata
# =============================================

print("\nSaving Modeling Dataset")
print("=" * 40)

# Export modeling dataset
output_file = modeling_dir / "subway_ridership_modeling_features.parquet"
df_modeling.to_parquet(output_file, index=False)

file_size_mb = output_file.stat().st_size / (1024 * 1024)
print(f"Modeling dataset saved:")
print(f"  File:       {output_file.name}")
print(f"  Size:       {file_size_mb:.1f} MB")
print(f"  Shape:      {df_modeling.shape}")
print(f"  Features:   {len(final_available)}")

# ---------------------------------------------
# Helper function for JSON serialization
def convert_to_json_serializable(obj):
    if isinstance(obj, (np.integer, np.int64)): return int(obj)
    if isinstance(obj, (np.floating, np.float64)): return float(obj)
    if isinstance(obj, (np.bool_)): return bool(obj)
    if isinstance(obj, pd.Timestamp): return obj.isoformat()
    if isinstance(obj, np.ndarray): return obj.tolist()
    if isinstance(obj, dict): return {k: convert_to_json_serializable(v) for k, v in obj.items()}
    if isinstance(obj, list): return [convert_to_json_serializable(i) for i in obj]
    return obj

# ---------------------------------------------
# Metadata dictionary
feature_metadata = {
    'creation_date': datetime.now().isoformat(),
    'source_dataset': str(integrated_file),
    'total_features': int(len(final_available)),
    'features_by_category': {
        category: [f for f in features if f in final_available]
        for category, features in TARGET_FEATURES.items()
    },
    'dataset_info': {
        'shape': [int(x) for x in df_modeling.shape],
        'date_range': [
            df_modeling['transit_timestamp'].min().isoformat(),
            df_modeling['transit_timestamp'].max().isoformat()
        ],
        'stations': int(df_modeling['station_complex_id'].nunique())
    },
    'identifier_columns': base_identifiers + geo_identifiers,
    'validated_patterns': {
        'rush_hours': RUSH_HOURS,
        'weekend_factor': round(WEEKEND_FACTOR, 2),
        'holiday_factor': round(HOLIDAY_FACTOR, 2),
        'cbd_advantage': round(CBD_ADVANTAGE, 2),
        'snow_deterrent_pct': round(SNOW_DETERRENT, 1),
        'rain_deterrent_pct': round(RAIN_DETERRENT, 1),
        'temp_comfort_factor': round(TEMP_COMFORT_FACTOR, 2)
    },
    'validation_results': convert_to_json_serializable(validation_results),
    'feature_list': final_available,
    'new_features_added': [
        'is_freezing - Binary indicator for severe cold impact',
        'is_hot - Binary indicator for optimal temperature conditions',
        'temp - Raw temperature preserved for fine-grained modeling'
    ]
}

# Save metadata to JSON
metadata_file = modeling_dir / "feature_metadata.json"
with open(metadata_file, 'w') as f:
    json.dump(feature_metadata, f, indent=2)

print(f"Feature metadata saved:")
print(f"  File: {metadata_file.name}")



Saving Modeling Dataset
Modeling dataset saved:
  File:       subway_ridership_modeling_features.parquet
  Size:       8.5 MB
  Shape:      (1052709, 28)
  Features:   24
Feature metadata saved:
  File: feature_metadata.json


In [24]:
# =============================================
# 12. Feature Engineering Summary
# =============================================

print("\nFeature Engineering Summary")
print("=" * 60)

total_created = temporal_features_created + weather_features_created

# Build summary report string
summary_report = f"""
SUBWAY RIDERSHIP FEATURE ENGINEERING SUMMARY
============================================

OBJECTIVE:
Create production-ready 24-feature dataset for ridership prediction

DATASET TRANSFORMATION:
• Input shape:              {df.shape}
• Output shape:             {df_modeling.shape}
• Features created:         {total_created}
• Features available:       {len(final_available)}/{TOTAL_FEATURES}

FEATURE CATEGORIES:
• Temporal (12):            Raw temporal, cyclical encodings, derived indicators
• Weather (9):              Temperature, precipitation, comfort, and conditions
• Location (3):             CBD classification, latitude, longitude

VALIDATED PATTERNS IMPLEMENTED:
• Rush hours:               {RUSH_HOURS} (Expected factor: 2.36x)
• Weekend effect:           {WEEKEND_FACTOR:.2f}x
• Holiday effect:           {HOLIDAY_FACTOR:.2f}x
• CBD advantage:            {CBD_ADVANTAGE:.2f}x
• Snow deterrent:           {SNOW_DETERRENT:.1f}% reduction
• Rain deterrent:           {RAIN_DETERRENT:.1f}% reduction
• Temperature comfort:      {TEMP_COMFORT_FACTOR:.2f}x

NEW FEATURES BASED ON ANALYSIS:
• is_freezing:              Binary flag for temp < 0°C
• is_hot:                   Binary flag for temp > 30°C
• temp:                     Preserved for real-time predictions

QUALITY ASSURANCE:
• Cyclical features:        Validated in [-1, 1] range
• Binary features:          Validated as 0/1 encoded
• Pattern validation:       Completed using prior analysis benchmarks
• Missing values:           Checked and resolved
• Temperature extremes:     Evaluated for comfort dynamics

PRODUCTION READINESS:
• Real-time weather inputs: Supported
• Geographic interpolation: Enabled via lat/lon
• Station metadata:         Preserved for interface and mapping
• Feature pipeline:         Fully validated and reproducible
• Metadata:                 Stored in JSON with structure and rationale

IDENTIFIER COLUMNS:
• station_complex_id, station_complex
• latitude, longitude
• transit_timestamp

OUTPUT FILES:
• Modeling dataset:         {output_file}
• Feature metadata:         {metadata_file}

STATUS: READY FOR MODEL DEVELOPMENT
Next step: Machine learning model training and evaluation

VALIDATION SUMMARY:
• {valid_patterns}/{total_patterns} analytical patterns validated successfully
"""

# Print summary
print(summary_report)

# Save to text file
report_file = modeling_dir / "feature_engineering_summary.txt"
with open(report_file, 'w') as f:
    f.write(summary_report)

print(f"Summary report saved: {report_file}")

# Final footer
print("\n" + "=" * 60)
print("FEATURE ENGINEERING COMPLETED SUCCESSFULLY")
print("=" * 60)
print(f"Dataset ready: {len(final_available)}/{TOTAL_FEATURES} features")
print("Complete identifiers: station_complex_id, station_complex, lat/lon, timestamp")
print(f"Output location: {output_file}")
print("Next phase: Model development and validation")
print("=" * 60)



Feature Engineering Summary

SUBWAY RIDERSHIP FEATURE ENGINEERING SUMMARY

OBJECTIVE:
Create production-ready 24-feature dataset for ridership prediction

DATASET TRANSFORMATION:
• Input shape:              (1052709, 37)
• Output shape:             (1052709, 28)
• Features created:         14
• Features available:       24/24

FEATURE CATEGORIES:
• Temporal (12):            Raw temporal, cyclical encodings, derived indicators
• Weather (9):              Temperature, precipitation, comfort, and conditions
• Location (3):             CBD classification, latitude, longitude

VALIDATED PATTERNS IMPLEMENTED:
• Rush hours:               [8, 17] (Expected factor: 2.36x)
• Weekend effect:           0.63x
• Holiday effect:           0.66x
• CBD advantage:            2.52x
• Snow deterrent:           31.8% reduction
• Rain deterrent:           1.2% reduction
• Temperature comfort:      2.41x

NEW FEATURES BASED ON ANALYSIS:
• is_freezing:              Binary flag for temp < 0°C
• is_hot:       