# 05 - Manhattan Subway Weather–Ridership Correlation Analysis  
*Quantifying the Impact of Weather on Subway Demand in Manhattan*

---

## Project Overview

**Objective:**  
Assess how various weather conditions influence subway ridership in Manhattan and determine the most predictive meteorological features for modeling.

**Input Dataset:**  
- `weather_ridership_integrated_2024.parquet`  
- Includes hourly subway ridership, cleaned weather data, and temporal insights (rush hours, holidays, etc.)

**Analysis Goals:**
- Measure correlation between weather variables and ridership
- Explore effects of temperature, precipitation, and atmospheric conditions
- Stratify impact by time (rush hour, weekend, CBD vs non-CBD)
- Identify top weather features for predictive modeling

**Key Research Question:**  
How significantly do weather conditions influence ridership patterns in the Manhattan subway system?

---

In [1]:
# =============================================
# Setup and Configuration
# =============================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.stats import pearsonr, spearmanr
import warnings
warnings.filterwarnings('ignore')

# Plotting configuration
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.3f}'.format)

# Notebook header
print("\n" + "=" * 60)
print("NOTEBOOK 05: WEATHER–RIDERSHIP CORRELATION ANALYSIS")
print("=" * 60)
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Objective: Analyze weather impact on subway ridership patterns")
print("=" * 60)

# Set up directory paths
data_dir = Path("../data/processed")
integration_dir = data_dir / "integration"
analysis_data_dir = data_dir / "analysis"
results_dir = Path("../results")
weather_analysis_dir = results_dir / "weather_correlation_analysis"
weather_analysis_dir.mkdir(parents=True, exist_ok=True)

# Verify data directory
if not integration_dir.exists():
    raise FileNotFoundError(f"Integration directory not found: {integration_dir}")

print("\nDirectory setup complete:")
print(f"  Data directory:         {data_dir}")
print(f"  Integration directory:  {integration_dir}")
print(f"  Results directory:      {weather_analysis_dir}")

# Load temporal analysis results
temporal_results_file = analysis_data_dir / "temporal_patterns.json"
if temporal_results_file.exists():
    with open(temporal_results_file, 'r') as f:
        temporal_insights = json.load(f)
    print(f"  Temporal insights loaded: {temporal_results_file.name}")
else:
    temporal_insights = {}
    print("  Temporal insights not found — using default context")


NOTEBOOK 05: WEATHER–RIDERSHIP CORRELATION ANALYSIS
Analysis Date: 2025-07-28 19:50:41
Objective: Analyze weather impact on subway ridership patterns

Directory setup complete:
  Data directory:         ..\data\processed
  Integration directory:  ..\data\processed\integration
  Results directory:      ..\results\weather_correlation_analysis
  Temporal insights loaded: temporal_patterns.json


In [2]:
# =============================================
# 1. LOAD INTEGRATED WEATHER–RIDERSHIP DATA
# =============================================

print("\n" + "=" * 60)
print("1. LOADING INTEGRATED WEATHER–RIDERSHIP DATA")
print("=" * 60)

# Load the integrated dataset
integrated_file = integration_dir / "weather_ridership_integrated_2024.parquet"
print(f"Loading dataset from: {integrated_file}")

try:
    df = pd.read_parquet(integrated_file)
    df['transit_timestamp'] = pd.to_datetime(df['transit_timestamp'])  # Ensure datetime
    print("Dataset loaded successfully!")
    print(f"  Records:    {len(df):,}")
    print(f"  Columns:    {len(df.columns)}")
    print(f"  Date range: {df['transit_timestamp'].min()} to {df['transit_timestamp'].max()}")
except FileNotFoundError:
    print(f"ERROR: Dataset not found at {integrated_file}")
    raise

# Extract temporal context from prior analysis
if temporal_insights:
    print("\nUsing temporal analysis results:")
    key_patterns = temporal_insights.get('key_patterns', {})
    rush_hours = key_patterns.get('rush_hours', [8, 17])
    weekend_factor = key_patterns.get('weekend_factor', 0.63)
    cbd_advantage = key_patterns.get('cbd_advantage', 2.52)
else:
    print("\nUsing default temporal parameters:")
    rush_hours = [8, 17]
    weekend_factor = 0.63
    cbd_advantage = 2.52

print(f"  Rush hours:        {rush_hours}")
print(f"  Weekend factor:    {weekend_factor:.2f}x")
print(f"  CBD advantage:     {cbd_advantage:.2f}x")

# Add temporal flags
df['is_rush_hour'] = df['hour'].isin(rush_hours)
df['is_weekend'] = df['day_of_week'].isin([5, 6])

# Identify available weather columns
core_weather_cols = [
    'temp', 'feels_like', 'dew_point', 'humidity', 'wind_speed', 
    'pressure', 'visibility', 'rain_1h', 'snow_1h', 'weather_main', 
    'weather_description', 'clouds_all'
]
weather_cols = [col for col in core_weather_cols if col in df.columns]
missing_weather = [col for col in core_weather_cols if col not in df.columns]

print("\nWeather features analysis:")
print(f"  Available features: {len(weather_cols)}")
print(f"  Columns:            {weather_cols}")
if missing_weather:
    print(f"  Missing:            {missing_weather}")

# Basic data integrity checks
print("\nDataset characteristics:")
print(f"  Unique stations:      {df['station_complex_id'].nunique()}")
print(f"  Missing ridership:    {df['ridership'].isna().sum()}")
print(f"  Missing weather cells: {df[weather_cols].isna().sum().sum()}")

# Store metadata
weather_analysis_results = {
    'dataset_info': {
        'total_records': len(df),
        'unique_stations': df['station_complex_id'].nunique(),
        'available_weather_features': weather_cols,
        'missing_weather_features': missing_weather,
        'temporal_context': {
            'rush_hours': rush_hours,
            'weekend_factor': weekend_factor,
            'cbd_advantage': cbd_advantage
        }
    }
}

print("\nData loading complete. Ready for weather correlation analysis.")



1. LOADING INTEGRATED WEATHER–RIDERSHIP DATA
Loading dataset from: ..\data\processed\integration\weather_ridership_integrated_2024.parquet
Dataset loaded successfully!
  Records:    1,052,709
  Columns:    23
  Date range: 2024-01-01 00:00:00 to 2024-12-31 23:00:00

Using temporal analysis results:
  Rush hours:        [8, 17]
  Weekend factor:    0.63x
  CBD advantage:     2.52x

Weather features analysis:
  Available features: 12
  Columns:            ['temp', 'feels_like', 'dew_point', 'humidity', 'wind_speed', 'pressure', 'visibility', 'rain_1h', 'snow_1h', 'weather_main', 'weather_description', 'clouds_all']

Dataset characteristics:
  Unique stations:      121
  Missing ridership:    0
  Missing weather cells: 1938290

Data loading complete. Ready for weather correlation analysis.


In [3]:
# =============================================
# 2. WEATHER DATA OVERVIEW AND DISTRIBUTION
# =============================================

print("\n" + "=" * 60)
print("2. WEATHER DATA OVERVIEW AND DISTRIBUTION")
print("=" * 60)

# Summary statistics
if weather_cols:
    print("Weather variable statistics:")
    weather_stats = df[weather_cols].describe()
    print(weather_stats)

    # Check for missing values
    missing_weather = df[weather_cols].isnull().sum()
    if missing_weather.sum() > 0:
        print("\nMissing weather data:")
        for col, count in missing_weather.items():
            if count > 0:
                pct = (count / len(df)) * 100
                print(f"  {col:<15}: {count:,} missing ({pct:.1f}%)")
    else:
        print("\nNo missing weather data found.")

# Precipitation checks
print("\nPrecipitation occurrence:")
if 'rain_1h' in df.columns:
    rain_hours = (df['rain_1h'] > 0).sum()
    rain_pct = (rain_hours / len(df)) * 100
    print(f"  Rain hours: {rain_hours:,} ({rain_pct:.1f}%)")

if 'snow_1h' in df.columns:
    snow_hours = (df['snow_1h'] > 0).sum()
    snow_pct = (snow_hours / len(df)) * 100
    print(f"  Snow hours: {snow_hours:,} ({snow_pct:.1f}%)")

# Frequency of main weather conditions
if 'weather_main' in df.columns:
    print("\nTop weather conditions:")
    weather_counts = df['weather_main'].value_counts().head(6)
    for condition, count in weather_counts.items():
        pct = (count / len(df)) * 100
        print(f"  {condition:<15}: {count:,} hours ({pct:.1f}%)")

# Temperature binning
temp_col = 'temp'
if temp_col in df.columns:
    print(f"\nTemperature analysis ({temp_col} in °C):")
    temp_stats = df[temp_col].describe()
    print(f"  Range:    {temp_stats['min']:.1f}°C to {temp_stats['max']:.1f}°C")
    print(f"  Average:  {temp_stats['mean']:.1f}°C")

    # Bin temperatures into categories
    df['temp_category'] = pd.cut(
        df[temp_col],
        bins=[-np.inf, 0, 10, 20, 30, np.inf],
        labels=['freezing', 'cold', 'cool', 'warm', 'hot']
    )

    temp_dist = df['temp_category'].value_counts().sort_index()
    print("\nTemperature distribution by category:")
    for cat, count in temp_dist.items():
        pct = (count / len(df)) * 100
        print(f"  {cat:<10}: {count:,} hours ({pct:.1f}%)")

print("\nWeather overview complete.")



2. WEATHER DATA OVERVIEW AND DISTRIBUTION
Weather variable statistics:
             temp  feels_like   dew_point    humidity  wind_speed    pressure  \
count 1052709.000 1052709.000 1052709.000 1052709.000 1052709.000 1052709.000   
mean       14.185      12.656       7.011      63.565       5.064    1016.548   
std         9.270      11.322       9.381      17.036       2.237       8.133   
min       -10.580     -17.580     -16.260      18.000       0.000     982.000   
25%         6.670       3.400      -0.350      50.000       3.600    1012.000   
50%        14.420      13.570       7.210      63.000       4.630    1016.000   
75%        22.040      21.940      14.560      77.000       6.170    1021.000   
max        35.430      40.040      24.440      96.000      17.490    1049.000   

       visibility    rain_1h  snow_1h  clouds_all  
count 1050304.000 162739.000 6794.000 1052709.000  
mean     9570.141      1.231    0.438      38.905  
std      1542.004      2.884    0.305     

In [4]:
# =============================================
# 3. BASIC WEATHER–RIDERSHIP CORRELATIONS
# =============================================

print("\n" + "=" * 60)
print("3. BASIC WEATHER–RIDERSHIP CORRELATIONS")
print("=" * 60)

# Define weather variables to test
numerical_weather = [
    'temp', 'feels_like', 'dew_point', 'humidity',
    'wind_speed', 'pressure', 'visibility',
    'rain_1h', 'snow_1h'
]
available_numerical = [col for col in numerical_weather if col in df.columns]

print(f"Analyzing correlations for: {available_numerical}")

correlations = {}

# Compute correlation for each weather variable vs. ridership
for var in available_numerical:
    subset = df[[var, 'ridership']].dropna()

    if len(subset) > 1000:
        try:
            pearson_r, pearson_p = pearsonr(subset[var], subset['ridership'])
            spearman_r, spearman_p = spearmanr(subset[var], subset['ridership'])

            correlations[var] = {
                'pearson_r': pearson_r,
                'pearson_p': pearson_p,
                'spearman_r': spearman_r,
                'spearman_p': spearman_p,
                'sample_size': len(subset)
            }
        except Exception as e:
            print(f"Error computing correlation for {var}: {e}")

# Print correlation results
if correlations:
    print("\nWeather–Ridership Correlations:")
    print(f"{'Variable':<15} {'Pearson r':<10} {'P-value':<10} {'Significance':<12}")
    print("-" * 50)

    for var, c in correlations.items():
        if c['pearson_p'] < 0.001:
            sig = "***"
        elif c['pearson_p'] < 0.01:
            sig = "**"
        elif c['pearson_p'] < 0.05:
            sig = "*"
        else:
            sig = "n.s."
        print(f"{var:<15} {c['pearson_r']:<10.4f} {c['pearson_p']:<10.4f} {sig:<12}")

    # Summary insights
    significant_vars = [v for v, c in correlations.items() if c['pearson_p'] < 0.05]
    strongest_positive = max([c['pearson_r'] for c in correlations.values()])
    strongest_negative = min([c['pearson_r'] for c in correlations.values()])

    print("\nKey findings:")
    print(f"  Significant variables: {len(significant_vars)} / {len(correlations)}")
    print(f"  Strongest positive correlation: {strongest_positive:.3f}")
    print(f"  Strongest negative correlation: {strongest_negative:.3f}")

# Store results
weather_analysis_results['correlations'] = correlations
print("\nCorrelation analysis complete.")



3. BASIC WEATHER–RIDERSHIP CORRELATIONS
Analyzing correlations for: ['temp', 'feels_like', 'dew_point', 'humidity', 'wind_speed', 'pressure', 'visibility', 'rain_1h', 'snow_1h']

Weather–Ridership Correlations:
Variable        Pearson r  P-value    Significance
--------------------------------------------------
temp            0.0719     0.0000     ***         
feels_like      0.0655     0.0000     ***         
dew_point       0.0027     0.0051     **          
humidity        -0.1320    0.0000     ***         
wind_speed      0.0828     0.0000     ***         
pressure        -0.0085    0.0000     ***         
visibility      0.0155     0.0000     ***         
rain_1h         0.0214     0.0000     ***         
snow_1h         -0.1277    0.0000     ***         

Key findings:
  Significant variables: 9 / 9
  Strongest positive correlation: 0.083
  Strongest negative correlation: -0.132

Correlation analysis complete.


In [5]:
# =============================================
# 4. TEMPERATURE EFFECTS ANALYSIS
# =============================================

print("\n" + "=" * 60)
print("4. TEMPERATURE EFFECTS ANALYSIS")
print("=" * 60)

if 'temp_category' in df.columns:
    print("Ridership by temperature category:")

    # Group and summarize
    temp_analysis = df.groupby('temp_category')['ridership'].agg(['mean', 'count']).round(0)
    temp_analysis.columns = ['avg_ridership', 'hours']
    print(temp_analysis)

    # Identify best and worst categories
    optimal_temp = temp_analysis['avg_ridership'].idxmax()
    worst_temp = temp_analysis['avg_ridership'].idxmin()
    comfort_factor = temp_analysis.loc[optimal_temp, 'avg_ridership'] / temp_analysis.loc[worst_temp, 'avg_ridership']

    print("\nTemperature insights:")
    print(f"  Best for ridership:   {optimal_temp} ({temp_analysis.loc[optimal_temp, 'avg_ridership']:.0f} avg)")
    print(f"  Worst for ridership:  {worst_temp} ({temp_analysis.loc[worst_temp, 'avg_ridership']:.0f} avg)")
    print(f"  Temperature comfort factor: {comfort_factor:.2f}x")

    # Store in results
    weather_analysis_results['temperature_effects'] = {
        'optimal_category': str(optimal_temp),
        'worst_category': str(worst_temp),
        'comfort_factor': float(comfort_factor),
        'category_averages': {
            str(k): float(v) for k, v in temp_analysis['avg_ridership'].to_dict().items()
        }
    }

else:
    print("Temperature categories not available in dataset.")

print("\nTemperature analysis complete.")



4. TEMPERATURE EFFECTS ANALYSIS
Ridership by temperature category:
               avg_ridership   hours
temp_category                       
freezing             434.000   61701
cold                 597.000  325296
cool                 640.000  320166
warm                 704.000  322101
hot                 1045.000   23445

Temperature insights:
  Best for ridership:   hot (1045 avg)
  Worst for ridership:  freezing (434 avg)
  Temperature comfort factor: 2.41x

Temperature analysis complete.


In [6]:
# =============================================
# 5. PRECIPITATION IMPACT ANALYSIS
# =============================================

print("\n" + "=" * 60)
print("5. PRECIPITATION IMPACT ANALYSIS")
print("=" * 60)

precipitation_effects = {}

# Rain analysis
if 'rain_1h' in df.columns:
    df['has_rain'] = df['rain_1h'] > 0
    rain_stats = df.groupby('has_rain')['ridership'].agg(['mean', 'count']).round(0)
    rain_stats.index = ['No Rain', 'Rain']

    print("Rain impact on ridership:")
    print(rain_stats)

    if len(rain_stats) == 2:
        rain_factor = rain_stats.loc['Rain', 'mean'] / rain_stats.loc['No Rain', 'mean']
        rain_deterrent = (1 - rain_factor) * 100

        print("\nRain insights:")
        print(f"  Rain effect factor:  {rain_factor:.3f}x")
        print(f"  Deterrent effect:    {rain_deterrent:.1f}%")

        precipitation_effects['rain'] = {
            'rain_factor': float(rain_factor),
            'deterrent_pct': float(rain_deterrent)
        }

# Snow analysis
if 'snow_1h' in df.columns:
    df['has_snow'] = df['snow_1h'] > 0
    snow_stats = df.groupby('has_snow')['ridership'].agg(['mean', 'count']).round(0)
    snow_stats.index = ['No Snow', 'Snow']

    print("\nSnow impact on ridership:")
    print(snow_stats)

    if len(snow_stats) == 2:
        snow_factor = snow_stats.loc['Snow', 'mean'] / snow_stats.loc['No Snow', 'mean']
        snow_deterrent = (1 - snow_factor) * 100

        print("\nSnow insights:")
        print(f"  Snow effect factor:  {snow_factor:.3f}x")
        print(f"  Deterrent effect:    {snow_deterrent:.1f}%")

        precipitation_effects['snow'] = {
            'snow_factor': float(snow_factor),
            'deterrent_pct': float(snow_deterrent)
        }

# Weather condition group analysis
if 'weather_main' in df.columns:
    print("\nWeather condition impact on ridership:")
    condition_stats = (
        df.groupby('weather_main')['ridership']
        .agg(['mean', 'count'])
        .round(0)
        .query("count >= 100")
        .sort_values('mean', ascending=False)
    )

    print(condition_stats.head(8))

    if len(condition_stats) > 1:
        best_weather = condition_stats.index[0]
        worst_weather = condition_stats.index[-1]
        weather_range = condition_stats.iloc[0]['mean'] / condition_stats.iloc[-1]['mean']

        print("\nWeather condition insights:")
        print(f"  Best:    {best_weather} ({condition_stats.iloc[0]['mean']:.0f} avg)")
        print(f"  Worst:   {worst_weather} ({condition_stats.iloc[-1]['mean']:.0f} avg)")
        print(f"  Weather condition range: {weather_range:.2f}x")

# Store results
weather_analysis_results['precipitation_effects'] = precipitation_effects
print("\nPrecipitation analysis complete.")



5. PRECIPITATION IMPACT ANALYSIS
Rain impact on ridership:
           mean   count
No Rain 645.000  889970
Rain    637.000  162739

Rain insights:
  Rain effect factor:  0.988x
  Deterrent effect:    1.2%

Snow impact on ridership:
           mean    count
No Snow 645.000  1045915
Snow    440.000     6794

Snow insights:
  Snow effect factor:  0.682x
  Deterrent effect:    31.8%

Weather condition impact on ridership:
                mean   count
weather_main                
Smoke        873.000    6687
Haze         761.000   11413
Thunderstorm 699.000    1181
Clouds       692.000  286175
Clear        632.000  516625
Rain         631.000  153504
Drizzle      620.000    9326
Mist         539.000   51183

Weather condition insights:
  Best:    Smoke (873 avg)
  Worst:   Fog (412 avg)
  Weather condition range: 2.12x

Precipitation analysis complete.


In [7]:
# =============================================
# 6. WEATHER–TEMPORAL INTERACTION ANALYSIS
# =============================================

print("\n" + "=" * 60)
print("6. WEATHER–TEMPORAL INTERACTION ANALYSIS")
print("=" * 60)

interactions = {}

# Rain sensitivity by time context
if 'has_rain' in df.columns:
    print("Rain sensitivity analysis:")

    # Rush hour vs non-rush
    rush_rain = (
        df.groupby(['is_rush_hour', 'has_rain'])['ridership']
        .mean()
        .unstack()
        .rename(columns={False: 'No Rain', True: 'Rain'})
    )
    rush_rain['rain_factor'] = rush_rain['Rain'] / rush_rain['No Rain']
    rush_rain.index = ['Non-Rush', 'Rush Hour']

    print("\nRain impact by rush hour:")
    print(rush_rain[['rain_factor']].round(3))

    # Weekend vs weekday
    weekend_rain = (
        df.groupby(['is_weekend', 'has_rain'])['ridership']
        .mean()
        .unstack()
        .rename(columns={False: 'No Rain', True: 'Rain'})
    )
    weekend_rain['rain_factor'] = weekend_rain['Rain'] / weekend_rain['No Rain']
    weekend_rain.index = ['Weekday', 'Weekend']

    print("\nRain impact by day type:")
    print(weekend_rain[['rain_factor']].round(3))

    # Save interactions
    interactions['rain_sensitivity'] = {
        'rush_hour_factor': float(rush_rain.loc['Rush Hour', 'rain_factor']),
        'non_rush_factor': float(rush_rain.loc['Non-Rush', 'rain_factor']),
        'weekend_factor': float(weekend_rain.loc['Weekend', 'rain_factor']),
        'weekday_factor': float(weekend_rain.loc['Weekday', 'rain_factor'])
    }

# Store to master results
weather_analysis_results['temporal_interactions'] = interactions

print("\nWeather–temporal interaction analysis complete.")



6. WEATHER–TEMPORAL INTERACTION ANALYSIS
Rain sensitivity analysis:

Rain impact by rush hour:
has_rain   rain_factor
Non-Rush         0.987
Rush Hour        0.996

Rain impact by day type:
has_rain  rain_factor
Weekday         1.006
Weekend         0.990

Weather–temporal interaction analysis complete.


In [8]:
# =============================================
# 7. WEATHER FEATURE ENGINEERING INSIGHTS AND RECOMMENDATIONS
# =============================================

print("\n" + "=" * 60)
print("7. WEATHER FEATURE ENGINEERING INSIGHTS AND RECOMMENDATIONS")
print("=" * 60)

print("Based on comprehensive weather–ridership correlation analysis:\n")

# High-priority weather features
print("HIGH-PRIORITY WEATHER FEATURES:")
effects = weather_analysis_results.get('precipitation_effects', {})
temp_effects = weather_analysis_results.get('temperature_effects', {})

if 'snow' in effects:
    print(f"• has_snow: {effects['snow']['deterrent_pct']:.1f}% ridership reduction")
if 'rain' in effects:
    print(f"• has_rain: {effects['rain']['deterrent_pct']:.1f}% ridership reduction")

if temp_effects:
    print(f"• temp_category: {temp_effects['comfort_factor']:.2f}x range (freezing to hot)")
    print(f"• is_freezing: Severe impact ({temp_effects['category_averages'].get('freezing', 0):.0f} avg ridership)")
    print(f"• is_hot: Optimal conditions ({temp_effects['category_averages'].get('hot', 0):.0f} avg ridership)")

print("\nMODERATE-PRIORITY WEATHER FEATURES:")
corrs = weather_analysis_results.get('correlations', {})
if 'humidity' in corrs:
    print(f"• humidity: Strongest negative correlation (r={corrs['humidity']['pearson_r']:.3f})")
if 'wind_speed' in corrs:
    print(f"• wind_speed: Positive correlation (r={corrs['wind_speed']['pearson_r']:.3f})")
if 'feels_like' in corrs:
    print(f"• feels_like: Human comfort proxy (r={corrs['feels_like']['pearson_r']:.3f})")

print("\nWEATHER–TEMPORAL INTERACTION FEATURES:")
interactions = weather_analysis_results.get('temporal_interactions', {}).get('rain_sensitivity', {})
if interactions:
    print(f"• rain_weekend: Weekend rain sensitivity ({interactions['weekend_factor']:.3f}x)")
    print(f"• rain_rush_hour: Rush hour rain resilience ({interactions['rush_hour_factor']:.3f}x)")
print(f"• snow_weekend: Enhanced weekend snow impact (leisure travel more affected)")
print(f"• weather_condition_hour: Time-varying weather condition impacts")

print("\nADVANCED WEATHER FEATURES:")
print(f"• temperature_comfort_score: Continuous 0–1 scale")
print(f"• precipitation_intensity: Rain/snow severity levels")
print(f"• weather_severity_index: Combined discomfort score")
print(f"• seasonal_temperature_adjustment: Adjust same temp across seasons")

# Quantified impact summary
print("\nQUANTIFIED WEATHER IMPACTS:")
if 'snow' in effects:
    print(f"• Snow deterrent: {effects['snow']['deterrent_pct']:.1f}% reduction (-{(1 - effects['snow']['snow_factor']) * 100:+.0f}%)")
if 'rain' in effects:
    print(f"• Rain deterrent: {effects['rain']['deterrent_pct']:.1f}% reduction (-{(1 - effects['rain']['rain_factor']) * 100:+.0f}%)")
if temp_effects:
    print(f"• Temperature range: {temp_effects['comfort_factor']:.2f}x between optimal and worst")
if corrs:
    strongest_pos = max([c['pearson_r'] for c in corrs.values()])
    strongest_neg = min([c['pearson_r'] for c in corrs.values()])
    print(f"• Correlation range: {strongest_neg:.3f} to {strongest_pos:.3f}")

# Prioritized feature recommendation structure
weather_feature_recommendations = {
    'critical_weather_features': [
        'has_snow',
        'temp_category',
        'is_freezing',
    ],
    'high_importance_features': [
        'has_rain',
        'humidity',
        'is_hot',
        'feels_like',
    ],
    'interaction_features': [
        'rain_weekend',
        'snow_weekend',
        'rain_rush_hour',
        'temp_season_adj',
    ],
    'advanced_features': [
        'weather_severity_index',
        'precipitation_intensity',
        'temperature_comfort_score',
        'visibility_impact',
    ]
}

print("\nWEATHER FEATURE PRIORITY RANKING:")
print(f"• Critical:          {weather_feature_recommendations['critical_weather_features']}")
print(f"• High importance:   {weather_feature_recommendations['high_importance_features']}")
print(f"• Interaction:       {weather_feature_recommendations['interaction_features']}")
print(f"• Advanced:          {weather_feature_recommendations['advanced_features']}")

print("\nWEATHER MODELING STRATEGY:")
print("1. Start with critical features (e.g., snow, freezing)")
print("2. Add high-importance variables (rain, humidity, comfort)")
print("3. Integrate temporal-weather interactions (weekend rain, etc.)")
print("4. Apply advanced features to boost model generalization")

print("\nEXPECTED FEATURE IMPACT:")
print("• Weather captures seasonal and daily volatility")
print("• Snow and temperature have strongest direct signals")
print("• Interactions reveal behavioral variation across time/context")
print("• Combined with temporal features → holistic model design")

# Save to results
weather_analysis_results['weather_feature_recommendations'] = weather_feature_recommendations

print("\nWeather feature engineering insights complete.")



7. WEATHER FEATURE ENGINEERING INSIGHTS AND RECOMMENDATIONS
Based on comprehensive weather–ridership correlation analysis:

HIGH-PRIORITY WEATHER FEATURES:
• has_snow: 31.8% ridership reduction
• has_rain: 1.2% ridership reduction
• temp_category: 2.41x range (freezing to hot)
• is_freezing: Severe impact (434 avg ridership)
• is_hot: Optimal conditions (1045 avg ridership)

MODERATE-PRIORITY WEATHER FEATURES:
• humidity: Strongest negative correlation (r=-0.132)
• wind_speed: Positive correlation (r=0.083)
• feels_like: Human comfort proxy (r=0.066)

WEATHER–TEMPORAL INTERACTION FEATURES:
• rain_weekend: Weekend rain sensitivity (0.990x)
• rain_rush_hour: Rush hour rain resilience (0.996x)
• snow_weekend: Enhanced weekend snow impact (leisure travel more affected)
• weather_condition_hour: Time-varying weather condition impacts

ADVANCED WEATHER FEATURES:
• temperature_comfort_score: Continuous 0–1 scale
• precipitation_intensity: Rain/snow severity levels
• weather_severity_index: C

In [9]:
# =============================================
# 8. SAVING ANALYSIS RESULTS
# =============================================

print("\n" + "=" * 60)
print("8. SAVING ANALYSIS RESULTS")
print("=" * 60)

# Add metadata
weather_analysis_results['analysis_metadata'] = {
    'analysis_date': datetime.now().isoformat(),
    'dataset_size': len(df),
    'weather_features_analyzed': len(weather_cols)
}

# Save results to JSON
results_file = weather_analysis_dir / "weather_correlation_results.json"
with open(results_file, 'w') as f:
    json.dump(weather_analysis_results, f, indent=2, default=str)

print(f"Results saved: {results_file.name}")

# Final summary
print("\nWeather correlation analysis complete:")

# Correlation summary
correlations = weather_analysis_results.get('correlations', {})
sig_corrs = len([c for c in correlations.values() if c['pearson_p'] < 0.05])
print(f"  Significant correlations: {sig_corrs}")

# Precipitation effect summary
effects = weather_analysis_results.get('precipitation_effects', {})
if 'rain' in effects:
    print(f"  Rain deterrent: {effects['rain']['deterrent_pct']:.1f}%")
if 'snow' in effects:
    print(f"  Snow deterrent: {effects['snow']['deterrent_pct']:.1f}%")

print("\nReady for feature engineering phase.")



8. SAVING ANALYSIS RESULTS
Results saved: weather_correlation_results.json

Weather correlation analysis complete:
  Significant correlations: 9
  Rain deterrent: 1.2%
  Snow deterrent: 31.8%

Ready for feature engineering phase.
