# 02 - Manhattan Weather Data Cleaning and Preprocessing
*Weather Data Preparation for Subway Ridership Prediction*

---

### **Objective**
Clean and preprocess 2024 Manhattan hourly weather data to prepare for integration with subway ridership data.

---

### **Data Source**
- **Source:** [OpenWeather](https://openweathermap.org/)
- **Location:** Manhattan (approx. lat: 40.77, lon: -73.98)
- **File:** `weather_data.csv`
- **Path:** `../data/raw/weather_data.csv`
- **Format:** CSV
- **Units:** Metric

---

### **Data Range**
- **Start:** 2023-12-31
- **End:** 2025-01-01
- **Total Records:** 9,597
- **Total Columns:** 28

---

### **Weather Parameters Included**
- Temperature: `temp`, `temp_min`, `temp_max`, `feels_like`, `dew_point`
- Atmosphere: `pressure`, `humidity`, `visibility`, `clouds_all`
- Precipitation: `rain_1h`, `snow_1h`
- Conditions: `weather_main`, `weather_description`, `weather_icon`
- Wind: `wind_speed`, `wind_deg`, `wind_gust`

---

### **Processing Requirements**
- Convert timestamps from **UTC to Eastern Time**, accounting for **Daylight Saving Time (DST)**
- Ensure 2024 data is extracted, cleaned, and **hourly complete**
- Remove or aggregate duplicate timestamps (especially during DST transitions)
- Export cleaned data to: `../data/processed/weather_data_cleaned.parquet`



In [11]:
# =============================
# Setup and Directory Configuration
# =============================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path
from datetime import datetime
import pytz
import warnings
warnings.filterwarnings('ignore')

# Plot and display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('default')
sns.set_palette("husl")

# Log start
print("Manhattan Weather Data Cleaning and Preprocessing")
print("=" * 60)
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Objective: Clean weather data for ridership prediction")
print("=" * 60)

# =============================
# Directory and File Configuration
# =============================

# Assumes this notebook is inside `notebooks/`
PROJECT_DIR = Path.cwd().resolve().parents[0]
DATA_DIR = PROJECT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# File paths
WEATHER_FILE = RAW_DIR / "weather_data.csv"
CLEANED_PARQUET = PROCESSED_DIR / "weather_data_cleaned.parquet"
QUALITY_REPORT = PROCESSED_DIR / "weather_data_quality_assessment.json"

# Constants
EXPECTED_HOURS_2024 = 8784
NYC_TIMEZONE = pytz.timezone("America/New_York")

# Verify path setup
print("\nPath Verification")
print("-" * 60)
print(f"Raw weather file:      {WEATHER_FILE}")
print(f"Cleaned output file:   {CLEANED_PARQUET}")
print(f"Quality report file:   {QUALITY_REPORT}")
print(f"RAW_DIR exists:        {RAW_DIR.exists()}")
print(f"WEATHER_FILE exists:   {WEATHER_FILE.exists()}")

if RAW_DIR.exists():
    print("\nFiles in RAW_DIR matching '*weather*.csv':")
    for file in RAW_DIR.glob("*weather*.csv"):
        print(f"  - {file.name}")


Manhattan Weather Data Cleaning and Preprocessing
Analysis Date: 2025-07-28 19:45:12
Objective: Clean weather data for ridership prediction

Path Verification
------------------------------------------------------------
Raw weather file:      C:\Users\neasa\manhattan-subway\data\raw\weather_data.csv
Cleaned output file:   C:\Users\neasa\manhattan-subway\data\processed\weather_data_cleaned.parquet
Quality report file:   C:\Users\neasa\manhattan-subway\data\processed\weather_data_quality_assessment.json
RAW_DIR exists:        True
WEATHER_FILE exists:   True

Files in RAW_DIR matching '*weather*.csv':
  - weather_data.csv


In [12]:
print("\n" + "=" * 60)
print("1. LOADING AND EXAMINING WEATHER DATA")
print("=" * 60)

# Confirm weather file exists
if not WEATHER_FILE.exists():
    raise FileNotFoundError(f"Weather data file not found: {WEATHER_FILE}")

print(f"Loading weather data from: {WEATHER_FILE.name}")
weather_df = pd.read_csv(WEATHER_FILE)

print("Raw weather data loaded successfully")
print(f"Shape: {weather_df.shape}")
print(f"Columns: {len(weather_df.columns)}")
print(f"Memory usage: {weather_df.memory_usage(deep=True).sum() / (1024**2):.1f} MB")

# Display column names
print(f"\nOriginal Columns ({len(weather_df.columns)}):")
for i, col in enumerate(weather_df.columns, 1):
    print(f"  {i:2d}. {col}")

# Dtypes
print("\nData types:")
print(weather_df.dtypes)

# Sample rows
print("\nFirst few rows:")
display(weather_df.head())

# Basic data quality check
print("\nQuick quality overview:")
print(f"• Date range: {weather_df['dt_iso'].iloc[0]} to {weather_df['dt_iso'].iloc[-1]}")
print(f"• Duplicate rows: {weather_df.duplicated().sum()}")
print(f"• Missing values: {weather_df.isnull().sum().sum()} total")



1. LOADING AND EXAMINING WEATHER DATA
Loading weather data from: weather_data.csv
Raw weather data loaded successfully
Shape: (9597, 28)
Columns: 28
Memory usage: 4.8 MB

Original Columns (28):
   1. dt
   2. dt_iso
   3. timezone
   4. city_name
   5. lat
   6. lon
   7. temp
   8. visibility
   9. dew_point
  10. feels_like
  11. temp_min
  12. temp_max
  13. pressure
  14. sea_level
  15. grnd_level
  16. humidity
  17. wind_speed
  18. wind_deg
  19. wind_gust
  20. rain_1h
  21. rain_3h
  22. snow_1h
  23. snow_3h
  24. clouds_all
  25. weather_id
  26. weather_main
  27. weather_description
  28. weather_icon

Data types:
dt                       int64
dt_iso                  object
timezone                 int64
city_name               object
lat                    float64
lon                    float64
temp                   float64
visibility             float64
dew_point              float64
feels_like             float64
temp_min               float64
temp_max              

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,visibility,dew_point,feels_like,temp_min,temp_max,pressure,sea_level,grnd_level,humidity,wind_speed,wind_deg,wind_gust,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1703980800,2023-12-31 00:00:00 +0000 UTC,-18000,Manhattan,40.768517,-73.982194,6.53,10000.0,1.26,3.19,5.97,7.08,1007,,,69,5.14,300,,,,,,40,802,Clouds,scattered clouds,03n
1,1703984400,2023-12-31 01:00:00 +0000 UTC,-18000,Manhattan,40.768517,-73.982194,5.81,10000.0,-0.6,1.16,4.87,6.2,1008,,,63,8.23,300,11.32,,,,,75,803,Clouds,broken clouds,04n
2,1703988000,2023-12-31 02:00:00 +0000 UTC,-18000,Manhattan,40.768517,-73.982194,5.45,10000.0,-1.49,1.83,4.49,6.08,1008,,,60,5.14,320,,,,,,0,800,Clear,sky is clear,01n
3,1703991600,2023-12-31 03:00:00 +0000 UTC,-18000,Manhattan,40.768517,-73.982194,4.92,10000.0,-1.54,1.17,3.44,5.86,1009,,,62,5.14,310,,,,,,0,800,Clear,sky is clear,01n
4,1703995200,2023-12-31 04:00:00 +0000 UTC,-18000,Manhattan,40.768517,-73.982194,4.34,10000.0,-1.46,0.7,2.87,4.99,1010,,,65,4.63,310,,,,,,75,803,Clouds,broken clouds,04n



Quick quality overview:
• Date range: 2023-12-31 00:00:00 +0000 UTC to 2025-01-01 23:00:00 +0000 UTC
• Duplicate rows: 0
• Missing values: 60848 total


In [13]:
print("\n" + "=" * 60)
print("2. DATA QUALITY ASSESSMENT & TEMPORAL CHECK")
print("=" * 60)

# -----------------------------
# Missing data per column
# -----------------------------
print("\nMissing values per column:")
missing_counts = weather_df.isnull().sum()
missing_cols = missing_counts[missing_counts > 0]

if not missing_cols.empty:
    print(missing_cols)
    print(f"\nTotal missing values: {missing_counts.sum():,}")
else:
    print("No missing values found.")

# -----------------------------
# Duplicate row check
# -----------------------------
duplicate_count = weather_df.duplicated().sum()
print(f"\nDuplicate rows: {duplicate_count}")

# -----------------------------
# Parse timestamps from 'dt_iso'
# -----------------------------
print("\nParsing UTC timestamps from 'dt_iso'...")

try:
    weather_df['dt_iso_temp'] = pd.to_datetime(weather_df['dt_iso'], format='%Y-%m-%d %H:%M:%S %z %Z')
    print("Parsed using timezone-aware format.")
except Exception:
    dt_iso_clean = weather_df['dt_iso'].str.replace(' UTC', '', regex=False)
    weather_df['dt_iso_temp'] = pd.to_datetime(dt_iso_clean, utc=True)
    print("Fallback applied: removed ' UTC' and parsed as UTC.")

# -----------------------------
# Check date range and gaps
# -----------------------------
min_ts = weather_df['dt_iso_temp'].min()
max_ts = weather_df['dt_iso_temp'].max()

print(f"Parsed date range: {min_ts} to {max_ts}")
print(f"Total records: {len(weather_df):,}")

# -----------------------------
# Temporal gap check (should be hourly)
# -----------------------------
print("\nChecking for irregular time gaps...")
time_deltas = weather_df['dt_iso_temp'].diff().dropna()
expected_gap = pd.Timedelta(hours=1)
irregular_gaps = time_deltas[time_deltas != expected_gap]

if not irregular_gaps.empty:
    print(f"Found {len(irregular_gaps)} irregular gaps (non-hourly spacing)")
else:
    print("All timestamps are spaced hourly (as expected)")

# -----------------------------
# Cleanup temporary column
# -----------------------------
weather_df.drop(columns='dt_iso_temp', inplace=True)



2. DATA QUALITY ASSESSMENT & TEMPORAL CHECK

Missing values per column:
visibility      20
sea_level     9597
grnd_level    9597
wind_gust     5668
rain_1h       7539
rain_3h       9332
snow_1h       9510
snow_3h       9585
dtype: int64

Total missing values: 60,848

Duplicate rows: 0

Parsing UTC timestamps from 'dt_iso'...
Fallback applied: removed ' UTC' and parsed as UTC.
Parsed date range: 2023-12-31 00:00:00+00:00 to 2025-01-01 23:00:00+00:00
Total records: 9,597

Checking for irregular time gaps...
Found 765 irregular gaps (non-hourly spacing)


In [14]:
print("\n" + "=" * 60)
print("2. TIMEZONE CONVERSION (UTC TO EASTERN TIME WITH DST)")
print("=" * 60)

# Parse datetime from 'dt_iso' and localize to Eastern Time
print("Converting UTC timestamps to Eastern Time (handling DST)...")

try:
    # Attempt parsing full timezone-aware string
    weather_df['dt_utc'] = pd.to_datetime(weather_df['dt_iso'], format='%Y-%m-%d %H:%M:%S %z %Z')
    print("Parsed datetime with timezone format successfully.")
except Exception:
    print("Fallback: removing ' UTC' and parsing as UTC...")
    cleaned_dt = weather_df['dt_iso'].str.replace(' UTC', '', regex=False)
    weather_df['dt_utc'] = pd.to_datetime(cleaned_dt, utc=True)

# Show UTC range
print(f"\nUTC Timestamp Range:")
print(f"  Start: {weather_df['dt_utc'].min()}")
print(f"  End:   {weather_df['dt_utc'].max()}")

# Convert to NYC time (DST-aware)
weather_df['transit_timestamp'] = weather_df['dt_utc'].dt.tz_convert(NYC_TIMEZONE).dt.tz_localize(None)

# Show ET range
print(f"\nEastern Timezone Timestamp Range:")
print(f"  Start: {weather_df['transit_timestamp'].min()}")
print(f"  End:   {weather_df['transit_timestamp'].max()}")

# Sample conversions
print("\nSample conversions:")
print(weather_df[['dt_iso', 'dt_utc', 'transit_timestamp']].head(5))

# DST transition checks
print("\nDST transition verification:")

march_sample = weather_df[weather_df['transit_timestamp'].dt.month == 3].head(3)
nov_sample = weather_df[weather_df['transit_timestamp'].dt.month == 11].head(3)

if not march_sample.empty:
    print("March sample (DST begins - second Sunday):")
    print(march_sample[['dt_iso', 'transit_timestamp']])

if not nov_sample.empty:
    print("November sample (DST ends - first Sunday):")
    print(nov_sample[['dt_iso', 'transit_timestamp']])

print("\nTimezone conversion completed successfully.")



2. TIMEZONE CONVERSION (UTC TO EASTERN TIME WITH DST)
Converting UTC timestamps to Eastern Time (handling DST)...
Fallback: removing ' UTC' and parsing as UTC...

UTC Timestamp Range:
  Start: 2023-12-31 00:00:00+00:00
  End:   2025-01-01 23:00:00+00:00

Eastern Timezone Timestamp Range:
  Start: 2023-12-30 19:00:00
  End:   2025-01-01 18:00:00

Sample conversions:
                          dt_iso                    dt_utc   transit_timestamp
0  2023-12-31 00:00:00 +0000 UTC 2023-12-31 00:00:00+00:00 2023-12-30 19:00:00
1  2023-12-31 01:00:00 +0000 UTC 2023-12-31 01:00:00+00:00 2023-12-30 20:00:00
2  2023-12-31 02:00:00 +0000 UTC 2023-12-31 02:00:00+00:00 2023-12-30 21:00:00
3  2023-12-31 03:00:00 +0000 UTC 2023-12-31 03:00:00+00:00 2023-12-30 22:00:00
4  2023-12-31 04:00:00 +0000 UTC 2023-12-31 04:00:00+00:00 2023-12-30 23:00:00

DST transition verification:
March sample (DST begins - second Sunday):
                             dt_iso   transit_timestamp
1647  2024-03-01 05:00:00 +0

In [15]:
# -----------------------------------------------
# Verification for Summer Months (e.g. July)
# -----------------------------------------------
print("\nVerifying summer timestamps (should reflect UTC-4 conversion)...")

july_sample = weather_df[weather_df['transit_timestamp'].dt.month == 7].head(3)

if not july_sample.empty:
    print("July sample (during Daylight Saving Time - UTC-4):")
    print(july_sample[['dt_iso', 'transit_timestamp']])
else:
    print("No July records found in dataset.")



Verifying summer timestamps (should reflect UTC-4 conversion)...
July sample (during Daylight Saving Time - UTC-4):
                             dt_iso   transit_timestamp
4840  2024-07-01 04:00:00 +0000 UTC 2024-07-01 00:00:00
4841  2024-07-01 05:00:00 +0000 UTC 2024-07-01 01:00:00
4842  2024-07-01 06:00:00 +0000 UTC 2024-07-01 02:00:00


In [16]:
print("\n" + "=" * 60)
print("3. FILTER TO 2024 DATA ONLY")
print("=" * 60)

# Define 2024 datetime bounds
start_2024 = datetime(2024, 1, 1, 0, 0, 0)
end_2024 = datetime(2025, 1, 1, 0, 0, 0)

print(f"Filtering to full-year 2024 window: {start_2024} to {end_2024}")

# Filter to 2024 only
weather_2024 = weather_df[
    (weather_df['transit_timestamp'] >= start_2024) & 
    (weather_df['transit_timestamp'] < end_2024)
].copy()

# Summary
print(f"\nRecords before filtering: {len(weather_df):,}")
print(f"Records after filtering:  {len(weather_2024):,}")
print(f"Records removed:          {len(weather_df) - len(weather_2024):,}")
print(f"Filtered date range:      {weather_2024['transit_timestamp'].min()} → {weather_2024['transit_timestamp'].max()}")

# DST-aware expected hours (spring forward loses 1 hour)
expected_hours_dst = EXPECTED_HOURS_2024 - 1
actual_hours = len(weather_2024)
coverage_pct = (actual_hours / expected_hours_dst) * 100

print(f"\nExpected hours (DST adjusted): {expected_hours_dst}")
print(f"Actual hourly records:         {actual_hours}")
print(f"Coverage:                      {coverage_pct:.2f}%")

# Coverage assessment
if coverage_pct > 100:
    print("Over 100% coverage — likely duplicate timestamps (requires aggregation).")
elif coverage_pct >= 99:
    print("Excellent coverage — minimal to no missing data.")
elif coverage_pct >= 95:
    print("Good coverage — acceptable for modeling.")
elif coverage_pct >= 90:
    print("Fair coverage — review before model use.")
else:
    print("Low coverage — major data gaps, not recommended for modeling.")



3. FILTER TO 2024 DATA ONLY
Filtering to full-year 2024 window: 2024-01-01 00:00:00 to 2025-01-01 00:00:00

Records before filtering: 9,597
Records after filtering:  9,549
Records removed:          48
Filtered date range:      2024-01-01 00:00:00 → 2024-12-31 23:00:00

Expected hours (DST adjusted): 8783
Actual hourly records:         9549
Coverage:                      108.72%
Over 100% coverage — likely duplicate timestamps (requires aggregation).


In [17]:
print("\n" + "=" * 60)
print("4. HANDLE DUPLICATE TIMESTAMPS")
print("=" * 60)

# Check for duplicates
duplicate_count = weather_2024['transit_timestamp'].duplicated().sum()
print(f"Duplicate timestamps found: {duplicate_count}")

if duplicate_count > 0:
    print("Resolving duplicates using aggregation...")

    # Show a few duplicated timestamps
    sample_dupes = weather_2024[weather_2024['transit_timestamp'].duplicated(keep=False)]
    print("Sample duplicated timestamps:")
    print(sample_dupes['transit_timestamp'].value_counts().head())

    # Aggregation strategy
    aggregation_rules = {
        # Core measurements (mean)
        'temp': 'mean', 'feels_like': 'mean', 'dew_point': 'mean',
        'humidity': 'mean', 'pressure': 'mean', 'visibility': 'mean',
        'wind_speed': 'mean', 'wind_deg': 'mean',
        'temp_min': 'mean', 'temp_max': 'mean',
        'clouds_all': 'mean',

        # Precipitation (max for conservative signal)
        'rain_1h': 'max', 'snow_1h': 'max',

        # Categorical (first valid observation)
        'weather_main': 'first', 'weather_description': 'first', 'weather_icon': 'first'
    }

    # Limit to available columns
    available_agg = {col: method for col, method in aggregation_rules.items() if col in weather_2024.columns}
    print(f"Columns being aggregated: {len(available_agg)}")

    # Apply aggregation
    weather_clean = (
        weather_2024.groupby('transit_timestamp')
        .agg(available_agg)
        .reset_index()
    )

    print(f"Cleaned record count: {len(weather_clean):,}")
    print(f"Duplicates resolved: {len(weather_2024) - len(weather_clean):,}")

else:
    print("No duplicates detected — using original 2024 data.")
    weather_clean = weather_2024.copy()

# Hourly timestamp verification
minute_check = weather_clean['transit_timestamp'].dt.minute.nunique()
second_check = weather_clean['transit_timestamp'].dt.second.nunique()

print("\nHourly granularity verification:")
print(f"  Unique minute values: {minute_check}")
print(f"  Unique second values: {second_check}")

if minute_check == 1 and second_check == 1:
    print("Timestamps aligned to hourly (HH:00:00) — granularity confirmed.")
else:
    print("Warning: Timestamps not perfectly aligned to hourly boundaries.")

print(f"\nFinal cleaned dataset: {len(weather_clean):,} hourly records")



4. HANDLE DUPLICATE TIMESTAMPS
Duplicate timestamps found: 766
Resolving duplicates using aggregation...
Sample duplicated timestamps:
transit_timestamp
2024-02-13 05:00:00    3
2024-02-13 06:00:00    3
2024-01-24 22:00:00    3
2024-01-07 11:00:00    3
2024-01-07 12:00:00    3
Name: count, dtype: int64
Columns being aggregated: 16
Cleaned record count: 8,783
Duplicates resolved: 766

Hourly granularity verification:
  Unique minute values: 1
  Unique second values: 1
Timestamps aligned to hourly (HH:00:00) — granularity confirmed.

Final cleaned dataset: 8,783 hourly records


In [18]:
print("\n" + "=" * 60)
print("5. DATA QUALITY VALIDATION")
print("=" * 60)

# Initialize result dictionary
quality_results = {}

# ----------------------------------------
# 5.1 Temporal Coverage Validation
# ----------------------------------------
print("5.1 TEMPORAL COVERAGE VALIDATION")
print("-" * 40)

expected_hours = EXPECTED_HOURS_2024 - 1  # Spring forward removes 1 hour
actual_hours = len(weather_clean)
coverage_pct = (actual_hours / expected_hours) * 100

print(f"Expected hourly records (with DST): {expected_hours}")
print(f"Actual hourly records:              {actual_hours}")
print(f"Coverage percentage:                {coverage_pct:.2f}%")

if coverage_pct >= 99.5:
    coverage_status = "EXCELLENT"
elif coverage_pct >= 95:
    coverage_status = "GOOD"
else:
    coverage_status = "NEEDS REVIEW"

print(f"Coverage Status: {coverage_status}")

quality_results["temporal_coverage"] = {
    "expected_hours": expected_hours,
    "actual_hours": actual_hours,
    "coverage_pct": coverage_pct,
    "status": coverage_status
}

# ----------------------------------------
# 5.2 Field-Level Validity Checks
# ----------------------------------------
print("\n5.2 FIELD-LEVEL VALIDITY CHECKS")
print("-" * 40)

# Define all checks
quality_checks = {
    "no_duplicate_timestamps": {
        "passed": weather_clean['transit_timestamp'].duplicated().sum() == 0,
        "description": "No duplicate timestamps"
    },
    "reasonable_temperature_celsius": {
        "passed": (weather_clean['temp'].min() > -30) and (weather_clean['temp'].max() < 50),
        "description": "Temperature within reasonable NYC range (-30°C to 50°C)"
    },
    "non_negative_precipitation": {
        "passed": (weather_clean['rain_1h'].fillna(0).min() >= 0) and (weather_clean['snow_1h'].fillna(0).min() >= 0),
        "description": "No negative values in rain or snow"
    },
    "reasonable_humidity": {
        "passed": (weather_clean['humidity'].min() >= 0) and (weather_clean['humidity'].max() <= 100),
        "description": "Humidity within 0-100%"
    },
    "reasonable_wind_speed": {
        "passed": (weather_clean['wind_speed'].min() >= 0) and (weather_clean['wind_speed'].max() < 50),
        "description": "Wind speed within 0-50 m/s"
    },
    "reasonable_visibility": {
        "passed": (weather_clean['visibility'].min() >= 0) and (weather_clean['visibility'].max() <= 50000),
        "description": "Visibility within 0-50,000 meters"
    }
}

# Evaluate checks
passed_checks = 0
for key, check in quality_checks.items():
    result = "PASS" if check["passed"] else "FAIL"
    print(f"  {result:4s} - {check['description']}")
    if check["passed"]:
        passed_checks += 1

# Store in results
quality_results["field_checks"] = {
    "total_checks": len(quality_checks),
    "passed": passed_checks,
    "pass_rate": round((passed_checks / len(quality_checks)) * 100, 1)
}

print(f"\nChecks passed: {passed_checks}/{len(quality_checks)} "
      f"({quality_results['field_checks']['pass_rate']}%)")



5. DATA QUALITY VALIDATION
5.1 TEMPORAL COVERAGE VALIDATION
----------------------------------------
Expected hourly records (with DST): 8783
Actual hourly records:              8783
Coverage percentage:                100.00%
Coverage Status: EXCELLENT

5.2 FIELD-LEVEL VALIDITY CHECKS
----------------------------------------
  PASS - No duplicate timestamps
  PASS - Temperature within reasonable NYC range (-30°C to 50°C)
  PASS - No negative values in rain or snow
  PASS - Humidity within 0-100%
  PASS - Wind speed within 0-50 m/s
  PASS - Visibility within 0-50,000 meters

Checks passed: 6/6 (100.0%)


In [19]:
print("\n" + "=" * 60)
print("6. FEATURE SELECTION FOR INTEGRATION")
print("=" * 60)

print("Selecting core weather features for ridership prediction...")

# Define essential features (safe for modeling and integration)
integration_features = [
    'transit_timestamp',      # 1
    'temp',                   # 2 
    'feels_like',             # 3 
    'dew_point',              # 4  
    'humidity',               # 5
    'wind_speed',             # 6
    'pressure',               # 7
    'visibility',             # 8
    'rain_1h',                # 9
    'snow_1h',                # 10
    'weather_main',           # 11
    'weather_description',    # 12
    'clouds_all',             # 13
]

# Filter to columns that exist in cleaned dataset
available_features = [col for col in integration_features if col in weather_clean.columns]
integration_df = weather_clean[available_features].copy()

# Output selection summary
print(f"Selected {len(available_features)} core features:")
for i, col in enumerate(available_features, 1):
    print(f"  {i:2d}. {col}")

# Preview dataset shape
print(f"\nIntegration-ready dataset shape: {integration_df.shape}")

# Feature category classification (for readability)
feature_categories = {
    "timestamp": ['transit_timestamp'],
    "temperature": ['temp', 'feels_like', 'dew_point'],
    "atmospheric": ['humidity', 'pressure', 'visibility', 'clouds_all'],
    "wind": ['wind_speed'],
    "precipitation": ['rain_1h', 'snow_1h'],
    "conditions": ['weather_main', 'weather_description']
}

print(f"\nFeature categories by type:")
for group, cols in feature_categories.items():
    present = [c for c in cols if c in available_features]
    if present:
        print(f"  • {group.title():<13}: {len(present)} feature(s)")



6. FEATURE SELECTION FOR INTEGRATION
Selecting core weather features for ridership prediction...
Selected 13 core features:
   1. transit_timestamp
   2. temp
   3. feels_like
   4. dew_point
   5. humidity
   6. wind_speed
   7. pressure
   8. visibility
   9. rain_1h
  10. snow_1h
  11. weather_main
  12. weather_description
  13. clouds_all

Integration-ready dataset shape: (8783, 13)

Feature categories by type:
  • Timestamp    : 1 feature(s)
  • Temperature  : 3 feature(s)
  • Atmospheric  : 4 feature(s)
  • Wind         : 1 feature(s)
  • Precipitation: 2 feature(s)
  • Conditions   : 2 feature(s)


In [20]:
print("\n" + "=" * 60)
print("7. DATA EXPORT AND SUMMARY")
print("=" * 60)

# Define output file path
output_file = PROCESSED_DIR / "weather_data_cleaned.parquet"

# Export cleaned and filtered weather dataset
integration_df.to_parquet(output_file, index=False)

# Export confirmation
print(f"Cleaned weather data exported to:\n  {output_file}")
print(f"Records exported:   {len(integration_df):,}")
print(f"Features exported:  {len(integration_df.columns)}")

# Summary stats for tracking processing loss
print("\nSUMMARY STATISTICS")
print("-" * 60)
print(f"Original records:         {len(weather_df):,}")
print(f"After 2024 filtering:     {len(weather_2024):,}")
print(f"After duplicate removal:  {len(weather_clean):,}")
print(f"Final integration set:    {len(integration_df):,}")
print(f"Processing success rate:  {(len(integration_df) / len(weather_df)) * 100:.2f}%")

print("\n" + "=" * 60)
print("WEATHER DATA CLEANING COMPLETED SUCCESSFULLY")
print("=" * 60)
print("Next: Integrate with subway ridership data.")



7. DATA EXPORT AND SUMMARY
Cleaned weather data exported to:
  C:\Users\neasa\manhattan-subway\data\processed\weather_data_cleaned.parquet
Records exported:   8,783
Features exported:  13

SUMMARY STATISTICS
------------------------------------------------------------
Original records:         9,597
After 2024 filtering:     9,549
After duplicate removal:  8,783
Final integration set:    8,783
Processing success rate:  91.52%

WEATHER DATA CLEANING COMPLETED SUCCESSFULLY
Next: Integrate with subway ridership data.
