# B4. Preprocessing and Feature Engineering for Basel Rain Prediction

This notebook takes the raw Kaggle/ECA&D weather file and creates a clean, Basel-only table with features and the `RainTomorrow` label. The output is saved under `data/processed/` so model notebooks can load it directly.


In [12]:
import pandas as pd
import numpy as np
import os
import sys
sys.path.append("/Users/purvigarg/Downloads/CMSE492/cmse492_project")
from src.preprocessing.features import load_raw_weather, make_basel_features
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

RAW_PATH = "/Users/purvigarg/Downloads/CMSE492/cmse492_project/data/raw/weather_prediction_dataset.csv"
raw = load_raw_weather(RAW_PATH)
print("Loaded raw shape:", raw.shape)

raw["DATE"] = pd.to_datetime(raw["DATE"].astype(str), errors="coerce")
raw = raw.sort_values("DATE").reset_index(drop=True)

# 2) Build Basel-only feature table using the src helper
df_proc = make_basel_features(raw, city="BASEL")
print("Processed shape:", df_proc.shape)

city = "BASEL"
pref = f"{city}_"
df_proc.head()



Loaded raw shape: (3654, 165)
Processed shape: (3653, 12)


Unnamed: 0,DATE,MONTH,RainToday,RainTomorrow,BASEL_pressure,BASEL_humidity,BASEL_temp_mean,BASEL_sunshine,BASEL_pressure_lag1,BASEL_humidity_lag1,BASEL_temp_mean_lag1,BASEL_sunshine_lag1
0,2000-01-02,1,0,0,1.0318,0.87,3.6,0.0,1.0286,0.89,2.9,0.0
1,2000-01-03,1,0,1,1.0314,0.81,2.2,3.7,1.0318,0.87,3.6,0.0
2,2000-01-04,1,1,1,1.0262,0.79,3.9,6.9,1.0314,0.81,2.2,3.7
3,2000-01-05,1,1,0,1.0246,0.9,6.0,3.7,1.0262,0.79,3.9,6.9
4,2000-01-06,1,0,0,1.0244,0.85,4.2,5.7,1.0246,0.9,6.0,3.7


## 1. Construct `RainToday` and `RainTomorrow` Labels

Here I turn the Basel precipitation measurements into two binary labels:

- `RainToday` = 1 if it rained on that day in Basel, 0 otherwise.  
- `RainTomorrow` = 1 if it rains on the following day, 0 otherwise.

`RainTomorrow` is the actual supervised-learning target that later models will
try to predict from today’s features.


In [13]:
# --- label construction: RainToday and RainTomorrow ---

df = raw.copy()

df["RainToday"] = (df[f"{pref}precipitation"] > 0).astype(int)
df["RainTomorrow"] = (df[f"{pref}precipitation"].shift(-1) > 0).astype(float)

# drop last row with missing RainTomorrow
df = df.dropna(subset=["RainTomorrow"]).reset_index(drop=True)
df["RainTomorrow"] = df["RainTomorrow"].astype(int)

print("RainTomorrow proportions:")
print(df["RainTomorrow"].value_counts(normalize=True).rename("proportion"))




RainTomorrow proportions:
RainTomorrow
0    0.533114
1    0.466886
Name: proportion, dtype: float64


 
The `RainTomorrow` label is now defined and reasonably balanced, with about
53.3% dry days (0) and 46.7% rainy days (1). This balance means the
classification problem is well-posed: models cannot get a good score by
predicting only one class, and metrics like F1 on the rain class will be
meaningful.



## 3. Add Calendar and Lag-Based Weather Features

In this step I engineer the actual predictors that future model notebooks will use.
I add:

- `MONTH` to capture basic seasonality,
- `RainToday` to encode whether it is currently raining, and
- 1-day lag versions of key Basel variables (pressure, humidity, mean temperature,
  sunshine) to let the model see what conditions looked like yesterday.

I then drop the first row, where the lag features are undefined.


In [14]:
# --- feature engineering ---

# month for seasonality
df["MONTH"] = df["DATE"].dt.month

# 1-day lags for key Basel variables
for col in [f"{pref}pressure", f"{pref}humidity", f"{pref}temp_mean", f"{pref}sunshine"]:
    if col in df.columns:
        df[col + "_lag1"] = df[col].shift(1)

# drop first row where lags are NaN
df = df.dropna().reset_index(drop=True)

feature_cols = [
    "DATE",
    "MONTH",
    "RainToday",
    "RainTomorrow",
    f"{pref}pressure",
    f"{pref}humidity",
    f"{pref}temp_mean",
    f"{pref}sunshine",
    f"{pref}pressure_lag1",
    f"{pref}humidity_lag1",
    f"{pref}temp_mean_lag1",
    f"{pref}sunshine_lag1",
]

# keep only existing columns
feature_cols = [c for c in feature_cols if c in df.columns]
df_proc = df[feature_cols].copy()

print("Processed shape:", df_proc.shape)
df_proc.head()


Processed shape: (3653, 12)


Unnamed: 0,DATE,MONTH,RainToday,RainTomorrow,BASEL_pressure,BASEL_humidity,BASEL_temp_mean,BASEL_sunshine,BASEL_pressure_lag1,BASEL_humidity_lag1,BASEL_temp_mean_lag1,BASEL_sunshine_lag1
0,2000-01-02,1,0,0,1.0318,0.87,3.6,0.0,1.0286,0.89,2.9,0.0
1,2000-01-03,1,0,1,1.0314,0.81,2.2,3.7,1.0318,0.87,3.6,0.0
2,2000-01-04,1,1,1,1.0262,0.79,3.9,6.9,1.0314,0.81,2.2,3.7
3,2000-01-05,1,1,0,1.0246,0.9,6.0,3.7,1.0262,0.79,3.9,6.9
4,2000-01-06,1,0,0,1.0244,0.85,4.2,5.7,1.0246,0.9,6.0,3.7


The engineered Basel feature table has 3,653 rows and 12 columns, including the
calendar field (`MONTH`), today’s rain flag, the `RainTomorrow` label, and both
current and 1-day-lagged weather variables. This compact set of features encodes
short-term history and seasonality while staying small and easy to interpret.


In [15]:
# --- save processed table ---
ROOT = "/Users/purvigarg/Downloads/CMSE492/cmse492_project"
PROC_DIR = os.path.join(ROOT, "data", "processed")
os.makedirs(PROC_DIR, exist_ok=True)
out_path = os.path.join(PROC_DIR, "basel_rain_features.csv")
df_proc.to_csv(out_path, index=False)
print("Saved processed Basel features to:", out_path)


Saved processed Basel features to: /Users/purvigarg/Downloads/CMSE492/cmse492_project/data/processed/basel_rain_features.csv


##  Conclusion

This preprocessing notebook takes the raw, multi-city Kaggle/ECA&D weather file
and turns it into a clean, Basel-only dataset ready for modeling. I defined a
balanced day-ahead rain label (`RainTomorrow`), added simple but informative
features (month, current rain, and 1-day-lagged weather variables), and saved
the result to `data/processed/basel_rain_features.csv`. With this step complete,
the next notebooks can focus on comparing baseline and improved models without
having to worry about data wrangling or label construction.
