# B5. Improved Models for "Will it Rain Tomorrow in Basel?"

This notebook loads the processed Basel feature table from `data/processed/`,
uses a time-aware train/test split, and fits stronger models:

- Tuned logistic regression (with scaling)
- Random forest (nonlinear benchmark)

The goal is to see how much we can improve over the simple baseline logistic
regression from the B3 notebook.


In [2]:
# 1. Imports

import os
import numpy as np
import pandas as pd

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
)
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier



## 1. Load Processed Basel Feature Table

### 1.1 Read `basel_rain_features.csv` and enforce time order

Here I load the Basel-only feature table created in the preprocessing notebook.
I convert `DATE` back to a datetime type and sort the rows chronologically so
that earlier days come first and later days come last.


In [3]:


PROC_PATH = "/Users/purvigarg/Downloads/CMSE492/cmse492_project/data/processed/basel_rain_features.csv"

df = pd.read_csv(PROC_PATH)
print("Loaded processed:", df.shape)

df["DATE"] = pd.to_datetime(df["DATE"].astype(str), errors="coerce")
df = df.sort_values("DATE").reset_index(drop=True)

df.head()



Loaded processed: (3653, 12)


Unnamed: 0,DATE,MONTH,RainToday,RainTomorrow,BASEL_pressure,BASEL_humidity,BASEL_temp_mean,BASEL_sunshine,BASEL_pressure_lag1,BASEL_humidity_lag1,BASEL_temp_mean_lag1,BASEL_sunshine_lag1
0,2000-01-02,1,0,0,1.0318,0.87,3.6,0.0,1.0286,0.89,2.9,0.0
1,2000-01-03,1,0,1,1.0314,0.81,2.2,3.7,1.0318,0.87,3.6,0.0
2,2000-01-04,1,1,1,1.0262,0.79,3.9,6.9,1.0314,0.81,2.2,3.7
3,2000-01-05,1,1,0,1.0246,0.9,6.0,3.7,1.0262,0.79,3.9,6.9
4,2000-01-06,1,0,0,1.0244,0.85,4.2,5.7,1.0246,0.9,6.0,3.7


## 2. Define Feature Matrix `X` and Target `y`

### 2.1 Use all engineered predictors and the `RainTomorrow` label

I now separate the processed table into a feature matrix `X` and the binary
target `y = RainTomorrow`. All columns except `DATE` and `RainTomorrow` are used
as predictors, including month, current rain flag, and current and lagged Basel
weather variables.


In [4]:
y = df["RainTomorrow"].astype(int)
feature_cols = [c for c in df.columns if c not in ["DATE", "RainTomorrow"]]

X = df[feature_cols].copy()

print("Feature columns:", feature_cols)
print("X shape:", X.shape)
print("y distribution:")
print(y.value_counts(normalize=True).rename("proportion"))


Feature columns: ['MONTH', 'RainToday', 'BASEL_pressure', 'BASEL_humidity', 'BASEL_temp_mean', 'BASEL_sunshine', 'BASEL_pressure_lag1', 'BASEL_humidity_lag1', 'BASEL_temp_mean_lag1', 'BASEL_sunshine_lag1']
X shape: (3653, 10)
y distribution:
RainTomorrow
0    0.532987
1    0.467013
Name: proportion, dtype: float64



The final feature matrix has shape `(3653, 10)` with the columns  
`MONTH`, `RainToday`, and eight Basel weather variables (current + lag1).  
The `RainTomorrow` label remains well balanced (about 53% no-rain, 47% rain),
so both classes are well represented in the training and test sets.


## 3. Time-Aware Train/Test Split

### 3.1 Use an 80/20 chronological split

To mimic real forecasting, I train on earlier years and test on later years.
Here I take the first 80% of days as the training set and the final 20% of days
as the test set, preserving the chronological order.


In [5]:
# 4.1 Chronological 80/20 split

n = len(df)
split_idx = int(0.8 * n)

X_train = X.iloc[:split_idx].copy()
y_train = y.iloc[:split_idx].copy()

X_test = X.iloc[split_idx:].copy()
y_test = y.iloc[split_idx:].copy()

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)
print("Train dates:", df["DATE"].iloc[0], "→", df["DATE"].iloc[split_idx - 1])
print("Test dates:", df["DATE"].iloc[split_idx], "→", df["DATE"].iloc[-1])


Train shape: (2922, 10) Test shape: (731, 10)
Train dates: 2000-01-02 00:00:00 → 2008-01-01 00:00:00
Test dates: 2008-01-02 00:00:00 → 2010-01-01 00:00:00



The training set contains 2,922 days and the test set contains 731 days, with
train dates from 2000-01-02 to 2008-01-01 and test dates from 2008-01-02 to
2010-01-01. This split ensures that the models are always trained on the past
and evaluated on a later, unseen period.


## 4. Evaluation Helper

### 4.1 Shared metric function for all models

I define a small helper that reports accuracy, precision, recall, F1 score, and
the confusion matrix for any set of predictions. Using the same metrics for both
models makes their performance directly comparable.


In [6]:


def eval_model(y_true, y_pred, name="model"):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred)

    print(f"\n=== {name} ===")
    print("Accuracy :", f"{acc:.3f}")
    print("Precision:", f"{prec:.3f}")
    print("Recall   :", f"{rec:.3f}")
    print("F1       :", f"{f1:.3f}")
    print("Confusion matrix:\n", cm)

    return {
        "model": name,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
    }


## 5. Improved Logistic Regression (with Scaling and Class Weights)

### 5.1 Pipeline with `StandardScaler` and hyperparameter tuning

Here I fit a logistic regression model in a pipeline with feature scaling and a
small grid search over the regularization strength `C` and the `class_weight`
setting. The search is optimized for F1 on the training folds, emphasizing
performance on rainy days.


In [7]:


logreg_pipe = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=2000)),
])

param_grid = {
    "logreg__C": [0.1, 1.0, 3.0],
    "logreg__class_weight": [None, "balanced"],
}

logreg_search = GridSearchCV(
    logreg_pipe,
    param_grid=param_grid,
    scoring="f1",   # focus on rainy days
    cv=5,
    n_jobs=-1,
)

logreg_search.fit(X_train, y_train)
print("Best params (logreg):", logreg_search.best_params_)

y_pred_logreg = logreg_search.predict(X_test)
logreg_results = eval_model(y_test, y_pred_logreg, name="Improved Logistic Regression")

print("\nClassification report (Improved Logistic Regression):")
print(classification_report(y_test, y_pred_logreg, digits=3))


Best params (logreg): {'logreg__C': 3.0, 'logreg__class_weight': 'balanced'}

=== Improved Logistic Regression ===
Accuracy : 0.668
Precision: 0.640
Recall   : 0.697
F1       : 0.668
Confusion matrix:
 [[244 137]
 [106 244]]

Classification report (Improved Logistic Regression):
              precision    recall  f1-score   support

           0      0.697     0.640     0.668       381
           1      0.640     0.697     0.668       350

    accuracy                          0.668       731
   macro avg      0.669     0.669     0.668       731
weighted avg      0.670     0.668     0.668       731




The best logistic regression model uses `C = 3.0` with `class_weight = "balanced"`.
On the 2008–2010 test period it achieves about **66.8% accuracy** and an
**F1 score of 0.668** for the rain class, with precision ≈ 0.64 and recall ≈ 0.70.
The confusion matrix shows that it detects a large fraction of rainy days while
still correctly identifying many dry days, and performance is much stronger and
more balanced than the simple logistic baseline from B3.


## 6. Random Forest Model

### 6.1 Nonlinear tree ensemble on the same features

Next, I fit a random forest classifier using the same train/test split and
features. This model can capture nonlinear relationships between pressure,
humidity, temperature, sunshine, and their lags without any extra feature
engineering.


In [8]:


rf = RandomForestClassifier(
    n_estimators=300,
    random_state=0,
    n_jobs=-1,
)

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

rf_results = eval_model(y_test, y_pred_rf, name="Random Forest")

print("\nClassification report (Random Forest):")
print(classification_report(y_test, y_pred_rf, digits=3))

# feature importances
importances = pd.Series(rf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop feature importances (Random Forest):")
print(importances.head(10))



=== Random Forest ===
Accuracy : 0.670
Precision: 0.655
Recall   : 0.657
F1       : 0.656
Confusion matrix:
 [[260 121]
 [120 230]]

Classification report (Random Forest):
              precision    recall  f1-score   support

           0      0.684     0.682     0.683       381
           1      0.655     0.657     0.656       350

    accuracy                          0.670       731
   macro avg      0.670     0.670     0.670       731
weighted avg      0.670     0.670     0.670       731


Top feature importances (Random Forest):
BASEL_pressure          0.174164
BASEL_pressure_lag1     0.129110
BASEL_temp_mean         0.104143
BASEL_temp_mean_lag1    0.101512
BASEL_sunshine          0.094483
RainToday               0.091315
BASEL_humidity_lag1     0.086223
BASEL_sunshine_lag1     0.084502
BASEL_humidity          0.083258
MONTH                   0.051289
dtype: float64



The random forest reaches about **67.0% accuracy** and an **F1 score of 0.656**
for rainy days, very similar overall performance to the tuned logistic model.
Feature importances highlight current and lagged **pressure**, followed by
`BASEL_temp_mean`, `BASEL_temp_mean_lag1`, and sunshine-related variables,
which matches the earlier physics-based intuition that pressure and recent
conditions are strong drivers of rain.


## 7. Compare Improved Models

### 7.1 Summary table of logistic regression vs random forest

Finally, I gather the main metrics from both improved models into a single
summary table and save it for use in the final report.


In [9]:

results_df = pd.DataFrame([logreg_results, rf_results])
print("Summary metrics:")
print(results_df)

results_path = "/Users/purvigarg/Downloads/CMSE492/cmse492_project/data/processed/improved_model_results.csv"
os.makedirs(os.path.dirname(results_path), exist_ok=True)
results_df.to_csv(results_path, index=False)
print("\nSaved results table to:", results_path)


Summary metrics:
                          model  accuracy  precision    recall        f1
0  Improved Logistic Regression  0.667579   0.640420  0.697143  0.667579
1                 Random Forest  0.670315   0.655271  0.657143  0.656205

Saved results table to: /Users/purvigarg/Downloads/CMSE492/cmse492_project/data/processed/improved_model_results.csv


## 9. Conclusion

This notebook used the processed Basel feature table to train and compare two
stronger models for the question “Will it rain tomorrow in Basel?” Using a
chronological 80/20 split, both the tuned logistic regression and the random
forest reached about 67% accuracy on the 2008–2010 test period. The logistic
regression with balanced class weights slightly outperformed the random forest
in F1 and recall for rainy days, while remaining easy to interpret and explain.

Taken together with the B3 baseline results, these experiments show that
careful preprocessing, lagged features, and modest tuning are enough to move
from a weak baseline (F1 ≈ 0.53 for Rain) to a stable, interpretable model with
F1 ≈ 0.67. In the final report, I treat the tuned logistic regression as the
main “final” model and use the random forest as a sanity check that a more
flexible method does not provide dramatically better performance on this
station-only dataset.
