# B3. Baseline Model

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# load the same raw file as in B1
DATA_PATH = "/Users/purvigarg/Downloads/CMSE492/cmse492_project/data/raw/weather_prediction_dataset.csv"
data = pd.read_csv(DATA_PATH)

print("Loaded:", data.shape)

Loaded: (3654, 165)


## Sort by Date 

In [8]:
data["DATE"] = pd.to_datetime(data["DATE"].astype(str), errors="coerce")
data = data.sort_values("DATE").reset_index(drop=True)

Because this is weather data, time order really matters. I don’t want to accidentally train on future days and test on past days — that would be leakage. So I converted the DATE column to a real datetime and then I sorted the whole dataframe by date. Now I know that row 0 is an earlier day and the last row is a later day. That makes my train/test split realistic.

## Create the target (RainTomorrow)

In [None]:
city = "BASEL"
pref = f"{city}_"

# label = tomorrow's rain (1 if tomorrow's precipitation > 0)
data["RainTomorrow"] = (data[f"{pref}precipitation"].shift(-1) > 0).astype(int)
data = data.dropna(subset=["RainTomorrow"]).reset_index(drop=True)

My actual question is “Will it rain in Basel tomorrow?” but the dataset only tells me whether it rained today. So I made the label myself: I took Basel’s precipitation column and shifted it up by 1 day. That way, today’s row is now paired with tomorrow’s rain (1 or 0). I also dropped the very last row because the last day has no tomorrow. Now I have a proper supervised-learning target called RainTomorrow.

## Pick a tiny feature set for the baseline

In [9]:
feature_cols = [
    f"{pref}pressure",
    f"{pref}humidity",
    f"{pref}temp_mean",
    f"{pref}snsunshine".replace("sns","sun") if f"{pref}snsunshine".replace("sns","sun") in data.columns else f"{pref}sunshine"
]
# remove columns that don't exist
feature_cols = [c for c in feature_cols if c in data.columns]

X = data[feature_cols].copy()
y = data["RainTomorrow"].astype(int)

Since this is only the baseline step, I didn’t want to throw in all 165 columns. I picked the 3–4 Basel features that I already saw in EDA are related to rain: pressure (low → rain), humidity (high → rain), temp_mean (context), and sunshine (low → rain). This gives me a small, clean, and weather-logical feature set. I also filtered the list to avoid errors if one of the columns is missing.

## Time-ordered train/test split

In [None]:
# time–ordered train/test split (first 80% train, last 20% test)
split = int(0.8 * len(data))
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

I want to simulate real life: I train on earlier years and test on later years. So I took the first 80% of the rows (earlier dates) as my training set and the last 20% (later dates) as my test set. I did not shuffle because that would mix past and future. This way, the baseline performance I report is honest.

## Baseline 1 — Majority class

In [None]:
# Baseline 1: Majority class (always predict the most common label in TRAIN) 
majority_label = y_train.mode()[0]
y_pred_majority = np.full_like(y_test, fill_value=majority_label)

acc_majority = accuracy_score(y_test, y_pred_majority)
f1_majority  = f1_score(y_test, y_pred_majority, pos_label=1)

print("=== Baseline 1: Majority (always", majority_label, ") ===")
print(f"Accuracy: {acc_majority:.3f}")
print(f"F1 (Rain): {f1_majority:.3f}")
print()

=== Baseline 1: Majority (always 0 ) ===
Accuracy: 0.521
F1 (Rain): 0.000



I wanted a “zero-effort” model to compare against. So I built the simplest classifier possible: always predict the class that happens the most in the training data. In my case that’s usually “No Rain.” This baseline tells me what accuracy I get without using any weather information. It also shows me a problem: the F1 score for the Rain class is 0, because this model never predicts rain. That’s good — now I know what I must beat.

## Baseline 2 — Simple Logistic Regression

In [None]:
logit = LogisticRegression(max_iter=1000)
logit.fit(X_train, y_train)
y_pred_logit = logit.predict(X_test)

acc_logit = accuracy_score(y_test, y_pred_logit)
f1_logit  = f1_score(y_test, y_pred_logit, pos_label=1)

print("=== Baseline 2: Logistic Regression (simple) ===")
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred_logit))
print(classification_report(y_test, y_pred_logit, digits=3))
print(f"Accuracy: {acc_logit:.3f}")
print(f"F1 (Rain): {f1_logit:.3f}")

=== Baseline 2: Logistic Regression (simple) ===
Confusion matrix:
[[234 147]
 [169 181]]
              precision    recall  f1-score   support

           0      0.581     0.614     0.597       381
           1      0.552     0.517     0.534       350

    accuracy                          0.568       731
   macro avg      0.566     0.566     0.565       731
weighted avg      0.567     0.568     0.567       731

Accuracy: 0.568
F1 (Rain): 0.534


I wanted a real baseline, not just the dumb one. Logistic regression is a good first real model for binary classification — it’s simple, fast, and I can explain it. Here it uses only a few Basel weather features but already predicts both classes. This shows me the problem is actually learnable from the features I chose. Later, when I try trees, random forest, or even a small neural net, I can say “is this better than my logistic?” If it’s not, I know the fancy model isn’t worth it.

## Conclusion

The baseline results tell me two very different stories. The first model, “always predict No Rain,” got 52.1% accuracy, but that number is a bit fake-helpful — it’s high only because a little over half of the days in this dataset are actually dry. Since it never predicts rain, its F1 for the Rain class is 0.0, which means it completely fails at the part of the problem I actually care about (finding rainy days). So I can already say: accuracy by itself is not enough for this project.

When I used a real model — the simple logistic regression — the picture improved. The accuracy went up to 56.8%, so it’s doing better overall than the “always no rain” rule. More importantly, the F1 score for Rain is 0.534, which means the model is now correctly identifying a good chunk of the rainy days and not just the dry ones. The confusion matrix [[234, 147], [169, 181]] tells me how: it correctly said “no rain” 234 times and “rain” 181 times, but it also missed 169 rainy days (false negatives) and raised 147 false alarms (false positives). So the model is useful, but still cautious and a bit noisy. Overall, what I understand from these numbers is: (1) a trivial baseline is not acceptable, (2) even a small, interpretable model can learn signal from Basel weather features, and (3) to get to the ~0.67 accuracy and ~0.68–0.70 F1 we saw later, I need to add better features (lags, seasonality, neighbor pressure) and possibly tune the threshold.