# 7b – Random Forest Model Training

This notebook trains a Random Forest classifier
as a non-linear comparison model to the Logistic Regression baseline.

Random Forest allows modeling feature interactions
and non-linear relationships in weather patterns.


## Import Required Libraries and Initialize AWS Session


In [2]:
import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

sess = sagemaker.Session()
bucket = sess.default_bucket()
region = boto3.Session().region_name
role = get_execution_role()

project_prefix = "ghcn-extreme"
s3 = boto3.client("s3")

print("Bucket:", bucket)
print("Region:", region)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Bucket: sagemaker-us-east-1-083422367993
Region: us-east-1


## Load Partitioned Parquet Dataset from S3


In [3]:
project_prefix = "ghcn-extreme"
parquet_s3_location = f"s3://{bucket}/{project_prefix}/parquet/"

df = pd.read_parquet(parquet_s3_location)

df.head()


Unnamed: 0,station_id,date,tmax,tmin,prcp_lag_1,prcp_roll_7,extreme_precip_tomorrow,month,year
0,USW00012921,2006-02-18,3.9,-1.1,1.5,0.214286,0,2,2006
1,USW00012921,2006-02-19,5.6,-1.7,0.0,0.257143,0,2,2006
2,USW00012921,2006-02-20,8.9,1.7,0.3,0.4,0,2,2006
3,USW00012921,2006-02-21,13.9,6.1,1.0,0.442857,0,2,2006
4,USW00012921,2006-02-22,22.2,12.8,0.3,0.442857,0,2,2006


## Sort by Date and Perform Time-Based Split


In [4]:
df["date"] = pd.to_datetime(df["date"])

df = df.sort_values("date").reset_index(drop=True)

split_date = df["date"].quantile(0.8)

train_df = df[df["date"] <= split_date].copy()
val_df = df[df["date"] > split_date].copy()

print("Training size:", train_df.shape)
print("Validation size:", val_df.shape)


Training size: (29155, 9)
Validation size: (7289, 9)


## Separate Features and Target


In [5]:
y_train = train_df["extreme_precip_tomorrow"].astype(int)
y_val = val_df["extreme_precip_tomorrow"].astype(int)

X_train = train_df.drop(columns=["extreme_precip_tomorrow", "date"])
X_val = val_df.drop(columns=["extreme_precip_tomorrow", "date"])


## Encode Categorical Feature (station_id)


In [6]:
X_train = pd.get_dummies(X_train, columns=["station_id"], dummy_na=False)
X_val = pd.get_dummies(X_val, columns=["station_id"], dummy_na=False)

X_val = X_val.reindex(columns=X_train.columns, fill_value=0)


## Initialize Random Forest Model

Class imbalance is handled using balanced class weights.


In [7]:
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)


0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Evaluate Model Performance


In [8]:
y_pred = rf_model.predict(X_val)
y_proba = rf_model.predict_proba(X_val)[:, 1]

print("ROC AUC:", roc_auc_score(y_val, y_proba))
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_pred))


ROC AUC: 0.6976604166350112

Confusion Matrix:
[[6943    5]
 [ 341    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.98      6948
           1       0.00      0.00      0.00       341

    accuracy                           0.95      7289
   macro avg       0.48      0.50      0.49      7289
weighted avg       0.91      0.95      0.93      7289



## Save Trained Model to S3


In [9]:
import joblib
import tarfile
import boto3
import pickle

project_prefix = "ghcn-extreme"

# Save model using older compatible protocol
joblib.dump(
    rf_model,
    "model.joblib",
    compress=3,
    protocol=4   # critical for SageMaker container compatibility
)

# Recreate archive cleanly
with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("model.joblib")
    tar.add("inference.py")

# Upload to S3 (overwrite existing)
s3 = boto3.client("s3")
model_s3_key = f"{project_prefix}/models/model.tar.gz"

s3.upload_file("model.tar.gz", bucket, model_s3_key)

print("Model archive uploaded to:")
print(f"s3://{bucket}/{model_s3_key}")


Model archive uploaded to:
s3://sagemaker-us-east-1-083422367993/ghcn-extreme/models/model.tar.gz


In [10]:
!tar -tzf model.tar.gz


model.joblib
inference.py


In [11]:
import tarfile

# Remove old archive if it exists
!rm -f model.tar.gz

# Create fresh archive
with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("model.joblib")
    tar.add("inference.py")

print("Archive rebuilt.")


Archive rebuilt.


## Summary

The Random Forest model provides a non-linear comparison
to the Logistic Regression baseline.

Performance metrics can now be compared to determine
whether non-linear modeling improves extreme event detection.
