# Rainfall Prediction Model

This notebook implements a rainfall prediction model with the following enhancements:
- **Class Imbalance Handling**: Uses SMOTE to oversample the minority class.
- **Hyperparameter Tuning**: Expands parameter grids for Random Forest, Gradient Boosting, CatBoost, and MLPClassifier.
- **Ensemble Diversification**: Includes a neural network (MLPClassifier) alongside other models.
- **Data Leakage Prevention**: Fits preprocessing steps (scaling, PCA) only on training data.
- **Distribution Shift Check**: Compares summary statistics between training and test sets.
- **Temporal Patterns**: Adds lag features for rainfall and temperature from the previous day.

Warnings are ignored, and predictions are not rounded in the final output.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load train and test data
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

check for missing data, fill NaN with bfill

In [2]:
# Check for missing data
print("Missing data in train_data:\n", train_data.isnull().sum())
print("\nMissing data in test_data:\n", test_data.isnull().sum())

# Fill NaN values with backward fill
train_data = train_data.bfill()
test_data = test_data.bfill()

# Verify if missing data is handled
print("\nMissing data in train_data after bfill:\n", train_data.isnull().sum())
print("\nMissing data in test_data after bfill:\n", test_data.isnull().sum())

Missing data in train_data:
 id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    0
windspeed        0
rainfall         0
dtype: int64

Missing data in test_data:
 id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    1
windspeed        0
dtype: int64

Missing data in train_data after bfill:
 id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    0
windspeed        0
rainfall         0
dtype: int64

Missing data in test_data after bfill:
 id               0
day              0
pressure         0
maxtemp          0
temparature      0
min

## Feature Engineering

We add lag features to capture temporal patterns and apply cyclical transformations to relevant features.

In [3]:
# Feature engineering for cyclical features
def create_cyclical_features(df, column, period):
    df[f'{column}_sin'] = np.sin(2 * np.pi * df[column] / period)
    df[f'{column}_cos'] = np.cos(2 * np.pi * df[column] / period)
    return df

# Apply cyclical transformations to 'day' and 'winddirection'
train_data = create_cyclical_features(train_data, 'day', 365)
test_data = create_cyclical_features(test_data, 'day', 365)
train_data = create_cyclical_features(train_data, 'winddirection', 360)
test_data = create_cyclical_features(test_data, 'winddirection', 360)

# Drop original cyclical columns
train_data = train_data.drop(['day', 'winddirection'], axis=1)
test_data = test_data.drop(['day', 'winddirection'], axis=1)

# Sort training data by 'id' for time-based consistency
train_data = train_data.sort_values('id')

# Add lag features for rainfall and temperature
train_data['rainfall_lag1'] = train_data['rainfall'].shift(1)
train_data['temparature_lag1'] = train_data['temparature'].shift(1)
test_data['rainfall_lag1'] = np.nan  # Placeholder
test_data['temparature_lag1'] = np.nan

# Fill test lag features with last training values
test_data['rainfall_lag1'].iloc[0] = train_data['rainfall'].iloc[-1]
test_data['temparature_lag1'].iloc[0] = train_data['temparature'].iloc[-1]
test_data['rainfall_lag1'] = test_data['rainfall_lag1'].ffill()
test_data['temparature_lag1'] = test_data['temparature_lag1'].ffill()

# Drop rows with NaN lag features in training data
train_data = train_data.dropna()

## Data Preparation

Separate features and target, apply scaling and PCA, and prevent data leakage by fitting transformations only on training data.

In [4]:
# Separate features and target from training data
X_train_full = train_data.drop(['id', 'rainfall'], axis=1)
y_train_full = train_data['rainfall']

# Prepare test features
X_test = test_data.drop(['id'], axis=1)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_full)
X_test_scaled = scaler.transform(X_test)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95)  # Retain 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

## Handle Class Imbalance

Use SMOTE to balance the classes in the training data.

In [5]:
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_pca, y_train_full)

## Model Tuning and Training

Define models, tune hyperparameters using GridSearchCV with TimeSeriesSplit, and train the final models.

In [6]:
# Define TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# Define base models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)
catboost = CatBoostClassifier(verbose=0, random_state=42)
mlp = MLPClassifier(random_state=42)

# Expanded hyperparameter grids
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
gb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.1, 0.05, 0.01],
    'max_depth': [3, 5, 7]
}
catboost_param_grid = {
    'iterations': [100, 200, 300],
    'learning_rate': [0.1, 0.05, 0.01],
    'depth': [4, 6, 8, 10]
}
mlp_param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50)],
    'alpha': [0.0001, 0.001, 0.01]
}

# Tune Random Forest
rf_grid_search = GridSearchCV(rf, rf_param_grid, cv=tscv, scoring='roc_auc', n_jobs=-1)
rf_grid_search.fit(X_train_resampled, y_train_resampled)
best_rf_params = rf_grid_search.best_params_
print('Best Random Forest parameters:', best_rf_params)

# Tune Gradient Boosting
gb_grid_search = GridSearchCV(gb, gb_param_grid, cv=tscv, scoring='roc_auc', n_jobs=-1)
gb_grid_search.fit(X_train_resampled, y_train_resampled)
best_gb_params = gb_grid_search.best_params_
print('Best Gradient Boosting parameters:', best_gb_params)

# Tune CatBoost
catboost_grid_search = GridSearchCV(catboost, catboost_param_grid, cv=tscv, scoring='roc_auc', n_jobs=-1)
catboost_grid_search.fit(X_train_resampled, y_train_resampled)
best_catboost_params = catboost_grid_search.best_params_
print('Best CatBoost parameters:', best_catboost_params)

# Tune MLPClassifier
mlp_grid_search = GridSearchCV(mlp, mlp_param_grid, cv=tscv, scoring='roc_auc', n_jobs=-1)
mlp_grid_search.fit(X_train_resampled, y_train_resampled)
best_mlp_params = mlp_grid_search.best_params_
print('Best MLP parameters:', best_mlp_params)

# Create and fit final models with best parameters
final_rf = RandomForestClassifier(**best_rf_params, random_state=42)
final_rf.fit(X_train_resampled, y_train_resampled)

final_gb = GradientBoostingClassifier(**best_gb_params, random_state=42)
final_gb.fit(X_train_resampled, y_train_resampled)

final_catboost = CatBoostClassifier(**best_catboost_params, verbose=0, random_state=42)
final_catboost.fit(X_train_resampled, y_train_resampled)

final_mlp = MLPClassifier(**best_mlp_params, random_state=42)
final_mlp.fit(X_train_resampled, y_train_resampled)

Best Random Forest parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Best Gradient Boosting parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Best CatBoost parameters: {'depth': 4, 'iterations': 100, 'learning_rate': 0.1}
Best MLP parameters: {'alpha': 0.0001, 'hidden_layer_sizes': (50,)}


## Ensemble Prediction

Combine predictions from all models by averaging probabilities.

In [7]:
# Predict probabilities on test set for each model
rf_prob = final_rf.predict_proba(X_test_pca)[:, 1]
gb_prob = final_gb.predict_proba(X_test_pca)[:, 1]
catboost_prob = final_catboost.predict_proba(X_test_pca)[:, 1]
mlp_prob = final_mlp.predict_proba(X_test_pca)[:, 1]

# Ensemble prediction by averaging probabilities
y_pred_prob = (rf_prob + gb_prob + catboost_prob + mlp_prob) / 4

# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'rainfall': y_pred_prob
})

# Save to CSV without rounding
submission.to_csv('outputs/submission5.csv', index=False)

## Distribution Shift Check

Compare summary statistics between training and test sets to detect potential distribution shifts.

In [8]:
# Check for distribution shifts
print('\nTraining set summary statistics:\n', pd.DataFrame(X_train_scaled, columns=X_train_full.columns).describe())
print('\nTest set summary statistics:\n', pd.DataFrame(X_test_scaled, columns=X_train_full.columns).describe())


Training set summary statistics:
            pressure       maxtemp   temparature       mintemp      dewpoint  \
count  2.189000e+03  2.189000e+03  2.189000e+03  2.189000e+03  2.189000e+03   
mean   4.690426e-15  7.530649e-16 -3.765325e-16  2.499397e-16  1.817743e-16   
std    1.000228e+00  1.000228e+00  1.000228e+00  1.000228e+00  1.000228e+00   
min   -2.581958e+00 -2.824597e+00 -3.170213e+00 -3.591922e+00 -3.924668e+00   
25%   -8.842800e-01 -8.965033e-01 -8.913566e-01 -8.838165e-01 -6.911499e-01   
50%   -1.061776e-01  2.532774e-01  2.959466e-01  3.417494e-01  3.299610e-01   
75%    7.426614e-01  8.547012e-01  8.512981e-01  8.359292e-01  8.594259e-01   
max    3.713598e+00  1.703770e+00  1.444950e+00  1.508014e+00  1.180887e+00   

           humidity         cloud      sunshine     windspeed       day_sin  \
count  2.189000e+03  2.189000e+03  2.189000e+03  2.189000e+03  2.189000e+03   
mean   6.995064e-16 -3.083671e-16 -9.088715e-17 -3.635486e-16  3.895163e-17   
std    1.000228e