This is an implementation and tuning of a ADABOOST classifier. 
Hyperparameter tuning is performed using a grid search approach, systematically evaluating various configurations of model parameters. The performance metrics for each model are calculated and stored in a CSV file, allowing for later extraction and analysis of the best-performing models.

The primary metric chosen for determining the best-performing model is the F1-score. This metric was selected because it provides the most balanced representation of model performance, particularly in scenarios such as this, involving imbalanced datasets. By considering both precision and recall, the F1-score ensures that the model's ability to identify the minority class is accurately reflected.

In [2]:
from sklearn.utils import resample
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sb
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Given that the dataset contains multiple linearly dependent features, two datasets were prepared for analysis. The first is the original dataset $D$, which has been oversampled to address class imbalance. The second is a PCA-transformed version of the same dataset $D$.

Applying PCA (Principal Component Analysis) allows us to reduce dimensionality by eliminating features that exhibit high correlation. This transformation helps to remove redundancy and multicollinearity, ensuring that the model can learn more effectively from the most informative features while mitigating potential overfitting.

In [3]:
X_train = pd.read_csv('data/X_train_final.csv')
y_train = pd.read_csv('data/y_train_final.csv')
X_test = pd.read_csv('data/X_test_final.csv')
y_test = pd.read_csv('data/y_test_final.csv')
X_train = X_train.drop(X_train.columns[0], axis=1)
X_test = X_test.drop(X_test.columns[0], axis=1)
'''
X_train = pd.read_csv('data/X_train_pca_final.csv')
y_train = pd.read_csv('data/y_train_final.csv')
X_test = pd.read_csv('data/X_test_pca_final.csv')
y_test = pd.read_csv('data/y_test_final.csv')
'''

"\nX_train = pd.read_csv('data/X_train_pca_final.csv')\ny_train = pd.read_csv('data/y_train_final.csv')\nX_test = pd.read_csv('data/X_test_pca_final.csv')\ny_test = pd.read_csv('data/y_test_final.csv')\n"

In [1]:
depths = [1, 2, 3, 4, 5, 6, 7]
estimators = [i for i in range(200, 400, 25)]
criterion = ['gini', 'entropy', 'log_loss']
ler_rate = [0.01, 0.1, 1.0, 1.5]
results = []

for depth in depths:
    for n_est in estimators:
        for crit in criterion:
            for lr in ler_rate:
                base_tree = DecisionTreeClassifier(max_depth=depth, criterion=crit)
                boosting_mdl = AdaBoostClassifier(base_tree, n_estimators=n_est, random_state=41, learning_rate=lr)
                boosting_mdl.fit(X_train, y_train)
                y_pred = boosting_mdl.predict(X_test)

                precision = precision_score(y_test, y_pred)
                recall = recall_score(y_test, y_pred)
                f1 = f1_score(y_test, y_pred)
                conf_matrix = confusion_matrix(y_test, y_pred)
                accuracy = accuracy_score(y_test, y_pred)
                results.append([depth, n_est, crit, lr, precision, recall, f1, accuracy, conf_matrix])

df_results = pd.DataFrame(results, columns=['Depth', 'Estimators', 'Criterion', 'Learning rate', 'Precision', 'Recall', 'F1-Score', 'Accuracy', 'Confusion Matrix'])
df_results.to_csv('data/with_lr_non_pca', index=False)
print(df_results)

NameError: name 'DecisionTreeClassifier' is not defined

In [4]:
df_results = pd.read_csv('data/model_comparison_results_non_pca_df.csv')
best_f1 = df_results.sort_values(by='F1-Score', ascending=False).iloc[0]
print(f"Model with best F1 (non-PCA dataset): \n{best_f1}")

Model with best F1 (non-PCA dataset): 
Depth                                     5
Estimators                              200
Criterion                              gini
Precision                          0.722222
Recall                             0.565217
F1-Score                           0.634146
Accuracy                            0.90625
Confusion Matrix    [[132   5]\n [ 10  13]]
Name: 84, dtype: object


In [5]:

base_tree = DecisionTreeClassifier(max_depth=best_f1.Depth, criterion=best_f1.Criterion)
boosting_mdl = AdaBoostClassifier(base_tree, n_estimators=best_f1.Estimators, random_state=41, learning_rate=1)
boosting_mdl.fit(X_train, y_train)
y_pred = boosting_mdl.predict(X_test)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("prec ",precision)
print("rec ",recall)
print("f1 ",f1)
print("acc ",accuracy)
print("mse ",mse)
print(conf_matrix)

  y = column_or_1d(y, warn=True)


prec  0.7222222222222222
rec  0.5652173913043478
f1  0.6341463414634146
acc  0.90625
mse  0.09375
[[132   5]
 [ 10  13]]


prec  0.7222222222222222

rec  0.5652173913043478

f1  0.6341463414634146

acc  0.90625

mse  0.09375

[[132   5]
 [ 10  13]]

 The confusion matrix presents a significant class imbalance, as evidenced by the low number of True Positives. This imbalance skews the model's performance metrics. The precision is relatively high (0.722), indicating that when the model predicts a positive class, it is often correct. However, the recall (0.565) reflects that the model struggles to identify all true positive cases. The overall accuracy (0.906) and mean squared error (MSE) of 0.093 appear strong, but these metrics are misleading in the context of imbalanced data. They primarily capture the model's ability to classify the majority class correctly, which may not translate into meaningful performance on the minority class.

In [10]:
X = pd.read_csv('data/test_data_fall2024.csv')
X = X.drop(['snow'], axis=1)
display(X)
preds = boosting_mdl.predict(X)
output_path = 'predictions.csv'
pd.DataFrame([preds], columns=[f'Prediction_{i}' for i in range(1, len(preds)+1)]).to_csv(output_path, index=False, header=False)

print(f"Predictions exported to {output_path}")

Unnamed: 0,hour_of_day,day_of_week,month,holiday,weekday,summertime,temp,dew,humidity,precip,snowdepth,windspeed,cloudcover,visibility
0,14,0,1,0,1,0,-1.7,-1.9,98.86,2.434,2.96,33.0,100.0,3.3
1,14,5,3,0,0,0,14.3,2.2,43.93,0.000,0.00,16.4,44.6,16.0
2,18,3,1,0,1,0,11.1,7.8,80.07,0.000,0.00,7.7,99.2,16.0
3,2,2,1,1,1,0,1.3,-3.2,71.95,0.000,0.00,0.0,94.3,16.0
4,15,0,5,0,1,1,16.1,1.6,37.47,0.000,0.00,33.7,86.8,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,21,4,9,0,1,1,24.1,20.6,81.09,0.000,0.00,9.0,44.6,16.0
396,20,5,1,0,0,0,15.7,14.5,92.63,0.000,0.00,1.0,100.0,13.0
397,2,5,7,0,0,1,22.3,12.9,55.48,0.000,0.00,6.7,87.5,16.0
398,5,1,4,0,1,1,11.6,6.7,72.01,0.000,0.00,6.7,99.6,16.0


Predictions exported to predictions.csv


In [11]:
preds = pd.read_csv('predictions.csv')
display(preds)

Unnamed: 0,0,1,0.1,0.2,0.3,0.4,1.1,0.5,0.6,0.7,...,0.335,0.336,0.337,0.338,1.55,0.339,0.340,0.341,0.342,0.343
