# <p style="background-color:#003366; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Customer Conversion Prediction with ML & DL</p>

This notebook provides a comprehensive workflow for binary classification on a digital marketing dataset. The goal is to predict whether a customer will convert. It covers a wide range of models, from classical machine learning to advanced boosting methods and a Keras-based deep neural network. Each model is trained and evaluated systematically to compare their performance.

## 1. Library Imports

In [1]:
# Core Libraries
import pandas as pd
import numpy as np
import warnings

# Preprocessing & Splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model Evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Classical ML Models
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC

# Ensemble Models
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier, StackingClassifier

# Advanced Boosting Models
import xgboost as xgb
import lightgbm as lgb
import catboost as cb

from sklearn.neural_network import MLPClassifier

# Deep Learning with Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Hyperparameter Tuning
import optuna

# Configuration
warnings.filterwarnings('ignore')
print("Libraries imported successfully.")

Libraries imported successfully.


## 2. Data Loading & Cleaning

In [2]:
# Load the dataset
df = pd.read_csv('digital_marketing_campaign_dataset.csv')

# Drop columns that do not add predictive value
df.drop(['CustomerID', 'AdvertisingPlatform', 'AdvertisingTool'], axis=1, inplace=True)

# Separate features and target variable
X = df.drop('Conversion', axis=1)
y = df['Conversion']

print("Data loaded and cleaned. Feature and target sets created.")
display(X.head())

Data loaded and cleaned. Feature and target sets created.


Unnamed: 0,Age,Gender,Income,CampaignChannel,CampaignType,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints
0,56,Female,136912,Social Media,Awareness,6497.870068,0.043919,0.088031,0,2.399017,7.396803,19,6,9,4,688
1,69,Male,41760,Email,Retention,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,2,7,2,3459
2,46,Female,88456,PPC,Awareness,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,11,2,8,2337
3,32,Female,44085,PPC,Conversion,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,2,2,0,2463
4,60,Female,83964,PPC,Conversion,1678.043573,0.252851,0.10994,0,2.046847,13.99337,6,6,6,8,4345


## 3. Train/Test Split

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (6400, 16)
Test set shape: (1600, 16)


## 4. Preprocessing Pipeline

In [4]:
# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=np.number).columns

# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ])

# Create a separate preprocessor for Naive Bayes models that require non-negative data
from sklearn.preprocessing import MinMaxScaler
preprocessor_nb = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ])

print("Preprocessing pipelines created.")

Preprocessing pipelines created.


## 5. Model Training & Evaluation

In [5]:
print("--- 1. Logistic Regression ---")
pipeline_lr = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(random_state=42))])
pipeline_lr.fit(X_train, y_train)
y_pred = pipeline_lr.predict(X_test)
y_proba = pipeline_lr.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

--- 1. Logistic Regression ---
Accuracy: 0.8912
Precision: 0.8946
Recall: 0.9929
F1 Score: 0.9412
ROC-AUC: 0.7850


In [6]:
print("\n--- 2. Ridge Classifier ---")
pipeline_ridge = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RidgeClassifier(random_state=42))])
pipeline_ridge.fit(X_train, y_train)
y_pred = pipeline_ridge.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
# RidgeClassifier does not have predict_proba, so ROC-AUC cannot be calculated directly
print("ROC-AUC: N/A")


--- 2. Ridge Classifier ---
Accuracy: 0.8762
Precision: 0.8762
Recall: 1.0000
F1 Score: 0.9340
ROC-AUC: N/A


In [7]:
print("\n--- 3. SGDClassifier ---")
pipeline_sgd = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', SGDClassifier(loss='log_loss', random_state=42))])
pipeline_sgd.fit(X_train, y_train)
y_pred = pipeline_sgd.predict(X_test)
y_proba = pipeline_sgd.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 3. SGDClassifier ---
Accuracy: 0.8862
Precision: 0.9050
Recall: 0.9722
F1 Score: 0.9374
ROC-AUC: 0.7703


In [8]:
print("\n--- 4. Linear Discriminant Analysis (LDA) ---")
pipeline_lda = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LinearDiscriminantAnalysis())])
pipeline_lda.fit(X_train, y_train)
y_pred = pipeline_lda.predict(X_test)
y_proba = pipeline_lda.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 4. Linear Discriminant Analysis (LDA) ---
Accuracy: 0.8888
Precision: 0.8913
Recall: 0.9943
F1 Score: 0.9400
ROC-AUC: 0.7833


In [9]:
print("\n--- 5. Quadratic Discriminant Analysis (QDA) ---")
pipeline_qda = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', QuadraticDiscriminantAnalysis())])
pipeline_qda.fit(X_train, y_train)
y_pred = pipeline_qda.predict(X_test)
y_proba = pipeline_qda.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 5. Quadratic Discriminant Analysis (QDA) ---
Accuracy: 0.7963
Precision: 0.8865
Recall: 0.8802
F1 Score: 0.8833
ROC-AUC: 0.5085


In [10]:
print("\n--- 6.1. Gaussian Naive Bayes ---")
pipeline_gnb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', GaussianNB())])
pipeline_gnb.fit(X_train, y_train)
y_pred = pipeline_gnb.predict(X_test)
y_proba = pipeline_gnb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 6.1. Gaussian Naive Bayes ---
Accuracy: 0.8919
Precision: 0.8927
Recall: 0.9964
F1 Score: 0.9417
ROC-AUC: 0.7923


In [11]:
print("\n--- 6.2. Multinomial Naive Bayes ---")
pipeline_mnb = Pipeline(steps=[('preprocessor', preprocessor_nb), ('classifier', MultinomialNB())])
pipeline_mnb.fit(X_train, y_train)
y_pred = pipeline_mnb.predict(X_test)
y_proba = pipeline_mnb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 6.2. Multinomial Naive Bayes ---
Accuracy: 0.8762
Precision: 0.8762
Recall: 1.0000
F1 Score: 0.9340
ROC-AUC: 0.7112


In [12]:
print("\n--- 6.3. Bernoulli Naive Bayes ---")
pipeline_bnb = Pipeline(steps=[('preprocessor', preprocessor_nb), ('classifier', BernoulliNB())])
pipeline_bnb.fit(X_train, y_train)
y_pred = pipeline_bnb.predict(X_test)
y_proba = pipeline_bnb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 6.3. Bernoulli Naive Bayes ---
Accuracy: 0.8775
Precision: 0.8773
Recall: 1.0000
F1 Score: 0.9347
ROC-AUC: 0.6347


In [13]:
print("\n--- 7. Decision Tree Classifier ---")
pipeline_dt = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(random_state=42))])
pipeline_dt.fit(X_train, y_train)
y_pred = pipeline_dt.predict(X_test)
y_proba = pipeline_dt.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 7. Decision Tree Classifier ---
Accuracy: 0.8337
Precision: 0.9086
Recall: 0.9009
F1 Score: 0.9047
ROC-AUC: 0.6297


In [14]:
print("\n--- 8. Random Forest Classifier ---")
pipeline_rf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))])
pipeline_rf.fit(X_train, y_train)
y_pred = pipeline_rf.predict(X_test)
y_proba = pipeline_rf.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 8. Random Forest Classifier ---
Accuracy: 0.8862
Precision: 0.8875
Recall: 0.9964
F1 Score: 0.9388
ROC-AUC: 0.7933


In [15]:
print("\n--- 9. Extra Trees Classifier ---")
pipeline_et = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', ExtraTreesClassifier(random_state=42, n_jobs=-1))])
pipeline_et.fit(X_train, y_train)
y_pred = pipeline_et.predict(X_test)
y_proba = pipeline_et.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 9. Extra Trees Classifier ---
Accuracy: 0.8788
Precision: 0.8784
Recall: 1.0000
F1 Score: 0.9353
ROC-AUC: 0.8067


In [16]:
print("\n--- 10. Bagging Classifier ---")
pipeline_bag = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', BaggingClassifier(random_state=42, n_jobs=-1))])
pipeline_bag.fit(X_train, y_train)
y_pred = pipeline_bag.predict(X_test)
y_proba = pipeline_bag.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 10. Bagging Classifier ---
Accuracy: 0.8925
Precision: 0.9095
Recall: 0.9743
F1 Score: 0.9408
ROC-AUC: 0.7694


In [17]:
print("\n--- 11. AdaBoost Classifier ---")
pipeline_ada = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', AdaBoostClassifier(random_state=42))])
pipeline_ada.fit(X_train, y_train)
y_pred = pipeline_ada.predict(X_test)
y_proba = pipeline_ada.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 11. AdaBoost Classifier ---
Accuracy: 0.8994
Precision: 0.9001
Recall: 0.9957
F1 Score: 0.9455
ROC-AUC: 0.8238


In [18]:
print("\n--- 12. Gradient Boosting Classifier ---")
pipeline_gb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', GradientBoostingClassifier(random_state=42))])
pipeline_gb.fit(X_train, y_train)
y_pred = pipeline_gb.predict(X_test)
y_proba = pipeline_gb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 12. Gradient Boosting Classifier ---
Accuracy: 0.9100
Precision: 0.9127
Recall: 0.9922
F1 Score: 0.9508
ROC-AUC: 0.8167


In [19]:
print("\n--- 13. K-Nearest Neighbors (KNN) ---")
pipeline_knn = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', KNeighborsClassifier(n_jobs=-1))])
pipeline_knn.fit(X_train, y_train)
y_pred = pipeline_knn.predict(X_test)
y_proba = pipeline_knn.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 13. K-Nearest Neighbors (KNN) ---
Accuracy: 0.8806
Precision: 0.8854
Recall: 0.9922
F1 Score: 0.9358
ROC-AUC: 0.6465


In [20]:
print("\n--- 14.1. LinearSVC ---")
pipeline_lsvc = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LinearSVC(random_state=42))])
pipeline_lsvc.fit(X_train, y_train)
y_pred = pipeline_lsvc.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
# LinearSVC does not have predict_proba, so ROC-AUC cannot be calculated directly
print("ROC-AUC: N/A")


--- 14.1. LinearSVC ---
Accuracy: 0.8812
Precision: 0.8811
Recall: 0.9993
F1 Score: 0.9365
ROC-AUC: N/A


In [21]:
print("\n--- 14.2. SVC with RBF Kernel ---")
pipeline_svc_rbf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', SVC(kernel='rbf', probability=True, random_state=42))])
pipeline_svc_rbf.fit(X_train, y_train)
y_pred = pipeline_svc_rbf.predict(X_test)
y_proba = pipeline_svc_rbf.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 14.2. SVC with RBF Kernel ---
Accuracy: 0.8906
Precision: 0.8925
Recall: 0.9950
F1 Score: 0.9410
ROC-AUC: 0.7878


In [22]:
print("\n--- 15. MLPClassifier (Neural Network) ---")
pipeline_mlp = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', MLPClassifier(random_state=42, max_iter=500))])
pipeline_mlp.fit(X_train, y_train)
y_pred = pipeline_mlp.predict(X_test)
y_proba = pipeline_mlp.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 15. MLPClassifier (Neural Network) ---
Accuracy: 0.8606
Precision: 0.9102
Recall: 0.9330
F1 Score: 0.9215
ROC-AUC: 0.7517


In [23]:
print("\n--- 16. XGBoost Classifier ---")
pipeline_xgb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))])
pipeline_xgb.fit(X_train, y_train)
y_pred = pipeline_xgb.predict(X_test)
y_proba = pipeline_xgb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 16. XGBoost Classifier ---
Accuracy: 0.9106
Precision: 0.9177
Recall: 0.9864
F1 Score: 0.9508
ROC-AUC: 0.7991


In [24]:
print("\n--- 17. LightGBM Classifier ---")
pipeline_lgb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', lgb.LGBMClassifier(random_state=42, n_jobs=-1))])
pipeline_lgb.fit(X_train, y_train)
y_pred = pipeline_lgb.predict(X_test)
y_proba = pipeline_lgb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 17. LightGBM Classifier ---
[LightGBM] [Info] Number of positive: 5610, number of negative: 790
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001317 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2055
[LightGBM] [Info] Number of data points in the train set: 6400, number of used features: 24
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.876563 -> initscore=1.960273
[LightGBM] [Info] Start training from score 1.960273
Accuracy: 0.9144
Precision: 0.9192
Recall: 0.9893
F1 Score: 0.9529
ROC-AUC: 0.8158


In [25]:
print("\n--- 18. CatBoost Classifier ---")
pipeline_cat = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', cb.CatBoostClassifier(random_state=42, verbose=0))])
pipeline_cat.fit(X_train, y_train)
y_pred = pipeline_cat.predict(X_test)
y_proba = pipeline_cat.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 18. CatBoost Classifier ---
Accuracy: 0.9237
Precision: 0.9261
Recall: 0.9922
F1 Score: 0.9580
ROC-AUC: 0.8179


In [26]:
print("\n--- 19. Keras Deep Learning Dense Model ---")
# Preprocess data specifically for the Keras model
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Define the model architecture
keras_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_processed.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

# Compile the model
keras_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
keras_model.fit(X_train_processed, y_train, epochs=100, batch_size=32, 
                validation_split=0.2, callbacks=[early_stopping], verbose=0)

# Evaluate the model
y_proba = keras_model.predict(X_test_processed).ravel()
y_pred = (y_proba > 0.5).astype(int)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


--- 19. Keras Deep Learning Dense Model ---
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Accuracy: 0.8944
Precision: 0.9080
Recall: 0.9786
F1 Score: 0.9420
ROC-AUC: 0.7679


## 6. Ensembling

In [27]:
print("\n--- 7.1. Voting Classifier ---")
# Select a few diverse, well-performing models for the ensemble
estimators = [
    ('lr', LogisticRegression(random_state=42)),
    ('rf', RandomForestClassifier(random_state=42, n_jobs=-1)),
    ('xgb', xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))
]

# Hard Voting
pipeline_vote_hard = Pipeline(steps=[('preprocessor', preprocessor), 
                                     ('classifier', VotingClassifier(estimators=estimators, voting='hard', n_jobs=-1))])
pipeline_vote_hard.fit(X_train, y_train)
y_pred_hard = pipeline_vote_hard.predict(X_test)
print("Hard Voting Results:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_hard):.4f}")
print(f"  F1 Score: {f1_score(y_test, y_pred_hard):.4f}")

# Soft Voting
pipeline_vote_soft = Pipeline(steps=[('preprocessor', preprocessor), 
                                     ('classifier', VotingClassifier(estimators=estimators, voting='soft', n_jobs=-1))])
pipeline_vote_soft.fit(X_train, y_train)
y_pred_soft = pipeline_vote_soft.predict(X_test)
y_proba_soft = pipeline_vote_soft.predict_proba(X_test)[:, 1]
print("\nSoft Voting Results:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}")
print(f"  F1 Score: {f1_score(y_test, y_pred_soft):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_soft):.4f}")


--- 7.1. Voting Classifier ---
Hard Voting Results:
  Accuracy: 0.8931
  F1 Score: 0.9422

Soft Voting Results:
  Accuracy: 0.9019
  F1 Score: 0.9467
  ROC-AUC: 0.8070


In [28]:
print("\n--- 7.2. Stacking Classifier ---")
stacking_estimators = [
    ('rf', RandomForestClassifier(random_state=42, n_jobs=-1)),
    ('xgb', xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')),
    ('lgbm', lgb.LGBMClassifier(random_state=42, n_jobs=-1))
]

pipeline_stack = Pipeline(steps=[('preprocessor', preprocessor), 
                                 ('classifier', StackingClassifier(estimators=stacking_estimators, 
                                                                   final_estimator=LogisticRegression(), 
                                                                   cv=5, n_jobs=-1))])
pipeline_stack.fit(X_train, y_train)
y_pred_stack = pipeline_stack.predict(X_test)
y_proba_stack = pipeline_stack.predict_proba(X_test)[:, 1]
print("Stacking Results:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_stack):.4f}")
print(f"  F1 Score: {f1_score(y_test, y_pred_stack):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_proba_stack):.4f}")


--- 7.2. Stacking Classifier ---
Stacking Results:
  Accuracy: 0.9225
  F1 Score: 0.9570
  ROC-AUC: 0.8099


## 7. Advanced Hyperparameter Tuning with Optuna

In [29]:
print("\n--- 8.1. Optuna for XGBoost ---")
def objective_xgb(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'random_state': 42,
        'use_label_encoder': False,
        'eval_metric': 'logloss'
    }
    
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', xgb.XGBClassifier(**params))])
    
    from sklearn.model_selection import cross_val_score
    score = cross_val_score(pipeline, X_train, y_train, n_jobs=-1, cv=3, scoring='accuracy').mean()
    return score

study_xgb = optuna.create_study(direction='maximize')
study_xgb.optimize(objective_xgb, n_trials=50, timeout=600) # Run for 50 trials or 10 minutes

print(f"Best XGBoost Trial Score: {study_xgb.best_value:.4f}")
print(f"Best XGBoost Params: {study_xgb.best_params}")

# Retrain and evaluate with best params
best_params_xgb = study_xgb.best_params
pipeline_xgb_opt = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss', **best_params_xgb))])
pipeline_xgb_opt.fit(X_train, y_train)
y_pred = pipeline_xgb_opt.predict(X_test)
y_proba = pipeline_xgb_opt.predict_proba(X_test)[:, 1]
print("\nOptimized XGBoost Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

[I 2025-09-25 21:24:43,171] A new study created in memory with name: no-name-a499ce64-7778-4e9c-a2e3-28e8f38bd442



--- 8.1. Optuna for XGBoost ---


[I 2025-09-25 21:24:45,013] Trial 0 finished with value: 0.9003126660049537 and parameters: {'n_estimators': 267, 'max_depth': 7, 'learning_rate': 0.011017191423363734, 'subsample': 0.9521524621035461, 'colsample_bytree': 0.7929892535880235, 'gamma': 1.7585997485431781}. Best is trial 0 with value: 0.9003126660049537.
[I 2025-09-25 21:24:46,166] Trial 1 finished with value: 0.9110934039160582 and parameters: {'n_estimators': 650, 'max_depth': 8, 'learning_rate': 0.059970478428753055, 'subsample': 0.8992087146559469, 'colsample_bytree': 0.8763953828261434, 'gamma': 1.4878821255578782}. Best is trial 1 with value: 0.9110934039160582.
[I 2025-09-25 21:24:48,850] Trial 2 finished with value: 0.908124849053119 and parameters: {'n_estimators': 591, 'max_depth': 7, 'learning_rate': 0.03414655306543619, 'subsample': 0.6245515163542303, 'colsample_bytree': 0.6295375263093302, 'gamma': 0.09051270115771282}. Best is trial 1 with value: 0.9110934039160582.
[I 2025-09-25 21:24:49,975] Trial 3 finis

Best XGBoost Trial Score: 0.9223
Best XGBoost Params: {'n_estimators': 361, 'max_depth': 3, 'learning_rate': 0.04473318977161385, 'subsample': 0.7261920634301087, 'colsample_bytree': 0.9936332715982877, 'gamma': 2.5628952896392256}

Optimized XGBoost Performance:
Accuracy: 0.9250
F1 Score: 0.9587
ROC-AUC: 0.8238


In [30]:
print("\n--- 8.2. Optuna for Keras NN ---")

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

def create_keras_model(trial):
    n_layers = trial.suggest_int('n_layers', 1, 3)
    model = Sequential()
    model.add(Dense(trial.suggest_int('units_0', 32, 256), activation='relu', input_shape=(X_train_processed.shape[1],)))
    model.add(Dropout(trial.suggest_float('dropout_0', 0.2, 0.5)))
    
    for i in range(1, n_layers):
        model.add(Dense(trial.suggest_int(f'units_{i}', 16, 128), activation='relu'))
        model.add(Dropout(trial.suggest_float(f'dropout_{i}', 0.2, 0.5)))
    
    model.add(Dense(1, activation='sigmoid'))
    
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    return model

def objective_keras(trial):
    model = create_keras_model(trial)
    early_stopping = EarlyStopping(monitor='val_accuracy', patience=10)
    history = model.fit(X_train_processed, y_train, epochs=100, batch_size=trial.suggest_int('batch_size', 32, 128),
                        validation_split=0.2, callbacks=[early_stopping], verbose=0)
    return history.history['val_accuracy'][-1]

study_keras = optuna.create_study(direction='maximize')
study_keras.optimize(objective_keras, n_trials=30, timeout=600)

print(f"Best Keras Trial Score: {study_keras.best_value:.4f}")
print(f"Best Keras Params: {study_keras.best_params}")

# Retrain and evaluate with best params
best_keras_model = create_keras_model(study_keras.best_trial)
best_keras_model.fit(X_train_processed, y_train, epochs=100, batch_size=study_keras.best_params['batch_size'],
                     validation_split=0.2, callbacks=[early_stopping], verbose=0)
y_proba = best_keras_model.predict(X_test_processed).ravel()
y_pred = (y_proba > 0.5).astype(int)

print("\nOptimized Keras Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")

[I 2025-09-25 21:25:38,692] A new study created in memory with name: no-name-f36483b2-fe7b-482a-9047-d7ea88d55c35



--- 8.2. Optuna for Keras NN ---


[I 2025-09-25 21:25:57,366] Trial 0 finished with value: 0.9117187261581421 and parameters: {'n_layers': 2, 'units_0': 107, 'dropout_0': 0.2965391203351117, 'units_1': 71, 'dropout_1': 0.40652073619887014, 'learning_rate': 0.00043247438531843164, 'batch_size': 66}. Best is trial 0 with value: 0.9117187261581421.
[I 2025-09-25 21:26:09,445] Trial 1 finished with value: 0.91796875 and parameters: {'n_layers': 1, 'units_0': 129, 'dropout_0': 0.26482651511882055, 'learning_rate': 0.004605986592202893, 'batch_size': 124}. Best is trial 1 with value: 0.91796875.
[I 2025-09-25 21:26:32,237] Trial 2 finished with value: 0.9140625 and parameters: {'n_layers': 2, 'units_0': 248, 'dropout_0': 0.2747948899248066, 'units_1': 123, 'dropout_1': 0.46991611129305066, 'learning_rate': 0.00012233143258915207, 'batch_size': 68}. Best is trial 1 with value: 0.91796875.
[I 2025-09-25 21:26:39,828] Trial 3 finished with value: 0.910937488079071 and parameters: {'n_layers': 3, 'units_0': 161, 'dropout_0': 0.4

Best Keras Trial Score: 0.9180
Best Keras Params: {'n_layers': 1, 'units_0': 129, 'dropout_0': 0.26482651511882055, 'learning_rate': 0.004605986592202893, 'batch_size': 124}
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 

Optimized Keras Performance:
Accuracy: 0.8906
F1 Score: 0.9406
ROC-AUC: 0.7680
