**Objective:**
The goal of this midterm report is to demonstrate your ability to analyze a dataset by explaining its source, relevance, and key characteristics using exploratory data analysis (EDA) techniques, including statistical summaries and visualizations.

## **Dataset Selection and Explanation**
- Clearly specify the dataset you have chosen.
- Provide details on its source (e.g., Kaggle, UCI Machine Learning Repository, government open data, etc.).
- Describe what the dataset is about (e.g., what kind of information it contains).
- Explain why you selected this dataset and why it is interesting or relevant for analysis.

ข้อมูลนี้เป็นชุดข้อมูลเกี่ยวกับแคมเปญทางการตลาดของสถาบันการเงินแห่งหนึ่งในประเทศโปรตุเกส ซึ่งแคมเปญการตลาดนี้ใช้การโทรศัพท์ไปยังลูกค้าโดยตรง หลายครั้งที่ต้องติดต่อมากกว่าหนึ่งครั้งกับลูกค้ารายเดียวกัน เพื่อประเมินว่าลูกค้าจะสมัครผลิตภัณฑ์ (เงินฝากประจำของธนาคาร) หรือไม่

1. **age** - อายุของลูกค้า  
2. **job** - อาชีพของลูกค้า  
3. **marital** - สถานะสมรสของลูกค้า  
4. **education** - ระดับการศึกษาของลูกค้า  
5. **default** - มีประวัติผิดนัดชำระหนี้หรือไม่  
6. **housing** - มีสินเชื่อที่อยู่อาศัยหรือไม่  
7. **loan** - มีสินเชื่อส่วนบุคคลหรือไม่  
8. **contact** - ช่องทางการติดต่อกับลูกค้า  
9. **month** - เดือนที่ติดต่อครั้งล่าสุด  
10. **day** - วันในสัปดาห์ที่ติดต่อครั้งล่าสุด  
11. **duration** - ระยะเวลาของการติดต่อครั้งล่าสุด (วินาที)  
12. **campaign** - จำนวนครั้งที่ติดต่อในแคมเปญนี้สำหรับลูกค้ารายนั้น  
13. **pdays** - จำนวนวันที่ผ่านไปหลังจากที่ลูกค้าถูกติดต่อครั้งล่าสุดจากแคมเปญก่อนหน้า  
14. **previous** - จำนวนครั้งที่เคยติดต่อก่อนแคมเปญนี้สำหรับลูกค้ารายนั้น  
15. **poutcome** - ผลลัพธ์ของแคมเปญการตลาดก่อนหน้า  
16. **emp.var.rate** - อัตราการเปลี่ยนแปลงของการจ้างงาน  
17. **cons.price.idx** - ดัชนีราคาผู้บริโภค  
18. **cons.confidx** - ดัชนีความเชื่อมั่นของผู้บริโภค  
19. **euribor3m** - อัตราดอกเบี้ยยูริบอร์ (Euribor) ระยะเวลา 3 เดือน  
20. **nr.employed** - จำนวนพนักงานในระบบเศรษฐกิจ  
21. **y** - ลูกค้าสมัครฝากเงินแบบมีกำหนดระยะเวลาหรือไม่ (เป้าหมายของแคมเปญ)

**Business requirements** <br>
- พัฒนาปรับปรุงแคมเปญ เพื่อการเพิ่มขึ้นของอัตราความสำเร็จของแคมเปญ (ลูกค้าสมัครผลิตภัณฑ์) ex. แผนการโทรหาเพื่อให้ลูกค้ามีแนวโน้มตอบตกลงมากขึ้น <br>
- การลดต้นทุนและเพิ่มประสิทธิภาพของแคมเปญ => ลดค่าใช้จ่ายในการโทร แต่ลูกค้าก็ยังสมัครเยอะอยู่
- สร้างโมเดลคาดการณ์ว่าลูกค้ารายใหม่จะสมัครหรือไม่ <br>
- Customer Segment

In [None]:
import pandas as pd
import numpy as np

## **Import Bank Marketing Data**

In [None]:
bank = pd.read_csv("bank-additional-full.csv", sep=';')
bank

## **Data Preparation**

In [None]:
bank = bank.rename(columns={
    "emp.var.rate": "emp_variation_rate",
    "cons.price.idx": "cons_price",
    "cons.confidx": "cons_confidence",
    "euribor3m": "euribor_3m",
    "nr.employed": "num_employees"
})


**Unknown Values**

In [None]:
unknown_values = (bank == "unknown").sum()
unknown_df = pd.DataFrame({"Count of Unknown": unknown_values.values,
                           "Percent": (unknown_values.values/bank.count())*100})

unknown_df = unknown_df[unknown_df["Count of Unknown"] > 0]
unknown_df

### **Replace Unknow value with Mode**

In [None]:
def fill_unknowns_with_mode(df):
    df = df.copy()
    for col in ["job", "marital", "education", "default", "housing", "loan"]:
        mode_val = df[col].mode()[0]
        df[col] = df[col].replace("unknown", mode_val)
    return df

In [None]:
bank_filledMode = fill_unknowns_with_mode(bank)

### **Replace Unknow value with ML**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

def fill_unknowns_with_ml(df: pd.DataFrame, fill_order: list):
    df_filled = df.copy()
    label_encoders = {}

    for target in fill_order:
        print(f"Replace Unknown values in column: {target}")
        known = df_filled[df_filled[target] != 'unknown'].copy()
        unknown = df_filled[df_filled[target] == 'unknown'].copy()

        if unknown.empty:
            continue

        # One-hot encoding
        full_encoded = pd.get_dummies(df_filled.drop(columns=[target]), drop_first=True)

        # Split encoded known/unknown rows
        X = full_encoded.loc[known.index]
        X_pred = full_encoded.loc[unknown.index]

        # Encode target (y)
        le = LabelEncoder()
        y = le.fit_transform(known[target])
        label_encoders[target] = le

        # Train and predict
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X, y)
        y_pred = model.predict(X_pred)
        y_pred_labels = le.inverse_transform(y_pred)

        df_filled.loc[unknown.index, target] = y_pred_labels
    return df_filled


In [None]:
fill_order = ["marital", "job", "housing", "loan", "education", "default"]
bank_filledML = fill_unknowns_with_ml(bank, fill_order)

### **Replace Unknow value with MICE**

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [None]:
def mice_impute_categorical(data):
    df = data.copy()
    cat_cols = df.select_dtypes(include=['object', 'category']).columns

    # Encode first
    encoder = OrdinalEncoder()
    df[cat_cols] = encoder.fit_transform(df[cat_cols])

    for col in cat_cols:
        if df[col].isnull().sum() > 0:
            # Prepare training and test set
            not_null = df[~df[col].isnull()]
            is_null = df[df[col].isnull()]

            X_train = not_null.drop([col], axis=1)
            y_train = not_null[col]
            X_pred = is_null.drop([col], axis=1)

            clf = RandomForestClassifier(n_estimators=100, random_state=42)
            clf.fit(X_train, y_train)

            preds = clf.predict(X_pred)

            df.loc[is_null.index, col] = preds

    # Decode back to original
    df[cat_cols] = encoder.inverse_transform(df[cat_cols])

    return df

In [None]:
bank_filledMICE = mice_impute_categorical(bank)

## **Train Model**

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split

In [None]:
def encode_data(df):
    num_cols = df.select_dtypes(include=['int64', 'float64']).columns
    cat_cols = df.select_dtypes(include='object').columns[:-1]  

    encoder = OneHotEncoder(handle_unknown='ignore')
    encoded_features = encoder.fit_transform(df[cat_cols]).toarray()

    encoded_df = pd.DataFrame(encoded_features, 
                             columns=encoder.get_feature_names_out(cat_cols),
                             index=df.index)

    encoded_df = pd.concat([encoded_df, df[num_cols]], axis=1)

    le = LabelEncoder()
    encoded_df['y'] = le.fit_transform(df['y'])

    return encoded_df


In [None]:
# 1. Not Replace Unknown value
bank_encoded_Unknown = encode_data(bank)
y_uk = bank_encoded_Unknown["y"]
X_uk = bank_encoded_Unknown.drop(columns = ["y"])

# 2. Replace Unknown value with ML
bank_encoded_ML = encode_data(bank_filledML)
y_ml = bank_encoded_ML["y"]
X_ml = bank_encoded_ML.drop(columns = ["y"])

# 3. Replace Unknown value with Mode
bank_encoded_Mode = encode_data(bank_filledMode)
y_mode = bank_encoded_Mode["y"]
X_mode = bank_encoded_Mode.drop(columns = ["y"])

# 4. Replace Unknown value with MICE
bank_encoded_MICE = encode_data(bank_filledMICE)
y_mice = bank_encoded_MICE["y"]
X_mice = bank_encoded_MICE.drop(columns = ["y"])

In [None]:
# ติดตั้งเวอร์ชันที่เข้ากันได้กับ Colab
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 imbalanced-learn==0.13.0

In [None]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.base import clone
from sklearn.metrics import (
    f1_score, accuracy_score, precision_score, recall_score, roc_auc_score,
    classification_report, confusion_matrix
)
from sklearn.utils.class_weight import compute_class_weight
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
import time 

##**Train Model with XGBoost**

In [None]:
!pip install xgboost==2.1.4

In [None]:
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline as ImbPipeline

In [None]:
def XGBoost_model(X, y, condition):
    start_time = time.time()
    n_splits = 5

    # Parameter grid
    param_grid = {
        'xgb__n_estimators': [100, 200, 300],
        'xgb__max_depth': [3, 6, 12, 20],
        'xgb__learning_rate': [0.01, 0.1]
    }

    # CV
    outer_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    inner_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    accuracies, aucs, conf_matrixs = [], [], []
    macro_f1s, macro_precisions, macro_recalls = [], [], []
    weighted_f1s, weighted_precisions, weighted_recalls = [], [], []
    sensitivities, specificities, balanced_accuracies = [], [], []

    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):
        X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

        # pipeline
        if condition == "IM":
            pipeline = ImbPipeline([
                ('xgb', XGBClassifier(
                    use_label_encoder=False,
                    eval_metric='logloss',
                    random_state=42
                    #tree_method='gpu_hist',
                    #predictor='gpu_predictor',
                    #gpu_id=0
                    ))
            ])

        elif condition == "SMOTE":
            class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
            scale_pos_weight = class_weights[0] / class_weights[1]

            pipeline = ImbPipeline([
                ('smote', SMOTE(random_state=42)),
                ('xgb', XGBClassifier(
                    use_label_encoder=False,
                    eval_metric='logloss',
                    random_state=42,
                    #tree_method='gpu_hist',
                    #predictor='gpu_predictor',
                    #gpu_id=0,
                    scale_pos_weight=scale_pos_weight))
            ])

        elif condition == "TomekLinks":
            class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
            scale_pos_weight = class_weights[0] / class_weights[1]

            pipeline = ImbPipeline([
                ('under', TomekLinks()),
                ('xgb', XGBClassifier(
                    use_label_encoder=False,
                    eval_metric='logloss',
                    random_state=42,
                    #tree_method='gpu_hist',
                    #predictor='gpu_predictor',
                    #gpu_id=0,
                    scale_pos_weight=scale_pos_weight))
            ])
        elif condition == "SMOTETomek":
            class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
            scale_pos_weight = class_weights[0] / class_weights[1]
            pipeline = ImbPipeline([
                ('resample', SMOTETomek(random_state=42)),
                ('xgb', XGBClassifier(
                    use_label_encoder=False,
                    eval_metric='logloss',
                    random_state=42,
                    #tree_method='gpu_hist',
                    #predictor='gpu_predictor',
                    #gpu_id=0,
                    scale_pos_weight=scale_pos_weight))
            ])

        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=param_grid,
            scoring='f1',
            n_jobs=-1,
            error_score='raise'
        )


        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_

        y_pred = best_model.predict(X_val)
        y_proba = best_model.predict_proba(X_val)[:, 1]

        cm = confusion_matrix(y_val, y_pred)
        conf_matrixs.append({"fold": fold_idx, "conf_matrix": cm})

        # Print report
        print(f"\n=== Classification Report for Fold {fold_idx} ===")
        print(classification_report(y_val, y_pred))

        print(f"Confusion Matrix for Fold {fold_idx}:")
        print(conf_matrixs[fold_idx-1]['conf_matrix'])

        print(f"Best Parameters for Fold {fold_idx}: {grid_search.best_params_}")

        # metrics
        accuracies.append(accuracy_score(y_val, y_pred))
        aucs.append(roc_auc_score(y_val, y_proba))
        macro_f1s.append(f1_score(y_val, y_pred, average='macro'))
        macro_precisions.append(precision_score(y_val, y_pred, average='macro'))
        macro_recalls.append(recall_score(y_val, y_pred, average='macro'))
        weighted_f1s.append(f1_score(y_val, y_pred, average='weighted'))
        weighted_precisions.append(precision_score(y_val, y_pred, average='weighted'))
        weighted_recalls.append(recall_score(y_val, y_pred, average='weighted'))

        sensitivity = recall_score(y_val, y_pred, pos_label=1)
        specificity = recall_score(y_val, y_pred, pos_label=0)
        balanced_acc = (sensitivity + specificity) / 2

        sensitivities.append(sensitivity)
        specificities.append(specificity)
        balanced_accuracies.append(balanced_acc)

    # Summary
    results = {
        'mean_accuracy': sum(accuracies) / n_splits,
        'mean_auc': sum(aucs) / n_splits,
        'macro_f1': sum(macro_f1s) / n_splits,
        'macro_precision': sum(macro_precisions) / n_splits,
        'macro_recall': sum(macro_recalls) / n_splits,
        'weighted_f1': sum(weighted_f1s) / n_splits,
        'weighted_precision': sum(weighted_precisions) / n_splits,
        'weighted_recall': sum(weighted_recalls) / n_splits,
        'mean_sensitivity': sum(sensitivities) / n_splits,
        'mean_specificity': sum(specificities) / n_splits,
        'mean_balanced_accuracy': sum(balanced_accuracies) / n_splits
    }

    end_time = time.time()
    print(f"\nTotal training time: {end_time - start_time:.2f} seconds")

    return pd.DataFrame([results])


#### **Imbalanced data**

In [None]:
XGB_Unknown_IM = XGBoost_model(X_uk, y_uk, "IM")
XGB_Unknown_IM

In [None]:
XGB_ML_IM = XGBoost_model(X_ml, y_ml, "IM")
XGB_ML_IM

In [None]:
XGB_Mode_IM = XGBoost_model(X_mode, y_mode, "IM")
XGB_Mode_IM

In [None]:
XGB_MICE_IM = XGBoost_model(X_mice, y_mice, "IM")
XGB_MICE_IM

#### **TomekLinks**

In [None]:
XGB_Unknown_TML = XGBoost_model(X_uk, y_uk, "TomekLinks")
XGB_Unknown_TML

In [None]:
XGB_ML_TML = XGBoost_model(X_ml, y_ml, "TomekLinks")
XGB_ML_TML

In [None]:
########
XGB_Mode_TML = XGBoost_model(X_mode, y_mode, "TomekLinks")
XGB_Mode_TML

In [None]:
XGB_MICE_TML = XGBoost_model(X_mice, y_mice, "TomekLinks")
XGB_MICE_TML

#### **SMOTE**

In [None]:
XGB_Unknown_SM = XGBoost_model(X_uk, y_uk, "SMOTE")
XGB_Unknown_SM

In [None]:
XGB_ML_SM = XGBoost_model(X_ml, y_ml, "SMOTE")
XGB_ML_SM

In [None]:
XGB_Mode_SM = XGBoost_model(X_mode, y_mode, "SMOTE")
XGB_Mode_SM

In [None]:
XGB_MICE_SM = XGBoost_model(X_mice, y_mice, "SMOTE")
XGB_MICE_SM

#### **SMOTE + TomekLinks**

In [None]:
XGB_Unknown_SMTML = XGBoost_model(X_uk, y_uk, "SMOTETomek")
XGB_Unknown_SMTML

In [None]:
XGB_ML_SMTML = XGBoost_model(X_ml, y_ml, "SMOTETomek")
XGB_ML_SMTML

In [None]:
XGB_Mode_SMTML = XGBoost_model(X_mode, y_mode, "SMOTETomek")
XGB_Mode_SMTML

In [None]:
XGB_MICE_SMTML = XGBoost_model(X_mice, y_mice, "SMOTETomek")
XGB_MICE_SMTML

##**Train Model with Random Forest model**

In [None]:
def RandomForest_model(X, y, condition):
    start_time = time.time()
    n_splits = 5

    # pipeline
    if condition == "IM":
        pipeline = Pipeline([('rf', RandomForestClassifier(random_state=42))])

    elif condition == "SMOTE":
        pipeline = Pipeline([
            ('smote', SMOTE(random_state=42)),
            ('rf', RandomForestClassifier(random_state=42, class_weight='balanced'))
        ])

    elif condition == "TomekLinks":
        pipeline = Pipeline([
            ('under', TomekLinks()),
            ('rf', RandomForestClassifier(random_state=42, class_weight='balanced'))
        ])

    elif condition == "SMOTETomek":
        pipeline = Pipeline([
            ('resample', SMOTETomek(random_state=42)),
            ('rf', RandomForestClassifier(random_state=42, class_weight='balanced'))
        ])

    # Parameter grid
    param_grid = {
        'rf__n_estimators': [200, 300, 500],
        'rf__max_depth': [None, 10, 15],
        'rf__min_samples_split': [2, 5]
    }

    # CV
    outer_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    inner_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    accuracies, aucs, conf_matrixs = [], [], []
    macro_f1s, macro_precisions, macro_recalls = [], [], []
    weighted_f1s, weighted_precisions, weighted_recalls = [], [], []
    sensitivities, specificities, balanced_accuracies = [], [], []

    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):
        X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

        grid_search = GridSearchCV(
            estimator=clone(pipeline),
            param_grid=param_grid,
            cv=inner_cv,
            scoring="f1",
            n_jobs=-1
        )
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_

        y_pred = best_model.predict(X_val)
        y_proba = best_model.predict_proba(X_val)[:, 1]
        cm = confusion_matrix(y_val, y_pred)
        conf_matrixs.append({"fold": fold_idx, "conf_matrix": cm})

        # Print report
        print(f"\n=== Classification Report for Fold {fold_idx} ===")
        print(classification_report(y_val, y_pred))

        print(f"Confusion Matrix for Fold {fold_idx}:")
        print(conf_matrixs[fold_idx-1]['conf_matrix'])
        
        print(f"Best Parameters for Fold {fold_idx}: {grid_search.best_params_}")

        # metrics
        accuracies.append(accuracy_score(y_val, y_pred))
        aucs.append(roc_auc_score(y_val, y_proba))
        macro_f1s.append(f1_score(y_val, y_pred, average='macro'))
        macro_precisions.append(precision_score(y_val, y_pred, average='macro'))
        macro_recalls.append(recall_score(y_val, y_pred, average='macro'))
        weighted_f1s.append(f1_score(y_val, y_pred, average='weighted'))
        weighted_precisions.append(precision_score(y_val, y_pred, average='weighted'))
        weighted_recalls.append(recall_score(y_val, y_pred, average='weighted'))

        sensitivity = recall_score(y_val, y_pred, pos_label=1)
        specificity = recall_score(y_val, y_pred, pos_label=0)
        balanced_acc = (sensitivity + specificity) / 2

        sensitivities.append(sensitivity)
        specificities.append(specificity)
        balanced_accuracies.append(balanced_acc)

    # Summary as DataFrame
    results = {
        'mean_accuracy': sum(accuracies) / n_splits,
        'mean_auc': sum(aucs) / n_splits,
        'macro_f1': sum(macro_f1s) / n_splits,
        'macro_precision': sum(macro_precisions) / n_splits,
        'macro_recall': sum(macro_recalls) / n_splits,
        'weighted_f1': sum(weighted_f1s) / n_splits,
        'weighted_precision': sum(weighted_precisions) / n_splits,
        'weighted_recall': sum(weighted_recalls) / n_splits,
        'mean_sensitivity': sum(sensitivities) / n_splits,
        'mean_specificity': sum(specificities) / n_splits,
        'mean_balanced_accuracy': sum(balanced_accuracies) / n_splits
    }
    end_time = time.time()
    print(f"\nTotal training time: {end_time - start_time:.2f} seconds")

    return pd.DataFrame([results])

#### **Imbalanced data**

In [None]:
RF_Unknown_IM = RandomForest_model(X_uk, X_uk, "IM")
RF_Unknown_IM

In [None]:
RF_ML_IM = RandomForest_model(X_ml, y_ml, "IM")
RF_ML_IM

In [None]:
RF_Mode_IM = RandomForest_model(X_mode, y_mode, "IM")
RF_Mode_IM

In [None]:
RF_MICE_IM = RandomForest_model(X_mice, y_mice, "IM")
RF_MICE_IM

#### **TomekLinks**

In [None]:
RF_Unknown_TML = RandomForest_model(X_uk, X_uk, "TomekLinks")
RF_Unknown_TML

In [None]:
RF_ML_TML = RandomForest_model(X_ml, y_ml, "TomekLinks")
RF_ML_TML

In [None]:
RF_Mode_TML = RandomForest_model(X_mode, y_mode, "TomekLinks")
RF_Mode_TML

In [None]:
RF_MICE_TML = RandomForest_model(X_mice, y_mice, "TomekLinks")
RF_MICE_TML

#### **SMOTE**

In [None]:
RF_Unknown_SM = RandomForest_model(X_uk, X_uk, "SMOTE")
RF_Unknown_SM

In [None]:
RF_ML_SM = RandomForest_model(X_ml, y_ml, "SMOTE")
RF_ML_SM

In [None]:
RF_Mode_SM = RandomForest_model(X_mode, y_mode, "SMOTE")
RF_Mode_SM

In [None]:
RF_MICE_SM = RandomForest_model(X_mice, y_mice, "SMOTE")
RF_MICE_SM

#### **SMOTE + TomekLinks**

In [None]:
RF_Unknown_SMTML = RandomForest_model(X_uk, X_uk, "SMOTETomek")
RF_Unknown_SMTML

In [None]:
RF_ML_SMTML = RandomForest_model(X_ml, y_ml, "SMOTETomek")
RF_ML_SMTML

In [None]:
RF_Mode_SMTML = RandomForest_model(X_mode, y_mode, "SMOTETomek")
RF_Mode_SMTML

In [None]:
RF_MICE_SMTML = RandomForest_model(X_mice, y_mice, "SMOTETomek")
RF_MICE_SMTML

## **Train Model with Support Vector Classifier**

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.svm import LinearSVC

In [None]:
def SVC_model(X, y, condition):
    start_time = time.time()
    n_splits = 5
    numeric_cols = ['age', 'duration', 'campaign', 'pdays', 'previous',
       'emp_variation_rate', 'cons_price', 'cons.conf.idx', 'euribor_3m',
       'num_employees']

    # numeric columns
    preprocessor = ColumnTransformer(
        transformers=[('num', StandardScaler(), numeric_cols)], remainder='passthrough'
    )

    # pipeline
    if condition == "IM":
        pipeline = Pipeline([
            ('preprocess', preprocessor),
            ('svc', LinearSVC(class_weight= None, random_state=42))
        ])

    elif condition == "SMOTE":
        pipeline = ImbPipeline([
            ('smote', SMOTE(random_state=42)),
            ('preprocess', preprocessor),
            ('svc', LinearSVC(class_weight= None, random_state=42))
        ])
    elif condition == "TomekLinks":
        pipeline = ImbPipeline([
            ('under', TomekLinks()),
            ('preprocess', preprocessor),
            ('svc', LinearSVC(class_weight= None, random_state=42))
        ])
    elif condition == "SMOTETomek":
        pipeline = ImbPipeline([
            ('resample', SMOTETomek(random_state=42)),
            ('preprocess', preprocessor),
            ('svc', LinearSVC(class_weight= None, random_state=42))
        ])

    param_grid = {
        'svc__C': [0.01, 0.1, 1, 10],
        'svc__penalty': ['l2']
    }

    outer_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    inner_cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

    accuracies, aucs, conf_matrixs = [], [], []
    macro_f1s, macro_precisions, macro_recalls = [], [], []
    weighted_f1s, weighted_precisions, weighted_recalls = [], [], []
    sensitivities, specificities, balanced_accuracies = [], [], []

    for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):
        X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

        grid_search = GridSearchCV(
            estimator=clone(pipeline),
            param_grid=param_grid,
            cv=inner_cv,
            scoring="f1",
            n_jobs=-1
        )
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_

        y_pred = best_model.predict(X_val)
        y_scores = best_model.decision_function(X_val)  # LinearSVC

        cm = confusion_matrix(y_val, y_pred)
        conf_matrixs.append({"fold": fold_idx, "conf_matrix": cm})

        # Print report
        print(f"\n=== Classification Report for Fold {fold_idx} ===")
        print(classification_report(y_val, y_pred))

        print(f"Confusion Matrix for Fold {fold_idx}:")
        print(conf_matrixs[fold_idx-1]['conf_matrix'])
        
        print(f"Best Parameters for Fold {fold_idx}: {grid_search.best_params_}")

        accuracies.append(accuracy_score(y_val, y_pred))
        aucs.append(roc_auc_score(y_val, y_scores))
        macro_f1s.append(f1_score(y_val, y_pred, average='macro'))
        macro_precisions.append(precision_score(y_val, y_pred, average='macro'))
        macro_recalls.append(recall_score(y_val, y_pred, average='macro'))
        weighted_f1s.append(f1_score(y_val, y_pred, average='weighted'))
        weighted_precisions.append(precision_score(y_val, y_pred, average='weighted'))
        weighted_recalls.append(recall_score(y_val, y_pred, average='weighted'))

        sensitivity = recall_score(y_val, y_pred, pos_label=1)
        specificity = recall_score(y_val, y_pred, pos_label=0)
        balanced_acc = (sensitivity + specificity) / 2

        sensitivities.append(sensitivity)
        specificities.append(specificity)
        balanced_accuracies.append(balanced_acc)

    results = {
        'mean_accuracy': np.mean(accuracies),
        'mean_auc': np.mean(aucs),
        'macro_f1': np.mean(macro_f1s),
        'macro_precision': np.mean(macro_precisions),
        'macro_recall': np.mean(macro_recalls),
        'weighted_f1': np.mean(weighted_f1s),
        'weighted_precision': np.mean(weighted_precisions),
        'weighted_recall': np.mean(weighted_recalls),
        'mean_sensitivity': np.mean(sensitivities),
        'mean_specificity': np.mean(specificities),
        'mean_balanced_accuracy': np.mean(balanced_accuracies)
    }

    end_time = time.time()
    print(f"\nTotal training time: {end_time - start_time:.2f} seconds")

    return pd.DataFrame([results])


#### **Imbalanced data**

In [None]:
SVC_ML_IM = SVC_model(X_ml, y_ml, "IM")
SVC_ML_IM

In [None]:
SVC_Mode_IM = SVC_model(X_mode, y_mode, "IM")
SVC_Mode_IM

In [None]:
SVC_MICE_IM = SVC_model(X_mice, y_mice, "IM")
SVC_MICE_IM

#### **TomekLinks**

In [None]:
SVC_ML_TML = SVC_model(X_ml, y_ml, "TomekLinks")
SVC_ML_TML

In [None]:
SVC_Mode_TML = SVC_model(X_mode, y_mode, "TomekLinks")
SVC_Mode_TML

In [None]:
SVC_MICE_TML = SVC_model(X_mice, y_mice, "TomekLinks")
SVC_MICE_TML

#### **SMOTE**

In [None]:
SVC_ML_SM = SVC_model(X_ml, y_ml, "SMOTE")
SVC_ML_SM

In [None]:
SVC_Mode_SM = SVC_model(X_mode, y_mode, "SMOTE")
SVC_Mode_SM

In [None]:
SVC_MICE_SM = SVC_model(X_mice, y_mice, "SMOTE")
SVC_MICE_SM

#### **SMOTE + TomekLinks**

In [None]:
SVC_ML_SMTM = SVC_model(X_ml, y_ml, "SMOTETomek")
SVC_ML_SMTM

In [None]:
SVC_Mode_SMTM = SVC_model(X_mode, y_mode, "SMOTETomek")
SVC_Mode_SMTM

In [None]:
SVC_MICE_SMTM = SVC_model(X_mice, y_mice, "SMOTETomek")
SVC_MICE_SMTM