#### Research Notebook: Evaluation of Privacy-Preserving Machine Learning with FHE

This research notebook will evaluate different models performance by comparing the classical models, provided by the scikit-learn library (which uses plain text data) to the equivalent provided by the concrete-ml models which works by leveragin the Fully Homomorphic Encryption. Specifically, the comparison will consider the following aspects (plaintext vs ciphertext):
- Overall Accuracy of the plaintext
- Latency of the model in plaintext     (average inference time on 100 iterations)
- Accuracy of the model in ciphertext giving from the training from scratch with a varying bid-width (sort of hyper-parameter tuning)
- Latency of the cipher model (average inference time on 100 iterations) varying the bid-width


### Configuration of the Environment

In [13]:
%pip install concrete-ml
%pip install kagglehub[pandas-datasets]
%pip install kagglehub
%pip install seaborn
%pip install imblearn



In [None]:
# data preprocessing
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import time
from typing import Dict, Callable

# data encpryption
from concrete.ml.pandas import ClientEngine
from io import StringIO
import pandas

# data manipulation
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# models
from sklearn.model_selection import StratifiedKFold
from pandas import pandas as pd # needed to handle csv files
# plain models
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score,accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier # to do
from xgboost import XGBClassifier

# cipher models
from concrete.ml.sklearn import DecisionTreeClassifier as ConcreteDecisionTreeClassifier
from concrete.ml.sklearn import LogisticRegression as ConcreteLogisticRegression
from concrete.ml.sklearn import RandomForestClassifier as ConcreteRandomForestClassifier
from concrete.ml.sklearn import LogisticRegression as ConcreteLogisticRegression
from concrete.ml.sklearn import NeuralNetClassifier # to do 
from concrete.ml.sklearn import XGBClassifier as ConcreteXGBClassifier


NameError: name 'data_left' is not defined

### About the Dataset: Credit Fraud Detection

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.
It is available on Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download) and it is used under the Database Contents License (DbCL) v1.0

### Data Preprocessing: Exploration

In [15]:
df = pd.read_csv("creditcard.csv")

In [16]:
df.head()



Looking at the df.describe() outcome the following consideration about the dataset can be made:
1. **Time Feature**:
    - The `Time` feature represents the seconds elapsed between each transaction and the first transaction in the dataset.
    - The range of values indicates transactions span over a significant time period.

2. **Amount Feature**:
    - The `Amount` feature represents the transaction amount.
    - The mean and standard deviation suggest a wide variation in transaction amounts, with some outliers likely present.

3. **Principal Components (V1 to V28)**:
    - These features are the result of PCA transformation and are scaled.
    - The mean values are close to zero, and the standard deviations indicate the spread of the data.

4. **Class Feature**:
    - The `Class` feature is the target variable, where `1` indicates fraud and `0` indicates non-fraud.
    - The dataset is highly imbalanced, as observed in the count of fraudulent vs. non-fraudulent transactions.

```

In [17]:
df.describe()



#### Missing value evaluation

In [18]:
if df.isnull().sum().sum() > 0: # the value will be higher than one if there is at least one missing value
    print("There are missing values")
else:
    print("This dataset does not contain any missing values")



#### Class Distribution evaluation


As expected the classes are highly imbalanced, thus rebalancing is needed to perform good model training

In [19]:
df['Class'].value_counts(normalize=True)



#### Correlation Analysis

In [20]:
correlation_matrix = df.corr()

plt.figure(figsize=(15, 10))  
heatmap = sn.heatmap(
    data=correlation_matrix, 
    annot=True, 
    fmt=".2f", 
    cmap="coolwarm", 
    cbar=True, 
    annot_kws={"size": 8}  
)
plt.title("Correlation Matrix", fontsize=16)  
plt.xticks(fontsize=10, rotation=45)  
plt.yticks(fontsize=10)
plt.show()



#### Data Distribution and Scaling

In [21]:
df.boxplot(column=['Amount'])





In [22]:
df.boxplot(column=['Time'])





##### Class Imbalance Handling

Given the highly imbalanced nature of the dataset (fraudulent transactions accounting for only 0.172%), undersampling was chosen to reduce computational costs and training time while maintaining focus on the minority class. The large dataset size (284,807 samples) ensures sufficient data remains for training, even after undersampling, without introducing synthetic data that could risk overfitting.


In [23]:
undersampler = RandomUnderSampler(random_state=42)

In [24]:
x = df.drop(columns=['Class']).values  # Features
y = df['Class'].values  # Target variable

In [None]:
def train_cipher_model(model, bit_widths, x, y, undersampler, kf):
    """_summary_
    Train and evaluate a cipher model with different bit-widths.
    This function performs the following steps:
    1. Sets the bit-width for the cipher model.
    2. Performs stratified k-fold cross-validation.
    3. For each fold, it trains the model on the training data and evaluates it on the test data.
    4. Measures the inference time for each fold.
    5. Computes the average accuracy and latency across all folds.
    6. Prints the results for each bit-width.
    Args:
        model (_type_): _description_
        best_params (_type_): _description_
        bit_widths (_type_): _description_
        x (_type_): _description_
        y (_type_): _description_
        undersampler (_type_): _description_
        kf (_type_): _description_
    """

    cipher_accuracies = []
    cipher_latencies = []

    for bit_width in bit_widths:
        model.set_params(n_bits=bit_width)
        cv_scores = []
        latencies = []

        for train_index, test_index in kf.split(x, y):
            X_train, X_test = x[train_index], x[test_index]
            Y_train, Y_test = y[train_index], y[test_index]

            X_train_resampled, Y_train_resampled = undersampler.fit_resample(X_train, Y_train)
            model.fit(X_train_resampled, Y_train_resampled)

            inference_times = []
            for _ in range(100): 
                start_time = time.time()
                model.predict(X_test)
                inference_times.append(time.time() - start_time)

            avg_inference_time = np.mean(inference_times)
            latencies.append(avg_inference_time)

            score = model.score(X_test, Y_test)
            cv_scores.append(score)

        cipher_accuracies.append(np.mean(cv_scores))
        cipher_latencies.append(np.mean(latencies))

    for i, bit_width in enumerate(bit_widths):
        print(f"Bit-Width: {bit_width}")
        print(f"Average Accuracy: {cipher_accuracies[i]}")
        print(f"Average Inference Latency (100 iterations): {cipher_latencies[i]:.6f} seconds")


In [None]:
def train_plain_model(model,x, y, undersampler, kf=None):
    """
    Train and evaluate a model using cross-validation.

    Parameters:
    - model: The machine learning model to train and evaluate.
    - x: Feature matrix.
    - y: Target vector.
    - undersampler: An instance of an undersampling technique.
    - kf: An instance of StratifiedKFold for cross-validation. If None, a default one will be used.

    """
    if kf is None:
        kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    cv_scores = []
    latencies = []

    for train_index, test_index in kf.split(x, y):
        X_train, X_test = x[train_index], x[test_index]
        Y_train, Y_test = y[train_index], y[test_index]

        X_train_resampled, Y_train_resampled = undersampler.fit_resample(X_train, Y_train)

        start_time = time.time()
        model.fit(X_train_resampled, Y_train_resampled)
        training_time = time.time() - start_time

        inference_times = []
        for _ in range(100):  
            start_time = time.time()
            model.predict(X_test)
            inference_times.append(time.time() - start_time)

        avg_inference_time = np.mean(inference_times)
        latencies.append(avg_inference_time)

        # Evaluate accuracy
        Y_pred = model.predict(X_test)
        score = accuracy_score(Y_test, Y_pred)
        cv_scores.append(score)

    average_accuracy = np.mean(cv_scores)
    average_latency = np.mean(latencies)

    print(f"Cross-Validation Accuracy Scores: {cv_scores}")
    print(f"Average Cross-Validation Accuracy: {average_accuracy}")
    print(f"Average Inference Latency (100 iterations): {average_latency:.6f} seconds")



In [None]:

def grid_search(grid_param: Dict, clf: Callable, X_train, y_train) -> GridSearchCV:
    """
    Perform a grid search to find the best hyper-parameters for a given classifier.

    Args:
        grid_param (Dict): The hyper-parameters to be tuned
        clf (Callable): The given classifier
        X_train: Training feature matrix
        y_train: Training target vector

    Returns:
        GridSearchCV: The fitted GridSearchCV object.
    """
    grid_search = GridSearchCV(
        clf,
        grid_param,
        cv=5,
        scoring='f1',
        verbose=1,
        n_jobs=-1,
    ).fit(X_train, y_train)

    # The best model
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best score: {grid_search.best_score_:.2%}")

    return grid_search

##### Hypeparameter Tuning

In [None]:
param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [None, 10, 20, 30, 40, 50, 60],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4, 6, 8],
    'max_features': [None, 'sqrt', 'log2', 0.5, 0.75],
    'splitter': ['best', 'random'],
    'class_weight': [None, 'balanced', {0: 1, 1: 2}, {0: 1, 1: 3}]
}
X_train_resampled, Y_train_resampled = undersampler.fit_resample(x, y)
dt_model = DecisionTreeClassifier(random_state=42)
grid_search_result = grid_search(param_grid, dt_model, X_train_resampled, Y_train_resampled)

#### Decision Tree Model

In [None]:
best_params = {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 2, 'min_samples_split': 20, 'splitter': 'random'}
final_cipher_model = ConcreteDecisionTreeClassifier(**best_params, random_state=42)
final_plain_model = DecisionTreeClassifier(**best_params, random_state=42)
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
bit_widths = [2, 4, 6, 8, 10]
train_plain_model(final_plain_model, x, y, undersampler, kf)
train_cipher_model(final_cipher_model,bit_widths, x, y, undersampler, kf)

#### Random Forest Model

In [None]:
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'class_weight': [None, 'balanced', 'balanced_subsample']
}

rf_model = RandomForestClassifier(random_state=42)
grid_search_result = grid_search(param_grid, rf_model, X_train_resampled, Y_train_resampled)

In [None]:
best_params_rf = {'class_weight': None, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 50}
final_cipher_model = ConcreteRandomForestClassifier(**best_params_rf, random_state=42)
bit_widths = [2, 4, 6, 8, 10]
final_plain_model = RandomForestClassifier(**best_params_rf, random_state=42)
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
train_plain_model(final_plain_model, x, y, undersampler, kf)
train_cipher_model(final_cipher_model, bit_widths, x, y, undersampler, kf)



### Logistic Regression

In [None]:
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l2'],
    'class_weight': [None, 'balanced']
}
lr_model = LogisticRegression(random_state=42, max_iter=1000)
grid_search_result = grid_search(param_grid, lr_model, X_train_resampled, Y_train_resampled)



In [None]:
best_params_lr = {'C': 100, 'class_weight': None, 'penalty': 'l2', 'solver': 'newton-cg'}
final_cipher_model = ConcreteLogisticRegression(**best_params_lr, random_state=42)
bit_widths = [2, 4, 6, 8, 10]
final_plain_model = LogisticRegression(**best_params_lr, random_state=42)
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
train_plain_model(final_plain_model, x, y, undersampler, kf)
train_cipher_model(final_cipher_model, bit_widths, x, y, undersampler, kf)



#### Gradient Boost

In [None]:
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2]
}

xgb_model = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
dt_model = DecisionTreeClassifier(random_state=42)
grid_search_result = grid_search(param_grid, xgb_model, X_train_resampled, Y_train_resampled)



In [None]:
best_params_xgb = {
    'n_estimators': 100,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_weight': 1,
    'subsample': 0.9,
    'colsample_bytree': 0.8,
    'gamma': 0.1
}
final_cipher_model_xgb = ConcreteXGBClassifier(**best_params_xgb, random_state=42)
final_plain_model_xgb = XGBClassifier(**best_params_xgb, random_state=42)
train_plain_model(final_plain_model_xgb, x, y, undersampler, kf)
train_cipher_model(final_cipher_model_xgb, bit_widths, x, y, undersampler, kf)



#### Multilayer Perceptron (MLP)

In [None]:
param_grid_mlp = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (100, 100)],
    'activation': ['identity', 'logistic', 'tanh', 'relu'],
    'solver': ['lbfgs', 'sgd', 'adam'],
    'alpha': [0.0001, 0.001, 0.01, 0.1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'max_iter': [200, 500, 1000]
}

mlp_model = MLPClassifier(random_state=42)
grid_search_result = grid_search(param_grid, mlp_model, X_train_resampled, Y_train_resampled)



