### STEP 1

Linear regression is a statistical model used to calculate the linear relationship between the dependent variable and feature(s) by fitting a line. It can be either simple linear regression, where there is only one independent variable and one dependent variable or multiple linear regression, which involves more than one independent variable and one dependent variable. The line is fitted to the data using the least squares method. The slope tells the rate of change, while the intercept gives the baseline prediction when the input is zero. Common metrics for evaluation are MSE and R-squared. Linear regression is primarily used when the target variable is continuous.

Logistic Regression is a statistical model used for classification problems. The goal is to predict the probability that an input belongs to a certain category. The input variables are called features and output variable is called target. Each feature has a weight to it and the linear combination of these weights is the logit function. The sigmoid function is used to map the predicted values to probabilities. The probabilities are converted into class predictions using a threshold (usually 0.5). For example, if the probability is greater than 0.5 it is predicted as class 1, else class 0. Logistic regression model can be evaluated using metrics like confusion matrix, likelihood-ratio test, and pseudo-R-squared. Logistic regression is most commonly used for problems where the dependent variable is binary.

### STEP 2

The study addresses the challenge of training ML models on private/sensitive data. Homomorphic encryption allows for computations directly on the encrypted data without invading privacy. The study proposes a method, where CKKS homomorphic encryption scheme is used for encrypting the data. The scheme is optimized for polynomial approximation, SIMD vectorization, and bootstrapping. This proposed method is used on two datasets, a real-world financial dataset and MNIST dataset. The financial dataset achieved an 80% accuracy, while the MNIST dataset achieved an 96.4% accuracy.

TODO LIST:\

Preprocessing and Setup: 
 
The MNIST dataset needs to be restructured and preprocessed. First, the MNIST dataset must be restructured for binary classification. By default, the images are 28x28 pixels. These images need to be resized into 14x14 pixels. 
 
For the homomorphic encryption scheme CKKS (HEEAN) is used. Also, the plaintext scaling factor, ciphertext modulus, and noise tolerance need to be set. 
 
The sigmoid function needs to be approximated using least squares fit method.\
<br />

Encryption: 
 
Encrypt the data using OpenFHE’s CKKS function. Multiplicative depth should 5 and scaling factor bits 30.\
<br /> 

Logistic Regression Implementation: 
 
Compute the encrypted predictions homomorphically using the matrix vector multiplication. 
 
Sigmoid function needs to be replaced with a low degree polynomial approximation. 
 
Apply gradient descent on encrypted data while minimizing the noise that comes with it.\
<br />

Evaluation of the model: 
 
Decrypt the weights. After decryption predict on the plaintext test data. Finally, evaluate the accuracy and AUROC.\
<br />

Comparison (to “vanilla” application): 
 
Compare the two models' accuracy.

### STEP 3

#### Vanilla application
The whole script for the vanilla application can be found in the zip file. It is simply run by ```python vanilla.py``` or ```python3 vanilla.py``` depending on the system alias.

Import the necessary libraries needed:

In [11]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

Download the MNIST dataset using sklearn and assign the training and validation data.

In [12]:
mnist = fetch_openml("mnist_784", version=1)
X, y = mnist.data, mnist.target.astype(int)

Restructure the data for binary classification. Just like in the study I am choosing the numbers 3 and 8.

In [13]:
# Filter for binary classification
binary_classes = [3, 8]
X, y = X[np.isin(y, binary_classes)], y[np.isin(y, binary_classes)]

# Convert labels to 0/1
y = np.where(y == binary_classes[0], 0, 1)

We need to downsize the images from 28x28 to 14x14 as done in the study.

In [14]:
# Resize images
def downsample(img, factor=2):
    size = int(np.sqrt(img.shape[0]))
    small_size = size // factor
    img_reshaped = img.reshape(size, size)
    small_img = img_reshaped.reshape(small_size, factor, small_size, factor).mean(axis=(1, 3))
    return small_img.flatten()

X_resized = np.apply_along_axis(downsample, 1, X)

Using sklearns "train_test_split" function we split the train test data with a 20% split and we also assign a value to the random state in order to reproduce these results.

In [15]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_resized, y, test_size=0.2, random_state=42)

Training the logistic regression model.

In [16]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Generate the predictions and test the accuracy.

In [17]:
y_pred = model.predict(X_test)
print("Plaintext Accuracy:", accuracy_score(y_test, y_pred))
print("Plaintext AUROC:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

Plaintext Accuracy: 0.9731567644953472
Plaintext AUROC: 0.9921533767240425


##### Logistic regression on HE

For me personally the .so library is needed to be in the same folder as the script in order for the OpenFHE to work. It is simply run by ```python logistic_regr_HE.py``` or ```python3 logistic_regr_HE.py``` depending on the system alias.

We begin with the same imports as the vanilla application plus the OpenFHE library. We do the same preprocessing just like in the vanilla application. 

In [18]:
from openfhe import *
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
import time

mnist = fetch_openml("mnist_784", version=1)
X, y = mnist.data, mnist.target.astype(int)

# Filter for binary classification
binary_classes = [3, 8]
X, y = X[np.isin(y, binary_classes)], y[np.isin(y, binary_classes)]

# Convert labels to 0/1
y = np.where(y == binary_classes[0], 0, 1)

# Resize images
def downsample(img, factor=2):
    size = int(np.sqrt(img.shape[0]))
    small_size = size // factor
    img_reshaped = img.reshape(size, size)
    small_img = img_reshaped.reshape(small_size, factor, small_size, factor).mean(axis=(1, 3))
    return small_img.flatten()

X_resized = np.apply_along_axis(downsample, 1, X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_resized, y, test_size=0.2, random_state=42)

ImportError: /home/harri/anaconda3/envs/openfhe_python/lib/python3.9/site-packages/zmq/backend/cython/../../../../.././libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/harri/Logistic-Regression-On-Homomorphic-Encryption/openfhe.cpython-39-x86_64-linux-gnu.so)

Setupping the HE parameters according to the study.

In [None]:
params = CCParamsCKKSRNS()  # CKKS scheme for approximate HE
params.SetMultiplicativeDepth(15)  # Depth for gradient descent iterations
params.SetScalingModSize(40)       # Scaling modulus size for precision
context = GenCryptoContext(params)

Enable OpenFHE features so the encryption logic works.

In [None]:
context.Enable(PKESchemeFeature.PKE)
context.Enable(PKESchemeFeature.KEYSWITCH)
context.Enable(PKESchemeFeature.LEVELEDSHE)
context.Enable(PKESchemeFeature.ADVANCEDSHE)

Key generation.

In [None]:
keypair = context.KeyGen()
context.EvalMultKeyGen(keypair.secretKey)

Now we can encrypt the training data. The study used 11982 samples but 180 is the maximum amount which my computer can handle.

In [None]:
# Encrypt the training data and labels
subset_size = 180  # Smaller dataset for demonstration
X_train_small = X_train[:subset_size]
y_train_small = y_train[:subset_size]

# Encrypt training data
ptxt_list = [context.MakeCKKSPackedPlaintext(row.tolist()) for row in X_train_small]
X_train_encrypted = [context.Encrypt(keypair.publicKey, ptxt) for ptxt in ptxt_list]

# Encrypt labels
y_train_encrypted = [context.Encrypt(keypair.publicKey, context.MakeCKKSPackedPlaintext([y]))
                     for y in y_train_small]

Initialize the weights and learning rate of 1.0.

In [None]:
# Initialize weights
initial_weights = context.Encrypt(keypair.publicKey, context.MakeCKKSPackedPlaintext([0.0] * X_train.shape[1]))

# Learning rate
learning_rate = context.MakeCKKSPackedPlaintext([1.0])

Polynomial sigmoid mimicking the one used in the study.

In [None]:
# Polynomial approximation of sigmoid function
def polynomial_sigmoid(x, context):
    const_0_5 = context.MakeCKKSPackedPlaintext([0.5])
    const_neg_0_0843 = context.MakeCKKSPackedPlaintext([-0.0843])
    const_0_0002 = context.MakeCKKSPackedPlaintext([0.0002])

    const_0_5_cipher = context.Encrypt(keypair.publicKey, const_0_5)
    const_neg_0_0843_cipher = context.Encrypt(keypair.publicKey, const_neg_0_0843)
    const_0_0002_cipher = context.Encrypt(keypair.publicKey, const_0_0002)


    x_squared = context.EvalMult(x, x)
    x_squared = context.ModReduce(x_squared)
    
    x_cubed = context.EvalMult(x_squared, x)
    x_cubed = context.ModReduce(x_cubed)

    term1 = context.EvalMult(x, const_neg_0_0843_cipher)
    term1 = context.ModReduce(term1)
    
    term2 = context.EvalMult(x_cubed, const_0_0002_cipher)
    term2 = context.ModReduce(term2)

    result = context.EvalAdd(term1, const_0_5_cipher)
    result = context.EvalAdd(result, term2)

    return result

Gradient descent function mimicking the one used in the study.

In [None]:
def gradient_descent_step(X_encrypted, y_encrypted, weights, learning_rate, context):
    # Compute predictions: sigmoid(X @ weights)
    predictions = [context.EvalMult(X_row, weights) for X_row in X_encrypted]
    predictions = [polynomial_sigmoid(p, context) for p in predictions]

    # Compute error: predictions - y
    errors = [context.EvalSub(pred, y) for pred, y in zip(predictions, y_encrypted)]

    # Compute gradient: dot product of errors with features
    gradients = [context.EvalMult(err, X_row) for err, X_row in zip(errors, X_encrypted)]

    # Sum gradients across all rows to get the overall gradient
    total_gradient = gradients[0]
    for grad in gradients[1:]:
        total_gradient = context.EvalAdd(total_gradient, grad)

    # Update weights: w = w - alpha * gradient
    updated_weights = context.EvalSub(weights, context.EvalMult(learning_rate, total_gradient))

    return updated_weights

The study used 32 iterations. I can only do 2 because I am unable to get the bootstrapping working. If the code tries to run 3 or more iterations it stops working due to the noise created by HE.

In [None]:
# Perform gradient descent
num_iterations = 2
weights = initial_weights

for i in range(num_iterations):
    print(f"Starting iteration {i+1}...")
    start = time.perf_counter()
    weights = gradient_descent_step(X_train_encrypted, y_train_encrypted, weights, learning_rate, context)
    end = time.perf_counter()
    print(f"Iteration {i+1} completed. ({end - start:.2f}s)")

Here we just do some encryption, decryption and predictions based on the dataset.

In [None]:
# Decrypt and evaluate the model
decrypted_weights = context.Decrypt(keypair.secretKey, weights).GetCKKSPackedValue()

# Encrypt test data
X_test_encrypted = [context.Encrypt(keypair.publicKey, context.MakeCKKSPackedPlaintext(row.tolist()))
                    for row in X_test[:subset_size]]

# Compute predictions on encrypted test data
test_predictions = [context.EvalMult(X_row, weights) for X_row in X_test_encrypted]
test_predictions = [polynomial_sigmoid(p, context) for p in test_predictions]

# Decrypt predictions
decrypted_predictions = [context.Decrypt(keypair.secretKey, pred).GetCKKSPackedValue()[0]
                         for pred in test_predictions]

Finally, we binarize the predictions and get the accuracy score.

In [None]:
# Binarize predictions
y_test_pred = np.array(decrypted_predictions) > 0.5

# Evaluate accuracy
print("Encrypted Test Accuracy:", accuracy_score(y_test[:subset_size], y_test_pred))