# Data Preparation for Logistic Regression HE Inference

This is a sample code to generate datasets for the LR HE inference example. It has been tested on ```python3 >= 3.5```. 

### Notes
1. The range of the all data point must be in ```[-1, 1]``` range, due to the utilization of a 3-degree polynomial representation of the sigmoid function.
2. For using own datasets, it must be in ```(name)_eval.csv``` and ```(name)_lrmodel.csv```. More details about the csv templates are described below.
3. After running this script, make sure to copy the csv files to ```${HE_SAMPLES}/build/examples/logistic-regression/datasets```.

In [1]:
import numpy as np
import sklearn.datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import lr_base as lrb
import generate_data

## Parameter Configuration
Dataset name, number of samples, and number of features can be set. It will generate ```n_samples``` of data then split into train, test and eval set with ```2:1:1``` ratio.


In [2]:
dataname = "lrtest"
n_samples = 10000  # 2500 samples for eval set
n_features = 150

## Data Generation
Generated data is first scaled to ```[-1, 1]``` range, then exclusively split to train, test and eval set.

In [3]:
X_train, y_train, X_test, y_test, X_eval, y_eval = generate_data.generateSynData(
    n_samples, n_features)

## Logistic Regression Training
We provide a simple logistic regression training script based on [Homomorphic training of 30,000 logistic regression models](https://eprint.iacr.org/2019/425) by Bergamaschi et al.


In [4]:
bias, weights = generate_data.doTrain(X_train, y_train, X_test, y_test, verbose=True)
biasweights = np.concatenate(([bias], weights), axis=0)

# evaluate model
_, acc, _, f1 = lrb.test_poly3(X_eval, y_eval, biasweights)
print("Accuracy:", acc, "  f1 score:", f1)

== Logistic Regression Training ==
Epoch: 0, - loss: 0.7590270072516531 - acc: 0.802
Epoch: 1, - loss: 0.7590270072516531 - acc: 0.802
Epoch: 2, - loss: 0.7262199385987 - acc: 0.8332
Epoch: 3, - loss: 0.6971525441158094 - acc: 0.8544
Epoch: 4, - loss: 0.6703830849331164 - acc: 0.8628
Epoch: 5, - loss: 0.6454043833825989 - acc: 0.8712
Epoch: 6, - loss: 0.6219822062877712 - acc: 0.8728
Epoch: 7, - loss: 0.5999795015244065 - acc: 0.8744
Epoch: 8, - loss: 0.5792977563454041 - acc: 0.8772
Epoch: 9, - loss: 0.5598552240090058 - acc: 0.8784
Accuracy: 0.882   f1 score: 0.8893058161350844


## Save to csv
Save the eval set and trained logistic regression model to csv files. The first row of the (eval) data file is the feature names. In this code, it will be ```feature_0, feature_1, ... target```, implying that the final column contains the binary target ```0``` or ```1```.

For using own weights, it must be stored in a single row csv file with ```bias, w[0], w[1], ... , w[n_features - 1]``` template.

In [5]:
generate_data.saveModel(dataname, bias, weights)
generate_data.saveData(dataname, X_eval, y_eval)