# Training and saving real world data set
This is a sample code to train a logistic regression model, then store the dataset and model for LR HE inference example with a real world dataset. It has been tested on ```python3 >= 3.5```.

### Requirements
```
pandas >= 1.1.0
sklearn >= 0.22.0
```

### Notes
1. Users need to download the datasets from [kaggle](https://www.kaggle.com/datasets) or ```sklearn.datasets```.
2. This is an example focused on logistic regression training without high-level data engineering. Thus the LR model result may not be high quality.
3. After running this script, make sure to copy the csv files to ```${HE_SAMPLES}/build/examples/logistic-regression/datasets```.


In [1]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
import lr_base as lrb
import generate_data

## Data preparation and preprocessing
Download the [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk) dataset. This example only needs the ```application_train.csv``` file for training, testing and evaluation. 

In [2]:
# update path/to/dataset to the actual path
data = pd.read_csv("path/to/dataset/application_train.csv")

### One hot encoding categorical features
Convert non-binary categorical features with one-hot encoding

In [3]:
le = LabelEncoder()
le_count = 0
le_list = []
for col in data:
  if data[col].dtype == 'object':
    if len(list(data[col].unique())) <= 2:
      le.fit(data[col])
      data[col] = le.transform(data[col])
      le_count += 1
      le_list.append(col)
data = pd.get_dummies(data)

In [4]:
# Extract samples and target
samples = data.loc[:, data.columns != 'TARGET']
target = data.TARGET

feature_columns = data.columns.tolist()
feature_columns.remove('TARGET')

### Imputation transform
Fill the missing data points in the sample with ```sklearn.impute.SimpleImputer```

In [5]:
samples = samples.fillna(samples.median())
list_categorical = []
for col in samples:
  if samples[col].dtype == 'object' and samples[col].isnull().values.any():
    list_categorical.append(col)

imputer = SimpleImputer()
for col in list_categorical:
  data = samples[col]
  features[col] = imputer.fit_transform(data)

X_train, X_test, y_train, y_test = train_test_split(
    samples, target, test_size=0.30, random_state=10)

### Scaling data points
The 3-degree polynomial representation of sigmoid function operates within a limited range, so the sample needs to be scaled to ```[-0.2, 0.2]```.


In [6]:
X_train, X_test = X_train.to_numpy(), X_test.to_numpy()
y_train, y_test = y_train.to_numpy(), y_test.to_numpy()

scaler = MinMaxScaler(feature_range=(-.2, .2))
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### Balance training set with target
The given dataset is highly imbalanced towards ```target = 0```, which can negatively affect the logistic regression training. Thus we adjust the training set to have the target distribution close to 50:50.

In [7]:
target_imba_ratio = 0.99  # imbalance threshold

n_y_train_1 = sum(y_train == 1)
n_y_train_0 = sum(y_train == 0)
n_remove = int((n_y_train_1 - n_y_train_0) * target_imba_ratio)

if n_remove > 0:
  to_remove = 1
else:
  to_remove = 0
  n_remove *= -1

print("Reallocate", n_remove, "of target", to_remove,
      "from training set to testing set for balanced training")

ind1 = np.where(y_train == to_remove)[0]
ind_remove = np.random.choice(ind1, size=n_remove, replace=False)
X_test = np.concatenate([X_test, X_train[ind_remove]], axis=0)
y_test = np.concatenate([y_test, y_train[ind_remove]], axis=0)

X_train = np.delete(X_train, ind_remove, axis=0)
y_train = np.delete(y_train, ind_remove)

n_y_train_1 = sum(y_train)
n_y_train_0 = y_train.size - n_y_train_1
print("After reallocation, %s 0s and %s 1s in training set" %
      (n_y_train_0, n_y_train_1))

Reallocate 178796 of target 0 from training set to testing set for balanced training
After reallocation, 19134 0s and 17327 1s in training set


In [8]:
# Split test set with test/eval set
X_test, X_eval, y_test, y_eval = train_test_split(
    X_test, y_test, test_size=0.4, random_state=10)

## Logistic Regression Training
We provide a simple logistic regression training script based on [Homomorphic training of 30,000 logistic regression models](https://eprint.iacr.org/2019/425) by Bergamaschi et al.


In [9]:
# set verbose = true to view training progress
bias, weights = generate_data.doTrain(
    X_train, y_train, X_test, y_test, epochs=40, verbose=True)

== Logistic Regression Training ==
Epoch: 0, - loss: 0.7799580361800673 - acc: 0.972268339174814
Epoch: 4, - loss: 2.4352559230475754 - acc: 0.027731660825186005
Epoch: 8, - loss: 0.9771821446783528 - acc: 0.027731660825186005
Epoch: 12, - loss: 0.7355180931336108 - acc: 0.7264895775687142
Epoch: 16, - loss: 0.7278504937344341 - acc: 0.7074709463198672
Epoch: 20, - loss: 0.7217525284561388 - acc: 0.6959048146098505
Epoch: 24, - loss: 0.7167104285538693 - acc: 0.6899096107729201
Epoch: 28, - loss: 0.7123961826994866 - acc: 0.6869089343909488
Epoch: 32, - loss: 0.7085991120947206 - acc: 0.6854885322511222
Epoch: 36, - loss: 0.7051827498589683 - acc: 0.6852548730246572


In [10]:
# Evaluate model
_, acc, _, f1 = lrb.test_poly3(X_eval, y_eval,
                               np.concatenate(([bias], weights), axis=0))
print("Accuracy:", acc, "  f1 score:", f1)

Accuracy: 0.6838590665928795   f1 score: 0.09293955753149148


## Save to csv
Save the eval set and trained logistic regression model to csv files. The first row of the (eval) data file is the feature names.

For using own weights, it must be stored in a single row csv file with ```bias, w[0], w[1], ... , w[n_features - 1]``` template.

In [11]:
generate_data.saveModel("kaggle_hcdr", bias, weights)
print("Eval set size:", X_eval.shape)
generate_data.saveData("kaggle_hcdr", X_eval, y_eval)

Eval set size: (108420, 242)
