First, let us load the training and testing data from the adult dataset.

In [1]:
import numpy as np

X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=", ")

X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=", ", skiprows=1)
y_test = np.array([a[:-1] for a in y_test])

For diffprivlib, LogisticRegression works best when the features are scaled, to control the norm of the data. To streamline this process, we create a Pipeline in sklearn.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

lr = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', LogisticRegression(solver="lbfgs"))
])

## Logistic Regression with No Privacy
To begin, let's first train a regular (non-private) logistic regression classifier, and test its accuracy.

In [3]:
lr.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

print("Non-private test accuracy: %.2f%%" % (accuracy_score(y_test, lr.predict(X_test)) * 100))

Non-private test accuracy: 81.04%


## Differentially Private Logistic Regression
First, install IBM Differential Privacy Library.

In [4]:
!pip install diffprivlib

Collecting diffprivlib
[?25l  Downloading https://files.pythonhosted.org/packages/fe/b8/852409057d6acc060f06cac8d0a45b73dfa54ee4fbd1577c9a7d755e9fb6/diffprivlib-0.3.0.tar.gz (70kB)
[K     |████▋                           | 10kB 17.1MB/s eta 0:00:01[K     |█████████▎                      | 20kB 11.9MB/s eta 0:00:01[K     |██████████████                  | 30kB 8.7MB/s eta 0:00:01[K     |██████████████████▋             | 40kB 6.6MB/s eta 0:00:01[K     |███████████████████████▎        | 51kB 4.3MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 4.9MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.5MB/s 
Building wheels for collected packages: diffprivlib
  Building wheel for diffprivlib (setup.py) ... [?25l[?25hdone
  Created wheel for diffprivlib: filename=diffprivlib-0.3.0-cp36-none-any.whl size=138999 sha256=e8c0a87fdd7df12b5632f3fe4e3828b99928f2171b5758074ccd43ae06c4d57d
  Stored in directory: /root/.cache/pip/wheels/64/68/62/617183f73d3fece

Using the diffprivlib.models.LogisticRegression module of diffprivlib, we can train a logistic regression classifier while satisfying differential privacy. If we don't specify any parameters, the model defaults to epsilon = 1 and data_norm = None. If the norm of the data is not specified at initialization (as in this case), the norm will be calculated on the data when .fit() is first called and a warning will be thrown as it causes a privacy leak. To ensure no additional privacy leakage, we should specify the data norm explicitly as an argument, and choose the bounds independently of the data (i.e. using domain knowledge).

In [5]:
import diffprivlib.models as dp
dp_lr = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', dp.LogisticRegression())
])

dp_lr.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
     (dp_lr['clf'].epsilon, accuracy_score(y_test, dp_lr.predict(X_test)) * 100))

Differentially private test accuracy (epsilon=1.00): 80.55%




As we can see from the output accuracies above, the regular (non-private) logistic regression classifier could produce an accuracy of 81.04%, while setting epsilon=1.00, the differentially private Naïve Bayes classifier could achieve an accuracy of 80.93%. If we use a smaller epsilon, it usually leads to better privacy protection while less accuracy. For instance, if we set epsilon=0.01:

In [6]:
import diffprivlib.models as dp
dp_lr = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', dp.LogisticRegression(epsilon=0.01))
])

dp_lr.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
     (dp_lr['clf'].epsilon, accuracy_score(y_test, dp_lr.predict(X_test)) * 100))

Differentially private test accuracy (epsilon=0.01): 71.25%


