First, let us load the training and testing data from the adult dataset.

In [1]:
import numpy as np

X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=", ")

X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=", ", skiprows=1)
y_test = np.array([a[:-1] for a in y_test])

## Naive Bayes with No Privacy
To begin, let us first train a regular (non-private) naive Bayes classifier, and test its accuracy.

In [2]:
from sklearn.naive_bayes import GaussianNB
nonprivate_clf = GaussianNB()
nonprivate_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

print("Non-private test accuracy: %.2f%%" % 
     (accuracy_score(y_test, nonprivate_clf.predict(X_test)) * 100))

Non-private test accuracy: 79.64%


## Differentially Private Naive Bayes Classification
First, install IBM Differential Privacy Library.

In [3]:
!pip install diffprivlib

Collecting diffprivlib
[?25l  Downloading https://files.pythonhosted.org/packages/fe/b8/852409057d6acc060f06cac8d0a45b73dfa54ee4fbd1577c9a7d755e9fb6/diffprivlib-0.3.0.tar.gz (70kB)
[K     |████▋                           | 10kB 20.3MB/s eta 0:00:01[K     |█████████▎                      | 20kB 19.2MB/s eta 0:00:01[K     |██████████████                  | 30kB 12.4MB/s eta 0:00:01[K     |██████████████████▋             | 40kB 13.7MB/s eta 0:00:01[K     |███████████████████████▎        | 51kB 14.9MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 16.9MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 5.4MB/s 
Building wheels for collected packages: diffprivlib
  Building wheel for diffprivlib (setup.py) ... [?25l[?25hdone
  Created wheel for diffprivlib: filename=diffprivlib-0.3.0-cp36-none-any.whl size=138999 sha256=22185221619386dc839e3b2d7b63bf732ac77bfa51155d771a4c695e64ed4080
  Stored in directory: /root/.cache/pip/wheels/64/68/62/617183f73d3

Using the models.GaussianNB module of diffprivlib, we can train a naive Bayes classifier while satisfying differential privacy. If we don't specify any parameters, the model defaults to epsilon = 1.00.

In [4]:
import diffprivlib.models as dp
dp_clf = dp.GaussianNB()

dp_clf.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
      (dp_clf.epsilon, accuracy_score(y_test, dp_clf.predict(X_test)) * 100))

Differentially private test accuracy (epsilon=1.00): 79.98%




As we can see from the output accuracies above, the regular (non-private) Naïve Bayes classifier could produce an accuracy of 79.64%, while setting epsilon=1.00, the differentially private Naïve Bayes classifier could achieve an accuracy of 78.59%. If we use a smaller epsilon, it usually leads to better privacy protection while less accuracy. For instance, if we set epsilon=0.01:

In [5]:
import diffprivlib.models as dp
dp_clf = dp.GaussianNB(epsilon=float("0.01"))
dp_clf.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
      (dp_clf.epsilon, accuracy_score(y_test, dp_clf.predict(X_test)) * 100))

Differentially private test accuracy (epsilon=0.01): 76.91%


