Let us consider the scenario of training a linear regression on the diabetes dataset (more information: https://archive.ics.uci.edu/ml/datasets/diabetes).


We will follow the example given by scikit-learn (https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html), and use the diabetes dataset to train and test a linear regressor. We begin by loading the dataset and splitting it into training and testing samples (an 80/20 split).


In [1]:
from sklearn.model_selection import train_test_split
from sklearn import datasets

dataset = datasets.load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
print("Train examples: %d, Test examples: %d" % (X_train.shape[0], X_test.shape[0]))

Train examples: 353, Test examples: 89


## Non-private Baseline
We now use scikit-learn's native LinearRegression function to establish a non-private baseline for our experiments. We will use the r-squared score to evaluate the goodness-of-fit of the model. R-squared score is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. A higher R-squared score indicates a better linear regression model.

In [2]:
from sklearn.linear_model import LinearRegression as sk_LinearRegression
from sklearn.metrics import r2_score

regr = sk_LinearRegression()
regr.fit(X_train, y_train)
baseline = r2_score(y_test, regr.predict(X_test))
print("Non-private baseline: %.2f" % baseline)

Non-private baseline: 0.41


## Differentially Private Linear Regression
First, install IBM Differential Privacy Library.

In [3]:
!pip install diffprivlib

Collecting diffprivlib
[?25l  Downloading https://files.pythonhosted.org/packages/fe/b8/852409057d6acc060f06cac8d0a45b73dfa54ee4fbd1577c9a7d755e9fb6/diffprivlib-0.3.0.tar.gz (70kB)
[K     |████▋                           | 10kB 18.1MB/s eta 0:00:01[K     |█████████▎                      | 20kB 23.1MB/s eta 0:00:01[K     |██████████████                  | 30kB 28.0MB/s eta 0:00:01[K     |██████████████████▋             | 40kB 31.4MB/s eta 0:00:01[K     |███████████████████████▎        | 51kB 4.3MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 5.1MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.3MB/s 
Building wheels for collected packages: diffprivlib
  Building wheel for diffprivlib (setup.py) ... [?25l[?25hdone
  Created wheel for diffprivlib: filename=diffprivlib-0.3.0-cp36-none-any.whl size=138999 sha256=c7dc6a6f6ad1d2a3b1b4ad5fda02192ec21aa4b80a6f7867f0925b58a8b94663
  Stored in directory: /root/.cache/pip/wheels/64/68/62/617183f73d3fe

Let's now train a differentially private linear regressor (epsilon=1.00), where the trained model is differentially private with respect to the training data. 

In [4]:
from diffprivlib.models import LinearRegression

regr = LinearRegression()
regr.fit(X_train, y_train)

print("R2 score for epsilon=%.2f: %.2f" % (regr.epsilon, r2_score(y_test, regr.predict(X_test))))

R2 score for epsilon=1.00: 0.43


This will result in additional privacy leakage. To ensure differential privacy with no additional privacy loss, specify `bounds_X` and `bounds_y`.
