# Differentially Private SKLearn with DiffPrivLib

Welcome to the lab 👋

In this tutorial, we'll learn about differentially private (DP) machine learning using the DiffPrivLib from IBM Research. A cool thing for you to know, is that this library was created by a team just down the road (in the IBM Research Ireland Campus) and is a great example of how differential privacy can be used in the context of machine learning.

Before digging in, we'll introduce some concepts to get a sense of what their code is doing. For full details of the DiffPrivLib framework, check out the codebase [here](https://github.com/IBM/differential-privacy-library).

🧠🥅 : If you aren't interested in SKLearn type models and want to move to look at neural nets or something "shiny" like that, why not take a look at [TF Privacy](https://github.com/tensorflow/privacy) and [Opacus](https://github.com/pytorch/opacus) to see how DP can be used to train (or more realistically finnetune) the weights of DNNs.


## What does it mean for a machine learning model to be DP?

Good question, glad you asked!

A a differentially private machine learning model is one in which the impact of the training od the learned weights/parameters is differentially private. So intuitively you should ask yourself "if I were to give this model to someone, or it's predictions, could they possibly learn about any individual training sample?"

If you were to think about previous examples from the lecture, the model itself is now the *output* of the differentially private "query" performed on the training data 🤯

## How does the machine learning model know how much noise to apply?

Your questions are on 🔥 today!

In general, it depends. If you were to go through the research, what you'll find is "catch-all" approaches and "handcrafted" approaches.

#### Catch-alls via Alternate Optimization Functions

What I mean by a catch-all is simply that a large chunk of ML algorithms use an optimization step in order to update the weights/parameters of a model based on training data. So if you can bottlekneck this step to be differentially private (with some usually very small epsilon per step) then your model will be differentially private with some multiple of epsilon times the number of optimisation steps and weights updated. This might seem complex to calculate, but if you know the models DAG of operations, it can usually be calculated auto-magically.

This is usually the approach taken with neural networks.

#### Handcrafted approaches

There are also a lot of effort to find efficient use of privacy loss in order to train models that have a specific structure. For some such models there may lie a trick that lets you get more bang for you buck (so to speak) by reformulating the problem such that the privacy loss during training is deduced. This is also the case for models which do not apply a common step (such as gradient decent) during training and hence require a bit of IQ to be thrown at finding the next best alternative.

#### Clipping and Normalizing

The above explains how DP is actually applied. But as we heard earlier, knowing the magnitude of noise to apply is also a challenge. For some problems it is very natural that the input data is bounded, like in image prcessing pixels are often a tripple of values ranged between 0 and 255. This is the ideal scenario.

If inputs were continuous in plus/minus infinity... we'd have a problem. Typically to avoid this data is "clipped". That is to say we either truncate individual values to lie within a fixed domain (like anything over the number 10 becomes 10 for example) or rewieght the norm of the input is rewieghted to lie within a specific bound - that's a little bit like making the inputs as "realtive" inputs rather than hard values.

![](https://www.tutorialexample.com/wp-content/uploads/2019/11/vector-normalization.png)

Let's see what clipping looks like in practice:

#### Norm data exeding limit

In [None]:
import numpy as np

clip_limit = 10

input_data = [
    [1,3,6,9],
    [1,2,1,2],
    [10,0,100,1000]
]

norms = np.linalg.norm(input_data, axis=1) / clip_limit
norms[norms < 1] = 1

output = input_data / norms[:, np.newaxis]

print("Rows renormalized to be under "+str(clip_limit)+":")
print(output)

#### Clip data exceding limit

In [None]:
import numpy as np

# purely examplary
clip_limit_upper = np.array([4,5,6,7])
clip_limit_lower = np.array([-1,0,1,2])

input_data = np.array([
    [1,3,6,9],
    [1,2,1,2],
    [10,0,100,1000]
])

output = np.clip(input_data, clip_limit_lower, clip_limit_upper)

print("Rows clipped to be between "+str(clip_limit_upper)+" and " +str(clip_limit_lower)+ ":")
print(output)

I think you get the idea at this stage...

As we turn our heads to DiffPrivLib you can actually see how this normalization is performed in practice by checking our the source code responsible for these opperations [here](https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/validation.py).

## What is the typical trust model of DP Libraries?

This is probably the most important question TBH. Adding some noise to data and functions for the sake of it is a bit of a waste of time if it is not _actually_ protecting our data. The trust model of a crypto library is basically scenario in which it is assumed to be applied.

In most cases of DP machine learning / statistics frameworks (OpenDP, DiffPrivLib, etc) the setting is assumed to be a *trusted curator model*.

#### Trusted Curator Models

A trusted curator model basically means that the person who is applying the differential privacy is allowed to now about the sensitive data. However, they are doing this for the purpose of output disclosure control. Essentially, you trust the person who is doing the model fitting but not thos who the model is given to upstream.

#### Malicious Querier

This is a much stronger claim, I believe PySyft by OpenMined is aiming for this, but they have some way to go to get it ready for production yet. Essentially in this scenario you don't trust the querier who will try to exploit anything and everything they can to learn more than they are meant to. In such a scenario, typically the queriers opperations need to be tightly constained and validated. You also have to be concerned with leaking side information like how long the query took, the resolution of the noise applied, etc. In such a setting you almost certainly want to have external security reviews and pentesting prior to moving to production.

# DiffPrivLib: Time to do some Noisy Learning!

OK, so from the lecture you now have some idea of how differential privacy works, and you get the gist of how it can be applied in the context of statistics and machine learning. But you want to see it actually work in practice!

There are of course many frameworks you can use, but what better way that to use a toolbox created by fellow Irish-based reseachers designed to provived a similar interface to SKLearn! Meet *DiffPrivLib*.

First off go ahead and install the library with pip install:

In [None]:
! pip install diffprivlib

Now that it's installed, let's go ahead and import it into the notebook along with numpy and the real sklear (just for comparison).

In [None]:
import diffprivlib.models as dp
import numpy as np
from sklearn.linear_model import LogisticRegression

We will need a dataset, UCI has loads of course of you could mix it up with something from kaggle or a open dataset - totally your choice.

In [None]:
X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=", ")

X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=", ", skiprows=1)
# Must trim trailing period "." from label
y_test = np.array([a[:-1] for a in y_test])

Basically the task here is to predict if someone earns more or less than 50k per year (it's a really old data set that's well known to be quite easy to do ok with). The inputs are:

- age: continuous. <- in range [0,100]
- education-num: continuous. <- in range [0, 16] 
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous. <- easily capped at 12*7, thus in range [0,84]

and the output is a binary class '<=50K' or '>50K'.

#### Go ahead and explore the data to get a sense for it:

In [None]:
# write any data exploration code here





OK let's create a benchmark of the non-DP logistic regression model

In [None]:
clf = LogisticRegression(solver="lbfgs")
clf.fit(X_train, y_train)

So with no noise applied, the accuracy of a logistic regression model is:

In [None]:
baseline = clf.score(X_test, y_test)
print("Non-private test accuracy: %.2f%%" % (baseline * 100))

Time to differential privacy this thing! 💃

In [None]:
dp_clf = dp.LogisticRegression()
dp_clf.fit(X_train, y_train)

Woops - bet you just got a privacy leak! 

That's because the model doesn't know how much noise to apply without actually looking at the data itself. We're better off to specify a data norm so that data with any norm larger that that will be re normalised to fit inside the bounds (like we discussed earlier).

What do you think is a reasonable bound to apply?

In [None]:
# add your data norm you think is reasonable considering you understand the data features
data_norm = 1234 # just an example

# we'll now add it as an input 
dp_clf = dp.LogisticRegression(data_norm=data_norm)
dp_clf.fit(X_train, y_train)

In [None]:
print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
     (dp_clf.epsilon, dp_clf.score(X_test, y_test) * 100))

Next, what would happen if we were to set the epsilon to a really big number or rather to infinity!? 

In [None]:
dp_clf = dp.LogisticRegression(epsilon=float("inf"), data_norm=1e5)
dp_clf.fit(X_train, y_train)

💥 Nailed it! We are back to the same situation as if there was no differential privacy applied at all:

In [None]:
print("Agreement between non-private and differentially private (epsilon=inf) classifiers: %.2f%%" % 
     (dp_clf.score(X_test, clf.predict(X_test)) * 100))

OK last step of the walk through, let's have a look at how the epsilon decrease (ie the increase in privacy) effects the accuracy of the model:

In [None]:
accuracy = []
epsilons = np.logspace(-3, 1, 500)

for eps in epsilons:
    dp_clf = dp.LogisticRegression(epsilon=eps, data_norm=100)
    dp_clf.fit(X_train, y_train)
    accuracy.append(dp_clf.score(X_test, y_test))

In [None]:
import matplotlib.pyplot as plt
import pickle

plt.semilogx(epsilons, accuracy, label="Differentially private")
plt.plot(epsilons, np.ones_like(epsilons) * baseline, dashes=[2,2], label="Non-private")
plt.title("Differentially private logistic regression accuracy")
plt.xlabel("epsilon")
plt.ylabel("Accuracy")
plt.ylim(0, 1)
plt.xlim(epsilons[0], epsilons[-1])
plt.legend(loc=3)
plt.show()

### Over to you!!

OK you're a high flying NUIG grad student - I think you got this next task!

The goal is to use the same inputs and outputs but mix up the models you are using. A list of the DiffPrivLib models are here: 

In [None]:
for model_name in dir(dp):
    if model_name[0].isupper():
        print(model_name)

Source are here if you need it: https://github.com/IBM/differential-privacy-library/tree/main/diffprivlib/models

In [None]:
# try out alternative DP models below



