<a href="https://colab.research.google.com/github/agatagruza/private-ai/blob/master/SPAIC_Project8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project 8: PATE Analysis + DP model training using PATE

## Part A: PATE Analysis
We are asking the question: how much information would leak through these labels if we were to publish them? ***How much epsilon is actually present inside these labsel? ***The reason we care about this is because this is really important property of differential privacy, which is - ***it is immune to -post-processing***. That means if a dataset contains a certain amount of private information, no amount of post-processing could divulge more information than was in the dataset. 



In [0]:
import numpy as np
import torch

In [19]:
labels = np.array([9, 9, 3, 6, 9, 9, 9, 9, 8, 2])
counts = np.bincount(labels, minlength=10)
query_result = np.argmax(counts)
query_result

9

In [0]:
# installing pysyft
!pip install syft

In [0]:
# importing pate so that we can apply pate algorithm
from syft.frameworks.torch.differential_privacy import pate

In [0]:
teachers_num, examples_num, labels_num = (100, 100, 10) # From prevoius project
prediction = (np.random.rand(teachers_num, examples_num) * labels_num).astype(int) # fake predictions
indices = (np.random.rand(examples_num) * labels_num).astype(int) 

In [33]:
# data_dependent_epsilon looks inside and says: "Hey, how much agreement is here?" 
# ddata_independent_epsilon is looser. It's a simpler epsilon.  
# It doesn't look at the data to be able to tell.

data_dependent_epsilon, data_independent_epsilon = pate.perform_analysis(teacher_preds=prediction, indices=indices, noise_eps=0.1, delta=1e-5)
print("Data Independent Epsilon:", data_independent_epsilon)
print("Data Dependent Epsilon:", data_dependent_epsilon)

Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 11.756462732485105


In [0]:
# Agreement, here first 5 examples all 10 hospitals agreed it was labeled 0
# We are forcing first 5 examples to have perfect consensus at zero. 
prediction[:,0:5] *= 0

In [37]:
data_dependent_epsilon, data_independent_epsilon = pate.perform_analysis(teacher_preds=prediction, indices=indices, noise_eps=0.1, delta=1e-5)
print("Data Independent Epsilon:", data_independent_epsilon)
print("Data Dependent Epsilon:", data_dependent_epsilon)

Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 1.52655213289881


In [0]:
# Agreement, here first 50 examples all 10 hospitals agreed it was labeled 0
# We are forcing first 50 examples to have perfect consensus at zero. 
prediction[:,0:50] *= 0

In [38]:
data_dependent_epsilon, data_independent_epsilon = pate.perform_analysis(teacher_preds=prediction, indices=indices, noise_eps=0.1, delta=1e-5)
print("Data Independent Epsilon:", data_independent_epsilon)
print("Data Dependent Epsilon:", data_dependent_epsilon)

Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 1.52655213289881


The smaller Data Dependent Epsilon, then significantly better privacy leak. 

***Warning: May not have used enough values of l. Increase 'moments' variable and run again.***
By default 'moments' = 8. We should follow Warning and increase 'moments' to ~20.

**In summary: the greater the agreemet, the more prediction agree with each other, and the tighter Data Dependent Epsilon value we can get. 
Weh Using PATE, if you can do things with your algorrithm, to encourage models at different locations to agree with each other, to find true signal, 
to NOT overfit to the data, the you have less provacy leakage. That happens beacuse each model was better at only memorizing and learnign generic information.
PATE rewards you for creating good generalized models that don't memorize the data by giving you a better Epsilon levels at the end. ***





## Part B: DP model training using PATE
For the final project for this section, you need to train a DP model using this PATE method on the MNIST dataset. You are given:
1.   Labelled private dataset which you must keep differentially private
2.   A public unlabeled dataset (MNIST) which doesn't need to be differentially private </br>

Goal: Automatically lael the 2nd dataset. Then you should be able to train a model on this 2nd public dataset and get reasonable level of accuracy on the task given a certain epsilon delta constraint. 



In [39]:
import torchvision.datasets as datasets
mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=None)

0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


9920512it [00:01, 9577114.89it/s]                            


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


32768it [00:00, 142404.01it/s]           
  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


1654784it [00:00, 2136541.80it/s]                           
0it [00:00, ?it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


8192it [00:00, 52092.97it/s]            

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!





In [40]:
train_data = mnist_trainset.train_data
train_targets = mnist_trainset.train_labels



In [41]:
test_data = mnist_trainset.test_data
test_targets = mnist_trainset.test_labels

