<a href="https://colab.research.google.com/github/ivyclare/PrivateAI/blob/master/MNIST_PATE_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### PATE Analysis on MNIST

http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html
Our PATE approach at providing differential privacy to machine learning is based on a simple intuition: if two different classifiers, trained on two different datasets with no training examples in common, agree on how to classify a new input example, then that decision does not reveal information about any single training example. The decision could have been made with or without any single training example, because both the model trained with that example and the model trained without that example reached the same conclusion.

====================

In order to train MNIST in a differentially private manner, we need 2 main components; private datasets (teachers) and public unlabelled dataset (student). MNIST is divided into train and test data. Hence, we'll have to create the teacher and student datasets ourselves. 

We will follow the steps below, to create a privacy preserving MNIST deep learning model:

- Create the teacher and student datasets
    - The training data is divided into non-overlapping subsets
- 

In [26]:
# install syft package to use Private Aggregation of Teacher Ensembles (PATE)
!pip install syft

Collecting syft
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8b/dc9a253392908d480322466832d618d85cdb1b66a1781604cf1064b50c32/syft-0.2.4-py3-none-any.whl (341kB)
[K     |████████████████████████████████| 348kB 2.8MB/s 
Collecting Pillow~=6.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/8a/fd/bbbc569f98f47813c50a116b539d97b3b17a86ac7a309f83b2022d26caf2/Pillow-6.2.2-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 38.8MB/s 
[?25hCollecting requests~=2.22.0
[?25l  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
[K     |████████████████████████████████| 61kB 7.7MB/s 
Collecting flask-socketio~=4.2.1
  Downloading https://files.pythonhosted.org/packages/66/44/edc4715af85671b943c18ac8345d0207972284a0cd630126ff5251faa08b/Flask_SocketIO-4.2.1-py2.py3-none-any.whl
Collecting syft-proto~=0.2.5.a1
[?25l  

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# import our libraries
import numpy as np
import pandas as pd
import torch
from torchvision import datasets, transforms
from torch.utils.data import Subset, DataLoader
from torch import nn, optim
import torch.nn.functional as F
import time, os
import math
from syft.frameworks.torch.dp import pate


### Step 1: Create Teacher and Student Datasets

In [5]:
# Load MNIST dataset

data_transforms = transforms.Compose([transforms.ToTensor(),
                                      transforms.Normalize((0.5,),(0.5,))
                                     ])
# train_data = datasets.MNIST(root=’data’, train=True, download=True, transform=transform)

trainset = datasets.MNIST(root='data', train=True, transform=data_transforms, download=True)

testset = datasets.MNIST(root='data', train=False, transform=data_transforms, download=True)



Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw
Processing...
Done!




In [0]:
len(trainset), len(testset)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
# TEACHERS
#divide train set between teachers and create dataloaders for valid and trainsets
num_teachers = 10
valid_per = 0.2 #20% for validation
batch_size = 32


def teacher_dataloaders(transet=trainset, num_teachers=num_teachers, batch_size=batch_size, valid_per = 0.3):
  trainloaders = []
  validloaders = []
  teacher_data_len = len(trainset) // num_teachers

  for i in range(num_teachers):
    # get particular subset of data
    indice = list(range(i*teacher_data_len, (i+1)*teacher_data_len))
    data_subset = Subset(trainset, indice)
    # split into train and validation set
    valid_size = int(len(data_subset) * valid_per)
    train_size = len(data_subset) - valid_size
    train_subset, valid_subset = torch.utils.data.random_split(data_subset, [train_size,valid_size])
    # print(len(train_subset))
    # print(len(valid_subset))

    #create data loaders
    trainloader = DataLoader(train_subset, batch_size=batch_size, shuffle=True, num_workers=1)
    validloader = DataLoader(valid_subset, batch_size=batch_size, shuffle=False, num_workers=1)

    #add dataloaders to list
    trainloaders.append(trainloader)
    validloaders.append(validloader)
  
  return trainloaders, validloaders

trainloaders, validloaders = teacher_dataloaders()
len(trainloaders), len(validloaders)





(10, 10)

In [0]:
#  # STUDENT 
# split into train and validation set
valid_size = int(len(testset) * 0.3)
train_size = len(testset) - valid_size
student_train_subset, student_valid_subset = torch.utils.data.random_split(testset, [train_size,valid_size])
# print(len(train_subset))
# print(len(valid_subset))

#create data loaders
student_trainloader = DataLoader(student_train_subset, batch_size=batch_size, shuffle=False, num_workers=1)
student_validloader = DataLoader(student_valid_subset, batch_size=batch_size, shuffle=False, num_workers=1)

## Step 2: Train Teachers

In [0]:
# define model
class Net(nn.Module):
  def __init__(self):
    super().__init__()

    self.fc1 = nn.Linear(784, 256)
    self.fc2 = nn.Linear(256, 128)
    self.fc3 = nn.Linear(128, 64)
    self.fc4 = nn.Linear(64, 10)
    self.dropout = nn.Dropout(p=0.4)

  def forward(self, x):
    x = x.view(x.shape[0], -1)
    x = self.fc1(x)
    x = self.dropout(F.relu(self.fc2(x)))
    x = self.dropout(F.relu(self.fc3(x)))
    x = F.log_softmax(self.fc4(x), dim=1)

    return x

In [0]:
# training loop
def train(trainloader, validloader, model, optimizer, criterion, epochs, device):
  start = time.time()
  trainloader = trainloaders[0]
  validloader = validloaders[0]
  best_loss = math.inf
  train_results = []
  valid_results = []

  for epoch in range(epochs):   
    model.train()
    running_loss = 0.0
    running_corrects = 0
    valid_corrects = 0
    valid_loss = 0
    
    for images, labels in trainloader:
      images = images.to(device)
      labels = labels.to(device)
      optimizer.zero_grad()

      outputs = model(images)
      _, preds = torch.max(outputs, 1)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()

      #running_loss += loss.item()
      running_loss += loss.item() * images.size(0)
      running_corrects += torch.sum(preds == labels.data)

      with torch.no_grad():
        model.eval()
        for images, labels in validloader:
          images = images.to(device)
          labels = labels.to(device)

          outputs = model(images)
          v_loss = criterion(outputs, labels)

          valid_loss += loss.item() * images.size(0)
          ps = torch.exp(outputs)
          top_p, top_class = ps.topk(1, dim=1)
          equals = top_class == labels.view(*top_class.shape)
          valid_corrects += torch.mean(equals.type(torch.FloatTensor))

      #   # if(valid_loss < best_loss):
      #   #   best_loss = valid_loss
      
        train_loss = running_loss / len(trainloader)
        train_acc = running_corrects.double() / len(trainloader)
        train_results.append([train_loss,train_acc])

        valid_losss = valid_loss / len(validloader)
        valid_acc = valid_corrects / len(validloader)
        valid_results.append([valid_losss,valid_acc])

    print("Epoch: {}/{}".format(epoch, epochs))
    print('\tTrain Loss: {:.4f} Train Acc: {:.4f}'.format(train_loss, train_acc))
    print('\tValid Loss: {:.4f} Valid Acc: {:.4f}'.format(valid_losss, valid_acc))
  return model
  # return model, train_results, valid_results

In [0]:
model = Net()
#model.to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters() , lr=0.001)
epochs = 10

#train(trainloaders, validloaders, model, optimizer, criterion, epochs, device)

In [0]:

# method for training
def train(model, criterion, optimizer, trainloader, validloader, epochs=10):
  model = model
  criterion = criterion
  optimizer = optimizer
  epochs = epochs

  train_losses, valid_losses = [], []
  for e in range(epochs):
    running_loss = 0
    for inputs, labels in trainloader:

      optimizer.zero_grad()

      log_ps = model(inputs)
      loss = criterion(log_ps, labels)
      loss.backward()
      optimizer.step()

      running_loss += loss.item()
  
    else:
      valid_loss = 0
      acc = 0

      # turn off gradients for validation, saving memory and computations
      with torch.no_grad():
        model.eval()
        for inputs, labels in validloader:

          log_ps = model(inputs)
          valid_loss += criterion(log_ps, labels)

          ps = torch.exp(log_ps)
          top_p, top_class = ps.topk(1, dim=1)
          equals = top_class == labels.view(*top_class.shape)
          acc += torch.mean(equals.type(torch.FloatTensor))

      model.train()
      train_losses.append(running_loss / len(trainloader))
      valid_losses.append(valid_loss / len(validloader))

      print('Epoch: {}/{}.. '.format(e+1, epochs),
            'Training Loss: {:.3f}.. '.format(running_loss / len(trainloader)),
            'Valid Loss: {:.3f}.. '.format(valid_loss / len(validloader)),
            'Valid Accuracy: {:.3f} '.format(acc / len(validloader)),
            '')
  return model

In [13]:
teacher_models = []
i = 1
for trainloader, validloader in zip(trainloaders, validloaders):
  print(" Training Teacher {}".format(i))
  #teacher_model = train(trainloaders, validloaders, model, optimizer, criterion, epochs, device)
  teacher_model = train(model, criterion, optimizer, trainloader, validloader)
  teacher_models.append(teacher_model)
  i+=1
  print("======================================================")

 Training Teacher 1
Epoch: 1/10..  Training Loss: 1.449..  Valid Loss: 0.532..  Valid Accuracy: 0.843  
Epoch: 2/10..  Training Loss: 0.671..  Valid Loss: 0.366..  Valid Accuracy: 0.895  
Epoch: 3/10..  Training Loss: 0.556..  Valid Loss: 0.422..  Valid Accuracy: 0.869  
Epoch: 4/10..  Training Loss: 0.468..  Valid Loss: 0.308..  Valid Accuracy: 0.905  
Epoch: 5/10..  Training Loss: 0.411..  Valid Loss: 0.287..  Valid Accuracy: 0.918  
Epoch: 6/10..  Training Loss: 0.360..  Valid Loss: 0.321..  Valid Accuracy: 0.905  
Epoch: 7/10..  Training Loss: 0.388..  Valid Loss: 0.360..  Valid Accuracy: 0.899  
Epoch: 8/10..  Training Loss: 0.336..  Valid Loss: 0.259..  Valid Accuracy: 0.928  
Epoch: 9/10..  Training Loss: 0.306..  Valid Loss: 0.275..  Valid Accuracy: 0.922  
Epoch: 10/10..  Training Loss: 0.278..  Valid Loss: 0.250..  Valid Accuracy: 0.931  
 Training Teacher 2
Epoch: 1/10..  Training Loss: 0.469..  Valid Loss: 0.289..  Valid Accuracy: 0.901  
Epoch: 2/10..  Training Loss: 0.404

## Step 3: Get Private Student Labels 

In [14]:
# get private labels
def student_train_labels(teacher_models, dataloader):
  student_labels = []
  # get label for each teacher
  for model in teacher_models:
    student_label = []
    for images,_ in dataloader:
      with torch.no_grad():
        outputs = model(images)
        preds = torch.argmax(torch.exp(outputs), dim=1)
      student_label.append(preds.tolist())
    # add all teacher predictions to student_labels  
    student_label = sum(student_label, [])
    student_labels.append(student_label)
  return student_labels

predicted_labels = student_train_labels(teacher_models, student_trainloader)     
predicted_labels = np.array([np.array(p) for p in predicted_labels]).transpose(1, 0)
# We see here that we have 10 labels for each of image in our dataset
print(predicted_labels.shape)
predicted_labels[0]

(7000, 10)


array([8, 8, 8, 2, 2, 6, 2, 5, 7, 5])

In [0]:
# labels = predicted_labels[1]
# counts = np.bincount(labels, minlength=10)
# query_result = np.argmax(counts)
# query_result

## Step 4: Add Laplacian Noise

In [0]:
# Get private labels with the most votes count and add noise them
def add_noise(predicted_labels, epsilon=0.1):
  noisy_labels = []
  for preds in predicted_labels:
    # print(preds.shape[0])
    # get labels with max votes
    counts = np.bincount(preds, minlength=preds.shape[0])
    # add laplacian noise to label
    epsilon = epsilon
    beta = 1/epsilon
    for i in range(len(counts)):
      counts[i] += np.random.laplace(0, beta, 1)
    
    # after adding noise we get labels with max counts
    new_label = np.argmax(counts)
    noisy_labels.append(new_label)
  return np.array(noisy_labels)
  

In [17]:
labels_with_noise = add_noise(predicted_labels, epsilon=0.1)
print(labels_with_noise)

[8 0 6 ... 9 6 9]


## Step 5 Peform PATE Analysis

## Step 6: Train **Student**

In [22]:
# We have to create a new training dataloader for the student with the newly created 
# labels with noise. We have to replace the old labels with the new labels

def new_student_data_loader(dataloader, noisy_labels, batch_size=32):
  image_list = []
  for image,_ in dataloader:
    image_list.append(image)
    
  data = np.vstack(image_list)
  new_dataset = list(zip(data, noisy_labels))
  new_dataloader = DataLoader(new_dataset, batch_size, shuffle=False)

  return new_dataloader

labeled_student_trainloader = new_student_data_loader(student_trainloader, labels_with_noise,32)
len(labeled_student_trainloader)
# student_trainloader = DataLoader(student_train_subset, batch_size=batch_size, shuffle=True, num_workers=1)
# student_validloader = DataLoader(student_valid_subset, batch_size=batch_size, shuffle=False, num_workers=1)



219

In [25]:
# Now we train the model
# We use the newly labeled trainloader for training and use the validloader data to evaluate the performance of our model
student_model = train(model, criterion, optimizer, labeled_student_trainloader, student_validloader)



Epoch: 1/10..  Training Loss: 2.304..  Valid Loss: 2.303..  Valid Accuracy: 0.097  
Epoch: 2/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 3/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 4/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 5/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 6/10..  Training Loss: 2.301..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 7/10..  Training Loss: 2.302..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 8/10..  Training Loss: 2.302..  Valid Loss: 2.304..  Valid Accuracy: 0.097  
Epoch: 9/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.101  
Epoch: 10/10..  Training Loss: 2.303..  Valid Loss: 2.304..  Valid Accuracy: 0.101  


# TRAIN MNIST NORMALLY