<a href="https://colab.research.google.com/github/agatagruza/private-ai/blob/master/SPAIC_Project7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project 7: Annotation for unknown labels of images
First we're going to consider a scenario - you work for a hospital and you have a large collection of images about your patients. However, you don't know what's in them. You would like to use these images to develop a neural network which can automatically classify them, however since your images aren't labeled, they aren't sufficient to train a classifier. </br></br>
However, being a cunning strategist, you realize that you can reach out to 10 partner hospitals which DO have annotated data. It is your hope to train your new classifier on their datasets so that you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their patients. Thus, you will use the following technique to train a classifier which protects the privacy of patients in the other hospitals. </br>
1.   You'll ask each of the 10 hospitals to train a model on their own datasets (All of which have the same kinds of labels) </br>
2.   You'll then use each of the 10 partner models to predict on your local dataset, generating 10 labels for each of your datapoints</br>
3.   Then, for each local data point (now with 10 labels), you will perform a DP query to generate the final true label. This query is a "max" function, where "max" is the most frequent label across the 10 labels. We will need to add laplacian noise to make this Differentially Private to a certain epsilon/delta constraint. </br>
4.   Finally, we will retrain a new model on our local dataset which now has labels. This will be our final "DP" model.</br>
 
So, let's say we have 10,000 training examples, and we've got 10 labels for each example (from our 10 "teacher models" which were trained directly on private data). Each label is chosen from a set of 10 possible labels (categories) for each image.

In [0]:
import numpy as np

In [0]:
teachers_num = 10 # total # of hospitals that we are workign with
examples_num = 10000 # the size of OUR dataset
labels_num = 10 # number of lablels for our classifier
# For now we are assuming that those parameters are mutually exzclisive

In [0]:
prediction = (np.random.rand(teachers_num, examples_num) * labels_num).astype(int) # fake predictions

In [14]:
prediction[0].shape # predictions from one teacher

(10000,)

In [15]:
prediction[:,0] # All examples for let's say first image. Index correcpond to teacher # 

array([7, 4, 7, 6, 8, 6, 5, 9, 1, 8])

In [16]:
prediction

array([[7, 7, 5, ..., 9, 8, 7],
       [4, 4, 7, ..., 1, 5, 0],
       [7, 1, 7, ..., 7, 2, 5],
       ...,
       [9, 8, 0, ..., 5, 7, 1],
       [1, 4, 6, ..., 7, 6, 3],
       [8, 8, 6, ..., 6, 3, 2]])

In [0]:
# We are assuming hospitals doesn't have overlapping patients
new_labels = list()
for an_image in prediction:

    # Adding noise to each of the counts
    label_counts = np.bincount(an_image, minlength=labels_num)

    epsilon = 0.1
    beta = 1 / epsilon

    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)

    new_label = np.argmax(label_counts)
    
    new_labels.append(new_label)

In [18]:
len(new_labels)

10

***As a result we have generated a synthetic dataset of new labels based on the predictions from all of our partner hospitals.***
