<a href="https://colab.research.google.com/github/agatagruza/private-ai/blob/master/SPAIC_Project5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 5: Varying Amounts of Noise
Augment the randomized response query from the previous project to allow for varying amounts of randomness to be added. Vary the amount of noise by
- Adding a new parameter to the query function. It will now accept the database and some noise parameter which is a percentage. The first coin flip will have varying probabilities of being 1 or 0. Experiment with different values of noise
- Properly rebalance the result of the query given this adjustable parameter <br></br>

###GLOSSARY:
- The size of the data set allows you to add more noise, or more privacy protection to the individuals who are inside the dataset. This is an interesting **trade-off**.
- The **counter-intuitive** thing here is that the more private data you have access to, the easier is to protect the privacy of the people who were involved. 
- The larger dataset is, the more noise you can add while still getting an accurate result.   
- With differential privacy (DP),  the opposite is true as DP wants to learn about an aggregation over a large corpus. DP looks for info that is **consistent** across **multiple different individuals**, without learning about single individual person. It looks for repeating statistical information inside the dataset and **filter out any information that is unique to undividual.**. The smaller the dataset, the more it will look like data are unique to each individual.  <br></br>

###TAKEAWAYS
- The larger the corpus of information that you can work with, the easier it is for you to protect provacy because it's easier for your algorithm to detect that some statistical information is happening in more than one person,  and therefore it is not private or unique or sensitive to that person. Because it's a general characteristic of of humans more and more generally. 

In [0]:
#import torch
import torch

In [34]:
# building random database of length 100 that is filled with 1's and 0's
db = torch.rand(100) > 0.5
db

tensor([1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,
        1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
        0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
        0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0], dtype=torch.uint8)

In [0]:
def create_parallel_db(db, index):

    return torch.cat((db[0:index], 
                      db[index+1:]))

In [0]:
def create_parallel_dbs(db):

    parallel_dbs = list()

    for i in range(len(db)):
        pdb = create_parallel_db(db, i)
        parallel_dbs.append(pdb)
    
    return parallel_dbs

In [0]:
pdbs = create_parallel_dbs(db)

In [0]:

def create_db_and_parallels(num_els):
    
    db = torch.rand(num_els) > 0.5
    pdbs = create_parallel_dbs(db)
    
    return db, pdbs

In [0]:
def query(db, noise=0.2):  

  # 0.2% probabilty of coin flip will ne a head

  true_data = torch.mean(db.float())

  # flipping two coins 100 times
  first_coin_flip = (torch.rand(len(db)) > noise).float()
  second_coin_flip = (torch.rand(len(db)) > 0.5).float()

  # will return 1 only for the places in the database where there actually was a 1 originally 
  # db.float() * first_coin_flip

  # augmented_database is differentially private !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  augmented_database = db.float() * first_coin_flip + (1-first_coin_flip) * second_coin_flip
  # augmented_database_mean = (true_dist_mean * noise) + (noise_dist_mean * (1 - noise))
  # # (1-first_coin_flip) are all the places we want to choose randomly
  # torch.mean(augmented_database.float())  skewd results

  sk_result = augmented_database.float().mean()

  # 0.5 comes fromsecond_coin_flip = (torch.rand(len(db)) > 0.5).float()
  db_result = ((sk_result / noise) - 0.5) * noise / (1 - noise) 

  return true_data, db_result

In [40]:
# db size=100, noise=0.1 
db, pdbs = create_db_and_parallels(100)
db_result, true_data = query(db, noise=0.1)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))

With Noise:tensor(0.4200)
Without Noise:tensor(0.4444)


In [41]:
# db size=100, noise=0.2
db, pdbs = create_db_and_parallels(100)
db_result, true_data = query(db, noise=0.2)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))

With Noise:tensor(0.5600)
Without Noise:tensor(0.5500)


In [42]:
# db size=100, noise=0.4
db, pdbs = create_db_and_parallels(100)
db_result, true_data = query(db, noise=0.4)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))

With Noise:tensor(0.4800)
Without Noise:tensor(0.5167)


In [45]:
# db size=100, noise=0.8
db, pdbs = create_db_and_parallels(100)
db_result, true_data = query(db, noise=0.8)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))

With Noise:tensor(0.4900)
Without Noise:tensor(0.3000)


In [46]:
# LARGE DATA SET !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# db size=10000, noise=0.1
db, pdbs = create_db_and_parallels(10000)
db_result, true_data = query(db, noise=0.1)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))
# With big data set results are closeto each other. 

With Noise:tensor(0.5007)
Without Noise:tensor(0.5042)


In [47]:
# LARGE DATA SET !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# db size=10000, noise=0.2
db, pdbs = create_db_and_parallels(10000)
db_result, true_data = query(db, noise=0.2)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))
# With big data set results are closeto each other. 

With Noise:tensor(0.4908)
Without Noise:tensor(0.4888)


In [48]:
# LARGE DATA SET !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# db size=10000, noise=0.4
db, pdbs = create_db_and_parallels(10000)
db_result, true_data = query(db, noise=0.4)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))
# With big data set results are closeto each other. 

With Noise:tensor(0.5087)
Without Noise:tensor(0.4905)


In [49]:
# LARGE DATA SET !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# db size=10000, noise=0.8
db, pdbs = create_db_and_parallels(10000)
db_result, true_data = query(db, noise=0.8)
print("With Noise:" + str(db_result))
print("Without Noise:" + str(true_data))
# With big data set results are closeto each other. 

With Noise:tensor(0.5060)
Without Noise:tensor(0.4815)
