<a href="https://colab.research.google.com/github/ingus-t/SPAI/blob/master/Notes/differential_privacy_noviceai_presentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install faker

Collecting faker
[?25l  Downloading https://files.pythonhosted.org/packages/67/ed/94a23058daff92545869848ccbcaeb826bc79c6ba4459c7df31ebe1f196d/Faker-2.0.1-py2.py3-none-any.whl (878kB)
[K     |████████████████████████████████| 880kB 4.7MB/s 
Installing collected packages: faker
Successfully installed faker-2.0.1


In [0]:
from faker import Faker
import random
from random import sample
import re
import copy

In [0]:
N  = 1000              # number of people in first dataset
N2 = 10000             # number of people in second dataset

### Introduction
**!! All information is generated and fake. !!**

**Dataset 1** includes public information about people from one small town Doorville somewhere in USA. Information comes from Facebook profiles/LinkedIn, online forums, phone books, etc.
Let's say it's a small town (1000 people) with relatively many people of Hawaiian origin (4%).

**Dataset 2** is Hospital database about patients and health conditions. It contains information about patients from Doorville, and also other towns in the area, where are no Hawaiian people. A total of 10000 people.

### Helper functions


In [0]:
def show_user_info(i, dataset):
  print('ID:',i)
  print('Name:',         dataset[i][0][0])
  print('Race:',         dataset[i][0][1])
  print('Address:',      dataset[i][0][2])
  print('Email:',        dataset[i][0][3])
  print('Company:',      dataset[i][0][4])
  print('Phone number:', dataset[i][0][5])
  print('Date of birth:',dataset[i][0][6])
  print('Health:',       dataset[i][0][7],'\n')

def get_race_town(rnd):
  if rnd > 95:
    race = 'Hawaiian and Pacific islander'
  elif rnd > 70:
    race = 'African American'
  else:
    race = 'White'
  return race

def get_race_city(rnd):
  if rnd > 98:
    race = 'American Indian'
  elif rnd > 50:
    race = 'African American'
  else:
    race = 'White'
  return race

def get_health_report(rnd):
  if rnd > 90:
    health = 'Sensitive health condition'
  else:
    health = ''
  return health

### Dataset #1
Public information.  
For example, it could be taken from Facebook profiles.

In [0]:
fake = Faker()
fake.seed(1475)     # seed so we can reproduce the results

dataset1 = [[] for x in range(N)]
for i in range(N):
  name = fake.name()
  rnd = random.randint(0,100)
  rnd2 = random.randint(0,100)
  dataset1[i].append([name,
                   get_race_town(rnd),
                   fake.address().replace("\n", ", "),
                   name.replace(" ", ".") + '@email.com',
                   fake.company(),
                   re.sub(r'(?:x).*', '', fake.phone_number()),  # we replace extra section for phone numbers, not important
                   str(fake.date_of_birth(None, 18, 65)),
                   get_health_report(rnd2)
                   ])

In [8]:
# show information about some random people
x = random.randint(0,N)
show_user_info(3, dataset1)
show_user_info(x, dataset1)

ID: 3
Name: Kelsey Garcia
Race: African American
Address: 963 Aguilar Common, North Heathershire, WV 60225
Email: Kelsey.Garcia@email.com
Company: Dillon-Larsen
Phone number: +1-771-310-0928
Date of birth: 1969-01-24
Health:  

ID: 508
Name: Steven Vargas
Race: White
Address: 6922 Ramos Junction Suite 629, West Robert, CT 87381
Email: Steven.Vargas@email.com
Company: Owens, Smith and Hernandez
Phone number: 362-790-5275
Date of birth: 1969-12-12
Health: Sensitive health condition 



### Dataset #2
Hospital information, names are hidden, data is expected to be safe


In [0]:
# add people from first dataset
dataset_temp = sample(dataset1, int(N/2))
dataset2 = copy.deepcopy(dataset_temp)

# add N2 more people
for i in range(int(N/2), N2):
  dataset2.append([])
  name = fake.name()
  rnd = random.randint(0,100)
  rnd2 = random.randint(0,100)
  dataset2[i].append([name,
                   get_race_city(rnd),
                   fake.address().replace("\n", ", "),
                   name.replace(" ", ".") + '@email.com',
                   fake.company(),
                   re.sub(r'(?:x).*', '', fake.phone_number()), # we replace extra section for phone numbers, not important
                   str(fake.date_of_birth(None, 18, 65)),
                   get_health_report(rnd2)
                   ])

# remove most sensitive information, leave only year for the 
for i in range(0, N2):
  dataset2[i][0][0] = ''
  dataset2[i][0][2] = ''
  dataset2[i][0][3] = ''
  dataset2[i][0][4] = ''
  dataset2[i][0][5] = ''
  dataset2[i][0][6] = dataset2[i][0][6][0:4]

In [11]:
dataset1

[[['Larry Gray',
   'White',
   '497 Garcia Ridge, East Matthewshire, IL 89163',
   'Larry.Gray@email.com',
   'Burgess-Harrington',
   '+1-411-974-5765',
   '1956-09-02',
   '']],
 [['Charles Hart',
   'White',
   '97168 Ward Meadows Suite 602, East Robert, ME 47199',
   'Charles.Hart@email.com',
   'Fowler-Krueger',
   '(840)554-0523',
   '1963-08-16',
   '']],
 [['Dr. Kenneth Casey',
   'White',
   '43083 Ortega Plaza, Griffinport, IN 48300',
   'Dr..Kenneth.Casey@email.com',
   'Wood-Rangel',
   '(642)184-6933',
   '1996-11-06',
   '']],
 [['Kelsey Garcia',
   'African American',
   '963 Aguilar Common, North Heathershire, WV 60225',
   'Kelsey.Garcia@email.com',
   'Dillon-Larsen',
   '+1-771-310-0928',
   '1969-01-24',
   '']],
 [['Lisa Graham',
   'African American',
   '71901 Julie Court Suite 031, Johnfurt, MN 72615',
   'Lisa.Graham@email.com',
   'Bryan-Bush',
   '158.685.8340',
   '1954-11-21',
   '']],
 [['Shannon Walter',
   'African American',
   '410 Saunders Station, M

In [12]:
dataset2

[[['', 'White', '', '', '', '', '1976', '']],
 [['', 'African American', '', '', '', '', '1994', '']],
 [['',
   'African American',
   '',
   '',
   '',
   '',
   '1959',
   'Sensitive health condition']],
 [['', 'Hawaiian and Pacific islander', '', '', '', '', '1956', '']],
 [['', 'White', '', '', '', '', '1977', '']],
 [['', 'White', '', '', '', '', '1971', '']],
 [['', 'White', '', '', '', '', '1954', '']],
 [['', 'White', '', '', '', '', '1957', 'Sensitive health condition']],
 [['', 'White', '', '', '', '', '1965', '']],
 [['', 'White', '', '', '', '', '1969', '']],
 [['', 'Hawaiian and Pacific islander', '', '', '', '', '1956', '']],
 [['',
   'African American',
   '',
   '',
   '',
   '',
   '1995',
   'Sensitive health condition']],
 [['', 'White', '', '', '', '', '1990', '']],
 [['', 'African American', '', '', '', '', '1998', '']],
 [['', 'African American', '', '', '', '', '1969', '']],
 [['', 'African American', '', '', '', '', '2000', '']],
 [['', 'White', '', '', '', ''

In [10]:
# count Hawaiian people, and Hawaiian people with health issues
h_people = 0
h_people_with_health_issues = 0 

# show Hawaiian people with sensitive health issues
for i in range(0, N2):
  if dataset2[i][0][1] == 'Hawaiian and Pacific islander':
    h_people += 1
  if dataset2[i][0][1] == 'Hawaiian and Pacific islander' and dataset2[i][0][7] == 'Sensitive health condition':
    h_people_with_health_issues += 1
    show_user_info(i, dataset2)

h_people, h_people_with_health_issues

ID: 339
Name: 
Race: Hawaiian and Pacific islander
Address: 
Email: 
Company: 
Phone number: 
Date of birth: 1966
Health: Sensitive health condition 

ID: 348
Name: 
Race: Hawaiian and Pacific islander
Address: 
Email: 
Company: 
Phone number: 
Date of birth: 1963
Health: Sensitive health condition 



(22, 2)

In [0]:
# show information about some users
x = random.randint(0, 100)
#show_user_info(3, dataset2)
#show_user_info(x, dataset2)

### Possible data leak
Imagine that the hospital releases statistics about health, by age group and race.

There are so few **Hawaiian** people in the dataset, that singling this group out by age group, gender, or health condition has the risk of exposing who has specific health issues.

In the example above, we have very few Hawaiian people with sensitive health issues.

In this example we have a relatively large dataset (10 000) but privacy of small groups of people can easily be violated.

Given the extremely small size of this group, singling it out in reports can expose more than we expect.