# ## Anonymization and k-Anonymity

## Instructions

The first half of this notebook contains code to read in and preprocess the example dataset. The second half contains questions for you to answer by writing code and describing your solutions.

## Preamble: Read in Adult dataset & Preprocessing

The dataset is based on census data. I have added the columns `Name`, `DOB`, `SSN`, and `Zip` to represent personally identifiable information (PII). The values in these columns are made up.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np

def your_code_here():
    return 1

adult_data = pd.read_csv("adult_with_pii.csv")
adult_data.head()

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,9/7/67,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,6/7/88,150-19-2766,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,8/6/91,725-59-9860,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,Dorry Poter,4/6/09,659-57-4974,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,Dick Honnan,9/16/51,220-93-3811,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
# Remove PII
adult_anon = adult_data.drop(columns=['Name', 'SSN'])
adult_anon.head()

Unnamed: 0,DOB,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,9/7/67,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,6/7/88,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,8/6/91,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,4/6/09,25503,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,9/16/51,75387,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# PII only
pii = adult_data[['Name', 'DOB', 'SSN', 'Zip']]
pii.head()

Unnamed: 0,Name,DOB,SSN,Zip
0,Karrie Trusslove,9/7/67,732-14-6110,64152
1,Brandise Tripony,6/7/88,150-19-2766,61523
2,Brenn McNeely,8/6/91,725-59-9860,95668
3,Dorry Poter,4/6/09,659-57-4974,25503
4,Dick Honnan,9/16/51,220-93-3811,75387


## END PREAMBLE
-------------

## Collaboration Statement

In the cell below, write your collaboration statement. This statement should describe all collaborations, even high-level ones (e.g. "I discussed my general approach for answering question 3 with Josh"). High-level collaborations of this kind are allowed as long as they are described; copying of answers or code is not allowed.

In [4]:
# In this cell (in markdown or a comment), write your collaboration statement

## Question 1

Using the dataframes `pii` and `adult_anon`, perform a linking attack to recover the names of as many samples in `adult_anon` as possible.

How many names are you able to recover?

In [5]:
# In this cell, write code to perform the linking attack

In [6]:
# In this cell, write code to determine how many names could be recovered

## Question 2

Implement a function `is_k_anonymous` to check (for a given `k`) whether a given dataframe satisfies k-Anonymity.

In [7]:
# In this cell, write code to implement 'is_k_anonymous'

def is_k_anonymous(k, iqs, df):
    return False

## Question 3

In one or two sentences, informally describe how well you expect your implementation of 'is_k_anonymous' to scale with the size of the input data.

In [8]:
# In this cell, describe (in markdown or in a comment) the scaling behavior of your answer in question 2.

## Question 4 

Write code to answer the query: "how many participants have never been married?"

*Hint*: filter the `adult_data` dataframe to contain only participants who were never married, then return the 0th element of the `shape` of the filtered dataframe.

In [9]:
query1 = your_code_here()
query1;

## Question 5 

In 2-5 sentences, answer the following:
- What privacy concerns are brought by query1?
- What could be a simple solution to limit the concern raised by Question 4? 

In [10]:
# Write your answer to Question 4 here