# Fairness, Accountability, Transparency and Ethics Course (FATE)


## Universitat Pompeu Fabra (UPF)
### 21/22

**Submission date: 09/02/2021 at 12PM on Aula Global.**

Please, implement this notebook **individually**.

*Originally created by David Solans (david.solans@upf.edu)



**Legend** <br>
In this notebook we use:    
<div class="alert alert-block col-md-7 alert-info">To recall information from the theory classes and other tips</div>
<div class="alert alert-block col-md-7 alert-warning">To point important things that should not be 
   forgotten</div> 
<div class="alert alert-block col-md-7 alert-success">To indicate tasks to be done by students</div>

Supplementary material: ARX deidentifier. [Link to ARX deidentifier video](https://www.youtube.com/watch?v=N8I-sxmMfqQ)

# 1. Data anonymization with Python

In this practise, you will explore some of the concepts related to data anonymization.

For that, we will use Python.

## Session 01
### 1.1. K-Anonymity

The concept of k-anonymity was first introduced by Latanya Sweeney and Pierangela Samarati in a paper published in 1998[1] as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."[2][3][4] 

**A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least *k - 1* individuals whose information also appear in the release.**


* [1] Samarati, Pierangela; Sweeney, Latanya (1998). "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression"
* [2] P. Samarati. Protecting Respondents' Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering archive Volume 13 Issue 6, November 2001.
* [3] L. Sweeney. "Database Security: k-anonymity". Retrieved 19 January 2014.
* [4] L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 卌, 2002; 557-570.


Let's begin by importing the required Python libraries:

###  Libraries used in this notebook
You will need to install: **numpy** and **pandas**

In [3]:
## Required imports
import pandas as pd
import numpy as np

import random


In this notebook we will use a synthetic dataset created for demonstration purposes. It aims to simulare medical records of 100 users. If you have some curiosity on how was it created, you can check the script named *synthetic_data_generator.py* inside the *Helper Scripts* folder.

## 1.2 Loading our dataset

In [5]:
df = pd.read_csv("Data/health_dataset.csv")

In [6]:
df.head()

Unnamed: 0,PatientName,SSN,PatientID,Phone,Gender,BirthDate,ZipCode,Health Condition,Weight,Height
0,Pablo de Cobo,412-92-6881,1000,+34 738 644 211,Male,20011009,8003,Bird flu,80,184
1,Fabio Menendez Velasco,265-27-3243,1001,+34742032426,Male,20010110,8003,Bird flu,79,179
2,Flavia Cabañas Cabanillas,554-97-8002,1002,+34712747631,Female,20010101,8014,Bird flu,67,172
3,Eliana Vendrell-Vilalta,990-49-6929,1003,+34 712 466 137,Female,20001009,8002,Common cold and flu,93,184
4,César Mas Silva,833-83-9451,1004,+34716 22 73 14,Male,20000101,8002,Common cold and flu,66,171


<div class="alert alert-block alert-info">
    <h2>Remember what you learned in the theory classes <h4>(T04C - slide5)</h4></h2>
    <br>
<ul>
    <li>Identifiers can unambiguously identify a person (e.g., passport number, DNI/NIE in Spain)</li>
    <li>Quasi-identifiers identify the person with some ambiguity (e.g., home address, telephone number)</li>
    <li>Confidential attributes contain sensitive information (e.g., salary, religion, ethnicity, gender orientation, diagnosis)</li>
    <li>Non-confidential attributes are what remains</li>
   
</ul>    
    
</div>


## Questions (1 point):

<div class="alert alert-block alert-success">Please, check the provided dataset and answer the following questions:</div>
<br>



1. Which Identifiers do you observe in the dataset?

**Your answer:**

2. What are the Quasi-identifiers that. you see in this dataset?


**Your answer:**

3. Is there any confidential attribute in it?

**Your answer:**

<div class="alert alert-block alert-warning">
    <h2>Remember:</h2>
<ul>
    <li>Identifiers should be supressed</li>
    <li>Quasi-identifiers cannot be suppressed, they have high analytical value</li>
    <li>Quasi-identifiers can be combined with external databases to re-identify a person</li>
   
</ul>    
    
</div>

## Coding (0.5 pts)
<div class="alert alert-block alert-success">Create a Python snippet that selects the subset of columns that only contains not personal identifible information</div>
<br>

In [35]:
## Your code here
## filtered_df = ...

### 1.3 Anonymity check

## Coding (1.5 points):

<div class="alert alert-block alert-success">Create a Python function that checks if a DataFrame sastifies k-anonimity for a given K.</div>

<div class="alert alert-block col-md-7 alert-info">Tip: Use only a subset of the columns (BirthDate, ZipCode) for this exercise</div>
<br>

In [40]:
## Use a helper function that calculates the groups by value
## If using pandas, you can consider using grouping and group sizes for that
def get_values_frequencies(_df, quasi_identifier_cols):
    ## Your code here
    pass

In [41]:
## Code a function that uses the function above and determines whether a given dataset satifies K-anonimity 
# for a given K
def is_k_anonymous(_df, K, quasi_identifier_cols=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

### Testing with k anonimity of our dataset
<div class="alert alert-block alert-success">Use the function above to check if the dataframe is 2-anonymous with respect to BirthData and ZipCode</div>

<br>

In [23]:
## Your code here (It should return False)

False

## Questions (1 point):
<div class="alert alert-block alert-success">Answer the following question:</div>
<br>

Does the dataset as it is satisfy k-anonimity for K=2? If not, what do you think should be done to achieve it?

**Your answer:**


<div class="alert alert-block alert-success">Provide an example of why this is happening</div>
<br>

**Your answer:** (You can use code for it)

### Testing k anonimity of another dataset

To test that the function works as expected, we will use a second dataset.

This dataset corresponds to the example given by Wikipedia on the [K-anonimity article](https://en.wikipedia.org/wiki/K-anonymity) and satisfies 2-anonymity for any combination of the attributes Gender, State and Age.


In [24]:
anonymous_df = pd.read_csv("Data/anonymized_dataset.csv")

In [25]:
anonymous_df.head()

Unnamed: 0,Name,Age,Gender,State,Religion,HealthCondition
0,*,>20,Female,Tamil Nadu,*,Cancer
1,*,>20,Female,Kerala,*,Viral infection
2,*,>20,Female,Tamil Nadu,*,No illness
3,*,>20,Male,Karnataka,*,No illness
4,*,>20,Female,Kerala,*,Heart-related


## Code (1 pts)
<div class="alert alert-block col-md-7 alert-success">Check 2-anonimity for any combination of the three protected attributes in the dataset: Gender, Age and State</div>
<div class="alert alert-block col-md-7 alert-info">Tip: Your function should return "True" for all the following tests</div>
<br>

In [36]:
## Your code here (You can implement more than one check)

## Session 02:
### 2.1 Anonymizing the dataset

Now we will work on data anonymization. For that, we will focus on two of the quasi-identifiers: BirthDate and ZipCode.

### Anonymization through GLOBAL generalization

## Code (2 points):
<div class="alert alert-block alert-success">Create a Python function that creates k-anonymity groups by greedy <b>GLOBAL</b> recoding -- generalization</div>

<div class="alert alert-block alert-info">Tip: to perform <b>GLOBAL</b> generalization, all the rows in the entire database are modified.</div>


In [39]:
def apply_global_generalization_to_single_column(c_values, num_digits_to_ofuscate):
    ## Your code here
    pass

def apply_global_generalization_to_df(_df, num_digits_to_ofuscate, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [42]:
## Check that the result of generalizing filtered_df for 2 digits now satisfies the K-anonimity check for K=2

## Your code here (it must return True)

## Code (1 point):

<div class="alert alert-block alert-success">Use the function above to create a table where:
    <ul>
        <li>rows = #of zip digits removed</li>
        <li>columns = #of date of birth digits removed</li>
        <li>cells = k-anonymity level achieved</li>
    </ul>
</div>


In [None]:
## Implement this function to create the K-Anonimity Table
def create_K_anonimity_table(_df, max_zipcode_digits, max_dob_digits, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [None]:
## Call the function above to compute results with 0-5 digits ofuscated for each column
## Your code here (Be sure it prints the table in the appropriate format)

### Anonymization through LOCAL generalization

## Code (2 points):
<div class="alert alert-block alert-success">Create a Python function that creates k-anonymity groups by greedy <b>LOCAL</b> recoding -- generalization</div>

<div class="alert alert-block alert-info">Tip: to perform <b>LOCAL</b> generalization, you first need to group by column values and then aggregate small groups.
    
<br>
Deciding how to aggregate the smaller groups in an smart way that still keeps part of the information is an arduous task that can be solved by treating the problem as a search task.
    
<br>
For this exercise, aggregating the smaller groups in aggregations of size k is enough. For that, you can concatenate their values to replace their original ones.

</div>

In [None]:
## Apply local generalization to the values of a DataFrame
def apply_local_generalization_to_df(_df, num_digits_to_ofuscate, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [None]:
## Call the function above to compute results with 0-5 digits ofuscated for each column
## Your code here

## - Optional Coding - (2 points)
<div class="alert alert-block alert-info"> If you want to try this exercise, any effort will be welcomed</div>

<br>
<br>
In the local generalization implemented above, we used the trick of aggregating the smaller groups in groupings of size K.

When implementing LOCAL generalization in a real environment, a more advance approach is generally required. So that we do not loose too much information in the aggregation.

For that:
<div class="alert alert-block alert-success"> 
    <ol>
        <li>Implement an alternative function for LOCAL generalization or use a ML classification/clustering technique to optimize the decision where to apply local generalization.</li>
       <li>Probe that the new groupings loose less information than the approach followed in the exercise proposed above</li>
    </ol>
</div>

Please, include an explaination of the ideas followed for your implementation.

In [None]:
## Your code and explaination here