# Fairness, Accountability, Transparency and Ethics Course (FATE)

## Universitat Pompeu Fabra (UPF)
### Year 22/23
### Author: Manuel Portela (manuel.portela@upf.edu - 
*** Partially based on the original exercises made by David Solans (david.solans@upf.edu) ***
<br>
<br>
<br>
Submission date: 10/02/2022 at 23:59 on Aula Global

Please, implement this notebook **individually**.

<br>
<br>

**Legend** <br>
In this notebook we use:    
<div class="alert alert-block col-md-7 alert-info">To recall information from the theory classes and other tips</div>
<div class="alert alert-block col-md-7 alert-warning">To point important things that should not be 
   forgotten</div> 
<div class="alert alert-block col-md-7 alert-success">To indicate tasks to be done by students</div>
<div class="alert alert-block col-md-7 alert-primary bg-primary">LAB TASK</div>

Supplementary material: ARX deidentifier. [Link to ARX deidentifier video](https://www.youtube.com/watch?v=N8I-sxmMfqQ)

# LAB 1: Data anonymization with Python

# Session 01
### 1.1. K-Anonymity

The concept of k-anonymity was first introduced by Latanya Sweeney and Pierangela Samarati in a paper published in 1998[1] as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."[2][3][4] 

**A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least *k - 1* individuals whose information also appear in the release.**


* [1] Samarati, Pierangela; Sweeney, Latanya (1998). "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression"
* [2] P. Samarati. Protecting Respondents' Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering archive Volume 13 Issue 6, November 2001.
* [3] L. Sweeney. "Database Security: k-anonymity". Retrieved 19 January 2014.
* [4] L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 卌, 2002; 557-570.


Let's begin by importing the required Python libraries:

###  Libraries used in this notebook
You will need to install: **numpy** and **pandas**

In [1]:
## Required imports
import pandas as pd
import numpy as np

import random


In this notebook we will use a synthetic dataset created for demonstration purposes. It aims to simulare medical records of 100 users. If you have some curiosity on how was it created, you can check the script named *synthetic_data_generator.py*

### 1.2 Loading our dataset

In [3]:
df = pd.read_csv("../Data/health_dataset.csv")

In [4]:
df.head()

Unnamed: 0,PatientName,SSN,PatientID,Phone,Gender,BirthDate,ZipCode,Health Condition,Weight,Height
0,Pablo de Cobo,412-92-6881,1000,+34 738 644 211,Male,20011009,8003,Bird flu,80,184
1,Fabio Menendez Velasco,265-27-3243,1001,+34742032426,Male,20010110,8003,Bird flu,79,179
2,Flavia Cabañas Cabanillas,554-97-8002,1002,+34712747631,Female,20010101,8014,Bird flu,67,172
3,Eliana Vendrell-Vilalta,990-49-6929,1003,+34 712 466 137,Female,20001009,8002,Common cold and flu,93,184
4,César Mas Silva,833-83-9451,1004,+34716 22 73 14,Male,20000101,8002,Common cold and flu,66,171


<div class="alert alert-block alert-info">
    <h2>Remember what you learned in the theory classes <h4>(T04C - slide5)</h4></h2>
    <br>
<ul>
    <li>Identifiers can unambiguously identify a person (e.g., passport number, DNI/NIE in Spain)</li>
    <li>Quasi-identifiers identify the person with some ambiguity (e.g., home address, telephone number)</li>
    <li>Confidential attributes contain sensitive information (e.g., salary, religion, ethnicity, gender orientation, diagnosis)</li>
    <li>Non-confidential attributes are what remains</li>
   
</ul>    
    
</div>

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.1] Questions about identifiers (0.5pts)</h2>
    <hr>
     <p class="mb-0">Please, check the provided dataset and answer the following questions:</p>
</div>

1. Which Identifiers do you observe in the dataset?

**Your answer:**

2. What are the Quasi-identifiers that. you see in this dataset?


**Your answer:**

3. Is there any confidential attribute in it?

**Your answer:**

<div class="alert alert-block alert-warning">
    <h2>Remember:</h2>
<ul>
    <li>Identifiers should be supressed</li>
    <li>Quasi-identifiers cannot be suppressed, they have high analytical value</li>
    <li>Quasi-identifiers can be combined with external databases to re-identify a person</li>
   
</ul>    
    
</div>

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.2] Coding a selection of identifiers (0.5 pts)</h2>
    <hr>
     <p class="mb-0">Create a Python snippet that selects the subset that only contains not personal identifible information</p>
</div>

In [3]:
## Your code here
## filtered_df = ...


<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.3] Anonymity check (1pts)</h2>
    <hr>
     <p class="mb-0">Create a Python function that checks if a DataFrame sastifies k-anonimity for a given K.</p>
</div>


<div class="alert alert-block alert-info">Tip: Use only a subset of the columns (BirthDate, ZipCode) for this exercise</div>
<br>

In [6]:
## Use a helper function that calculates the groups by value
## If using pandas, you can consider using grouping and group sizes for that
def get_values_frequencies(_df, quasi_identifier_cols):
    ## Your code here
    pass

In [7]:
## Code a function that uses the function above and determines whether a given dataset satifies K-anonimity 
# for a given K
def is_k_anonymous(_df, K, quasi_identifier_cols=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

### Testing with k anonimity of our dataset
<div class="alert alert-block alert-success">Use the function above to check if the dataframe is 2-anonymous with respect to BirthData and ZipCode</div>

<br>

In [8]:
## Your code here (It should return False)

False

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.4] Questions about k-anonymity (0.5 point)</h2>
    <hr>
     <p class="mb-0">Answer the following question:</p>
</div>

Does the dataset as it is satisfy k-anonimity for K=2? If not, what do you think should be done to achieve it?

**Your answer:**


<div class="alert alert-block alert-success">Provide an example of why this is happening</div>
<br>

**Your answer:** (You can use code for it)

### Testing k anonimity of another dataset

To test that the function works as expected, we will use a second dataset.

This dataset corresponds to the example given by Wikipedia on the [K-anonimity article](https://en.wikipedia.org/wiki/K-anonymity) and satisfies 2-anonymity for any combination of the attributes Gender, State and Age.


In [9]:
anonymous_df = pd.read_csv("../Data/anonymized_dataset.csv")

In [10]:
anonymous_df.head()

Unnamed: 0,Name,Age,Gender,State,Religion,HealthCondition
0,*,>20,Female,Tamil Nadu,*,Cancer
1,*,>20,Female,Kerala,*,Viral infection
2,*,>20,Female,Tamil Nadu,*,No illness
3,*,>20,Male,Karnataka,*,No illness
4,*,>20,Female,Kerala,*,Heart-related


<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.5] Evaluate K-anonymity (0.5 pts)</h2>
    <hr>
     <p class="mb-0">Check 2-anonimity for different combinations of the three attributes</p>
</div>

<div class="alert alert-block col-md-7 alert-info">Tip: Your function should return "True" for all the following tests</div>


In [1]:
## Your code here (You can implement more than one check)

Does any of the test failed? Why? 

In [16]:
# Your answer

# Session 02: Generalization and ℓ -diversity

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.6] Anonymization through GLOBAL generalization (1 points)</h2>
    <hr>
     <p class="mb-0">Now we will work on data anonymization. For that, we will focus on two of the quasi-identifiers: BirthDate and ZipCode.</p>
         <p>Create a Python function that creates k-anonymity groups by greedy <b>GLOBAL</b> recoding -- generalization.</p>
    <p>Your function should return "True" for all the following tests.</p>
</div>

<div class="alert alert-block alert-info">Tip: to perform <b>GLOBAL</b> generalization, all the rows in the entire database are modified.</div>


In [17]:
def apply_global_generalization_to_single_column(c_values, num_digits_to_ofuscate):
    ## Your code here
    pass

def apply_global_generalization_to_df(_df, num_digits_to_ofuscate, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [2]:
## Check that the result of generalizing filtered_df for 2 digits now satisfies the K-anonimity check for K=2

## Your code here (it must return True)

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.7] Implement Global Generalization (1.5 point)</h2>
    <hr>
     <p class="mb-0">Use the function above to create a table where:</p>
     <ul>
        <li>rows = #of zip digits removed</li>
        <li>columns = #of date of birth digits removed</li>
        <li>cells = k-anonymity level achieved</li>
    </ul>
</div>


In [20]:
## Implement this function to create the K-Anonimity Table
def create_K_anonimity_table(_df, max_zipcode_digits, max_dob_digits, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [4]:
## Call the function above to compute results with 0-5 digits ofuscated for each column
## Your code here (Be sure it prints the table in the appropriate format)

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.8] Anonymization through LOCAL generalization (1.5 point)</h2>
    <hr>
     <p class="mb-0">Create a Python function that creates k-anonymity groups by greedy <b>LOCAL</b> recoding -- generalization</p>
     <p>
        Deciding how to aggregate the smaller groups in an smart way that still keeps part of the information is an arduous task that can be solved by treating the problem as a search task.
<br>
For this exercise, aggregating the smaller groups in aggregations of size k is enough. For that, you can concatenate their values to replace their original ones.
    </p>
</div>


<div class="alert alert-block alert-info">Tip: to perform <b>LOCAL</b> generalization, you first need to group by column values and then aggregate small groups.
   
</div>

In [93]:
## Apply local generalization to the values of a DataFrame
def apply_local_generalization_to_df(_df, num_digits_to_ofuscate, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [5]:
## Call the function above to compute results with 0-5 digits ofuscated for each column
## Your code here

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.9] Evaluating LOCAL generalization (1 points)</h2>
    <hr>
     <p class="mb-0">In the local generalization implemented above, we used the trick of aggregating the smaller groups in groupings of size K.
</p>
<p>When implementing LOCAL generalization in a real environment, a more advance approach is generally required. So that we do not loose too much information in the aggregation.
</p>
<p>For that:</p>
     <ul>
             <li>Deciding how to aggregate the smaller groups in an smart way that still keeps part of the information is an arduous task that can be solved by treating the problem as a search task.</li>
             <li>For this exercise, aggregating the smaller groups in aggregations of size k is enough. For that, you can concatenate their values to replace their original ones.
             </li>
    </ul>
    <p>Implement an alternative function for LOCAL generalization and probe that the new groupings loose less information than the approach followed in the exercise proposed above</p>
    <p>Please, include an explaination of the ideas followed for your implementation.</p>
</div>


<div class="alert alert-block alert-info"> 
  TIP: you can use a ML classification/clustering technique to optimize the decision where to apply local generalization.
</div>



In [95]:
## Apply local generalization to the values of a DataFrame
def apply_advanced_local_generalization_to_df(_df, num_digits_to_ofuscate, target_columns=["BirthDate", "ZipCode"]):
    ## Your code here
    pass

In [6]:
## Call the function above to compute results with 0-5 digits ofuscated for each column
## Your code here

### 1.2.  ℓ -diversity

Now, k-anonymity demands that we group individual rows/persons of our dataset into group of at least 
k rows/persons and replace the quasi-identifier attributes of these rows with aggregate quantities, such that it is no longer possible to read the individual values. This protects people by ensuring that an adversary who knows all values of a person's quasi-identifier attributes can only find out which group a person might belong to but not know if the person is really in the dataset.

A large problem of this approach is that it might happen that all people in a k-anonymous group possess the same value of the sensitive attribute. An adversary who knows that a person is in that k-anonymous group can then still learn the value of the sensitive attribute of that person with absolute certainty. This problem can be fixed by using an extension of k-anonymity called "l-diversity": l-diversity ensures that each k-anonymous group contains at least l different values of the sensitive attribute. Therefore, even if an adversary can identify the group of a person he/she still would not be able to find out the value of that person's sensitive attribute with certainty.

A data set is said to satisfy ℓ -diversity if, for each group of records sharing a combination of key attributes, there are at least ℓ “well represented” values for each conﬁdential attribute. A table is said to have l -diversity if every equivalence class of the table has l-diversity [1, 2].

[1]: Machanavajjhala, A., Gehrke, J., Kifer, D. and Venkitasubramaniam, M., 2006, April. l-diversity: Privacy beyond k-anonymity. In 22nd International Conference on Data Engineering (ICDE’06) (pp. 24-24). IEEE.

[2]: Domingo-Ferrer, J. and Torra, V., 2008, March. A critique of k-anonymity and some of its enhancements. In 2008 Third International Conference on Availability, Reliability and Security (pp. 990-993). IEEE.

A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.

The new ℓ -diverse equivalence classes require that the anonymous record is ‘well represented’ in the records. As an example, if we have 4 types of disease diagnosis, then for 2-diversity the equivalence records should expose the sensitive record as no more than two of four possible records.

<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.10] Evaluate ℓ -diversity (1 point)</h2>
    <hr>
     <p class="mb-0">Create a function that check if a dataset complies with the ℓ -diversity principle by analysing each Qi group and comparing with different sensitivity values.
</p>

</div>


<div class="alert alert-block alert-info"> 
  TIP: You might use the same target columns from the k-anonymity test as Qi-Groups.
</div>


In [82]:
def is_l_diverse(df,partitions, sensitive_column, l=2):
    #your function here
    pass

In [28]:
#defining groups and sensitive column
target_columns=[]
sensitive_column=[]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['partition'] = filtered_df['ZipCode'].astype(str) + '-' + filtered_df['BirthDate'].astype(str)


<div class="alert alert-block alert-primary bg-primary head-2">
    <h2 class="alert-heading">[L1.12] Implement ℓ -diversity check on our previous dataset (1 point)</h2>
    <hr>
     <p class="mb-0">Does our generalized dataset complies the principle? what changes we need to satisfy  ℓ -diversity ?
</p>

</div>


<div class="alert alert-block alert-info"> 
  TIP: Test with filtered_df, globally_generalized_df and your locally localized dataset using 2 to 5 levels of sensitivity
</div>


In [7]:
## Call the function above to compute results for each column of the sensitive columns
## Your code here

In [92]:
is_l_diverse(globally_generalized_df, globally_generalized_df_partitions, sensitive_column, l=5)

80**-200110** false
80**-200101** false
80**-200010** false
80**-200001** false
