--------------------------------------------------------------------------------

<div style="text-align: center;">
    <h1 style="font-weight: bold;">Pseudonymisation </h1>
</div>

-------------------------------------------------------------------

# **Introduction**

In today's data-driven world, safeguarding personal information is crucial for privacy and security. One common approach to protect privacy while maintaining the utility of data is through the use of pseudonymisation. This project focuses on demonstrating how pseudonymisation works, using an example to illustrate its application. This notebook is a help to illustrate how the **pseudonymise_your_data.py** script works. 

Key Concepts:
* Definition
* Tokenisation : a reversible pseudonymisation method 
* Quasi-Identifiers
* Limits of Pseudonymisation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import uuid  # For generating unique IDs

**1) Definition**

`Pseudonymisation` is a data protection technique that involves replacing personally identifiable information (PII) within a dataset with pseudonyms or artificial identifiers. This process helps to mitigate the risk of re-identifying individuals while maintaining the ability to perform data processing, analysis, and other operations. The main goal of pseudonymisation is to protect privacy and limit exposure to personal data, making it particularly useful in **compliance with data protection laws like the General Data Protection Regulation (GDPR).**

Pseudonymisation can be achieved through various methods, including data encryption, tokenization, or hashing. The process may be **reversible**, where the pseudonym can be linked back to the original data with the use of an additional key or mapping, or **irreversible**, where the original data cannot be reconstructed, providing a higher level of privacy protection.

**2) Tokenization**

In this section, we will focus on **tokenization**, where sensitive data is replaced by randomly generated tokens. Unlike irreversible methods like hashing with salt, tokenization allows for re-identification when needed.

`dataset_with_pii.csv` is a modified version of the [UCI Adult Income](https://archive.ics.uci.edu/dataset/2/adult) dataset which contains demographic information from the 1994 US Census. The associated classification task is to predict whether a particular individual earns more than $50,000 per year. 

In [2]:
# Specify the path to your CSV file
file_path = "D:/PORTFOLIO/project01-pseudonymization/data_resources/dataset_with_pii.csv"

# Read the CSV file into a Pandas DataFrame
data = pd.read_csv(file_path)

In [3]:
#print the first 3 lines
data.head(3)

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,9/7/1967,732-14-6110,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,6/7/1988,150-19-2766,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,8/6/1991,725-59-9860,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


Parameters definition:

* DOB = date of birth

* SSN = Social Security number

* fnlwgt = final weight, weights on the Current Population Survey (CPS) files

* Education-Num = number assigned to the different ‘Education’ types

* Target = earns more or less than $50,000/year

`Personal Identifiable Information (PII)` or direct identifiers are attributes that explicitly identify a subject. In this case, we can list the attributes **Name**, **DOB** and **SSN**. Information such as zip code, marital status and gender, for example, are not considered explicit identifiers on their own, as they must be combined with other data to identify an individual.

&nbsp;

To pseudonymize our dataset in a reversible way, we will remove SSN and DOB and replace the names with tokens that will be stored in a CSV file

In [4]:
data_WDI = data.drop(['SSN', 'DOB'],axis=1) #dataset without direct identifiers

In [5]:
name_map = {name: str(uuid.uuid4())[:8] for name in data["Name"].unique()}  # 8-char token

# Get the first 5 key-value pairs
first_5_pairs = list(name_map.items())[:5]
print(first_5_pairs)

[('Karrie Trusslove', '56f45652'), ('Brandise Tripony', 'c1dd5973'), ('Brenn McNeely', 'd7548254'), ('Dorry Poter', '9161b5e3'), ('Dick Honnan', '55836e75')]


In [6]:
data_WDI["Name"] = data["Name"].map(name_map)
data_WDI = data_WDI.rename(columns={'Name': 'ID'})
data_WDI.head(3)

Unnamed: 0,ID,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,56f45652,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,c1dd5973,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,d7548254,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [7]:
# Save pseudonymized dataset
data_WDI.to_csv("data_resources/pseudonymized_data_from_jupyter.csv", index=False)

# Store mapping securely for reversibility
mapping_data = pd.DataFrame(list(name_map.items()), columns=["Original Name", "Pseudonym"])
mapping_data.to_csv("data_resources/name_mapping.csv", index=False)

mapping_data.head(3)

Unnamed: 0,Original Name,Pseudonym
0,Karrie Trusslove,56f45652
1,Brandise Tripony,c1dd5973
2,Brenn McNeely,d7548254


If you need to **de-pseudonymize** the data, just reload the mapping:

In [8]:
# Load mapping to reverse
mapping_df = pd.read_csv("data_resources/name_mapping.csv")
name_map = dict(zip(mapping_df["Pseudonym"], mapping_df["Original Name"]))

In [9]:
# Load pseudonymized data
df = pd.read_csv("data_resources/pseudonymized_data_from_jupyter.csv")

# Reverse the tokenization
df["ID"] = df["ID"].map(name_map)
df = df.rename(columns={'ID': 'Name'})
df.head(3)

Unnamed: 0,Name,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,Karrie Trusslove,64152,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,Brandise Tripony,61523,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,Brenn McNeely,95668,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


&nbsp;

**3) Membership inference attack**

We have intuitively removed Name, DOB, and SSN to eliminate explicit identification of individuals in this dataset. 

**However, could other attributes contribute to the re-identification of a person from our pseudonymised dataset?**

Let's take the example of the `Faith McCloughlin` record from the original dataset

In [10]:
#Print his information
McCloughlin_row = data[data['Name']=='Faith McCloughlin']
McCloughlin_row.head(3)

Unnamed: 0,Name,DOB,SSN,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
11019,Faith McCloughlin,12/25/1984,718-79-8158,41277,30,Private,132565,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,0,40,United-States,<=50K


In [11]:
#We save his zip code
zip = McCloughlin_row['Zip'].values[0] 
print("zip: "+str(zip))

zip: 41277


Knowing that `Faith McCloughlin` is present in this this pseudonymised dataset, can we re-identify him and retrieve his information just with his zip code?

In [12]:
data_WDI[(data_WDI['Zip']==zip)]

Unnamed: 0,ID,Zip,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
11019,8e094313,41277,30,Private,132565,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,0,40,United-States,<=50K


Yes, it is possible. In the code above, we filtered data_WDI to include only individuals with the same ZIP code as McCloughlin. The result contains **only McCloughlin’s record**, indicating that **his ZIP code alone is enough to uniquely identify him.**


This raises an important question: How many individuals can be uniquely identified based on each attribute?

In [13]:
for col in data.columns:
    unique_count = len(data[col].value_counts()[data[col].value_counts() == 1])
    pourcentage = round(unique_count*100/data.shape[0],2)
    print(f"Nombre de personnes identifiables à partir de {col} : {unique_count}, soit {pourcentage} %")

Nombre de personnes identifiables à partir de Name : 32561, soit 100.0 %
Nombre de personnes identifiables à partir de DOB : 7845, soit 24.09 %
Nombre de personnes identifiables à partir de SSN : 32559, soit 99.99 %
Nombre de personnes identifiables à partir de Zip : 23513, soit 72.21 %
Nombre de personnes identifiables à partir de Age : 2, soit 0.01 %
Nombre de personnes identifiables à partir de Workclass : 0, soit 0.0 %
Nombre de personnes identifiables à partir de fnlwgt : 15330, soit 47.08 %
Nombre de personnes identifiables à partir de Education : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Education-Num : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Marital Status : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Occupation : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Relationship : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Race : 0, soit 0.0 %
Nombre de personnes identifiables à partir de Sex : 0, soi

As expected, Name, DOB or SSN result in direct identification of individuals. However, we can note that the **zip code is a high risk of re-identification**, and will therefore be considered as a PII, so we adjust our new depersonalised dataset accordingly

In [14]:
data_WDI = data_WDI.drop(['Zip'],axis=1) #dataset without direct identifiers
data_WDI.head(3)

Unnamed: 0,ID,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,56f45652,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,c1dd5973,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,d7548254,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


&nbsp;

**4) QIDs**

What would happen if we kept the date of birth but generalized the data to include only the year of birth? **How could an attacker exploit the combination of multiple attributes?**

In [15]:
data_WDI['DOB'] = pd.to_datetime(data['DOB']).dt.year
data_WDI.head(3)

Unnamed: 0,ID,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target,DOB
0,56f45652,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,1967
1,c1dd5973,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,1988
2,d7548254,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,1991


In [16]:
# Group by 'Country' and 'DOB', then count the number of people in each group
grouped_data = data_WDI.groupby(['Country', 'DOB']).size().reset_index(name='count')

# Limit to the first 30 rows
examples_grouped_data = grouped_data.head(30)

# Create a formatted output string
for index, row in examples_grouped_data.iterrows():
    print(f"{row['Country']} {row['DOB']} count: {row['count']}")

Cambodia 1953 count: 1
Cambodia 1958 count: 1
Cambodia 1959 count: 1
Cambodia 1960 count: 1
Cambodia 1964 count: 1
Cambodia 1966 count: 1
Cambodia 1972 count: 1
Cambodia 1976 count: 2
Cambodia 1982 count: 1
Cambodia 1985 count: 1
Cambodia 1986 count: 2
Cambodia 1991 count: 1
Cambodia 1994 count: 1
Cambodia 1995 count: 1
Cambodia 1996 count: 1
Cambodia 2002 count: 1
Cambodia 2003 count: 1
Canada 1950 count: 1
Canada 1951 count: 3
Canada 1952 count: 2
Canada 1953 count: 2
Canada 1955 count: 5
Canada 1956 count: 2
Canada 1957 count: 2
Canada 1958 count: 3
Canada 1960 count: 1
Canada 1961 count: 3
Canada 1962 count: 1
Canada 1964 count: 2
Canada 1965 count: 1


In [17]:
# Filter the rows where count equals 1
identified_people = grouped_data[grouped_data['count'] == 1]

# Get the total number of individuals that can be identified
identified_count = identified_people['count'].sum()

total = data_WDI.shape[0]

# Display the result
print(f"Total identifiable individuals out of {total} people : {identified_count}")

Total identifiable individuals out of 32561 people : 693


By combining the year of birth and the country of origin, the attacker can uniquely identify 693 individuals. These elements are known as [`quasi-identifiers (QIDs)`](https://csrc.nist.gov/glossary/term/quasi_identifier)

**5) Limits of pseudonymisation**

Pseudonymisation is a valuable technique for enhancing data privacy by replacing personally identifiable information with pseudonyms, reducing the risk of direct identification. However, its effectiveness has significant limitations. One major weakness is that pseudonymised **data can still be re-identified when combined with other datasets or auxiliary information**, particularly as computational techniques advance. Additionally, pseudonymisation **does not eliminate all privacy risks, as patterns within the data may still reveal sensitive insights** about individuals. This makes it an insufficient standalone measure for robust data protection, especially in contexts where data sharing and linkage are common.

---------------------------------------------------------------------------------

# **Conclusion**

In conclusion, pseudonymisation is a useful but imperfect tool for data privacy, providing some protection but not guaranteeing complete anonymity. To address these shortcomings, more advanced techniques like differential privacy are emerging. Differential privacy adds carefully designed statistical noise to datasets, ensuring that individual contributions remain indistinguishable even when multiple datasets are combined. This approach offers stronger privacy guarantees and is increasingly being adopted in areas like official statistics, machine learning, and large-scale data analytics.We will explore this approach further in a separate project.

&nbsp;

<span style="font-size: 11px;">Copyright © 2024 Laure Bouzerand. All rights reserved.
This notebook and its source code is protected by copyright law. You may not reproduce, distribute, or use this work without the express written permission of the copyright holder.
</span>