<a href="https://colab.research.google.com/github/nalymugwe/data_anonymity/blob/main/Data_Anonymity_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Privacy**

Data privacy, also known as information privacy, is concerned with the proper handling of data—specifically, the consent, notice, and regulatory obligations for personal data. Personal data which is commonly abbreviated as PII (Personally Identifiable Information) is information that could potentially identify a specific individual. Examples of PII include, but are not limited to: Full name, ID number, Address,email address, telephone number etc. 

Due to various laws, rules, and standards that are in place, the precise definition of data privacy may fluctuate greatly depending on the region and industry. The General Data Protection Regulation (GDPR) of the European Union and Kenya's The Data Protection Act, 2019 are examples of comprehensive data protection regulations. Therefore, it's critical to understand these rules and have a working knowledge of data, especially for those who work in the data industry. 

There are programming techniques that can be used to privatize the data at hand in addition to any rules that may have been established by the organization or nation. Below, we'll examine each of these tools separately. 


## **1. Load data and import required libraries**

In [1]:
# Install the faker library
!pip -q install faker

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.7 MB[0m [31m25.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:
# import the required libraries
import pandas as pd
import numpy as np
import re
import random

import hashlib
from faker import Faker
import secrets 
from uuid import uuid4

In [3]:
# # Create a datasets directory on google colab to host the files
!rm -rf datasets
!mkdir -p datasets
!cd datasets

# Fetch the files from Github
!wget -q --show-progress "https://raw.githubusercontent.com/nalymugwe/data_anonymity/main/TopRichestInWorld.csv" -P ./datasets



In [4]:
# Load the dataset and preview the data
data = pd.read_csv('/content/datasets/TopRichestInWorld.csv')
data.head()

Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,Elon Musk,30593231,718961429,United States
1,Jeff Bezos,37826627,796223116,United States
2,Bernard Arnault & family,30510802,700939457,France
3,Bill Gates,35009505,704851206,United States
4,Warren Buffett,36910374,719522691,United States


In [5]:
# Get the shape of the data

data.shape

(101, 4)

### **2a. Faker Library**

The Faker library generates realistic fake data, such as names, addresses, phone numbers, and email addresses. It can be used to replace sensitive information with synthetic data while preserving the structure and format of the original data.

In [6]:
# Create an instance for faker
fake = Faker()

#Make a copy of the data and name is faker_data
faker_data = data.copy()

# Create fake names for the 'Name', ID number' and 'Phone Number' columns
faker_data['Name'] = [fake.name() for _ in range(len(faker_data))]
faker_data['ID Number'] = [fake.unique.random_number(digits=5) for _ in range(len(faker_data))]
faker_data['Phone Number'] = [fake.phone_number() for _ in range(len(faker_data))]

#Review the data
faker_data.head()

Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,Michael Coleman,83674,(427)558-6047,United States
1,Michelle Davis,35148,0643509245,United States
2,Elizabeth Stanley,86707,001-777-048-9994,France
3,Samuel Miller,58683,001-384-207-5905x91169,United States
4,Jennifer Le,42797,4610316975,United States


In [7]:
#Reconfirm the shape

faker_data.shape

(101, 4)

### **2b. Data Masking**

This technique involves replacing some or all of the data in a particular field with asterisks, X's, or other placeholders.

In [8]:
def mask_data(data):
    data = str(data)
    #return '*' * len(data) # returns all characters in asterix
    return data[:2] + '*' * (len(data) - 2) # returns the first two characters as visible and the others in asterix

#Make a copy of the original data and name it masked_data
masked_data = data.copy()

#mask the required columns
masked_data['Name'] = masked_data['Name'].apply(mask_data)
masked_data['Phone Number'] = masked_data['Phone Number'].apply(mask_data)
masked_data['ID Number'] = masked_data['ID Number'].apply(mask_data)

#view the masked data
masked_data.head()


Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,El*******,30******,71*******,United States
1,Je********,37******,79*******,United States
2,Be**********************,30******,70*******,France
3,Bi********,35******,70*******,United States
4,Wa************,36******,71*******,United States


In [9]:
# Confirm the shape of the data

masked_data.shape

(101, 4)

### **2c. Data Shuffling**

This method involved randomly rearranging the data in a column so that the original values exist, but their connection to the rest of the data in each row is broken. It's used when the individual values are important, but the association between them is not.


In [10]:
#Make a copy of the data and call it shuffle_data
shuffle_data = data.copy()

#Shuffle the values in the 'Name' ,'ID number' and 'Phone number'
shuffle_data['Name'] = shuffle_data['Name'].sample(frac=1).reset_index(drop=True)
shuffle_data['ID Number'] = shuffle_data['ID Number'].sample(frac=1).reset_index(drop=True)
shuffle_data['Phone Number'] = shuffle_data['Phone Number'].sample(frac=1).reset_index(drop=True)

# View the data
shuffle_data.head()


Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,Francoise Bettencourt Meyers & family,33285767,728003875,United States
1,John Mars,25030796,711891279,United States
2,Steve Ballmer,36636397,793018716,France
3,Vladimir Lisin,35436577,111267739,United States
4,Elon Musk,37958285,713548735,United States


In [11]:
#Confirm the shape of the data
shuffle_data.shape

(101, 4)

### **2d. Data Tokenization**

This involves replacing sensitive data with tokens (usually random strings of characters). 

In [12]:
# Create a function to tokenize a series
def tokenize_series(series):
    # Create a dictionary to store the mapping between original values and tokens
    token_dict = {value: secrets.token_hex(8) for value in series.unique()}
    
    # Replace the original values with tokens
    return series.map(token_dict)

#Create a copy of the original data and name it token_data
token_data = data.copy()

#Tokenize the variables on interest: 'Name', 'ID Number' and 'Phone Number'
token_data['Name'] = tokenize_series(token_data['Name'])
token_data['ID Number'] = tokenize_series(token_data['ID Number'])
token_data['Phone Number'] = tokenize_series(token_data['Phone Number'])

#View the dataset
token_data.head()

Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,5d161f7fc21bf028,d88056e9aa276f18,7cc2a96e1d2e25fe,United States
1,6a861b90f844c931,989eb7379cc82f48,892a31cc9fbdf927,United States
2,afa4315cf4bf2047,99ca207c52902bf3,1ba50c850991ffcf,France
3,e5d332b2a513514e,3185123766edef55,8df388e24a79d25f,United States
4,8e36b543f6b548a3,59e8d1c5b00eba9a,a57d33211e5b0309,United States


In [13]:
#Confirm the shape of the data
token_data.shape

(101, 4)

In [14]:
# Create a function to pseudonymize a series (another way of tokenizing)
def pseudonymize_series(series):
    # Create a dictionary to store the mapping between original values and pseudonyms
    pseudo_dict = {value: secrets.token_hex(8) for value in series.unique()}
    
    # Replace the original values with pseudonyms
    return series.map(pseudo_dict)

#Create a copy of the original data and rename it pseudo_data
pseudo_data = data.copy()

# Pseudonymize the columns of interest: 'Name', 'ID_Number' and 'Phone_Number' 
pseudo_data['Name'] = pseudonymize_series(pseudo_data['Name'])
pseudo_data['ID Number'] = pseudonymize_series(pseudo_data['ID Number'])
pseudo_data['Phone Number'] = pseudonymize_series(pseudo_data['Phone Number'])

# Review the data
pseudo_data.head()

Unnamed: 0,Name,ID Number,Phone Number,Country/Territory
0,f52fe99a0f5d4b72,f7a340a12c62fb6a,3450bf69c12a3ecf,United States
1,659cf30bdf4177c9,c5427f73f08b36c1,5c8db1b52813a364,United States
2,523a248267d9b947,b8e3432f3a58d29b,7cfd12b89384d807,France
3,1174b3c8fbaf9d48,3dded26088c384a3,b89fad8746028814,United States
4,b36747da0e5b6b4b,bc16efae738c4e03,8793f6d3b33a61c6,United States


In [15]:
# Confirm the shape of the data
pseudo_data.shape

(101, 4)

### **2e. Data Grouping**

In [17]:
# Group the columns to create a single unique ID.
columns_to_group = ['Name', 'ID Number', 'Phone Number']

#Create a copy of the original dataset called unique_data
unique_data = data.copy()

# Create a new column 'Unique ID' that is a UUID for each row
unique_data['Unique ID'] = [str(uuid4()) for _ in range(len(unique_data))]

# Create a reference DataFrame that consists of the unique IDs and the original columns
reference_df = unique_data[columns_to_group + ['Unique ID']].copy()

# Drop the original columns from the main DataFrame
unique_data.drop(columns=columns_to_group, inplace=True)

unique_data.head()

Unnamed: 0,Country/Territory,Unique ID
0,United States,d560c863-c507-45c5-b397-ea097abf0230
1,United States,6aeccff4-e1a1-49ec-8f01-34ea9f3e719c
2,France,7e5c285f-02d4-4639-803b-a2e3e23cea73
3,United States,690ccb44-571b-41ee-bff7-d6aa8b86cf2d
4,United States,08fddeea-008c-4bd4-8ab3-15d25b3bdc14


In [18]:
# Confirm the shape of the dataset
unique_data.shape

(101, 2)

I haven't covered all the tools available but some of the common tools used by data profesionals as they analyze the data. As noted, the shape of the dataset remained constant as you wouldn't want to use a tool that would aletr the shape of your data. However, kindly note that there are certain factors to consider as you select a tool for anonymizing your data:

- Data Consistency: When anonymizing data, it's important to maintain the same type and format of data. For example, if a field originally contains a phone number, it should still contain a phone number after being anonymized. 
- Relevance: If you are replacing data in fields like 'Location', ensure the generated data is relevant. For instance, it might be misleading to have a country that doesn't match the phone number country code. Additionally, when shuffling data, you disrupt the relationships between different data fields. This can limit the usefulness of the data for many analytical tasks that rely on understanding correlations between different data fields.
- Information Loss: Always consider the nature of your data and the requirements of your use case when choosing an anonymization tool. Be aware that data masking could lead to loss of information. The masked part of the data cannot be used for analysis, so it's important to balance the need for privacy with the need for data utility.
- Risk of Re-identification: Even if some data is masked, individuals may still be re-identifiable if there's enough unmasked data that could be linked back to them. This could be a risk especially with semi-identifiers or quasi-identifiers. Shuffling preserves the overall distribution of data in each column. If your use case depends on maintaining the same overall patterns in the data, shuffling can be a good choice. However, if there's data on the email address that contains the names of the indvidual, then re-identification may be possible.
- Uniqueness: If the same original value gets assigned different tokens in different instances, it might create inconsistencies in data analysis. So, you should consider whether to use consistent tokens (the same original value always gets the same token) or unique tokens (the same original value can get different tokens).
- Legal and Ethical Considerations: Even though data is anonymized, you should still handle it responsibly. Some jurisdictions have specific legal requirements around data anonymization, so ensure that your tool complies with any relevant laws and regulations.


