#Deidentification Example Using Faker

Faker is a Python library that generates fake data, including names, addresses, dates, and more. It can be used to generate fake data for a variety of purposes, including de-identifying sensitive data. 

Here are some of the common attributes that faker can generate:

*   name: Full name of a person
*   first_name: First name of a person
*   last_name: Last name of a person
*   address: Address of a person
*   street_address: Street address of a person
*   city: City of a person
*   state: State of a person
*   country: Country of a person
*   postcode: Postal code of a person
*   email: Email address of a person
*   phone_number: Phone number of a person
*   company: Name of a company
*   job: Job title of a person

Note: You will see in the output below that VIN is shown as an attribute that is not part of the library, but it is still de-identified but since it is not built in Faker library.

You can find a more comprehensive list of attributes that faker can generate in the official documentation: https://faker.readthedocs.io/en/master/providers.html.


Here's an example of how to use Faker to de-identify a dataset containing customer information with a practical example. Data set that is used here is not real data but randomly generated.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [46]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [47]:
!pip install pyth

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [62]:
import pandas as pd
from faker import Faker

# Load the data from the input file (in CSV format)
input_file = '/content/gdrive/MyDrive/Notebooks/ExampleCustomerData.csv'
input_data = pd.read_csv(input_file)

# Define the columns to de-identify
columns_to_deidentify = ['name', 'email', 'address', 'VIN']

# Define the Faker object for de-identification
faker = Faker()

# Loop over the input data and de-identify the specified columns
deidentified_data = input_data.copy()
for col in columns_to_deidentify:
    print(f'De-identifying column: {col}')
    attr = getattr(faker, col, None)
    if attr is None:
        print(f'Invalid attribute for de-identification: {col}')
    else:
        deidentified_data[col] = deidentified_data[col].apply(lambda x: attr() if pd.notnull(x) else x)

# Write the de-identified data to a new file (in CSV format)
output_file = '/content/gdrive/MyDrive/Notebooks/DeidentifiedCustomerData_Faker.csv'
deidentified_data.to_csv(output_file, sep=',', index=False)

# Print the first 10 rows of the de-identified data
print(deidentified_data.head(10))


De-identifying column: name
De-identifying column: email
De-identifying column: address
De-identifying column: VIN
Invalid attribute for de-identification: VIN
                 name                       email  \
0    Geoffrey Bennett      carrcarlos@example.org   
1          Erin Glenn         tgarner@example.net   
2      Richard Larson      kiddcarlos@example.net   
3      Jessica Hodges          nmcgee@example.org   
4  Christopher Dorsey     littleamber@example.com   
5        Edward Garza      michelle34@example.org   
6     Tyrone Bartlett        uesparza@example.org   
7         Terry Silva      kellihicks@example.net   
8          Kim Tanner         nancy57@example.org   
9       Dylan Chapman  amandajohnston@example.org   

                                             address  age                VIN  
0         024 Mckay Ports\nNorth Jerryview, WA 63710   30  1HGBH41JXMN109186  
1           6872 Edgar Run\nNorth Kimberly, IA 96363   25  3VWML7AJ9CM670208  
2  96279 Scott Moun

#Deidentification Example Using Mimesis

Mimesis is a Python library that provides data anonymization functionality by generating realistic fake data for a variety of categories, including names, addresses, phone numbers, emails, and more. It is useful for generating test data, populating databases with mock data, and, as we will see, de-identifying sensitive data. The library is easy to use, flexible, and customizable. It can generate data in multiple languages and is available under the MIT license.

Mimesis supports a variety of data providers, each of which is designed to generate data for a specific category, some examples include:

*   mimesis.Person() provider can generate fake names, birthdays, genders, and more.
*   mimesis.Address() provider can generate street addresses, postal codes, cities, and more. 
*   Other providers include mimesis.Internet(), mimesis.Numbers(), and mimesis.Business(), among others.


To learn more about the Mimesis library, check out the official documentation at https://mimesis.name/. The documentation includes a comprehensive list of providers and attributes that can be anonymized.

In [58]:
!pip install mimesis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [64]:
import pandas as pd
from faker import Faker

# Load the data from the input file (in CSV format)
input_file = '/content/gdrive/MyDrive/Notebooks/ExampleCustomerData.csv'
input_data = pd.read_csv(input_file)

# Define the columns to de-identify
columns_to_deidentify = ['name', 'email', 'address', 'VIN']

# Define the Faker object for de-identification
faker = Faker()

# Loop over the input data and de-identify the specified columns
deidentified_data = input_data.copy()
for col in columns_to_deidentify:
    print(f'De-identifying column: {col}')
    attr = getattr(faker, col, None)
    if attr is None:
        print(f'Invalid attribute for de-identification: {col}')
    elif col == 'address':
        deidentified_data[col] = deidentified_data[col].apply(lambda x: attr() if pd.notnull(x) else x)
    elif col == 'VIN':
        # Replace the VIN with a random alphanumeric string of length 17
        deidentified_data[col] = deidentified_data[col].str.replace(r'^[A-Z0-9]{17}$', lambda x: faker.bothify(text='?????????????????#'))
    else:
        deidentified_data[col] = deidentified_data[col].apply(lambda x: attr() if pd.notnull(x) else x)

# Write the de-identified data to a new file (in CSV format)
output_file = '/content/gdrive/MyDrive/Notebooks/DeidentifiedCustomerData.csv'
deidentified_data.to_csv(output_file, sep=',', index=False)

# Show the first 10 samples of the de-identified data
print(deidentified_data.head(10))



De-identifying column: name
De-identifying column: email
De-identifying column: address
De-identifying column: VIN
Invalid attribute for de-identification: VIN
               name                         email  \
0   Elizabeth White          reedjudy@example.org   
1    Madison Martin  gregoryzimmerman@example.com   
2  Jennifer Jackson         matthew27@example.net   
3        John Reyes       branchterri@example.org   
4     Cheryl Barker            john90@example.org   
5  Larry Strickland            mark82@example.org   
6         John Rose        erikaperez@example.org   
7    Patrick Gibson   alexanderwilson@example.org   
8   Walter Melendez         barbara61@example.net   
9       Lauren Hill            tgreen@example.net   

                                             address  age                VIN  
0    955 Andrew Neck Apt. 855\nHaroldmouth, MH 50110   30  1HGBH41JXMN109186  
1    48213 Sarah Fort Apt. 246\nMeganville, AS 80732   25  3VWML7AJ9CM670208  
2  9068 Garrett Por