Tutorial session: How to replace a sensitive field with synthetic data


In this session we will replace a key field with dummy data that will still be unique, but safe, and will still allow joins
The routine will create a companion translation table.
The synthetic key will be either text or numeric, to meet the criteria of the key (you will have to set it upfront)
The synthetic key will be either sequencial or randomly generated, this might have some impact on performance e.g. sorting
The assignment of the key to the data will be either sequential or random, to make it more realistic in some scenarios where it may impact performance.



In [2]:
#!Python3
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv("./test_data/input.csv")
df.shape

(1000, 4)

In [4]:
df.head()

Unnamed: 0,Amount,ID,Name,Transact
0,92673,88208,Nicole Smith,1
1,8367,82205,Connie Crane,10
2,-6421,89351,Michele Reynolds,5
3,40212,86814,Adam Allison,8
4,57817,80451,Aaron Evans,5


In [7]:
#This is where we create the converstion table, by picking the unique values of the column
mapping_table = df[['ID',"Name"]].drop_duplicates()
mapping_table['New_Customer'] = range(10000, 10000+len(mapping_table))
print(mapping_table.shape)
mapping_table.head()


(100, 3)


Unnamed: 0,ID,Name,New_Customer
0,88208,Nicole Smith,10000
1,82205,Connie Crane,10001
2,89351,Michele Reynolds,10002
3,86814,Adam Allison,10003
4,80451,Aaron Evans,10004


In [8]:
table_anon =df.merge (mapping_table[["ID","New_Customer"]], left_on = 'ID', right_on = 'ID', how='left' )
table_anon.drop(['ID',"Name"], axis =1, inplace = True)
table_anon.head()

Unnamed: 0,Amount,Transact,New_Customer
0,92673,1,10000
1,8367,10,10001
2,-6421,5,10002
3,40212,8,10003
4,57817,5,10004


In [60]:
#we export in halfs (e.g. if too big for excel)
table_anon.sort_values(['New_Customer']).head(500000).to_csv('./output/table_top500klines.csv')
table_anon.sort_values(['New_Customer']).tail (500001).to_csv('./output/table_bottom_half.csv')


In [61]:
#we export the entirety
table_anon.to_csv ('./output/comb_table_anon.csv')

In [62]:
mapping_table.to_csv('./output/mapping_table.csv')