# Data Anonymizer

DataAnonymizer handles anonymization in free text columns by using named entity recognition (NER) with a pretrained mdoel from the [transformers](https://huggingface.co/transformers/) package to pick up entities such as location and person, generate a MD5 hash for the entity, replaces the entity with the hash, and stores the hash to entity in a dictionary for de-anonymization. A similar process is repeated for categorical columns, without the use of NER.

This notebook will demonstrate the usage of this by testing it on some Wikipedia data.

### Grab some data from Wikipedia

In [1]:
import wikipedia
import pandas as pd
import ner_anonymizer



In [2]:
person = []
page_content = []

for _person in ["Andrew Ng", "Jacinda Ardern"]:
    person.append(_person)
    page_content.append(wikipedia.page(wikipedia.search(_person)[0]).content)

In [3]:
df = pd.DataFrame({"person": person,
                   "page_content": page_content})
df.head()

Unnamed: 0,person,page_content
0,Andrew Ng,Andrew Yan-Tak Ng (Chinese: 吳恩達; born 1976) is...
1,Jacinda Ardern,"Jacinda Kate Laurell Ardern (, NZ pronunciatio..."


In [4]:
df.page_content[0][:1000]

'Andrew Yan-Tak Ng (Chinese: 吳恩達; born 1976) is a British-born American businessman, computer scientist, investor, and writer. He is focusing on machine learning and AI. As a businessman and investor, Ng co-founded and led Google Brain and was a former Vice President and Chief Scientist at Baidu, building the company\'s Artificial Intelligence Group into a team of several thousand people.Ng is an adjunct professor at Stanford University (formerly associate professor and Director of its AI Lab). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai. He has successfully spearheaded many efforts to "democratize deep learning" teaching over 2.5 million students through his online courses. He is one of the world\'s most famous and influential computer scientists being named one of Time magazine\'s 100 Most Influential People in 2012, and Fast Company\'s Most Creative People in 2014. Since 2018 he launched and currently heads AI Fund, initially a $175-million investm

In [5]:
df.page_content[1][:1000]

'Jacinda Kate Laurell Ardern (, NZ pronunciation ; born 26 July 1980) is a New Zealand politician who has served as the 40th prime minister of New Zealand and leader of the Labour Party since 2017. She has been the member of Parliament (MP) for Mount Albert since March 2017, having first been elected to the House of Representatives as a list MP in 2008.Born in Hamilton, Ardern grew up in Morrinsville and Murupara, where she attended a state school. After graduating from the University of Waikato in 2001, Ardern began her career working as a researcher in the office of Prime Minister Helen Clark. She later worked in London, within the Cabinet Office, and was elected president of the International Union of Socialist Youth. Ardern was first elected as an MP in the 2008 general election, when Labour lost power after nine years. She was later elected to represent the Mount Albert electorate in a by-election in February 2017.\nArdern was unanimously elected as deputy leader of the Labour Par

### Anonymize Data

In [6]:
# initialize anonymizer
anonymizer = ner_anonymizer.DataAnonymizer(pretrained_model_name="dslim/bert-base-NER")

In [7]:
# specify free text columns, categorical columns, as well as any other regex 
# you might want to include to hash the free text columns.
# MOTE: this will take some time to download the model & iterate across the 
#       whole dataset. Have a coffee or go for a run!

anonymized_df, hash_dictionary = anonymizer.anonymize(
    df=df,
    free_text_columns=["page_content"],
    free_text_additional_regex_to_hash={
        "page_content": ["[0-9]{7, 9}"] # to pick up potential handphone numbers
    },
    categorical_columns=["person"]
)

You may choose to select a different pretrained model from the links below:
* https://huggingface.co/transformers/pretrained_models.html
* https://huggingface.co/models

In [8]:
anonymized_df.head()

Unnamed: 0,person,page_content
0,d4e4d49054268e95b9f7952db8c0536b,8aae3a73a9a43ee6b04dfd986fe9d136 ff9af30819fb3...
1,e4c6d0199151ba16ffba9985213c86bf,12e124e5371137dadfeb0fa797958e92 67ac1f0779736...


Both the categorical column and the free text column were hashed

In [9]:
# check anonymization results
anonymized_df.page_content[0][:1000]

'8aae3a73a9a43ee6b04dfd986fe9d136 ff9af30819fb3c2e35a54034824a183f-3091c457dce701f7c7cc90fe70586c07 (Chinese: 吳恩達; born 1976) is a British-born American businessman, computer scientist, investor, and writer. He is focusing on machine learning and AI. As a businessman and investor, 8582d13498fb14c51eba9bc3742b8c2f co-founded and led Google Brain and was a former Vice President and Chief Scientist at Baidu, building the company\'s Artificial Intelligence Group into a team of several thousand people.8582d13498fb14c51eba9bc3742b8c2f is an adjunct professor at Stanford University (formerly associate professor and Director of its AI Lab). Also a pioneer in online education, 8582d13498fb14c51eba9bc3742b8c2f co-founded Coursera and deeplearning.ai. He has successfully spearheaded many efforts to "democratize deep learning" teaching over 2.5 million students through his online courses. He is one of the world\'s most famous and influential computer scientists being named one of Time magazine\'s 

Comment: The Chinese name was not hashed, but can be done by specifying a multi-language NER model in the `anonymize` step.

In [10]:
# check anonymization results
anonymized_df.page_content[1][:1000]

'12e124e5371137dadfeb0fa797958e92 67ac1f0779736ebadf14da5e9b294b69 (, 8e3eb2c69a184ad1d448afe5985f50b3 pronunciation ; born 26 July 1980) is a 03c2e7e41ffc181a4e84080b4710e81e 4841ed0d728f95b3cb393f4a9c9efdbd politician who has served as the 40th prime minister of 03c2e7e41ffc181a4e84080b4710e81e 4841ed0d728f95b3cb393f4a9c9efdbd and leader of the Labour Party since 2017. She has been the member of Parliament (MP) for eace16d66cdd93ad876c620db7456077 91869f9f8d6f767b7b960a41d133fc67 since March 2017, having first been elected to the House of Representatives as a list MP in 2008.Born in adec714ae69bef54c5ee79cfcb41955d, Ardern grew up in c08df9bb5fb44242a6291b1eee5d09ad42e954e0635d0d6894a2d463e08c7a77 0cc175b9c0f1b6a831c399e269772661nd 893b7719713faaa97b1caa5603313723rup0cc175b9c0f1b6a831c399e269772661r0cc175b9c0f1b6a831c399e269772661, where she 0cc175b9c0f1b6a831c399e269772661ttended 0cc175b9c0f1b6a831c399e269772661 st0cc175b9c0f1b6a831c399e269772661te school. 7fc56270e7a70fa81a5935b72e

Comment: `Ardern` in the third sentence was not picked up by the BERT NER model.

### De-anonymize Data

In [11]:
de_anonymized_df = ner_anonymizer.de_anonymize_data(anonymized_df, hash_dictionary)

In [12]:
print(
    "Is the de-anonymized data exactly the same as the original",
    "data? {}".format(df.equals(de_anonymized_df))
)

Is the de-anonymized data exactly the same as the original data? True
