<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# General data anomymization techniques

This notebook provides methods to anonymize data containing IDs, numbers, categoricals, string, and dates. When anonymizing data, it is important to keep in mind the purpose of anonymizing data, e.g.:
- Just remove personal information from data, while keeping other information intact
- Obfuscating the data entirely, to provide reasonably representative data, but original values are no longer retained
- Generating a dataset for broader reporting that must satisfy certain anonymity criteria
- Creating a sample dataset to illustrate how code / analysis is performed

While most of the steps for data anonymization are straightforward, it can be harder to keep track of the anonymization steps and ensure that all relevant columns are anonymized. So we are using a custom class to hold the data and anonymize it. 

This notebook will take walk through comprehensive example of anonymization using the class, while also showing each type of anoymization step taken.

These steps are fairly basic and done on each column independently. So it won't keep correlations between columns. More advanced tehniques are needed for that.

**Here are some useful references**
* https://medium.com/codex/data-anonymization-with-python-8976db6ded36
* https://www.districtdatalabs.com/a-practical-guide-to-anonymizing-datasets-with-python-faker
* https://www.codeproject.com/Articles/5324569/anonympy-Data-Anonymization-with-Python
* https://www.postgresql.eu/events/fosdem2019/sessions/session/2287/slides/151/postgresql_anonymizer.reveal..pdf

**Load in key packages**

In [1]:
import sys

import numpy as np
import pandas as pd
from cryptography.fernet import Fernet
from IPython.display import display

In [2]:
# Developer tools
%load_ext autoreload
%autoreload 2

# Import custom module
import sys
sys.path.append('../../utilities')
from data_anonymization.anonymization import DataAnonymization

<h1>Table of Contents<span class="tocSkip"></span></h1>
<ul class="toc-item"><li><span><a href="#General-data-anomymization-techniques" data-toc-modified-id="General-data-anomymization-techniques-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>General data anomymization techniques</a></span></li><li><span><a href="#Load-in-dataset" data-toc-modified-id="Load-in-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load in dataset</a></span></li><li><span><a href="#Initializing-Data-Anonymization-Class-to-transform-the-data" data-toc-modified-id="Initializing-Data-Anonymization-Class-to-transform-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initializing Data Anonymization Class to transform the data</a></span></li><li><span><a href="#Data-type-agnostic-methods" data-toc-modified-id="Data-type-agnostic-methods-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data type agnostic methods</a></span><ul class="toc-item"><li><span><a href="#Suppression" data-toc-modified-id="Suppression-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Suppression</a></span></li><li><span><a href="#Shuffling" data-toc-modified-id="Shuffling-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Shuffling</a></span></li></ul></li><li><span><a href="#Numeric-data" data-toc-modified-id="Numeric-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Numeric data</a></span><ul class="toc-item"><li><span><a href="#Replacing-Numeric-Column-with-Random-Normal-Distribution¶" data-toc-modified-id="Replacing-Numeric-Column-with-Random-Normal-Distribution¶-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Replacing Numeric Column with Random Normal Distribution¶</a></span><ul class="toc-item"><li><span><a href="#Replacing-with-any-other-distribution" data-toc-modified-id="Replacing-with-any-other-distribution-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Replacing with any other distribution</a></span></li></ul></li><li><span><a href="#Adding-random-Gaussian-noise-to-a-numeric-variable" data-toc-modified-id="Adding-random-Gaussian-noise-to-a-numeric-variable-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Adding random Gaussian noise to a numeric variable</a></span></li><li><span><a href="#Bucketing-a-numerical-variable-into-ranges" data-toc-modified-id="Bucketing-a-numerical-variable-into-ranges-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Bucketing a numerical variable into ranges</a></span></li><li><span><a href="#Capping-outliers-through-winsorization" data-toc-modified-id="Capping-outliers-through-winsorization-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Capping outliers through winsorization</a></span></li></ul></li><li><span><a href="#Date-columns" data-toc-modified-id="Date-columns-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Date columns</a></span><ul class="toc-item"><li><span><a href="#Add-noise-to-date-(shift-dates-randomly)" data-toc-modified-id="Add-noise-to-date-(shift-dates-randomly)-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Add noise to date (shift dates randomly)</a></span></li></ul></li><li><span><a href="#Categorical-data" data-toc-modified-id="Categorical-data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Categorical data</a></span><ul class="toc-item"><li><span><a href="#Hash-the-values" data-toc-modified-id="Hash-the-values-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Hash the values</a></span></li><li><span><a href="#Encrypt-using-cryptography.fernet" data-toc-modified-id="Encrypt-using-cryptography.fernet-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Encrypt using cryptography.fernet</a></span></li><li><span><a href="#Map-to-alphabetic-encoding,-e.g.-A,-B,-C,-..." data-toc-modified-id="Map-to-alphabetic-encoding,-e.g.-A,-B,-C,-...-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Map to alphabetic encoding, e.g. A, B, C, ...</a></span></li><li><span><a href="#Provide-a-manual-mapping-(or-apply-a-previously-established-mapping)" data-toc-modified-id="Provide-a-manual-mapping-(or-apply-a-previously-established-mapping)-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Provide a manual mapping (or apply a previously established mapping)</a></span></li><li><span><a href="#Map-to-a-numerical-encoding-instead" data-toc-modified-id="Map-to-a-numerical-encoding-instead-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>Map to a numerical encoding instead</a></span></li><li><span><a href="#Mapping-underrepresented-values" data-toc-modified-id="Mapping-underrepresented-values-7.6"><span class="toc-item-num">7.6&nbsp;&nbsp;</span>Mapping underrepresented values</a></span><ul class="toc-item"><li><span><a href="#Map-Ticket_Copy-to-the-most-frequent" data-toc-modified-id="Map-Ticket_Copy-to-the-most-frequent-7.6.1"><span class="toc-item-num">7.6.1&nbsp;&nbsp;</span>Map Ticket_Copy to the most frequent</a></span></li><li><span><a href="#Map-Ticket_Copy2-to-a-random-existing-value" data-toc-modified-id="Map-Ticket_Copy2-to-a-random-existing-value-7.6.2"><span class="toc-item-num">7.6.2&nbsp;&nbsp;</span>Map Ticket_Copy2 to a random existing value</a></span></li></ul></li></ul></li><li><span><a href="#Dataset-sampling" data-toc-modified-id="Dataset-sampling-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Dataset sampling</a></span></li><li><span><a href="#Get-a-summary-of-all-of-the-data-anonymization-steps" data-toc-modified-id="Get-a-summary-of-all-of-the-data-anonymization-steps-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Get a summary of all of the data anonymization steps</a></span></li></ul>

# Load in dataset

We are using the titanic dataset, and added a column of random movie reviews (as a string column)

In [3]:
df_dataset = pd.read_csv('sample_input/titanic_augmented.csv')
df_dataset['SomeDate'] = pd.to_datetime(df_dataset['SomeDate'])
df_dataset.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1962-04-28,A wonderful little production. <br /><br />The...
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is..."


In [4]:
# For demonstration purposes, we create a copy a few colunms (for purposes of extracting the mapping used by some anonymization steps)
df_dataset['Ticket_Copy'] = df_dataset['Ticket'].copy()
df_dataset['Ticket_Copy2'] = df_dataset['Ticket'].copy()
df_dataset['Ticket_Copy3'] = df_dataset['Ticket'].copy()

# Initializing Data Anonymization Class to transform the data

Start by instantiating the data anonymization class with the dataset

In [5]:
anonymize = DataAnonymization(df_dataset)

In [6]:
# Access the anonymized data by
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


# Data type agnostic methods

## Suppression 

Replaces information in a column with a particular value or a null value. This is used for fields that can't risk being anonymized and must be removed (but we want to keep the column)

In [7]:
anonymize.suppression("Name", fill_value="Name Redacted")
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,male,22.0,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,female,38.0,1,0,PC 17599,71.2833,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,female,35.0,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,35.0,0,0,373450,8.05,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


## Shuffling

This shuffles (randomly re-orders) the values in a column. This can be done if the values themselves do not contain any confidential data (e.g. names, SSN, etc.), and has the advantage that it retains the same set of values of the original data.

```
dataset['column'] = np.random.shuffle(np.array(dataset['column']))
```

In [8]:
anonymize.shuffling("Sex")
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,22.0,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,38.0,1,0,PC 17599,71.2833,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,35.0,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,35.0,0,0,373450,8.05,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


# Numeric data

## Replacing Numeric Column with Random Normal Distribution¶

This is useful when generating a range of values with same mean and standard deviation as the data. Since normal distribution can have outliers, optional floor (min) and cap (max) can be set.

```python
dataset['column'] = np.random.normal(loc = dataset['column'].mean(), 
                                     scale = dataset['column'].std(),
                                     size = dataset.shape[0])

# If you need to set min and max
dataset['column'] = np.clip(dataset['column'], min_value, max_value)

```

In [9]:
anonymize.replace_gaussian("Age")
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,33.005002,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,10.730945,1,0,PC 17599,71.2833,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,23.472953,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,21.97743,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,47.115977,0,0,373450,8.05,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


### Replacing with any other distribution

More distributions to come. For example:
- Binomial distribution for 0 or 1 (or Yes and No)
- Multinomial distribution for multiple (but few) values
- Poisson distribution for count data


## Adding random Gaussian noise to a numeric variable

This function adds random noise to an existing numeric column (perturbation). The size of the noise is proportional (scaled) to the standard deviation of the column. You can also choose to set a minimum and maximum value of the outcome (after factoring in noise).

```
dataset['column'] = np.clip(dataset['column'] + 
                            np.random.normal(loc = 0, 
                                             scale = scaling_factor * dataset['column'].std(),
                                             size = dataset.shape[0]),
                            min_value, max_value)
```

In [10]:
anonymize.noise_num("Fare", ratio_std=0.1, a_min=0, a_max=1000)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,33.005002,1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,10.730945,1,0,PC 17599,75.32205,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,23.472953,0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,21.97743,1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,47.115977,0,0,373450,11.104627,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


## Bucketing a numerical variable into ranges

This converts a numerical variable to a categorical variable of number ranges. This keeps the data close to the original data. There are many ways to create the bins. Here, it is created by manually specifying the min, max, and step size. 

```python
list_bins = [-np.inf] + list(np.arange(a_min, a_max, a_step)) + [np.inf]
dataset['column'] = pd.cut(dataset['column'], bins=list_bins, right=False).astype(str)
```

(Not covered here) 
* Min and max could also be set by the data itself (and rounded)
* A different approach to setting the bins is to use percentiles
* A potential next step is to replace the number range with the mid-value or sampling uniformly from that number range.

In [11]:
anonymize.bucket_num("Age", a_min = 0, a_max = 120, a_step = 10)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,A/5 21171,7.25,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,PC 17599,75.32205,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,373450,11.104627,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


## Capping outliers through winsorization

Outlier values can be floored / capped based on a percentile (often 0.95). This doesn't affect values that are not outliers, so doesn't completely anonymize a column. Often outliers are particularly identifiable, so this provides an additional layer of anonymization to those. 

The code for winsorization is fairly straightforward: Calculate the quantiles and apply them:
```
v_low = np.nanquantile(dataset['column'], q_min, axis=0)
v_high = np.nanquantile(dataset['column'], q_max, axis=0)
dataset['column'] = np.clip(dataset['column'], v_low, v_high)
```

In [12]:
anonymize.cap_outliers("Fare", q_min=0.05, q_max=0.95)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,A/5 21171,7.447777,,S,1978-01-28,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,PC 17599,75.32205,C85,C,1962-04-28,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,STON/O2. 3101282,7.925,,S,1974-09-06,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,113803,53.1,C123,S,1965-08-19,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,373450,11.104627,,S,1965-06-07,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


# Date columns

We added a date column to the dataset, called 'SomeDate'

## Add noise to date (shift dates randomly)

Randomly generate number of days ('D') / months ('M') / years ('Y') to a date column, where a min (low) and max (high) number of set.

```python
def f(x, delta):
    return np.datetime64(x, delta_type) + np.timedelta64(delta, delta_type)

delta = np.random.randint(low=low, high=high, size=len(dataset['column']))
dataset['column'] = np.vectorize(f)(dataset['column'], delta)

```

In [13]:
anonymize.noise_date("SomeDate", low = -10, high = 10, delta_type="M")
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,A/5 21171,7.447777,,S,1977-05-01,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,PC 17599,75.32205,C85,C,1962-12-01,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,STON/O2. 3101282,7.925,,S,1974-12-01,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,113803,53.1,C123,S,1964-11-01,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,373450,11.104627,,S,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


# Categorical data

## Hash the values

Hashing encodes the values of a column to numbers. This uses the pandas method https://pandas.pydata.org/docs/reference/api/pandas.util.hash_array.html

A hash key is used to encode. To make it difficult to reverse engineer the hashed values (which could be done through brute forcing if the key is short), we can divide the hashed values and round them.

The resulting number bears no visual resemblence to the original data.

```
dataset['column'] = (np.around(pd.util.hash_array(dataset['column'], hash_key=hash_key) / factor, 0)).astype("uint64")
```

The default value for the hash key is the pandas default '0123456789123456' and can be overridden

In [14]:
anonymize.hash_round("Ticket", hash_key = '3212121872133454', factor=7)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,2143699640136888064,7.447777,,S,1977-05-01,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,2216941847914850560,75.32205,C85,C,1962-12-01,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,1397738331497762560,7.925,,S,1974-12-01,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,1668570913841087744,53.1,C123,S,1964-11-01,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,1209947277582521344,11.104627,,S,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


## Encrypt using cryptography.fernet

Encrypts a column using a user-provided key. It can be decoded using the same key.

```
def f(x, key):
    return Fernet(key).encrypt(x.encode())

dataset['column'] = np.vectorize(f)(np.array(dataset['column']), key)
```

The key must be 32 url-safe base64-encoded bytes large. You can use following code to generate the key
```
from cryptography.fernet import Fernet
key = Fernet.generate_key()
```

In [15]:
from cryptography.fernet import Fernet
key = Fernet.generate_key()
key

b'ecczn4DD4qDwCk2AqekmVCvXKrSjqAFJXlbwZTnuW6w='

In [16]:
anonymize.encrypt_fernet("Embarked", key = b'_e5D_ZUjFDpVN9Bx6EgI91ykMI0UDM8fpLYA2S1azbw=')
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,2143699640136888064,7.447777,,b'gAAAAABk4oZPhdrbNiXxqpYcjT8Msw22T81tl41peTJf...,1977-05-01,One of the other reviewers has mentioned that ...,A/5 21171,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,2216941847914850560,75.32205,C85,b'gAAAAABk4oZPhABtGOo1bKYbFRuDhXYQg0eAq37tPqUR...,1962-12-01,A wonderful little production. <br /><br />The...,PC 17599,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,1397738331497762560,7.925,,b'gAAAAABk4oZPRJfCsLB-DzYXa5c_7RHCGY3FJnJRfGZp...,1974-12-01,I thought this was a wonderful way to spend ti...,STON/O2. 3101282,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,1668570913841087744,53.1,C123,b'gAAAAABk4oZPvhTiLobHeUA3b4hNC5CuL5Il1NaiOmVy...,1964-11-01,Basically there's a family where a little boy ...,113803,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,1209947277582521344,11.104627,,b'gAAAAABk4oZPUS_4eWQpyWUOcXtxlAeCWLBwOArMUjbE...,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",373450,373450,373450


**If you need to decrypt, you would run the following**
```
anonymize.decrypt_fernet("Embarked", key = b'_e5D_ZUjFDpVN9Bx6EgI91ykMI0UDM8fpLYA2S1azbw=')
```

## Map to alphabetic encoding, e.g. A, B, C, ...

The 'map_cat' function automatically generates a mapping, unless you specify the argument 'map_cat_dict'. 

To use this function, see information in the following cell: 

In [32]:
help(anonymize.map_cat)

Help on method map_cat in module data_anonymization.anonymization:

map_cat(col, map_type=1, map_cat_dict=None) method of data_anonymization.anonymization.DataAnonymization instance
    Codify categorical columns with new identifiers that can be letters (1) or numbers (2).
    
    Args:
        col (string): Column target of the transformation.
        map_type (integer): 1 -> Map to alphabetic coding.
                            2 -> Map to numeric coding.



In [17]:
anonymize.map_cat("Ticket_Copy", map_type=1) # Use map_type=1 to generate mapping to alphabetical instead of numerical values (use 2 for numeric)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,2143699640136888064,7.447777,,b'gAAAAABk4oZPhdrbNiXxqpYcjT8Msw22T81tl41peTJf...,1977-05-01,One of the other reviewers has mentioned that ...,F,A/5 21171,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,2216941847914850560,75.32205,C85,b'gAAAAABk4oZPhABtGOo1bKYbFRuDhXYQg0eAq37tPqUR...,1962-12-01,A wonderful little production. <br /><br />The...,JL,PC 17599,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,1397738331497762560,7.925,,b'gAAAAABk4oZPRJfCsLB-DzYXa5c_7RHCGY3FJnJRfGZp...,1974-12-01,I thought this was a wonderful way to spend ti...,QL,STON/O2. 3101282,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,1668570913841087744,53.1,C123,b'gAAAAABk4oZPvhTiLobHeUA3b4hNC5CuL5Il1NaiOmVy...,1964-11-01,Basically there's a family where a little boy ...,SZ,113803,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,1209947277582521344,11.104627,,b'gAAAAABk4oZPUS_4eWQpyWUOcXtxlAeCWLBwOArMUjbE...,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",Y,373450,373450


## Provide a manual mapping (or apply a previously established mapping)

Imagine that we wanted to manually provide a mapping to a categorical column.

In this example, we create the manual mapping by taking the most first 20 categories from the mapping generated in the prior section. We can access that mapping as shown in the next cell.

In [18]:
mapping = anonymize.map_cat_dict["Ticket_Copy"]
mapping

{'value': array(['110152', '110413', '110465', '110564', '110813', '111240',
        '111320', '111361', '111369', '111426', '111427', '111428',
        '112050', '112052', '112053', '112058', '112059', '112277',
        '112379', '113028', '113043', '113050', '113051', '113055',
        '113056', '113059', '113501', '113503', '113505', '113509',
        '113510', '113514', '113572', '113760', '113767', '113773',
        '113776', '113781', '113783', '113784', '113786', '113787',
        '113788', '113789', '113792', '113794', '113796', '113798',
        '113800', '113803', '113804', '113806', '113807', '11668', '11751',
        '11752', '11753', '11755', '11765', '11767', '11769', '11771',
        '11774', '11813', '11967', '12233', '12460', '12749', '13049',
        '13213', '13214', '13502', '13507', '13509', '13567', '13568',
        '14311', '14312', '14313', '14973', '1601', '16966', '16988',
        '17421', '17453', '17463', '17464', '17465', '17466', '17474',
        '17764', 

**We keep only the first 20 categories**

In [19]:
mapping["value"] = mapping["value"][0:100]
mapping["code"] = mapping["code"][0:100]

**Apply the new mapping to the column called "Ticket_Copy"**

Values not in the mapping will get their own mapped values.

In [20]:
anonymize.map_cat("Ticket_Copy2", map_type=1, map_cat_dict=mapping)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,2143699640136888064,7.447777,,b'gAAAAABk4oZPhdrbNiXxqpYcjT8Msw22T81tl41peTJf...,1977-05-01,One of the other reviewers has mentioned that ...,F,CY,A/5 21171
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,2216941847914850560,75.32205,C85,b'gAAAAABk4oZPhABtGOo1bKYbFRuDhXYQg0eAq37tPqUR...,1962-12-01,A wonderful little production. <br /><br />The...,JL,VD,PC 17599
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,1397738331497762560,7.925,,b'gAAAAABk4oZPRJfCsLB-DzYXa5c_7RHCGY3FJnJRfGZp...,1974-12-01,I thought this was a wonderful way to spend ti...,QL,PY,STON/O2. 3101282
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,1668570913841087744,53.1,C123,b'gAAAAABk4oZPvhTiLobHeUA3b4hNC5CuL5Il1NaiOmVy...,1964-11-01,Basically there's a family where a little boy ...,SZ,SZ,113803
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,1209947277582521344,11.104627,,b'gAAAAABk4oZPUS_4eWQpyWUOcXtxlAeCWLBwOArMUjbE...,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",Y,TV,373450


## Map to a numerical encoding instead

In [21]:
anonymize.map_cat("Ticket_Copy3", map_type=2)
anonymize.data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SomeDate,RandomReviews,Ticket_Copy,Ticket_Copy2,Ticket_Copy3
0,1,0,3,Name Redacted,female,"[30.0, 40.0)",1,0,2143699640136888064,7.447777,,b'gAAAAABk4oZPhdrbNiXxqpYcjT8Msw22T81tl41peTJf...,1977-05-01,One of the other reviewers has mentioned that ...,F,CY,216
1,2,1,1,Name Redacted,male,"[10.0, 20.0)",1,0,2216941847914850560,75.32205,C85,b'gAAAAABk4oZPhABtGOo1bKYbFRuDhXYQg0eAq37tPqUR...,1962-12-01,A wonderful little production. <br /><br />The...,JL,VD,394
2,3,1,3,Name Redacted,female,"[20.0, 30.0)",0,0,1397738331497762560,7.925,,b'gAAAAABk4oZPRJfCsLB-DzYXa5c_7RHCGY3FJnJRfGZp...,1974-12-01,I thought this was a wonderful way to spend ti...,QL,PY,262
3,4,1,1,Name Redacted,male,"[20.0, 30.0)",1,0,1668570913841087744,53.1,C123,b'gAAAAABk4oZPvhTiLobHeUA3b4hNC5CuL5Il1NaiOmVy...,1964-11-01,Basically there's a family where a little boy ...,SZ,SZ,111
4,5,0,3,Name Redacted,male,"[40.0, 50.0)",0,0,1209947277582521344,11.104627,,b'gAAAAABk4oZPUS_4eWQpyWUOcXtxlAeCWLBwOArMUjbE...,1964-11-01,"Petter Mattei's ""Love in the Time of Money"" is...",Y,TV,408


## Mapping underrepresented values

Very underrepresented values can more likely give away the identity of the data point, even after anonymization. One way to mask the fact that the value is unique / very infrequent is to replace it with a more frequent value

Using the ```remove_small_cats``` method, we can replace values with a frequency below a certain threshold with the **most frequent value** or a **random value above the frequency threshold**.

Use the arguments:
- factor: Factor used to define what is considered a category with low representation:
    - a proportion less than 1, as the minimum percentage of rows that contain a specific category.
    - an integer equal or greater than 1, as the minimum number of samples that have a specific category
- fill_type: Indicates how the infrequent values should be mapped 
    - 'most_frequent' to map to the most frequent value
    - 'rand' to map to a random other value that is above the threshold, with probability matching the actual distribution of values

In [33]:
help(anonymize.remove_small_cats)

Help on method remove_small_cats in module data_anonymization.anonymization:

remove_small_cats(col, factor, fill_type='most_frequent') method of data_anonymization.anonymization.DataAnonymization instance
    Remove categories that are underrepresented.
    
    Args:
        col (string): Column target of the transformation.
        factor (float / integer): Factor used to define what is considered a category with low representation:
                                  < 1 -> Minimum percentage of rows that contain a specific category (float).
                                  >= 1-> Minimum number of samples that have a specific category (Integer).
        fill_type (string): Type of replacement value replacement (Default value: most_frequent):
                            most_frequent -> Fill with the most frequent category.
                            rand -> Fill with another random category with a probability according to its distribution in the dataset.



### Map Ticket_Copy to the most frequent

In [24]:
# First print the frequency before mapping
print(anonymize.data['Ticket_Copy'].value_counts().head())

Ticket_Copy
OB    7
SI    7
WK    7
WQ    6
ED    6
Name: count, dtype: int64


In [25]:
# Map anything with frequency of 2 or fewer to the most frequent
anonymize.remove_small_cats("Ticket_Copy", factor=2, fill_type="most_frequent")
print(anonymize.data['Ticket_Copy'].value_counts().head())

Ticket_Copy
OB    742
SI      7
WK      7
WQ      6
ED      6
Name: count, dtype: int64


**This is less appropriate when there is no dominating value, or when there are an overwhelming number of infrequent values**

### Map Ticket_Copy2 to a random existing value

In [29]:
# First print the frequency before mapping
print(anonymize.data['Ticket_Copy2'].value_counts().head())

Ticket_Copy2
AO    7
XQ    7
WK    7
KQ    6
EO    6
Name: count, dtype: int64


In [30]:
# Map anything with frequency of 2 or fewer to values with at least 3 occurrences
anonymize.remove_small_cats("Ticket_Copy2", factor=2, fill_type="rand")
print(anonymize.data['Ticket_Copy2'].value_counts().head())

Ticket_Copy2
WK    42
XQ    37
AO    36
EO    36
KQ    33
Name: count, dtype: int64


# Dataset sampling

The following downsamples the dataset. In the future, other methods may be provided. 
- If you provide an integer, it will sample that many rows of the data
- If you provide a proportion, it will sample for that many % of the data

Sampling is done without replacement, and simply uses pandas ```sample``` function:
- DATAFRAME.sample(n = ..., replace = False)
- DATAFRAME.sample(frac = ..., replace = False)

In [34]:
print(f"Rows before sampling: {anonymize.data.shape[0]}")
anonymize.sampling(100)
print(f"Rows after sampling: {anonymize.data.shape[0]}")

Rows before sampling: 891
Rows after sampling: 100


# Get a summary of all of the data anonymization steps

The Data Anonymization class tracks all the steps taken, as well as any mappings done during the anonymization process, which let's you replicate the anonymization on new data, or revert the anonymization

**Use the .report attribute to get the information about order, anonymization type, column, and other relevant information**

In [41]:
anonymize.report

**-- REPORT --**

**1 - Suppression:** col=Name fill_value=Name Redacted

**2 - Shuffling:** col=Sex

**3 - Replace with normal distribution:** col=Age a_min=-inf a_max=inf

**4 - Noise numeric:** col=Fare ratio_std=0.1 a_min=0 a_max=1000

**5 - Bucket numeric:** col=Age a_min=0 a_max=120 a_step=10

**6 - Cap Outliers:** col=Fare q_min=0.05 q_max=0.95

**7 - Noise date:** col=SomeDate low=-10 high=10 delta_type=M

**8 - Hash:** col=Ticket factor=7

**9 - Encrypt:** col=Embarked key=b'_e5D_ZUjFDpVN9Bx6EgI91ykMI0UDM8fpLYA2S1azbw='

**10 - Mapping categories:** col=Ticket_Copy map_type=1

**11 - Mapping categories:** col=Ticket_Copy2 map_type=1

**12 - Mapping categories:** col=Ticket_Copy3 map_type=2

**13 - Remove small categories:** col=Ticket_Copy factor=2 fill_type=most_frequent

**14 - Remove small categories:** col=Ticket_Copy2 factor=2 fill_type=randomm

**15 - Remove small categories:** col=Ticket_Copy2 factor=2 fill_type=random

**16 - Remove small categories:** col=Ticket_Copy2 factor=2 fill_type=rand

**17 - Sampling:** sample=100

**To get the mappings, use ```map_cat_dict``` and ```key_dict``` attributes**

In [39]:
anonymize.map_cat_dict.keys() # This is a dictionary with the mappings associated with each of the columns that were mapped

dict_keys(['Ticket_Copy', 'Ticket_Copy2', 'Ticket_Copy3'])

In [40]:
anonymize.key_dict.keys() # This is a dictionary with the encryption/encoding keys used

dict_keys(['Embarked'])

**NOTE, the two dictionaries above only retain the last mapping or key applied on the column. It will not capture multiple mappings used on the same column**

# Note about advanced methods (code not covered here)

## GAN (Generative Adverserial Networks) to create fake data

* https://github.com/evajurado/fakeyourdata
* https://pypi.org/project/table-evaluator
* https://machinelearningmastery.com/how-to-code-a-wasserstein-generative-adversarial-network-wgan-from-scratch
* https://www.arxiv-vanity.com/papers/1706.08224
* https://github.com/soumith/ganhacks

OW DNA team has a WIP version using GANs: https://github.com/mmctech/data-anonymization

## Clustering methods

By associating individuals to granular clusters, and replacing the members of the cluster with the cluster average (or randomly sampling from within the body of the cluster (e.g. interquartile range for each variable).

A formal method that is similar is the k-anonymity approach: https://github.com/kaylode/k-anonymity (though this method censors data completely instead of faking the values)

## More to be added