# Data Anonymisation
Personal data is the core concept of data protection. Data protection law only applies when data relates to individuals. Singapore Personal Data Protection Act (PDPA) provides a baseline standard of protection for personal data in Singapore. The PDPA recognises both the need to protect individuals’ personal data and the need of organisations to collect, use or disclose personal data for legitimate and reasonable purposes.

Organisations are required to comply with the various data protection obligations if they undertake activities relating to the collection, use or disclosure of personal data.

When working in the field of Big Data, Data Science or related fields it is essential to know about these laws and how anonymization and pseudonymization give the possibility of still using the data for your use cases.


## What is Personal Data?
This is any information relating to an identified or identifiable person. An “identifiable” person is one who can be identified directly or indirectly, in particular by means of association with an identifier such as a name, an identification number, location data or other special characteristics. The possibility of identifying a person is sufficient here.

Personal data are, for example:
* Name
* Address
* E-mail address
* Telephone number
* Birthday

Anonymising data offers one solution. When data is anonymised, it is no longer personal data. In this practical, we will cover various techniques to anonymise data.


## Anonympy
**anonympy** is a genearal data anonymisation library for images, PDFs, and tabular data. You will need to install the following two packages:
> pip install anonympy

> pip install cape-privacy==0.3.0 --no-deps

If you faced issues with anonympy package, you can download the repository from <a href='https://pypi.org/project/anonympy/#files'>pypi</a>, unzip the repository and run the following command in ananconda prompt:
> python setup.py install

Let us start by importing the data for this practical.


### Import libraries and Read Data

Looking at columns, we can see that all are **personal** and **sensitive**. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer object.

It’s important to know of what data type is a column before applying any functions. Let’s check the data types.

Let’s see what methods are available to us.

From the list that available_methods returned, we can find functions for each data type.

## Attribute Suppression
The removal of a data attribute (i.e. column of data, especially where such data is not needed in the dataset and may contain
unique data values that cannot be anonymised further).

Let's apply attribute suppression call ```column_suppression``` on the address and web column.

## Character Masking

Character masking refers to changing the characters of a data value. This can be done by using a consistent symbol (e.g. “*” or “x”). Masking is typically applied only to some characters in the attribute.

Character masking is used when the data value is a string of characters and hiding part of it is sufficient to provide the extent of anonymity required.

Let's apply partial email masking call ```categorical_email_masking``` on the email column.

## Pseudonumisation
Pseudonymisation refers to the replacement of identifying data with made-up values. It is also referred to as coding. Pseudonyms can be irreversible when the original values are disposed of properly and the pseudonymisation is done in a nonrepeatable fashion. They can also be reversible (by the owner of the original data) when the original values are securely kept, but can be retrieved and linked back to the pseudonym should the need arise.

Pseudonymisation is used when data values need to be uniquely distinguished and no character or any other implied information about the direct identifiers of the original attribute are kept.

Let's apply pseudonumisation call ```categorical_tokenization``` on the id column.

Create the mapping table.

## Data Perturbation
The modification of the values in the data by adding "noise" to the original data (e.g. +/- random values to the data). The degree of perturbation should be proportionate to the range of values of the attribute. For example, data perturbation would involve modifying salary data of an individual from "256,654" to "260,000" by rounding the data to the nearest 10,000. Alternatively, the individual's salary can be modified to "250,554" by subtracting a random number within $10,000 from its original value. 

Let's add some noise using the ```numeric_noise``` on the age column.

Let's try applying ```numeric_rounding``` to salary column

## Generalisation
Generalisation is a deliberate reduction in the precision of data. Examples include converting a person’s age into an age range or a precise location into a less precise location. This technique is also referred to as recoding.

Generalisation is used for values that can be generalised and still be useful for the intended purpose.

Let's apply generalisation using the ```numeric_binning``` on the salary column.

## Synthetic Data
Apply heavy anonymisation to the original data to create synthetic data such that all data attributes (including target attributes) are modified significantly. The resulting dataset and individual records created using this methodology
will not have any resemblance to any individual’s record and does not retain the characteristics of the original dataset.

Let's substitute names in first_name column with fake ones. For that, we first have to check if Faker has a corresponding method for that. 

Faker has a method called ```first_name```, let's permutate the column.

Faker also has methods for address and city. ```categorical_fake_auto``` method will change the city automatically because the column names correspond to the method names. For phone as the column name differ from the method name, we need to specify phone with phone_number method.

## Summary of Anonymisation
For a short summary on the anonymised methods used for the data call the ```info()``` method.

To apply all the different anonymise methods in a single line of code, you can use the ```anonymize``` method.