# ARX Data Anonymization Tool

ARX is a tool for data anonymization, that in general, takes a dataset as an input, applies different privacy models, and produces an anonymized version of this dataset, thus offering privacy to its members.

At its core, ARX uses a highly efficient globally-optimal search algorithm for transforming data with full-domain generalization and record suppression. The transformation of attribute values is implemented through domain generalization hierarchies, which represent valid transformations that can be applied to individual-level values.

## Classic Privacy Models

The ARX tool offers standard privacy models that are tested in theory and are widely use to ensure anonymity given a plain dataset.

### k-Anonymity

This well-known privacy model that aims at protecting datasets from re-identification in the prosecutor model. A dataset is $k$-anonymous if each record cannot be distinguished from at least $k-1$ other records regarding the quasi-identifiers. Each group of indistinguishable records forms a so-called equivalence class. 

### Average risk


This privacy model can be used for protecting datasets from re-identification in the marketer model by enforcing a threshold on the average re-identification risk of the records. By combining the model with k-anonymity, a privacy model called strict-average risk can be constructed.

### ℓ-Diversity

This privacy model can be used to protect data against attribute disclosure by ensuring that each sensitive attribute has at least $ℓ$ "well represented" values in each equivalence class. Different variants, which implement different measures of diversity, have been proposed.

## Differential Privacy

Given the strict definition of DP, we know that we must access the dataset through various queries, given a privacy budget that we must not exceed. The ARX team, proposes a quite different application of DP in their tool, where privacy protection is not considered a property of a dataset, but a property of a data processing method.

DP guarantees that the probability of any possible output of the anonymization process does not change "by much" if data of an individual is added to or removed from input data.

In order to implement Differential Privacy, ARX uses the __SafePub algorithm__

### Concepts used

__Random Sampling__: A probability sampling method is any method of sampling that utilizes some form of random selection. In order to have a random selection method, you must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen. In SafePub, such sampling happens with probability $\beta$

__Attribute Generalization__: In SafePub, generalization is achieved through user-defined hierarchies, which describe rules for replacing values with more general but semantically consistent values on increasing levels of generalization. 

__Record Suppression__: Deletion of a specific row on the input dataset.

### Theorem

__Random sampling with probability $\beta$ followed by attribute generalization and the suppression of
every record which appears less than k times__ satisfies $(\epsilon, \delta)$ differential privacy for every $\epsilon \geq -ln(1-\beta)$ with 
$$\delta = \max_{n:n \geq n_m} \sum_{j>\gamma_n}^{n}f(j;n,\beta)$$

where $n_m = \frac{k}{\gamma} - 1$, $\gamma = \frac{e^\epsilon-1+\beta}{e^\epsilon}$ and $f(j;n,\beta) = {n \choose  j} \beta^j(1-\beta)^{n-j}$

## Techniques

In order to achieve attribute generalization, ARX uses the so called __hierarchies__. They are either imported from a csv, or being hard-coded into the API, and they are used in order to generalize a sensitive field. An example is given below. The subject to generalize is the age of a person. Let's see the values as they proceed through generalization.

1<sup>st</sup> level | 2<sup>nd</sup> level | 3<sup>rd</sup> level | 4<sup>th</sup> level | 5<sup>th</sup> level
--- | ----- | ------ |----- | --
1 |	0-4 | 0-9| 0-19	|*
2 |	0-4 | 0-9| 	0-19|	*
3 |	0-4 |	0-9|	0-19|	*
4 |	0-4 |	0-9|	0-19|	*
5 |	0-4 |	0-9|	0-19|	*
6 |	5-9 |	0-9|	0-19|	*
7 |	5-9 |	0-9|	0-19|	*
8 |	5-9 |	0-9|	0-19|	*
9 |	5-9 |	0-9|	0-19|	*
10| 5-9	| 0-9|	0-19	|*
11|	10-14 |	10-19|	0-19|	*
12|	10-14 |	10-19|	0-19|	*
13|	10-14 |	10-19|	0-19|	*
14|	10-14 |	10-19|	0-19|	*
15|	10-14 |	10-19|	0-19|	*
16|	15-19 |	10-19|	0-19|	*
17|	15-19 |	10-19|	0-19|	*
18|	15-19 |	10-19|	0-19|	*
19|	15-19 |	10-19|	0-19|	*
20|	15-19 |	10-19|	0-19|	*
21|	20-24 |	20-29|	20-39|	*

## Testings

In order to test the accuracy of the models used by ARX, we are going to run simple np python queries, on the datasets produced by the anonymization process. We want to eliminate the probability of extremely high noise generation, thus we are going to run the anonymization tool multiple times, and the output dataset will be constructed by the mean values of the fields.

### Problems we faced

As show on the above matrix, ARX hierarchies tend to treat every type of value as a string, in order to replace it with a interval. This is not  desirable when applying the testings we mentioned. Thus, we had to come up with a better solution of defining hierarchies. The ARX GUI provides a wizard that gives a variety of choices so the user can easily create a hierarchy for plenty data types.

Given the help from Fabian Prasser, we opted to treat the integer values as numbers, and in each level apply __TODO............__