# Velocity-Aware Geo-Indistinguishability

This notebook provides a tour of how to replicate results obtained with the Velocity-Aware Geo-Indistinguishability Mechanism in the Privkit Framework:

## Abstract

Location Privacy-Preserving Mechanisms (LPPMs) have been proposed to mitigate the risks of privacy disclosure yielded from location sharing. However, due to the nature of this type of data, spatio-temporal correlations can be leveraged by an adversary to extenuate the protections. Moreover, the application of LPPMs at collection time has been limited due to the difficulty in configuring the parameters and in understanding their impact on the privacy level by the end-user. In this work we adopt the velocity of the user and the frequency of reports as a metric for the correlation between location reports. Based on such metric we propose a generalization of Geo-Indistinguishability denoted Velocity-Aware Geo-Indistinguishability (VA-GI). We define a VA-GI LPPM that provides an automatic and dynamic trade-off between privacy and utility according to the velocity of the user and the frequency of reports. This adaptability can be tuned for general use, by using city or country-wide data, or for specific user profiles, thus warranting fine-grained tuning for users or environments. Our results using vehicular trajectory data show that VA-GI achieves a dynamic trade-off between privacy and utility that outperforms previous works. Additionally, by using a Gaussian distribution as estimation for the distribution of the velocities, we provide a methodology for configuring our proposed LPPM without the need for mobility data. This approach provides the required privacy-utility adaptability while also simplifying its configuration and general application in different contexts.

## Citation

Please consider to cite our publication in your scientific work:

```
@inproceedings{10.1145/3577923.3583644,
    author = {Mendes, Ricardo and Cunha, Mariana and Vilela, Jo\~{a}o P.},
    title = {Velocity-Aware Geo-Indistinguishability},
    year = {2023},
    isbn = {9798400700675},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3577923.3583644},
    doi = {10.1145/3577923.3583644},
    booktitle = {Proceedings of the Thirteenth ACM Conference on Data and Application Security and Privacy},
    pages = {141–152},
    numpages = {12},
    location = {<conf-loc>, <city>Charlotte</city>, <state>NC</state>, <country>USA</country>, </conf-loc>},
    series = {CODASPY '23}
}
```

## Getting started

This tutorial introduces a quick run down on:
- what are scenarios and what are the desired results from considering each subset
- how to load each scenario
- how to apply each PPM to every scenario
- how to visualize the quality loss obtained from every aplication of the ppm on each scenario

### What are scenarios?

Scenarios are subsets of a dataset that respect certain properties. From those properties, we expect that each subset lead to differnt results, as it can be seen bellow.

From *cabspotting* dataset it can be extracted four different scenarios considering user and report velocity ($v_u$ and $v_r$, respectively).

- ↑$v_u$, ↑$v_r$ - balance privacy and utility
- ↑$v_u$, ↓$v_r$ - favor utility
- ↓$v_u$, ↑$v_r$ - favor privacy
- ↓$v_u$, ↓$v_r$ - balance privacy and utility

The symbols ↑ and ↓ denote a high and a low value, respectively.

These scenarios can be found in the corresponding folder.


In [2]:
import privkit as pk
from privkit.utils import constants

# id of the dataset
file = 'cabspotting'

# list of the id's of each scenario
scenarios = ['high_vu_high_vr', 'high_vu_low_vr', 'low_vu_high_vr', 'low_vu_low_vr']

# list of the id's of each ppm
ppms = ['va_gi', 'planar_laplace', 'adaptive', 'clustering']

### Loading scenarios, applying PPMs and saving results

We have seen already how to load datasets and apply Privacy-Preserving Mechanisms.

The next step is for every scenario apply every mechanism and save the obtained results, for later visualization.


In [None]:
import numpy as np

# iterate over every pair of scenario and ppm
for scenario in scenarios:
    for ppm in ppms:
        # load data
        location_data = pk.LocationData('{}_{}'.format(scenario, ppm))
        location_data.load_data('{}{}/{}.pkl'.format(constants.data_folder, file, scenario))
        
        # function defined bellow
        location_data.data = apply_ppm_to_scenario(ppm, location_data)
        
        # extract quality loss values from the execution of the ppm
        quality_loss_values = location_data.data[constants.QUALITY_LOSS]

        # save quality loss values
        filepath = '{}{}/{}/'.format(constants.output_folder, file, ppm)
        filename = 'ql_{}_e{}'.format(scenario, int(epsilon * 1000))
        np.save(filepath + filename, quality_loss_values)

The function *apply_ppm_to_scenario(.)* will generate new columns on our dataset *location_data* with respect to the obfuscated locations computed as well as the quality loss from applying a PPM. We recall that quality loss is a metric to qualify the appliance of a PPM, given by the distance of the obfuscated locations given by the mechanism and the original data.

Every result will be saved under the *privkit/output/ppm/* path, where ppm is a variable that will take the ID of every PPM.

We now define the function that properly applies the mechanism, and return a modified dataset with the obfuscated location calculated as well as the quality loss resulting.

In [None]:
def apply_ppm_to_scenario(ppm, location_data):
    # privacy parameter to be used in the appliance of the PPMs
    epsilon = 0.016
    
    if ppm == 'planar_laplace':
        planar_laplace = pk.PlanarLaplace(epsilon=epsilon)
        return planar_laplace.execute(location_data)

    elif ppm == 'clustering':
        # Clustering Geo-Ind additional parameters
        r = np.log(4)/epsilon
        
        clustering = pk.ClusteringGeoInd(r=r, epsilon=epsilon)
        return clustering.execute(location_data)

    elif ppm == 'adaptive':
        # Adaptive Geo-Ind additional parameters
        ws = 2
        delta1 = 124.29
        delta2 = 428.56
        
        adaptive = pk.AdaptiveGeoInd(epsilon=epsilon, ws=ws, delta1=delta1, delta2=delta2)
        return adaptive.execute(location_data)

    elif ppm == 'va_gi':
        # VA-GI additional parameter
        m = 10
    
        va_gi = pk.VAGI(epsilon=epsilon, m=m)
        return va_gi.execute(location_data)

There are, of course, multiple ways of achieving the same results. It is possible to use parallelism on each scenario and/or on each mechanism since they are independent, i.e. every iteration of the double for-loop can be done concurrently. Such would result in lower execution time, but we won't cover that in this tutorial for the sake of simplicity.

## Visualizing the Results

We now present a simple script that generates a *matplotlib* box plot graph, showing the behavior of each scenario under the different PPMs.

In [None]:
# scenarios labels
scenarios_labels = [r'$\uparrow v_u \downarrow v_r$', 
                    r'$\downarrow v_u \downarrow v_r$', 
                    r'$\uparrow v_u \uparrow v_r$', 
                    r'$\downarrow v_u \uparrow v_r$']

# ppms colors
colors = ['blue', 'orange', 'green', 'red']

# initialize four subplots: one for each scenario
plt.subplots(1, 4, sharey=True)

# iterate over every scenario
for index, (scenario, scenario_label) in enumerate(zip(scenarios, scenarios_labels)):
    plt.subplot(1, 4, index + 1)

    # load quality loss from every PPM
    ql_data = [None] * 4
    for ppm_index, ppm in enumerate(ppms):
        ql_data[ppm_index] = np.load('{}{}/cabspotting/ql_{}_e{}.npy'.format(constants.output_folder, ppm, scenario, int(epsilon*1000)))

    # generate boxplots
    boxplots = [None] * 4
    for ppm_index, ppm in enumerate(ppms):
        boxplots[ppm_index] = plt.boxplot(ql_data[ppm_index], positions=[-1 + ppm_index*2/3], widths=0.20, notch=True, patch_artist=True, showfliers=False)

    # set different colors to different PPMs
    for plot, color in zip(boxplots, colors):
        for patch in plot['boxes']:
            patch.set_facecolor(color)

    # set scenario label as well as additional graph settings
    plt.xticks(np.arange(0, 2, 2), [scenario_label])
    plt.axis(xmin=-2, xmax=2)
    plt.xlim([-1-2/3, 1+2/3])
    plt.ylim([0, 800])
    plt.grid(True)

# add y label
plt.subplot(1, 4, 1)
plt.ylabel('Quality Loss (m)')

# add legend
handles = []
for color, ppm in zip(colors, ppms_labels):
    handles.append(mpatches.Patch(color=color, label=ppm))
plt.legend(handles=handles, loc='upper right')

# stick the four subplots together
plt.subplots_adjust(wspace=.0)

plt.show()