<a href="https://colab.research.google.com/github/idhamari/ia-data-privacy-tutorials/blob/main/IA_Ano01_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guide to Data Privacy

**Models, Technologies, Solutions**

Vicenç Torra

Department of Computing Science, Umeå University
Umeå, Sweden

**About:**

I will try to create a practical tutorial about data privacy using the above book as main reference. I will add python examples so one can understand the concept and I will try to cover background concepts when needed. These notebooks could be used as a course in data privacy or anonymization.

## 1 Introduction

**Shared data** plays a crucial role in driving collaboration and innovation across various fields. For instance, in the healthcare sector, the sharing of patient data among different hospitals and research institutions enables faster and more accurate diagnoses, leading to improved treatment outcomes. By pooling anonymized data from diverse sources, medical professionals can identify patterns and trends, develop new therapies, and make informed decisions for individual patients and public health initiatives.

**Public data**, on the other hand, plays a vital role in fostering transparency and empowering communities. For example, government agencies often release public data sets containing information about demographics, education, crime rates, and more. This data enables researchers, policymakers, and citizens to analyze societal challenges, identify areas for improvement, and make data-driven decisions. Public data is instrumental in promoting accountability, supporting evidence-based policies, and encouraging public participation in the democratic process.

**Risks:**

While sharing data brings numerous benefits, it also entails certain risks that must be carefully managed.

One significant concern is the potential compromise of personal privacy and security.

For instance, in recent years, there have been several high-profile data breaches where personal information, such as social security numbers and financial details, were exposed to unauthorized individuals.

 These breaches not only cause harm to individuals but also erode public trust in data sharing practices.

 Another risk is the misuse of data for unethical purposes. For example, social media platforms collecting user data for targeted advertising have faced criticism for manipulating user behavior and exploiting personal information without explicit consent.

 It is crucial to establish robust safeguards, such as strong data protection methods, laws and ethical guidelines, to mitigate these risks and protect individuals' rights and interests.

 **Data anonymization** algorithms are an essential tool in mitigating the risks associated with data sharing.

 These algorithms transform personal data into a format that cannot be directly linked to an individual, thereby preserving privacy while still allowing for analysis and insights.

 By implementing effective data anonymization techniques, organizations can minimize the risk of re-identification and unauthorized access to sensitive information.

  These algorithms play a crucial role in ensuring that shared data can be used for research, analysis, and innovation while protecting individuals' privacy rights.
  
  It is vital to strike a balance between data utility and privacy through the use of robust anonymization methods in data sharing practices.

Privacy enhancing technologies (PET), Privacy-Preserving Data Mining (PPDM), and Statistical Disclosure Control (SDC) are related fields with
a similar interest on ensuring data privacy. Their goal is to avoid the disclosure of sensitive or proprietary information to third parties.

**Privacy Enhancing Technologies (PET):** refer to a set of tools, methods, and techniques designed to protect individuals' privacy and enhance the security of their personal data. These technologies aim to minimize the risks associated with the collection, storage, processing, and transmission of sensitive information. PETs can be applied to various areas, such as communication networks, data storage, data analysis, and online browsing.

**Privacy-Preserving Data Mining (PPDM):** is a specific application of PETs that focuses on protecting privacy during the process of data mining. Data mining involves extracting valuable insights and patterns from large datasets. However, this process often requires access to sensitive or **personally identifiable information (PII)**, which can pose privacy risks.

PPDM techniques employ various privacy-preserving algorithms and methods to ensure that data mining can be performed without revealing sensitive information about individuals. These techniques include:

1. **Data anonymization:** It involves modifying or removing identifying information from the dataset before analysis. This can be done by replacing identifying attributes with generalizations or using techniques like k-anonymity, l-diversity, or **differential privacy**.

2. **Secure multiparty computation:** It enables multiple parties to jointly compute the results of data mining operations without revealing their individual data. Each party encrypts their data, and computations are performed on the encrypted data, ensuring privacy.

3. **Homomorphic encryption:** It allows computations to be performed directly on encrypted data without decrypting it, ensuring that sensitive information remains protected during the analysis.

4. **Secure data outsourcing:** This approach allows data owners to outsource their datasets to third-party service providers for mining while ensuring the privacy of the data. Encryption and other privacy-preserving techniques are employed to protect the data throughout the outsourcing process.

**Statistical Disclosure Control (SDC):** is another set of techniques used to protect privacy by limiting the disclosure of sensitive information when releasing statistical data or research results. Anonymization is considered part of SDC as well. SDC methods aim to find a balance between the utility of the data and the privacy of individuals. These techniques include:

1. **Masking:** Sensitive data points are modified by replacing them with similar values to prevent identification while maintaining statistical properties.

2. **Aggregation:** Instead of releasing individual-level data, aggregate statistics are provided, such as averages, totals, or proportions. This reduces the risk of re-identifying individuals while still offering useful information.

3. **Data swapping and swapping algorithms:** This technique involves swapping data between records or substituting values across different records to break the link between personal identifiers and sensitive information.

4. **Sampling and data reduction:** Instead of releasing the entire dataset, a representative sample is used to generate statistical estimates. This reduces the risk of exposing individual-level details.
5. **Data Perturbation:** Sensitive data points are modified by adding noise to prevent identification while maintaining statistical properties. Unlike strict anonymization methods that aim to remove or alter identifying information, data perturbation focuses on introducing controlled noise to preserve privacy while maintaining the usefulness of the dataset for statistical analysis. If the noise added to the income values satisfies the requirements of **differential privacy**, then it can be considered a differential privacy mechanism, which is a specific type of anonymization technique. Differential privacy mechanisms ensure that the released data provides privacy guarantees even in the presence of background knowledge and potential attacks.


The primary goal of SDC techniques is to ensure that statistical information can be shared for research, policy-making, and decision support without compromising the privacy and confidentiality of individuals involved in the data.

## Data Privacy vs Data Security:

* **Data Privacy:** methods to prevent when access is authorized e.g. data is public. Attacker has access to the data.

* **Data Security** methods to prevent un-authorized access e.g. data is private. Attacker has no access to the data. An example is **cryptography**.


### Data released with privacy issues:




#### Examples:


##### **Public data:**

**Example1:**

A hospital plans to study children's average hospital stay, considering different ages, diagnoses, and changes in length over time. They will analyze data from previous admissions (2010-2019) with limited attributes: year of birth, town, year of admission, illness, and length of stay. However, due to small towns and low population, the table may lead to disclosure, compromising privacy. Simply restricting access or using age at admission instead of birth year and admission year still risks identification.

**Example 2:**

A university aims to study the impact of studies and commuting distance on student stress. They provide a researcher with records in the format (town, degree, stress leave?). Although seemingly non-personal, records from a town with only one person studying a specific degree can reveal the individual's stress leave status, breaching privacy.

**Example 3:**

Location data from mobile phones can often identify individuals with just two locations, such as home and workplace. Most people have distinct patterns in their locations, making them easily identifiable.

**Conclusion:**

These examples illustrate two situations that can lead to disclosure in public data.

1. A few unique attributes can easily identify an individual, as seen in the first examples.
2. Information richness from online services or other sources can still enable identification despite not having singular attributes.

#### **Function of the data is public:**

Summary functions of data, such as computing the mean of an attribute, may seem to provide privacy. However, this is not always the case.

**Example1**

If an attacker is aware of the population mean, such as the average height of people in a certain country being 1.85m, and they discover that the mean height calculated from a dataset is higher, for instance, 1.88m, they can infer the presence of a tall person within the dataset.

In this example, we have a population mean height of 1.85m. The dataset contains heights of individuals, and we calculate the mean height from the dataset using np.mean(). If the dataset mean height is higher than the population mean, we conclude that there may be a tall person within the dataset.

In [26]:
import numpy as np
import random

# Population mean height
population_mean = 1.85
x = 1.99

# Dataset heights
# generate random values
dataset_heights = [ round(random.randint(184,186)/100,2) for x in range(10)]

print("dataset_heights: ", dataset_heights)
# adding outlier
#dataset_heights.append(x)

# Calculating the mean height from the dataset
dataset_mean = np.mean(dataset_heights)

print("population_mean : ", population_mean)
print("dataset_mean    : ", dataset_mean)

# Comparing the dataset mean with the population mean
if dataset_mean > population_mean:
    print("There may be a tall person in the dataset.")
else:
    print("No tall person is inferred from the dataset.")


dataset_heights:  [1.85, 1.84, 1.86, 1.86, 1.86, 1.84, 1.85, 1.84, 1.84, 1.84]
population_mean :  1.85
dataset_mean    :  1.848
No tall person is inferred from the dataset.


**Example2**

Consider computing the mean income of individuals admitted to a psychiatric unit in a city. The incomes are as follows: 1000, 2000, 3000, 2000, 1000, 6000, 2000, 10000, 2000, 4000. The mean income appears to be 3300 Euros, seemingly harmless.

However, if we add the income of a rich landlord e.g. 100000, the mean income becomes 12090.90 Euros. This implies that the rich landlord's income is included, allowing inference about this person attendance at the psychiatric unit.

Even if the mean were 3300 Euros, inference about the landlord could still be made, suggesting no attendance to the unit.

This example highlights that computation, even in the form of a summary, can lead to disclosure. Complex structures built from data, such as machine learning models, can also expose traces of the underlying data, enabling disclosure through attacks like membership inference.

Consideration of a linear regression model further exemplifies this issue. Adding the record associated with an outlier value significantly impacts the regression line.

In [15]:
data = [
    ["Age", "Income"],
     [ 24, 1000],
     [ 30, 2000],
     [ 40, 3000],
     [ 33, 2000],
     [ 26, 1000],
     [ 40, 6000],
     [ 50, 2000],
     [ 55, 10000],
     [ 37, 2000],
     [ 42, 4000],
    ]
dataX = [
    ["Age", "Income"],
     [ 24, 1000],
     [ 30, 2000],
     [ 40, 3000],
     [ 33, 2000],
     [ 26, 1000],
     [ 40, 6000],
     [ 50, 2000],
     [ 55, 10000],
     [ 37, 2000],
     [ 42, 4000],
     [ 65, 100000,]
    ]

y  = [x[1] for x in data[1:]]
yX = [x[1] for x in dataX[1:]]

n  = len(data[1:])
nX = len(dataX[1:])

mean = round(sum(y)/n,2)
print("meanIncome1 : ",mean1)

meanX = round(sum(yX)/nX,2)
print("meanIncome2 : ",mean2)

# to check if person in dataset we check the mean with or without him
x = 100000

meanNoX=round( (nX*meanX-x)/(nX-1) ,2)
# we should get the same value i.e. meanIncome1
print("mean without ",x, "=", meanNoX )

# This means x found in dataX but not in data

meanIncome1 :  3300.0
meanIncome2 :  12090.91
mean without  100000 = 3300.0


#### Real-world examples:

##### **AOL case:**

In August/September 2006, queries submitted to AOL by 650,000
users within a period of three months were released for research purposes. Only
queries from AOL registered users were released. Numerical identifiers were published
instead of real names and login information. Nevertheless, from the terms in
the queries a set of queries was [linked to people](https://www.nytimes.com/2006/08/09/technology/09aol.html).

##### **Netflix case:**

a database with the ratings to films of about 500,000
subscribers of Netflix was published in 2006 to foster research in recommender
systems. About 100 milion film ratings were included in the database. [Narayanan
and Shmatikov](https://ieeexplore.ieee.org/document/4531148) used the Internet Movie Database as background knowledge
to link some of the records in the database to known users.