<a href="https://colab.research.google.com/github/idhamari/differential_privacy_tutorials/blob/main/ia_DP_Attacks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About:


In these tutorial, I will try to explain Differential Privacy (DP) in a practical way with examples using simple python scripts. I will try to add more details, examples, and scripts whenever i find the time, so please consider this work in progress.

Resources:



## Introduction to Data Privacy:

**Data privacy** is the concept that refers to the protection and management of personal or sensitive information to ensure that it remains confidential, secure, and used appropriately. With the increasing digitization and collection of vast amounts of data, data privacy has become a critical concern for individuals, organizations, and governments. It encompasses the ethical and legal considerations surrounding the collection, storage, sharing, and usage of personal data.

In today's interconnected world, where data is collected from various sources such as social media platforms, online transactions, healthcare records, and IoT devices, maintaining data privacy is challenging. Protecting personal information is crucial as it includes details like names, addresses, financial records, health information, and other sensitive attributes that can be misused or exploited if not adequately safeguarded.

**Real-Life Examples: The Limitations of Removing Names and Fields**

Removing names and certain identifiable fields from datasets has been a common practice to protect privacy. However, this approach alone is often insufficient to guarantee privacy due to the following reasons:

**Membership disclosure** means that data linkage allows an attacker to determine whether or not data about an individual is contained in a dataset. While this does not directly disclose any information from the dataset itself, it may allow an attacker to infer meta-information. While this deals with implicit sensitive attributes (meaning attributes of an individual that are not contained in the dataset), other disclosure models deal with explicit sensitive attributes.

**Attribute disclosure** may be achieved even without linking an individual to a specific item in a dataset. It protects sensitive attributes, which are attributes from the dataset with which individuals are not willing to be linked with. As such, they might be of interest to an attacker and, if disclosed, could cause harm to data subjects. As an example, linkage to a set of data entries allows inferring information if all items share a certain sensitive attribute value.

**Identity disclosure (or re-identification)** means that an individual can be linked to a specific data entry. This is a serious type of attack, as it has legal consequences for data owners according to many laws and regulations worldwide. From the definition it also follows that an attacker can learn all sensitive information contained in the data entry about the individual.

**Statistical Disclosure:** Aggregated or anonymized datasets can still contain statistical patterns that enable attackers to deduce sensitive information about individuals. Researchers have shown that by applying advanced statistical analysis techniques, individuals' private information, such as medical conditions or financial status, can be inferred from seemingly anonymized data.

**Contextual Information:** Even when explicit identifiers are removed, contextual information present in the dataset can aid in re-identification. For example, a combination of location, occupation, and age may uniquely identify an individual within a specific region or profession, even if their name is not present.

These examples illustrate that merely removing names and identifiable fields from datasets is not sufficient to ensure data privacy. It is essential to adopt more comprehensive approaches, such as data anonymization techniques, encryption, differential privacy, and rigorous access controls, to safeguard individuals' privacy and protect against privacy attacks.

Data privacy regulations, such as the **General Data Protection Regulation (GDPR)** in Europe and the **California Consumer Privacy Act (CCPA)** in the United States, emphasize the need for organizations to implement robust privacy measures and obtain informed consent from individuals when handling their personal data. These regulations aim to strike a balance between data utility and privacy protection, encouraging organizations to implement privacy-enhancing practices to mitigate the risks associated with data privacy breaches.




##  Popular types of privacy attacks.


* **Membership Inference Attack (MIA):** This attack aims to determine whether a particular individual's record is present in a dataset without revealing the specific information of the individual. This technique poses a significant threat to privacy, especially in scenarios where sensitive information is stored or shared.

* **Attribute Inference Attack:** This attack aims to infer sensitive attributes or characteristics of individuals based on available non-sensitive attributes. For example, inferring a person's salary based on their occupation, education level, and geographic location.

* **Reconstruction Attack:** Reconstruction attacks attempt to reconstruct the original data or sensitive information from aggregated or anonymized data. These attacks exploit vulnerabilities in data anonymization techniques and re-identify individuals based on unique combinations of attributes.

* **Linkage Attack:** Linkage attacks involve linking multiple datasets or combining publicly available information to identify individuals in a dataset that is presumed to be anonymous. By matching common identifiers or attributes, an attacker can re-identify individuals and uncover sensitive information.

* **Differential Privacy Attack:** Differential privacy attacks focus on bypassing or exploiting the privacy guarantees provided by differential privacy mechanisms. By analyzing multiple queries or responses, an attacker can gain insights into the sensitive data or individuals.

It's important to note that these attacks highlight the need for robust privacy protection mechanisms and practices. Data anonymization, differential privacy techniques, secure data handling, and encryption play a crucial role in safeguarding individuals' privacy and mitigating the risks associated with these attacks.

### Membership Inference Attack (MIA):

These attacks are a type of privacy attack that aims to determine whether a particular individual's record is present in a dataset without revealing the specific information of the individual. This technique poses a significant threat to privacy, especially in scenarios where sensitive information is stored or shared. In this tutorial, we will explore the concept of membership attacks and demonstrate a Python example to understand how they can be implemented.

**Understanding Membership Attacks:**

A membership attack, also known as a membership inference attack, attempts to determine if a specific record or individual is present in a dataset by exploiting the statistical properties of the dataset. These attacks can be launched against machine learning models that have been trained on sensitive data or contain personal information. The primary goal is to extract sensitive information without directly accessing it, thereby compromising the privacy of individuals.

**Example - Membership Attack on a Dataset:**

Assuming we have a dataset.

First, we create a machine learning model to represent the dataset.   
Then we use the model to check if a record is available in the dataset by getting the label with some accuracy. if the acuracy is high then the record is included, if not then it is not.     



In [19]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import urllib
import random


def getData():
    # Download the Adult dataset from UCI Machine Learning Repository
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
    filename = "adult.csv"
    urllib.request.urlretrieve(url, filename)

    # Load the Adult dataset
    data = pd.read_csv('adult.csv', header=None)

    # Assign column names to the dataset
    column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
                    'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'label']
    data.columns = column_names

    # Perform one-hot encoding for categorical features
    categorical_features = ['workclass', 'education', 'marital_status', 'occupation',
                            'relationship', 'race', 'sex', 'native_country']
    data_encoded = pd.get_dummies(data, columns=categorical_features)
    return data_encoded

def getTrainedModel(data_encoded, test_size=0.2,random_state=42, modelID="RandomForestClassifier"):

    # Split the dataset into training and testing sets
    train_data, test_data = train_test_split(data_encoded, test_size=test_size, random_state=random_state)

    # Separate features and labels
    train_features = train_data.drop('label', axis=1)
    train_labels = train_data['label']
    test_features = test_data.drop('label', axis=1)
    test_labels = test_data['label']

    # Train a machine learning model on the training set
    model = None
    if modelID=="RandomForestClassifier":
       model = RandomForestClassifier(n_estimators=100, random_state=random_state)

    model.fit(train_features, train_labels)

    # Evaluate the model on the testing set
    test_predictions = model.predict(test_features)
    test_accuracy = accuracy_score(test_labels, test_predictions)

    return model, test_features, test_labels, test_accuracy

def getSampleAttackData(target_data, target_labels, isMember=1):
    if isMember:
        attack_data = target_data.sample(n=1)
        attack_labels = target_labels.sample(n=1)
    else:
        # Generate random data of the same type
        attack_data =pd.DataFrame([[random.random() for x in range(len(target_data.columns))]], columns=  list(target_data.columns))
        attack_labels = pd.Series([random.choice(['<=50K','>50K'])])

    return attack_data, attack_labels

# Perform membership inference attack
def checkMembership(model, attack_data, attack_labels ):

    # Make predictions on the attack data
    attack_predictions = model.predict(attack_data)
    attack_predictions= attack_predictions[0]
    attack_labels=list(attack_labels)[0]

    # # Calculate the accuracy of the membership inference attack
    attack_result = (attack_predictions == attack_labels)
    return  attack_result

print("=============  Membership Attack Example ================")
random_state = 42

# Get sample data, we are using USA adult dataset (32561 rows and 109 attributes)
data_encoded = getData()


# get trained model
model, test_features, test_labels, test_accuracy = getTrainedModel(data_encoded, test_size=0.01,random_state=42, modelID="RandomForestClassifier")
print(f'Test Accuracy: {test_accuracy}')

# run multiple times
results = []
numIterations = 100
for i in range(numIterations):
    # create sample data to attack (we either generate a record for a member person or non member person)
    isMember     = random.randint(0,1)
    attack_data, attack_labels =  getSampleAttackData(test_features, test_labels, isMember=isMember)
    # Perform the membership inference attack on the trained model using the sample attack data
    results.append( 1 if checkMembership(model, attack_data, attack_labels)==isMember else 0)

acc = results.count(1)/numIterations
print(f'Membership Inference Attack Result: {acc}')

Test Accuracy: 0.8895705521472392
Membership Inference Attack Result: 0.8


**Another example: using aggregate statistics**
We will show in this example that even when publishing aggregate data, e.g. sum or mean,  can leads to a successfull.

This is a simple demonstration and doesn't provide an accurate or reliable measure of whether a person is in a dataset based on their income. It is a heuristic that relies on the premise that if adding a person's income brings the average closer to their income, then they are likely in the dataset. This premise has several issues:

- The average income is only weakly related to whether any individual person's income is in the dataset. A single income is unlikely to change the average significantly unless the dataset is very small or the income is an extreme outlier.

- The method doesn't account for the distribution of incomes. For example, if incomes are widely dispersed or have many outliers, then the average is less meaningful and the method will be less accurate.

- The method doesn't account for the possibility of similar or identical incomes in the dataset. Two people can have the same income, but only one of them might be in the dataset.

In [48]:
import pandas as pd
import numpy as np

# Create a random sample dataset
np.random.seed(0)
numberOfRecords = 100
data = {'id': range(1, numberOfRecords+1),
        'age': np.random.randint(20, 70, size=numberOfRecords),
        'zipcode': np.random.randint(10000, 99999, size=numberOfRecords),
        'income': np.random.randint(500, 10000, size=numberOfRecords)}

# Converting dictionary to list of lists
data_list = [['id', 'age', 'zipcode', 'income']] + [list(item) for item in list(zip(data['id'], data['age'], data['zipcode'], data['income']))]

for x in data_list[:5]:
    print(x)

df = pd.DataFrame(data)

# Compute the aggregate data for income
count       = df['income'].count()
sum_income  = df['income'].sum()
mean_income = df['income'].mean()

# Compute the aggregate data for age
sum_age = df['age'].sum()
mean_age = df['age'].mean()

print("count       : ", count)
print("sum_income  : ", sum_income)
print("mean_income : ", mean_income)
print("sum_age     : ", sum_age)
print("mean_age    : ", mean_age )

# Function to infer membership
def infer_membership(target_income, target_age, count, sum_income, mean_income, sum_age, mean_age):
    new_count = count + 1

    # Income related
    new_sum_income = sum_income + target_income
    new_mean_income = new_sum_income / new_count

    # Age related
    new_sum_age = sum_age + target_age
    new_mean_age = new_sum_age / new_count
    # If adding the target income and age both make the new means closer to the targets,
    # infer that the target might be in the dataset
    if (abs(target_income - new_mean_income) < abs(target_income - mean_income)) and (abs(target_age - new_mean_age) < abs(target_age - mean_age)):
        return True
    else:
        return False


# Test the function

id = 10
personAge    = 35; personIncome = 50000
if id >= 0:
   personAge    = data_list[id][1]; personIncome = data_list[id][3]


print(" we are looking if a person with age= ",personAge," and income= ",personIncome," is in the datset!")
print(infer_membership(personIncome, personAge, count, sum_income, mean_income, sum_age, mean_age))  # Example output: True/False

matching_rows = df[(df['income'] == personIncome) & (df['age'] == personAge)]
matching_rows_approx = 0
if matching_rows.empty:
    df['income_diff'] = abs(df['income'] - personIncome)
    df['age_diff'] = abs(df['age'] - personAge)
    # Calculate a score based on the differences (you could also weight these differences if you want)
    df['score'] = df['income_diff'] + df['age_diff']
    # Return the row with the minimum score
    matching_rows_approx =  list(df.loc[df['score'].idxmin()])[:-3]

if not matching_rows_approx:
   print("----- matchine persons -----")
   print(matching_rows)
else:
   print("----- matchine persons (approx) -----")
   print(matching_rows_approx)


['id', 'age', 'zipcode', 'income']
[1, 64, 90041, 6346]
[2, 67, 81331, 4281]
[3, 20, 60624, 3275]
[4, 23, 99183, 3103]
count       :  100
sum_income  :  533067
mean_income :  5330.67
sum_age     :  4250
mean_age    :  42.5
 we are looking if a person with age=  56  and income=  4551  is in the datset!
True
----- matchine persons -----
   id  age  zipcode  income
9  10   56    58682    4551


    
**Conclusion:**
Membership attacks are a serious privacy concern, and understanding their concepts and implications is crucial for data privacy protection. By exploiting the statistical properties of a dataset, attackers can determine if specific records or individuals are part of the dataset. Organizations and data practitioners should be aware of the risks associated with membership attacks and employ appropriate privacy-preserving techniques to safeguard sensitive information.

Remember that respecting privacy and following ethical guidelines should be a priority when dealing with sensitive data.

 ###  Attribute inference attacks

This attack aims to infer sensitive attributes or characteristics of individuals based on available non-sensitive attributes. For example, inferring a person's salary based on their occupation, education level, and geographic location.


Let's use an example to illustrate. Suppose we have a dataset that has been anonymized by removing any directly identifying information (like names, social security numbers, etc.). The remaining dataset contains columns like 'age', 'income', 'education level', and a sensitive attribute 'has_cancer'.

Now, let's say an attacker has some partial information about a person and wants to find out more about them from the dataset e.g. if the person has cancer. They might know the person's age and income rangea, and education level. Using this information, they could match these known characteristics with the corresponding rows in the dataset, potentially allowing them to find the target's row and learn more about them.

Here is a basic Python script demonstrating a reconstruction attack:

In [57]:
import pandas as pd

# Anonymized dataset
data = {
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 50000, 60000, 80000, 70000],
    'education_level': ['Bachelor', 'PhD', 'Master', 'Bachelor', 'Master'],
    'number_of_children': [0, 2, 1, 3, 2],
    'has_cancer': [0, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Partial information about a person
#known_info:
ageRange         = [31,35]
incomeRange      = [40000,80000]
educationDegree  ='Master'

# Perform the reconstruction attack
attack = df[ (df['age'] >= ageRange[0]) & (df['age'] <= ageRange[1]) &
            (df['income'] >= incomeRange[0]) & (df['income'] <= incomeRange[1])  &
            (df['education_level'] == educationDegree)]

print("let's see if the person we know has cancer")
print(attack)


Linking information from the two databases:
   age  income education_level  has_cancer
2   35   60000          Master           1


 ### Re-Construction Attack:



The attack demonstrated above could also be classified as Re-Construction Attack.

It's worth noting that the line between reconstruction attacks and attribute inference attacks can sometimes be blurred, as they both involve using known information to infer or reconstruct unknown information. The key distinction is that a reconstruction attack typically aims to reconstruct the entire original dataset, while an attribute inference attack focuses on inferring specific attributes for particular individuals.

Reconstruction attacks are a type of data privacy violation where an attacker uses aggregate data or statistical summaries to re-identify individual records. This could potentially violate the privacy of individuals in the dataset, even if the data was anonymized.

In the above example, one can reconstruct all datset records by running multiple queries with known information.



### Linkage attack:

linkage attacks try to reveale sensitive information by linking different datasets based on a shared attribute.

Suppose we have two anonymized datasets. One dataset contains public information (like 'age', 'income', and 'education level'), and another dataset contains private information (like 'health status') indexed by an anonymized but common user identifier.

For demonstration purposes, let's say the public dataset was released by a marketing firm and the private dataset was leaked from a healthcare provider. Although the healthcare provider's dataset doesn't contain directly identifiable personal information, an attacker could still link the records in this dataset to specific individuals if they can match the user identifier with the corresponding identifier in the public dataset.

Here's how an attacker could perform a linkage attack in Python:

In [58]:
import pandas as pd

# Anonymized dataset 1
data1 = {
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 50000, 60000, 80000, 70000],
    'education_level': ['Bachelor', 'PhD', 'Master', 'Bachelor', 'Master'],
    'number_of_children': [0, 2, 1, 3, 2]
}

df1 = pd.DataFrame(data1)

# Anonymized dataset 2
data2 = {
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 50000, 60000, 80000, 70000],
    'education_level': ['Bachelor', 'PhD', 'Master', 'Bachelor', 'Master'],
    'has_cancer': [0, 0, 1, 0, 1]
}

df2 = pd.DataFrame(data2)

# Partial information about a person
ageRange = [31, 35]
incomeRange = [40000, 80000]
educationDegree = 'Master'

# Perform the linkage attack
attack1 = df1[ (df1['age'] >= ageRange[0]) & (df1['age'] <= ageRange[1]) &
            (df1['income'] >= incomeRange[0]) & (df1['income'] <= incomeRange[1])  &
            (df1['education_level'] == educationDegree)]

print("Linking information from the two databases:")
for index, row in attack1.iterrows():
    attack2 = df2[ (df2['age'] == row['age']) &
                   (df2['income'] == row['income']) &
                   (df2['education_level'] == row['education_level']) ]
    print(attack2)


Linking information from the two databases:
   age  income education_level  has_cancer
2   35   60000          Master           1


### Differential Privacy Attack:

Differential Privacy (DP) is a method that is designed to provide privacy guarantees for individuals whose data is being used for statistical analysis. However, as with any security or privacy mechanism, it's not invincible.

A potential weakness in differential privacy mechanisms comes in the form of repeated queries. If an attacker has the ability to make multiple, slightly different queries to the dataset, they could potentially learn more than what a single query would reveal. This is sometimes called a "composition attack."

Here's a simplified Python example that illustrates the issue:

Differential privacy is a method that is designed to provide privacy guarantees for individuals whose data is being used for statistical analysis. However, as with any security or privacy mechanism, it's not invincible.

A potential weakness in differential privacy mechanisms comes in the form of repeated queries. If an attacker has the ability to make multiple, slightly different queries to the dataset, they could potentially learn more than what a single query would reveal. This is sometimes called a "composition attack."

Here's a simplified Python example that illustrates the issue.
In this example, the attacker knows that Alice is in the database and knows her exact age. By querying the database multiple times and averaging the results, the attacker can estimate the average age of all the people in the database. If the database is supposed to be private, this represents an attack because the attacker has gained information about the people in the database (their average age) that they should not have.

The attacker queries the database twice: once with Alice's data and once without. By comparing the average results of these two sets of queries, the attacker can infer the influence Alice's data has on the overall database. If there is a significant difference, the attacker could infer that Alice's age is above or below the average age of the people in the database. This can be seen as a privacy leak.

Assuming output:

       alice_age          :  24
       alice_age influence:  -0.2537783890259462

The negative influence value of -0.2537783890259462 indicates that the average age of the entire dataset is slightly lower when Alice is included. This implies that Alice's age is less than the average age of the overall population in the dataset.

While this might seem like a minor detail, it can be a significant privacy concern. The attacker was able to infer that Alice is younger than the average person in the dataset without directly accessing Alice's data. If the dataset were sufficiently large and the attack were more sophisticated, this kind of inference could potentially reveal much more about Alice or any other individual in the dataset.

In [74]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Set a seed for reproducibility
np.random.seed(0)

# Generate a sample dataset
N = 1000  # Number of people
labels = ['Alice', 'Bob', 'Charlie', 'David', 'Eve']

names = np.random.choice(labels, N)
ages = np.random.randint(20, 60, N)

df = pd.DataFrame({'Name': names, 'Age': ages})
print("----data ")
print(df)

# Encode the 'Name' column
encoder = LabelEncoder()
df['Name'] = encoder.fit_transform(df['Name'])
print("----anonymized data ")
print(df)

def query(db, noise=0.2):
    """Generate a query with differential privacy."""
    true_result = np.mean(db)
    privacy_result = true_result + noise * np.random.randn()
    return privacy_result

# Attack begins
# Suppose the attacker knows Alice's age and that Alice is in the database
alice_age = df[df['Name'] == encoder.transform(['Alice'])[0]]['Age'].values[0]

# The attacker queries the database multiple times
n_queries = 1000
means_with_alice = [query(df['Age']) for _ in range(n_queries)]
print("means_with_alice age          : ", means_with_alice)

# The attacker then removes Alice from the database
df_without_alice = df[df['Name'] != encoder.transform(['Alice'])[0]]

# The attacker queries the database without Alice
means_without_alice = [query(df_without_alice['Age']) for _ in range(n_queries)]
print("means_without_alice age          : ", means_without_alice)

# The attacker compares the averages of the queries
average_with_alice = np.mean(means_with_alice)
average_without_alice = np.mean(means_without_alice)

# If there is a significant difference between the averages, the attacker can infer that Alice has an influence on the average age
influence = average_with_alice - average_without_alice

print("alice_age          : ", alice_age)
print("alice_age influence: ", influence)


----data 
      Name  Age
0      Eve   30
1    Alice   24
2    David   37
3    David   23
4    David   57
..     ...  ...
995    Eve   31
996  David   56
997    Bob   38
998    Bob   31
999  David   54

[1000 rows x 2 columns]
----anonymized data 
     Name  Age
0       4   30
1       0   24
2       3   37
3       3   23
4       3   57
..    ...  ...
995     4   31
996     3   56
997     1   38
998     1   31
999     3   54

[1000 rows x 2 columns]
means_with_alice age          :  [39.91284541341598, 39.42953947328916, 40.11187240135545, 39.575186914089606, 39.6628181087247, 39.56017850251259, 40.19970231518215, 39.85221539519736, 39.558684073853264, 39.56519032865888, 39.7287949727254, 39.69488516830686, 39.74447513496242, 39.47587734489258, 39.73250975644443, 39.80681251375395, 40.086217193034344, 40.065351081560266, 39.76067783050964, 39.70758631737551, 39.8897787189251, 39.916563766051055, 39.71285181552216, 39.51207819378116, 39.42012771032558, 39.45234696644339, 39.53870326075909

# Sources:

- [privacy-criteria](https://arx.deidentifier.org/overview/privacy-criteria/)