<a href="https://colab.research.google.com/github/idhamari/differential_privacy_tutorials/blob/main/DP_Attacks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About:


In these tutorial, I will try to explain Differential Privacy (DP) in a practical way with examples using simple python scripts. I will try to add more details, examples, and scripts whenever i find the time, so please consider this work in progress.

Resources:



## Introduction to Data Privacy:

**Data privacy** is the concept that refers to the protection and management of personal or sensitive information to ensure that it remains confidential, secure, and used appropriately. With the increasing digitization and collection of vast amounts of data, data privacy has become a critical concern for individuals, organizations, and governments. It encompasses the ethical and legal considerations surrounding the collection, storage, sharing, and usage of personal data.

In today's interconnected world, where data is collected from various sources such as social media platforms, online transactions, healthcare records, and IoT devices, maintaining data privacy is challenging. Protecting personal information is crucial as it includes details like names, addresses, financial records, health information, and other sensitive attributes that can be misused or exploited if not adequately safeguarded.

**Real-Life Examples: The Limitations of Removing Names and Fields**

Removing names and certain identifiable fields from datasets has been a common practice to protect privacy. However, this approach alone is often insufficient to guarantee privacy due to the following reasons:

1. **Re-identification Attacks:** Even when names and directly identifiable information are removed, attackers can use various techniques to re-identify individuals. For instance, by cross-referencing multiple datasets or combining publicly available information, an attacker can match unique attributes to re-identify individuals. The Netflix Prize dataset release in 2006 was an example where researchers were able to re-identify individuals by linking the movie ratings dataset with other publicly available information.

2. **Attribute Inference:** Even if explicit identifiers are removed, sensitive attributes can still be inferred by analyzing other non-identifiable fields. By leveraging patterns or correlations within the data, attackers can deduce sensitive information about individuals. For instance, a study published in 2009 demonstrated that individuals' sexual orientation could be inferred with high accuracy by analyzing their social network connections, even when explicit identifiers were not present.

3. **Statistical Disclosure:** Aggregated or anonymized datasets can still contain statistical patterns that enable attackers to deduce sensitive information about individuals. Researchers have shown that by applying advanced statistical analysis techniques, individuals' private information, such as medical conditions or financial status, can be inferred from seemingly anonymized data.

4. **Contextual Information:** Even when explicit identifiers are removed, contextual information present in the dataset can aid in re-identification. For example, a combination of location, occupation, and age may uniquely identify an individual within a specific region or profession, even if their name is not present.

These examples illustrate that merely removing names and identifiable fields from datasets is not sufficient to ensure data privacy. It is essential to adopt more comprehensive approaches, such as data anonymization techniques, encryption, differential privacy, and rigorous access controls, to safeguard individuals' privacy and protect against privacy attacks.

Data privacy regulations, such as the **General Data Protection Regulation (GDPR)** in Europe and the **California Consumer Privacy Act (CCPA)** in the United States, emphasize the need for organizations to implement robust privacy measures and obtain informed consent from individuals when handling their personal data. These regulations aim to strike a balance between data utility and privacy protection, encouraging organizations to implement privacy-enhancing practices to mitigate the risks associated with data privacy breaches.




##  popular types of privacy attacks.


* **Membership Inference Attack (MIA):** This attack aims to determine whether a particular individual's record is present in a dataset without revealing the specific information of the individual. This technique poses a significant threat to privacy, especially in scenarios where sensitive information is stored or shared.

* **Attribute Inference Attack:** This attack aims to infer sensitive attributes or characteristics of individuals based on available non-sensitive attributes. For example, inferring a person's salary based on their occupation, education level, and geographic location.

* **Reconstruction Attack:** Reconstruction attacks attempt to reconstruct the original data or sensitive information from aggregated or anonymized data. These attacks exploit vulnerabilities in data anonymization techniques and re-identify individuals based on unique combinations of attributes.

* **Linkage Attack:** Linkage attacks involve linking multiple datasets or combining publicly available information to identify individuals in a dataset that is presumed to be anonymous. By matching common identifiers or attributes, an attacker can re-identify individuals and uncover sensitive information.

* **Timing Attack:** Timing attacks exploit variations in response times or resource consumption to infer sensitive information. For example, in a web application, an attacker can measure the time it takes to respond to different queries and deduce information about the underlying data or operations being performed.

* **Side-Channel Attack:** Side-channel attacks leverage information leaked through unintended channels such as power consumption, electromagnetic radiation, or network traffic. These attacks target the physical implementation of a system to extract sensitive information, rather than directly targeting the data itself.

* **Differential Privacy Attack:** Differential privacy attacks focus on bypassing or exploiting the privacy guarantees provided by differential privacy mechanisms. By analyzing multiple queries or responses, an attacker can gain insights into the sensitive data or individuals.

* **Homomorphic Encryption Attack:** Homomorphic encryption attacks involve exploiting weaknesses or vulnerabilities in homomorphic encryption schemes to reveal sensitive data. These attacks target the encryption and decryption processes and aim to bypass the privacy protection provided by homomorphic encryption.

It's important to note that these attacks highlight the need for robust privacy protection mechanisms and practices. Data anonymization, differential privacy techniques, secure data handling, and encryption play a crucial role in safeguarding individuals' privacy and mitigating the risks associated with these attacks.

### Membership Inference Attack (MIA):

These attacks are a type of privacy attack that aims to determine whether a particular individual's record is present in a dataset without revealing the specific information of the individual. This technique poses a significant threat to privacy, especially in scenarios where sensitive information is stored or shared. In this tutorial, we will explore the concept of membership attacks and demonstrate a Python example to understand how they can be implemented.

**Understanding Membership Attacks:**

A membership attack, also known as a membership inference attack, attempts to determine if a specific record or individual is present in a dataset by exploiting the statistical properties of the dataset. These attacks can be launched against machine learning models that have been trained on sensitive data or contain personal information. The primary goal is to extract sensitive information without directly accessing it, thereby compromising the privacy of individuals.

**Example - Membership Attack on a Dataset:**

Assuming we have a dataset.

First, we create a machine learning model to represent the dataset.   
Then we use the model to check if a record is available in the dataset by getting the label with some accuracy. if the acuracy is high then the record is included, if not then it is not.     

    
**Conclusion:**
Membership attacks are a serious privacy concern, and understanding their concepts and implications is crucial for data privacy protection. By exploiting the statistical properties of a dataset, attackers can determine if specific records or individuals are part of the dataset. Organizations and data practitioners should be aware of the risks associated with membership attacks and employ appropriate privacy-preserving techniques to safeguard sensitive information.

Remember that respecting privacy and following ethical guidelines should be a priority when dealing with sensitive data.


In [64]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import urllib
import random


def getData():
    # Download the Adult dataset from UCI Machine Learning Repository
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
    filename = "adult.csv"
    urllib.request.urlretrieve(url, filename)

    # Load the Adult dataset
    data = pd.read_csv('adult.csv', header=None)

    # Assign column names to the dataset
    column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
                    'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'label']
    data.columns = column_names

    # Perform one-hot encoding for categorical features
    categorical_features = ['workclass', 'education', 'marital_status', 'occupation',
                            'relationship', 'race', 'sex', 'native_country']
    data_encoded = pd.get_dummies(data, columns=categorical_features)
    return data_encoded

def getTrainedModel(data_encoded, test_size=0.2,random_state=42, modelID="RandomForestClassifier"):

    # Split the dataset into training and testing sets
    train_data, test_data = train_test_split(data_encoded, test_size=test_size, random_state=random_state)

    # Separate features and labels
    train_features = train_data.drop('label', axis=1)
    train_labels = train_data['label']
    test_features = test_data.drop('label', axis=1)
    test_labels = test_data['label']

    # Train a machine learning model on the training set
    model = None
    if modelID=="RandomForestClassifier":
       model = RandomForestClassifier(n_estimators=100, random_state=random_state)

    model.fit(train_features, train_labels)

    # Evaluate the model on the testing set
    test_predictions = model.predict(test_features)
    test_accuracy = accuracy_score(test_labels, test_predictions)

    return model, test_features, test_labels, test_accuracy

def getSampleAttackData(target_data, target_labels, isMember=1):
    if isMember:
        attack_data = target_data.sample(n=1)
        attack_labels = target_labels.sample(n=1)
    else:
        # Generate random data of the same type
        attack_data =pd.DataFrame([[random.random() for x in range(len(target_data.columns))]], columns=  list(target_data.columns))
        attack_labels = pd.Series([random.choice(['<=50K','>50K'])])

    return attack_data, attack_labels

# Perform membership inference attack
def checkMembership(model, attack_data, attack_labels ):

    # Make predictions on the attack data
    attack_predictions = model.predict(attack_data)
    attack_predictions= attack_predictions[0]
    attack_labels=list(attack_labels)[0]

    # # Calculate the accuracy of the membership inference attack
    attack_result = (attack_predictions == attack_labels)
    return  attack_result

print("=============  Membership Attack Example ================")
random_state = 42

# Get sample data, we are using USA adult dataset (32561 rows and 109 attributes)
data_encoded = getData()


# get trained model
model, test_features, test_labels, test_accuracy = getTrainedModel(data_encoded, test_size=0.01,random_state=42, modelID="RandomForestClassifier")
print(f'Test Accuracy: {test_accuracy}')

# create sample data to attack (we either generate a record for a member person or non member person)
isMember     = 0
attack_data, attack_labels =  getSampleAttackData(test_features, test_labels, isMember=isMember)

# Perform the membership inference attack on the trained model using the sample attack data
print("testing using isMember              : ", isMember)
print(f'Membership Inference Attack Accuracy: {checkMembership(model, attack_data, attack_labels)}')


Test Accuracy: 0.8895705521472392
testing using isMember              :  0
Membership Inference Attack Accuracy: False
