# Differential Privacy: Hands-On Exercises

## 📚 Overview

This notebook demonstrates why traditional data anonymization methods fail to protect privacy. Through practical exercises, you'll learn how to perform various privacy attacks on "de-identified" data.

**Based on**: [Programming Differential Privacy Chapter 1](https://programming-dp.com/ch1.html)

## 🎯 Learning Objectives

By completing these exercises, you will:
1. Understand the limitations of de-identification
2. Perform linkage attacks using auxiliary information
3. Discover how aggregation can leak individual data
4. Execute differencing attacks on aggregate statistics

---

## 📦 Setup and Imports

First, let's import the necessary packages and configure our environment.

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Configure pandas to display all columns for better visibility
pd.set_option('display.max_columns', 20)

print("✅ Packages imported successfully!")

## 📊 Load and Explore the Dataset

We'll use a census dataset that includes synthetic personally identifiable information (PII) for educational purposes.

**Note**: Make sure you have downloaded `adult_with_pii.csv` from the course materials.

In [None]:
# Read the dataset
adult = pd.read_csv("adult_with_pii.csv")

print("📋 Dataset Overview:")
print(f"Shape: {adult.shape}")
print(f"Columns: {list(adult.columns)}")
print("\n🔍 First 5 records:")
adult.head()

## 🛡️ De-identification Process

Organizations often "de-identify" data by removing obvious identifiers like names and SSNs. Let's simulate this process:

In [None]:
# Create a "de-identified" dataset by dropping PII columns
adult_data = adult.copy().drop(columns=['Name', 'SSN'])

# Save PII separately (we'll use this for our attacks)
adult_pii = adult[['Name', 'SSN', 'DOB', 'Zip']]

print("✅ De-identification complete!")
print("\n📊 'De-identified' dataset (first record):")
print(adult_data.head(1))
print("\n⚠️ Question: Is this data truly anonymous now?")

---

# 🔓 Part 1: Linkage Attacks

A **linkage attack** uses auxiliary information to re-identify individuals in supposedly anonymous data.

## Exercise 1: Basic Linkage Attack

**Task**: Perform a linkage attack on Brenn McNeely using date of birth and ZIP code.

**Scenario**: You know Brenn McNeely's birthday and ZIP code from public sources (e.g., social media).

In [None]:
# Find Brenn's row in our auxiliary data
brenns_row = adult_pii[adult_pii['Name'] == 'Brenn McNeely']

print("🎯 Target: Brenn McNeely")
print(f"Known information: DOB={brenns_row['DOB'].values[0]}, ZIP={brenns_row['Zip'].values[0]}")
print("\n🔍 Performing linkage attack...")

# Perform the linkage attack using DOB and ZIP
result = pd.merge(brenns_row, adult_data, 
                  left_on=["Zip", "DOB"], 
                  right_on=["Zip", "DOB"])

print(f"\n✅ Attack successful! Found {len(result)} matching record(s)")
print("\n📊 Brenn's private information revealed:")
result

## Exercise 2: Linkage with Limited Information

**Task**: What if we only know Brenn's ZIP code? How effective is the attack?

In [None]:
print("🔍 Attempting linkage with only ZIP code...")

# Perform linkage attack with only ZIP
zip_only_result = pd.merge(brenns_row, adult_data, 
                          left_on=['Zip'], 
                          right_on=['Zip'])

print(f"\n📊 Found {len(zip_only_result)} potential matches:")
zip_only_result[['Zip', 'Age', 'Sex', 'Occupation', 'Target']]

## Exercise 3: Analyzing the Results

**Question**: You found 2 potential matches. What additional information could help identify the real Brenn?

**💡 Think about**:
- What attributes differ between the matches?
- What information might be publicly available?

In [None]:
# Let's analyze the differences between potential matches
print("🔍 Analyzing differences between potential matches:\n")

# Display key differentiating attributes
comparison_cols = ['Sex', 'Marital Status', 'Occupation', 'Age', 'Race']
print("Differentiating attributes:")
print(zip_only_result[comparison_cols])

print("\n💡 Potential distinguishing information:")
print("- Sex (Male/Female)")
print("- Marital Status")
print("- Occupation")
print("- Age (if approximately known)")
print("- Any of these could be found on social media profiles!")

---

# 📊 Part 2: Aggregation Vulnerabilities

Organizations often release aggregate statistics thinking they're safe. Let's see why this assumption is dangerous.

## Exercise 4: Small Group Problem

**Task**: Determine how many people's data is completely exposed when we compute average age by ZIP code.

In [None]:
# First, let's see the aggregation
print("📊 Average age by ZIP code (sample):")
zip_age_avg = adult[['Zip', 'Age']].groupby('Zip', as_index=False).mean()
print(zip_age_avg.head())
print("\n⚠️ Problem: What if a ZIP code has only one person?")

In [None]:
# Count how many people are in each ZIP code
adult["ones"] = 1  # Add a column for counting
counts = adult[['Zip', "ones"]].groupby('Zip', as_index=False).count()

# Find ZIP codes with only one person
single_person_zips = counts[counts["ones"] == 1]

print(f"🚨 PRIVACY BREACH ALERT:")
print(f"{len(single_person_zips)} ZIP codes contain only ONE person!")
print(f"\nFor these {len(single_person_zips)} people:")
print("- Their 'average' age is their EXACT age")
print("- Their data is completely exposed!")

# Show some examples
print("\n📋 Example vulnerable ZIP codes:")
vulnerable_zips = single_person_zips['Zip'].head(5).values
for zip_code in vulnerable_zips:
    person_age = adult[adult['Zip'] == zip_code]['Age'].values[0]
    print(f"  ZIP {zip_code}: 'Average' age = {person_age} (exact age!)")

# Clean up
adult.drop('ones', axis=1, inplace=True)

---

# 🔄 Part 3: Differencing Attacks

Even large aggregates can be attacked by comparing different query results.

## Exercise 5: Simple Differencing Attack

**Task**: Find Brenn McNeely's hours worked per week using two aggregate queries.

**Attack Formula**: 
```
Individual's value = (Sum with individual) - (Sum without individual)
```

In [None]:
print("🎯 Target: Brenn McNeely's working hours")
print("\n📊 Executing differencing attack...")

# Query 1: Total hours for everyone
query1 = adult['Hours per week'].sum()
print(f"Query 1 - Total hours (all employees): {query1:,}")

# Query 2: Total hours excluding Brenn
query2 = adult[adult['Name'] != 'Brenn McNeely']['Hours per week'].sum()
print(f"Query 2 - Total hours (without Brenn): {query2:,}")

# Calculate the difference
brenns_hours = query1 - query2
print(f"\n🔓 Attack result: Brenn works {brenns_hours} hours per week")

# Verify our result
actual = adult[adult['Name'] == 'Brenn McNeely']['Hours per week'].values[0]
print(f"✅ Verification: Actual value = {actual} hours")

## Exercise 6: Indirect Differencing Attack

**Task**: Find Minni Mathevon's working hours using an indirect exclusion.

**Hint**: Minni is the only person from "Holand-Netherlands" in the dataset.

In [None]:
print("🎯 Target: Minni Mathevon's working hours")
print("🔍 Strategy: Use country information for indirect attack")

# First, verify the hint
dutch_people = adult[adult['Country'] == 'Holand-Netherlands']
print(f"\n✅ Confirmed: {len(dutch_people)} person(s) from Holand-Netherlands")
print(f"   Name: {dutch_people['Name'].values[0]}")

In [None]:
# Now perform the differencing attack
print("\n📊 Executing indirect differencing attack...")

# Query 1: Total hours for everyone
query1 = adult['Hours per week'].sum()
print(f"Query 1 - Total hours (all countries): {query1:,}")

# Query 2: Total hours excluding Holand-Netherlands
query2 = adult[adult['Country'] != 'Holand-Netherlands']['Hours per week'].sum()
print(f"Query 2 - Total hours (without Dutch): {query2:,}")

# Calculate the difference
minnis_hours = query1 - query2
print(f"\n🔓 Attack result: Minni works {minnis_hours} hours per week")

# Verify
actual = adult[adult['Name'] == 'Minni Mathevon']['Hours per week'].values[0]
print(f"✅ Verification: Actual value = {actual} hours")

---

# 🎓 Key Takeaways

## What We've Learned

1. **De-identification is Not Enough**
   - Removing names and SSNs doesn't guarantee anonymity
   - Quasi-identifiers (ZIP, DOB, etc.) can uniquely identify individuals

2. **Linkage Attacks are Easy**
   - Even partial information can narrow down possibilities
   - Public data sources make auxiliary information readily available

3. **Aggregation Has Limits**
   - Small groups completely expose individual data
   - "Average" of one person is their exact value

4. **Differencing Attacks are Powerful**
   - Multiple queries can be combined to extract individual data
   - Works even on large aggregates
   - Indirect attacks using unique characteristics are possible

## Why This Matters

These vulnerabilities show why we need **differential privacy** - a mathematical framework that provides provable privacy guarantees regardless of:
- What auxiliary information attackers have
- How many queries they make
- What other datasets exist

## Ethical Note

⚠️ **Important**: These techniques are for educational purposes only. Using them on real data without authorization is unethical and potentially illegal.

---

# 🚀 Additional Challenges

Try these additional exercises to deepen your understanding:

## Challenge 1: Find the Most Vulnerable Person

Who in the dataset would be easiest to re-identify? Consider multiple quasi-identifiers.

In [None]:
# Your code here
# Hint: Look for people with unique combinations of attributes

## Challenge 2: Group Size Analysis

What's the minimum group size needed to provide reasonable privacy for age aggregation by occupation?

In [None]:
# Your code here
# Hint: Check the distribution of group sizes for different occupations

## Challenge 3: Multi-Attribute Linkage

How many people can be uniquely identified using Age + Sex + Education level?

In [None]:
# Your code here
# Hint: Group by these three attributes and count unique combinations