# ðŸ“š Privacy Background â€” K-Anonymity, L-Diversity, T-Closeness

Built by **Stu** ðŸš€

_Context: E-commerce Travel Booking Platform_

## Introduction to Classical Privacy Methods

We study how anonymization methods evolved before Differential Privacy.

## Toy Travel Dataset (30 records)

In [1]:
import pandas as pd
toy_data = pd.DataFrame({
    'Age': [25, 34, 29, 45, 52, 23, 31, 40, 29, 60, 48, 27, 35, 42, 37, 53, 46, 30, 24, 38, 41, 50, 28, 47, 36, 39, 44, 32, 26, 33],
    'ZipCode': ['02138', '02139', '02138', '02139', '02140', '02141', '02140', '02141', '02138', '02139',
                '02140', '02141', '02138', '02139', '02140', '02141', '02140', '02141', '02138', '02139',
                '02140', '02141', '02138', '02139', '02140', '02141', '02140', '02141', '02138', '02139'],
    'Gender': ['Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female',
               'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female',
               'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'],
    'DestinationCity': ['Paris', 'London', 'Paris', 'Tokyo', 'Boston', 'Rome', 'Paris', 'Rome', 'Boston', 'Tokyo',
                         'London', 'Paris', 'Tokyo', 'Paris', 'Rome', 'Tokyo', 'London', 'Paris', 'Boston', 'Rome',
                         'Paris', 'Tokyo', 'London', 'Rome', 'Paris', 'Tokyo', 'Boston', 'Rome', 'London', 'Tokyo'],
    'TripPurpose': ['Leisure', 'Business', 'Leisure', 'Leisure', 'Business', 'Leisure', 'Business', 'Business',
                    'Medical', 'Leisure', 'Education', 'Medical', 'Business', 'Leisure', 'Leisure', 'Education',
                    'Business', 'Medical', 'Leisure', 'Business', 'Medical', 'Leisure', 'Education', 'Business',
                    'Medical', 'Business', 'Leisure', 'Medical', 'Business', 'Education'],
    'BookingPrice': [1200, 2500, 1800, 3400, 2200, 2900, 3100, 4000, 450, 3200,
                     2700, 1300, 2600, 1800, 3000, 3400, 2200, 2500, 500, 3800,
                     470, 3600, 2800, 4100, 490, 3000, 510, 4200, 2500, 3100],
    'LoyaltyStatus': ['Gold', 'Silver', 'Gold', 'None', 'Silver', 'Gold', 'None', 'Gold', 'Silver', 'None',
                      'Gold', 'None', 'Silver', 'Gold', 'Silver', 'None', 'Silver', 'None', 'Gold', 'Silver',
                      'Gold', 'None', 'Silver', 'Gold', 'Silver', 'None', 'Silver', 'Gold', 'None']
})
toy_data.head()

## Exercise 1: What is k-Anonymity?

Define k-anonymity in your own words.

In [2]:
k_anonymity_definition = ""

## Exercise 2: Identify Quasi-Identifiers

Which fields in the toy dataset are quasi-identifiers?

In [3]:
quasi_identifiers = []  # e.g., Age, ZipCode, Gender

## Exercise 3: k-Anonymity Violation

Does the toy dataset satisfy 3-anonymity for the fields Age + ZipCode + Gender?

In [4]:
def check_k_anonymity(df, fields, k):
    groups = df.groupby(fields).size()
    return (groups >= k).all()

check_k_anonymity(toy_data, ['Age', 'ZipCode', 'Gender'], 3)

## Exercise 4: Generalize ZipCode

Generalize ZipCode to the first 3 digits only.

In [5]:
toy_data['ZipCode_Generalized'] = toy_data['ZipCode'].str[:3]
toy_data.head()

## Exercise 5: Check 3-Anonymity After Generalization

Does the dataset now satisfy 3-anonymity using Age + generalized ZipCode + Gender?

In [6]:
check_k_anonymity(toy_data, ['Age', 'ZipCode_Generalized', 'Gender'], 3)

## Exercise 6: What is l-Diversity?

Define l-diversity and why it was introduced.

In [7]:
l_diversity_definition = ""

## Exercise 7: Check l-Diversity for TripPurpose

Does each 3-anonymous group have at least 2 diverse Trip Purposes?

In [8]:
def check_l_diversity(df, group_fields, sensitive_field, l):
    grouped = df.groupby(group_fields)[sensitive_field].nunique()
    return (grouped >= l).all()

check_l_diversity(toy_data, ['Age', 'ZipCode_Generalized', 'Gender'], 'TripPurpose', 2)

## Exercise 8: What is t-Closeness?

Define t-closeness and its advantage over l-diversity.

In [9]:
t_closeness_definition = ""

## Exercise 9: Booking Price Distribution

Plot the histogram of Booking Prices for all users.

In [10]:
import matplotlib.pyplot as plt
toy_data['BookingPrice'].hist(bins=10)
plt.xlabel('Booking Price ($)')
plt.ylabel('Frequency')
plt.title('Overall Booking Price Distribution')
plt.show()

## Exercise 10: Simulate Linkage Attack

Assume attacker knows Age and ZipCode prefix. Try to identify DestinationCity.

In [11]:
def simulate_linkage_attack(df, known_fields):
    results = df.groupby(known_fields)['DestinationCity'].nunique()
    return results

simulate_linkage_attack(toy_data, ['Age', 'ZipCode_Generalized'])

## Exercise 11: Linkage Attack Risk

What percentage of groups have only 1 unique DestinationCity (perfect attack)?

In [12]:
linkage_results = simulate_linkage_attack(toy_data, ['Age', 'ZipCode_Generalized'])
(linkage_results == 1).mean() * 100

## Exercise 12: Suggest Defense Strategies

Suggest how to defend against linkage attacks using generalization or suppression.

In [13]:
defense_strategies = ""

## Exercise 13: Suppress Unique Records

Suppose we suppress records that are unique based on quasi-identifiers. Show how many records remain.

In [14]:
group_sizes = toy_data.groupby(['Age', 'ZipCode_Generalized', 'Gender']).size()
safe_indices = group_sizes[group_sizes >= 3].index
safe_data = toy_data.set_index(['Age', 'ZipCode_Generalized', 'Gender']).loc[safe_indices].reset_index()
len(safe_data)