# 📚 Realistic Travel Dataset — K-Anonymity, L-Diversity, T-Closeness

Built by **Stu** 🚀

_Context: Travel E-commerce Booking Platform — Privacy Risks_

## Introduction

Working with a realistic 400-customer travel booking dataset.

## Load Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulated load (already prebuilt)
np.random.seed(42)
ages = np.random.randint(18, 75, size=400)
zipcodes = np.random.choice(['02138', '02139', '94016', '10001', '60614', '94102'], size=400)
genders = np.random.choice(['Male', 'Female', 'Nonbinary'], size=400)
cities = np.random.choice(['Paris', 'London', 'Tokyo', 'New York', 'Rome', 'Sydney', 'Dubai'], size=400)
purposes = np.random.choice(['Leisure', 'Business', 'Medical', 'Education'], size=400)
prices = np.random.choice([200, 500, 1000, 1500, 2500, 3500, 4500], size=400)
loyalty = np.random.choice(['Gold', 'Silver', 'Platinum', 'None'], size=400)

df = pd.DataFrame({
    'Age': ages,
    'ZipCode': zipcodes,
    'Gender': genders,
    'DestinationCity': cities,
    'TripPurpose': purposes,
    'BookingPrice': prices,
    'LoyaltyStatus': loyalty
})

# Risk Score calculation
df['ZipPrefix'] = df['ZipCode'].str[:3]
risk_freq = df.groupby(['Age', 'ZipPrefix', 'DestinationCity']).size()
df['PrivacyRiskScore'] = df.apply(lambda row: 1 / (risk_freq.get((row['Age'], row['ZipPrefix'], row['DestinationCity']), 1)), axis=1)
df['PrivacyRiskScore'] = (df['PrivacyRiskScore'] - df['PrivacyRiskScore'].min()) / (df['PrivacyRiskScore'].max() - df['PrivacyRiskScore'].min())

df.head()

## Exercise 1: Describe Dataset Fields

Summarize the meaning of each column.

In [2]:
field_descriptions = ""

## Exercise 2: Identify Quasi-Identifiers

Which fields are quasi-identifiers?

In [3]:
quasi_identifiers = []  # e.g., Age, ZipPrefix, Gender

## Exercise 3: Basic k-Anonymity Check

Does the dataset satisfy 5-anonymity on Age + ZipPrefix?

In [4]:
def check_k_anonymity(df, fields, k):
    groups = df.groupby(fields).size()
    return (groups >= k).all()

check_k_anonymity(df, ['Age', 'ZipPrefix'], 5)

## Exercise 4: Generalize Age into Buckets

Bucket Age into 18–29, 30–39, etc.

In [5]:
bins = [18, 29, 39, 49, 59, 69, 79]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-79']
df['AgeBucket'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)
df.head()

## Exercise 5: Check 5-Anonymity with Age Buckets

Does AgeBucket + ZipPrefix achieve 5-anonymity?

In [6]:
check_k_anonymity(df, ['AgeBucket', 'ZipPrefix'], 5)

## Exercise 6: Define l-Diversity

What is l-diversity?

In [7]:
l_diversity_definition = ""

## Exercise 7: Check 2-Diversity for TripPurpose

Does each equivalence class (AgeBucket + ZipPrefix) have 2 distinct TripPurposes?

In [8]:
def check_l_diversity(df, group_fields, sensitive_field, l):
    diversity = df.groupby(group_fields)[sensitive_field].nunique()
    return (diversity >= l).all()

check_l_diversity(df, ['AgeBucket', 'ZipPrefix'], 'TripPurpose', 2)

## Exercise 8: Visualize Booking Price Distribution

Plot the histogram of Booking Prices.

In [9]:
df['BookingPrice'].hist(bins=20)
plt.xlabel('Booking Price ($)')
plt.ylabel('Frequency')
plt.title('Booking Price Distribution')
plt.show()

## Exercise 9: Linkage Attack Simulation

Use ZipPrefix + DestinationCity to attempt re-identification.

In [10]:
def linkage_attack(df, known_fields):
    linkage = df.groupby(known_fields).size()
    return linkage[linkage == 1].count() / len(df) * 100

linkage_attack(df, ['ZipPrefix', 'DestinationCity'])

## Exercise 10: Calculate Privacy Risk Score

Show a quick distribution of PrivacyRiskScore.

In [11]:
df['PrivacyRiskScore'].hist(bins=20)
plt.xlabel('Privacy Risk Score')
plt.ylabel('Frequency')
plt.title('Privacy Risk Score Distribution')
plt.show()

## Exercise 11: Top 10 Highest Risk Travelers

Find top 10 travelers by PrivacyRiskScore.

In [12]:
df.nlargest(10, 'PrivacyRiskScore')