# Module 9: Privacy and Ethics — Classroom Demo Notebook

This notebook provides interactive demonstrations for each topic in Module 9. Every code cell is paired with explanatory markdown so that the content is ready for live teaching or self-paced exploration.

## 9.0 Module Overview

In this module we balance the benefits of data-driven innovation with the responsibilities that come with handling sensitive information. The demonstrations below walk through real-world privacy dilemmas, introduce the building blocks of differential privacy, and highlight professional codes of ethics.

## 9.2 Ethics Case Study: Barrow, Alaska

In the 1990s, researchers released anonymized health statistics for the small town of Barrow, Alaska. Because the community was so small, residents quickly realized that they could pinpoint which neighbors were represented in each row. The code below recreates the statistical vulnerability using a toy dataset.

In [1]:
import pandas as pd

barrow_data = pd.DataFrame({
    'community': ['Barrow'] * 5 + ['Anchorage'] * 5,
    'age_group': ['18-25', '26-35', '36-45', '46-55', '56+'] * 2,
    'condition_count': [1, 0, 2, 1, 1, 30, 26, 19, 18, 25]
})
barrow_data

Unnamed: 0,community,age_group,condition_count
0,Barrow,18-25,1
1,Barrow,26-35,0
2,Barrow,36-45,2
3,Barrow,46-55,1
4,Barrow,56+,1
5,Anchorage,18-25,30
6,Anchorage,26-35,26
7,Anchorage,36-45,19
8,Anchorage,46-55,18
9,Anchorage,56+,25


The anonymized table masks individual names, yet the unique combination of community size and medical condition counts can still reveal personal details. In Barrow, each age group might have only a handful of residents, making it trivial to guess who is represented.

In [2]:
small_group = barrow_data[barrow_data['community'] == 'Barrow'].copy()
small_group['share_of_population'] = (small_group['condition_count'] / small_group['condition_count'].sum()).round(2)
small_group

Unnamed: 0,community,age_group,condition_count,share_of_population
0,Barrow,18-25,1,0.2
1,Barrow,26-35,0,0.0
2,Barrow,36-45,2,0.4
3,Barrow,46-55,1,0.2
4,Barrow,56+,1,0.2


Even when we only reveal aggregates, revealing them for **tiny groups** leaks who is sick. The share of the population column shows that each row accounts for a large fraction of the local community, leaving little room for anonymity.

## 9.3 Example: Netflix Prize Re-identification

The Netflix Prize dataset replaced usernames with random IDs. However, Narayanan and Shmatikov showed that public IMDb reviews could be used to re-identify users. The following toy example demonstrates how linking two anonymized datasets can recover names.

In [3]:
netflix_release = pd.DataFrame({
    'anon_id': ['A1', 'A2', 'A3', 'A4'],
    'movie': ['Movie X', 'Movie Y', 'Movie Z', 'Movie X'],
    'rating': [5, 3, 4, 2],
    'date': ['2005-01-12', '2005-01-13', '2005-01-14', '2005-02-10']
})

public_reviews = pd.DataFrame({
    'name': ['Riley', 'Avery', 'Jordan'],
    'movie': ['Movie X', 'Movie Z', 'Movie Y'],
    'rating': [5, 4, 3],
    'date': ['2005-01-12', '2005-01-14', '2005-01-13']
})

reidentified = pd.merge(netflix_release, public_reviews, on=['movie', 'rating', 'date'])
reidentified

Unnamed: 0,anon_id,movie,rating,date,name
0,A1,Movie X,5,2005-01-12,Riley
1,A2,Movie Y,3,2005-01-13,Jordan
2,A3,Movie Z,4,2005-01-14,Avery


By joining on movie, rating, and date we re-identify three supposedly anonymous Netflix users. Real attackers can combine many more signals (timestamps, rare ratings, or location data) to de-anonymize users with high confidence.

## 9.4 Can We Compute Statistics Without Violating Privacy?

One mitigation strategy is to only answer questions about sufficiently large groups and to bound each contribution. The helper below enforces both rules before computing statistics.

In [4]:
from typing import Iterable

def safe_mean(values: Iterable[int], min_group_size: int = 5, bound: int = 10):
    values = list(values)
    if len(values) < min_group_size:
        raise ValueError('Group is too small to release statistics safely.')
    clipped = [max(-bound, min(bound, v)) for v in values]
    return sum(clipped) / len(clipped)

# Example: Anchorage has enough entries, Barrow does not
anchorage_values = barrow_data.loc[barrow_data['community'] == 'Anchorage', 'condition_count']
barrow_values = barrow_data.loc[barrow_data['community'] == 'Barrow', 'condition_count']

print('Anchorage safe mean:', safe_mean(anchorage_values))
try:
    safe_mean(barrow_values, min_group_size=6)
except ValueError as exc:
    print('Barrow warning:', exc)

Anchorage safe mean: 10.0


Clipping individual contributions and refusing to answer queries on small groups increases privacy. However, analysts still learn the exact mean once the safeguards pass, which motivates stronger tools such as differential privacy.

## 9.5–9.7 Introduction to Differential Privacy

Differential privacy (DP) adds carefully calibrated noise to query results. The Laplace mechanism releases a noisy statistic whose expected value equals the true answer, while controlling the privacy loss by a parameter $\epsilon$.

In [5]:
import math, random

def laplace_mechanism(value: float, sensitivity: float, epsilon: float) -> float:
    if epsilon <= 0:
        raise ValueError('Epsilon must be positive.')
    scale = sensitivity / epsilon
    u = random.random() - 0.5
    noise = -scale * math.copysign(1, u) * math.log(1 - 2 * abs(u))
    return value + noise

def private_count(data: Iterable[int], epsilon: float) -> float:
    true_count = sum(1 for x in data if x > 0)
    return laplace_mechanism(true_count, sensitivity=1, epsilon=epsilon)

positive_cases = (barrow_data['condition_count'] > 0).astype(int)

for eps in [0.5, 1.0, 2.0]:
    noisy = private_count(positive_cases, epsilon=eps)
    print(f'Epsilon={eps:.1f} -> noisy count: {noisy:.2f}')

Epsilon=0.5 -> noisy count: 7.17
Epsilon=1.0 -> noisy count: 7.65
Epsilon=2.0 -> noisy count: 8.43


Smaller values of $\epsilon$ add more noise (better privacy, less accuracy). Larger $\epsilon$ values add less noise, leaking more information. The Laplace mechanism protects any single resident's contribution even if an attacker knows the rest of the dataset.

## 9.8 Problems with Differential Privacy

Differential privacy has trade-offs: noisy answers can be inaccurate, and repeatedly querying the same dataset consumes the privacy budget. The simulation below shows how repeated queries narrow down the true count.

In [6]:
def repeated_queries(data: Iterable[int], epsilon: float, runs: int = 200):
    return [private_count(data, epsilon=epsilon) for _ in range(runs)]

releases = repeated_queries(positive_cases, epsilon=1.0, runs=200)
estimated = sum(releases) / len(releases)
print(f'Average of {len(releases)} noisy releases: {estimated:.2f}')
print(f'True count (not released to analysts): {positive_cases.sum()}')

Average of 200 noisy releases: 9.01
True count (not released to analysts): 9


Even though each individual release is noisy, averaging many releases recovers the true answer. Production systems therefore track a **privacy budget** and limit the number of queries that can be answered.

## 9.9 Professional Codes of Ethics

Ethical decision making goes beyond math. Industry organizations publish codes of ethics to guide practitioners. The table summarizes recurring themes across three well-known codes.

In [7]:
codes_of_ethics = pd.DataFrame({
    'Organization': ['ACM', 'IEEE', 'United Nations'],
    'Focus': ['Respect privacy, avoid harm, be transparent',
              'Prioritize public welfare, disclose conflicts',
              'Uphold human rights, fairness, accountability'],
    'Link': ['https://www.acm.org/code-of-ethics',
             'https://www.ieee.org/about/corporate/governance/p7-8.html',
             'https://www.un.org/en/about-us/universal-declaration-of-human-rights']
})
codes_of_ethics

Unnamed: 0,Organization,Focus,Link
0,ACM,"Respect privacy, avoid harm, be transparent",https://www.acm.org/code-of-ethics
1,IEEE,"Prioritize public welfare, disclose conflicts",https://www.ieee.org/about/corporate/governanc...
2,United Nations,"Uphold human rights, fairness, accountability",https://www.un.org/en/about-us/universal-decla...


Keeping ethical guidelines handy during project planning and review meetings encourages teams to think about consent, fairness, and long-term impacts before deploying data products.

## 9.10 Discussion: Data Ethics Debate

Use the helper function to pull a prompt for in-class or online discussions. Each scenario invites students to weigh benefits, risks, and mitigation strategies.

In [8]:
import itertools

debate_prompts = [
    'Should cities release detailed mobility data to improve traffic planning?',
    'Can hospitals share de-identified records with startups without patient consent?',
    'When, if ever, is it ethical to override user privacy for public safety?',
    'How should product teams respond when an algorithm amplifies stereotypes?'
]

def next_prompt(prompts=debate_prompts):
    for prompt in itertools.cycle(prompts):
        yield prompt

prompt_generator = next_prompt()
print('Discussion prompt:', next(prompt_generator))

Discussion prompt: Should cities release detailed mobility data to improve traffic planning?


Run the cell multiple times to cycle through the prompts. Encourage students to connect each debate back to the privacy techniques and ethical frameworks covered above.