[__Join our Slack__](https://join.slack.com/t/opendp/shared_invite/zt-zw7o1k2s-dHg8NQE8WTfAGFnN_cwomA) if you have any questions or comments!

# OpenDP Library Exercise

You can work [here in Colab](https://shorturl.at/jlwV3), or download this notebook and edit 
locally.
Use `Shift+Enter` to execute a cell. 

In [None]:
%pip install opendp seaborn

In [None]:
import opendp.prelude as dp
TODO = "TODO" # placeholder variable to mark things to do

Any constructors that have not completed the proof-writing and vetting process may still be accessed if you opt-in to "contrib".

[[Documentation]](https://github.com/opendp/opendp/discussions/304)

In [None]:
TODO # enable "contrib"

# (this assertion should pass)
assert "contrib" in dp.GLOBAL_FEATURES

## Exercise: Sydney Pedestrian Traffic Data

The Sydney Pedestrian Traffic data set logs the first sightings of an individual 
on Park Street, Bridge Street and Market Street, on February 3, 2024.
The data is based on true counts of hourly per-street pedestrian traffic,
but has additional synthetically-generated attributes: name, age and gender.

In [None]:
import urllib.request
data_url = "https://raw.githubusercontent.com/opendp/opendp/sydney/Sydney_Synth_Pedestrian_Counts_2024_2_3.csv"
with urllib.request.urlopen(data_url) as data_req:
    data = data_req.read().decode('utf-8')

### 1. Privacy Unit

First, determine the privacy unit. Remember that:
* the data only spans one day
* only the first sighting of an individual is logged, per-street
* there are only three streets

In [None]:
privacy_unit = TODO

### 2. Privacy Loss

Create a privacy loss budget of $\epsilon = 1.0$.

In [None]:
privacy_loss = TODO

### 3. Public Information

Taken from the data description.

In [None]:
col_names = ["Street", "Hour", "First Name", "Last Name", "Age", "Gender"]
street_names = ["Park Street", "Bridge Street", "Market Street"]

### 4. Mediate Access
Initialize the context for our analysis: data, privacy unit and privacy loss. Split the budget evenly over 10 queries.

In [None]:
context = TODO

### 5.a Per-Street Counts

Compute a DP count of the number of sightings on Bridge Street:

In [None]:
bridge_count_query = (
    context.query()
    .split_dataframe(",", col_names=col_names)
    .df_is_equal("Street", "Bridge Street")
    .subset_by("Street", keep_columns=["First Name"])
    .select_column("First Name", str)
    # TODO: count and noise
)
bridge_count_query.release()

Release DP counts of the other two streets.

In [None]:
TODO

The following release counts sightings on all three streets using only one query, 
by taking advantage of parallel composition:

In [None]:
count_query = (
    context.query()
    .split_dataframe(",", col_names=col_names)
    .select_column("Street", str)
    .count_by_categories(street_names, null_category=False)
    .laplace()
)
street_counts = count_query.release()
street_counts

### 5.b Distribution of Pedestrians Per-Hour

Visualize how busy downtown Sydney is over the day.
Might collecting data by first sighting skew the data?

In [None]:
hours = list(range(24))
hourly_query = TODO
hourly_sightings = hourly_query.release()

In [None]:
import matplotlib.pyplot as plt


def fmt(hour):
    return f"{int(hour) % 12} {'P' if int(hour) > 11 else 'A'}M"


plt.bar(list(map(fmt, hours)), hourly_sightings)
plt.xticks(range(1, 25, 4))

plt.xlabel("time")
plt.ylabel("first sightings");

### 5.c Park Street Ages Analysis

In this exercise, you'll compute the mean and variance of ages on Park Street.

You are given part of a query, that preprocesses the data to a vector of ages.
Adjust the query to transform the data to also be bounded and have known size.

In [None]:
q_park_street_ages = (
    context.query()
    .split_dataframe(",", col_names=col_names)
    .df_is_equal("Street", "Park Street")
    .subset_by("Street", keep_columns=["Age"])
    .select_column("Age", str)
    .cast_default(float)
    # TODO: clamp and resize
)

Now release queries for the mean and variance.

In [None]:
q_park_mean_age = TODO
q_park_mean_age.release()

In [None]:
q_park_mean_age = TODO
q_park_mean_age.release()

### 5.c Upper Quartile of Ages on Market Street

_This example is already worked._

Estimate the 75th percentile of the age of individuals on Market Street.

In [None]:
quantile_query = (
    context.query()
    .split_dataframe(",", col_names=col_names)
    .df_is_equal("Street", "Market Street")
    .subset_by("Street", keep_columns=["Age"])
    .select_column("Age", str)
    .cast_default(int)
    .quantile_score_candidates(list(range(100)), alpha=.75)
    .report_noisy_max_gumbel(optimize="min")
)
quantile_query.release()

### 5.d Sparse Histogram of Names

_This example is already worked._

When dealing with data that has an unknown domain, the keys themselves are private information.
One approach used to privately release histograms on unknown key sets is to only release keys with large counts:
that is, only release keys that many people have contributed to, and thus remain stable across all neighboring datasets.
Another approach, used below, is to privatize a low-dimensional projection.

In [None]:
quantile_query = (
    context.query()
    .split_dataframe(",", col_names=col_names)
    .df_is_equal("Street", "Park Street")
    .subset_by("Street", keep_columns=["First Name"])
    .select_column("First Name", str)
    .count_by(MO=dp.L1Distance[int])
    .alp_queryable(scale=.025, total_limit=sum(street_counts), value_limit=400)
)
qbl = quantile_query.release()

This mechanism releases a queryable containing a differentially private, hash-based representation of the counts of all possible names.

As you would expect, the most common names have a relatively high number of observations.
Names that you would expect to be less common have lower counts.

In [None]:
qbl("Michael"), qbl("James"), qbl("Sharon"), qbl("Lancelot")

Lancelot would likely not be observed in the data, so you would expect the true count to be near zero.