## Part 2: Quantiles

## Introduction

In this tutorial, we will explore some quantile methods supported by Polars and OpenDP. Specifically, we will look at the quantile method generally, and then explain its primary steps.

This tutorial will use the Labor Force Survey data (see the Pre-Processing Notebook for more information) and some of the concepts introduced in Part 1 - Data Exploration. 

### Why Not Add Noise Directly to the Quantile?
Quantiles are more robust as they consider the data distribution. However, consider the case where exactly half of the points are upper quartile and exactly half are in the lower. Directly adding noise which leads to a specific points can change the sensitivity (max - min) greatly, so constructing a differentially private quantile requires more sophistication.

In [1]:
pip install "opendp[polars]"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import polars as pl
import opendp.prelude as dp
dp.enable_features("contrib")
import matplotlib.pyplot as plt
import seaborn as sns

If the "sample_FR_LFS.csv" file exists, then follow the code for importing the dataset as is. Otherwise follow the instructions from Preprocessing.ipynb to compile the dataset.

In [3]:
# reading in the data
df = pl.scan_csv("FR_LFS_2008Q1.csv", infer_schema_length=1000, ignore_errors=True)

For more information on defining the compositor, see the Part 1 Notebook on Data Exploration methods!

In [4]:
context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=1),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=10,
    margins={
        ("AGE", ): dp.Margin(public_info="keys", max_partition_length=60_000_000),
        (): dp.Margin(public_info="keys", max_partition_length=60_000_000),
    },
)

## Using the Quantile Method Directly

### Compute the Median Age

To compute the median using the quantile method directly, follow these steps:

1. Select the variable you are interested in. In this case we chose "AGE"
2. Fill in null values. We'll impute the values with the mean age.
3. Specify the following parameters in the quantile method:
    a. Quantile: This is between 0 and 1. Since we're interested in computing the median, we input 0.5.
    b. A range of possible values for your quantiles. This will depend on your specific domain. We know that ages are generally between 0 and 100, we use those as our candidates, so our results will be limited to these values.

In [5]:
mean_age = df.select(pl.col("AGE")).mean().collect()

In [6]:
candidates = list(range(100))
quantile = 0.5

dp_median = context.query().select(
    pl.col("AGE").fill_null(mean_age).dp.quantile(quantile, candidates)
).release().collect()
dp_median

AGE
i64
43


### Multiple Quantiles for the Same Variable

To compute multiple quantiles for the same variable, we can use list comprehension to specify multiple quantiles in our query.

In [7]:
multiple_quantiles = context.query().select(
    [pl.col("AGE").fill_null(40).dp.quantile(q, list(range(120))).alias(f"Quantile_{q}") for q in [0.25, 0.5, 0.75]]
).release().collect()
multiple_quantiles

Quantile_0.25,Quantile_0.5,Quantile_0.75
i64,i64,i64
20,34,56


Notice that we also specified a different alias for each column. Polars uses the original column name on derived columns, so aliases are needed to distinguish these columns.

## Breaking Down the Quantile Method

Now that you know how to use the quantile method, we can also break it down for you a little more.

### 1. Compute the Discrete Quantile Score for Each Candidate

The `discrete_quantile_score` function takes in the same parameters as the quantile function. It computes a utility score for each candidate quantile that represents how close each candidate is to the true quantile. Lower scores are closer to 0 are more accurate

In [8]:
discrete_scores = pl.col("AGE").fill_null(mean_age).dp._discrete_quantile_score(quantile, candidates)
discrete_scores

### 2. Add Noise and Return Index of Candidate with Lowest Score

We now pass the scores to the '_report_noisy_max_gumbel' function. This adds Gumbel noise to the score and returns index of the candidate that has the lowest score.

In [9]:
noisy_index = discrete_scores.dp._report_noisy_max_gumbel("min")
noisy_index

### 3. Return the Corresponding Quantile Value

We pass the index obtained in the last function to '_index_candidates function' which maps the index to its corresponding candidate value. This differentially private quantile estimate is our final result!

In [10]:
final_result = noisy_index.dp._index_candidates(candidates)
final_result

In [11]:
context.query().select(final_result).release().collect()

AGE
i64
41


## Conclusion

In this notebook, we covered how to use the OpenDP Context API with Polars to compute differentially private quantiles and explained the methodology.

If you have any ideas on how to improve this notebook or specific content you'd like to see in future notebooks, let us know here!