# Computing Fundemental Statistics

Welcome to the introduction to computing fundamental statistics using the OpenDP library. In this section, you will learn to compute essential statistical measures such as: 
- Sum 
- Mean 
- Median
- Quantiles

For each method, we will compare the actual values to the computed differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) also provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. 

## Set Up

In [None]:
%pip install numpy matplotlib seaborn 
%pip install "opendp[polars]"

In [3]:
import polars as pl 
import opendp.prelude as dp
import seaborn as sns 

dp.enable_features("contrib")
sns.set_theme(style='darkgrid')

In [4]:
df = pl.scan_csv("sample_FR_LFS.csv")

The compositor here is nearly identical to the one explained in introduction. One additional paramter that is included is the `max_num_partitions` which is required when the metric isn't sensitive to ordering. This means (???). In the following examples, there is only one partition in select, so `max_num_partitions` is set to 1. 

In [74]:
# Filter HWUSUAL of null values. 
df = df.filter(pl.col("HWUSUAL") != 99)

estimated_max_partition_len = 60_000_000

context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    margins={
        ("YEAR", ): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=4),
        ("QUARTER", ): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=13),
        ("YEAR", "QUARTER",): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=1),
        #TODO: ask Mike why the length of this needs to be public info for mean but not sum
        (): dp.Margin(public_info= "lengths",max_partition_length=estimated_max_partition_len, max_num_partitions=1),
    },
)

Note that another purpose of computing actual and differentially private statistics is to illustrate the similarities between regular polars queries and differentially private queries. 

Some key differences are: 

1. Specifying the Data

In a regular query, use `df` directly. In a differentially private query, use `context.query()`. 

2. Applying the Method

In a regular query, you'll often be able to apply the function directly, such as `.sum()`. In a differentially private query, you'll specify the package and pass the required parameters. For example, to compute the differentially private sum, pass in bounds as a tuple for example, `dp.sum((1,10))`. The bounds are the bounds of the input data, essentially the known minimum and maximum. 

3. Collecting the Results 

In a regular query, use `collect()`. In a differentially private query, use `release().collect().`. 

## Sum

To demonstrate the `sum` method, let's calculate the total number of hours worked for all years in the entire dataset. 

This mirrors [aggregrate hours](https://www.investopedia.com/terms/a/aggregate_hours.asp), a statistic gathered by the U.S. Department of Labor that represents the total hours worked by all people during the course of a year. In a later example, we will be able to filter our data to also compute this statistic. 

In [64]:
total_hours_actual = df.select(pl.col("HWUSUAL").sum()).collect().item()
print('Actual Total Hours: ', total_hours_actual)

Actual Total Hours:  2962104.0


The query for the sum is essentially the same. We add an extra argument, `fill_null` because the `sum` method requires a non-nullable input. 

Do not compute the values you use to impute null values or specify the bounds since doing so will hinder the privacy guarantee. These values should be based on domain knowledge. 

In [65]:
total_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(0.).dp.sum((1,80))).release().collect().item()
print('Differentially Private Total Hours: ', total_hours_dp)

Differentially Private Total Hours:  2961539.9897759007


## Mean

To demonstrate the `mean` method, let's calculate the mean number of hours worked for all years in the entire dataset. The `mean` method also requires all null values to be filled. The bounds parameter is the same as the bounds parameter used in `sum`. 

In [66]:
mean_hours_actual = df.select(pl.col("HWUSUAL").mean()).collect().item()
print('Actual Hours Mean: ', mean_hours_actual)

Actual Hours Mean:  37.63409056258576


In [67]:
mean_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.mean((1,80))).release().collect().item()
print('Differentially Private Mean for Hours: ', mean_hours_dp )

Differentially Private Mean for Hours:  37.42618019779592


## Median


To demonstrate the `median` method, let's calculate the median number of hours worked for all years in the entire dataset. The `median` method requires a parameter `candidates,` which is the potential values for the median. 

In [68]:
median_hours_actual = df.select(pl.col("HWUSUAL").median()).collect().item()
print('Actual Hours Median: ', median_hours_actual)

Actual Hours Median:  37.0


In [69]:
median_candidates = list(range(20,60))
median_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.median(median_candidates)).release().collect().item()
print('Differentially Private Median for Hours: ', median_hours_dp )

Differentially Private Median for Hours:  37


## Quantiles

To demonstrate the `quantile` method, let's calculate the number of hours worked at the 25th percentile. 

The `median` method requires two parameters:
- Quantile: This is between 0 and 1. We provide 0.25. 
- Candidates: A range of possible values for your quantiles. We provide values from 20 through 60.  

In [70]:
quantile = 0.25
quantile_30_actual = df.select(pl.col("HWUSUAL").quantile(quantile)).collect().item()
print('Actual 25th Quantile for Hours: ', quantile_30_actual)

Actual 25th Quantile for Hours:  35.0


In [71]:
quantile_candidates = list(range(20, 60))
quantile_30_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.quantile(quantile, quantile_candidates)).release().collect().item()
print('Differentially Private 25th Quantile for Hours: ', quantile_30_dp)

Differentially Private 25th Quantile for Hours:  35


### Computing Multiple Quantiles

In [72]:
multiple_quantiles_actual = df.select(
    [pl.col("HWUSUAL").fill_null(40.).quantile(q).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).collect()
multiple_quantiles_actual

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
f64,f64,f64,f64
34.0,35.0,39.0,44.0


In [73]:
multiple_quantiles_dp = context.query().select(
    [pl.col("HWUSUAL").fill_null(40.).dp.quantile(q, quantile_candidates).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).release().collect()
multiple_quantiles_dp

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
i64,i64,i64,i64
30,36,39,43
