# Computing Basic Statistics

In this section, you will learn to compute essential statistical measures such as: 
- Sum 
- Mean 
- Median
- Quantiles

For each method, we will compare the actual values to the differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) also provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. 

## Set Up

In [2]:
import requests
import zipfile
import io

import polars as pl 
import opendp.prelude as dp

dp.enable_features("contrib")

In [3]:
url = "https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true"

response = requests.get(url)
response.raise_for_status()

with zipfile.ZipFile(io.BytesIO(response.content)) as the_zip:
    with the_zip.open("sample_FR_LFS.csv") as file:
        # Set ignore_errors to true to avoid conversion issues in len function. 
        # Many columns contain mixtures of strings and numbers and cannot be parsed as floats.
        df = pl.scan_csv(file.name)

The compositor here is nearly identical to the one explained in the introduction. One additional parameter that is included is the `max_num_partitions.` This is required when the metric isn't sensitive to ordering. In the following examples, there is only one partition in select, so `max_num_partitions` is set to 1. 

<!-- TODO\n",
    "For further information on the context parameters, see the introduction for justification. \n",
    "-->

In [4]:
# Filter HWUSUAL of null values. 
df = df.filter(pl.col("HWUSUAL") != 99)

estimated_max_partition_len = 60_000_000

context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    margins={
        (): dp.Margin(public_info= "lengths", max_partition_length=estimated_max_partition_len, max_num_partitions=1),
    },
)

In these examples we seem some differences between how regular Polars and Polars with DP are used:

### 1. Specifying the Data

In a regular query, use `df` directly. In a differentially private query, use `context.query()`. 

### 2. Applying the Method

In a regular query, you'll often be able to apply the function directly, such as `.sum()`. In a differentially private query, you'll specify the package and pass the required parameters. For example, to compute the differentially private sum, pass in bounds as a tuple for example, `dp.sum((1,10))`. The bounds are the bounds of the input data, essentially the known minimum and maximum. 

### 3. Collecting the Results 

In a regular query, use `collect()`. In a differentially private query, use `release().collect().`. 

## Sum

To demonstrate the `sum` method, calculate the total number of hours worked for all years in the dataset. 

In [5]:
total_hours_actual = df.select(pl.col("HWUSUAL").sum()).collect().item()
print('Actual Total Hours: ', total_hours_actual)

Actual Total Hours:  2962104.0


The query for the sum is essentially the same, but we do need to call `fill_null` because the `sum` method requires a non-nullable input. 

The value you use to impute can be non-zero and depends on the context of your data. 

Do not use private data to calculate imputed values or bounds: This could leak private information, and rendering the differential privacy guarantees meaningless. Instead, choose bounds and imputed values based on prior domain knowledge.

In [6]:
total_hours_dp = context.query().select(
    pl.col("HWUSUAL").fill_null(0.).dp.sum((0,80))
).release().collect().item()

print('DP Total Hours: ', total_hours_dp)

Differentially Private Total Hours:  2963161.838941833


## Mean

To demonstrate the `mean` method, calculate the mean number of hours worked for all years in the entire dataset. The `mean` method also requires all null values to be filled. The bounds parameter is the same as the bounds parameter used in `sum`. 

In [7]:
mean_hours_actual = df.select(pl.col("HWUSUAL").mean()).collect().item()
print('Actual Mean Hours: ', mean_hours_actual)

Actual mean hours:  37.63409056258576


In [8]:
mean_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.mean((1,80))).release().collect().item()
print('DP Mean Hours: ', mean_hours_dp )

Differentially Private Mean for Hours:  37.6687051102132


## Median


To demonstrate the `median` method, calculate the median number of hours worked for all years in the entire dataset. The `median` method requires a parameter `candidates,` which are potential values for the median. 

Having more candidates may allow for a more accurate median, but at a higher computational cost. The 1 hour increments in this case are arbitrary. 

In [9]:
median_hours_actual = df.select(pl.col("HWUSUAL").median()).collect().item()
print('Actual Mean Hour: ', median_hours_actual)

Actual Hours Median:  37.0


In [10]:
median_candidates = list(range(20,60))

median_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.median(median_candidates)).release().collect().item()
print('DP Median Hour: ', median_hours_dp )

Differentially Private Median for Hours:  37


## Quantiles

To demonstrate the `quantile` method, calculate the number of hours worked at the 25th percentile. 

The `median` method requires two parameters:
- Quantile: This is between 0 and 1. We provide 0.25. 
- Candidates: A range of possible values for your quantiles. We provide values from 20 through 60.  

In [11]:
quantile = 0.25
quantile_30_actual = df.select(pl.col("HWUSUAL").quantile(quantile)).collect().item()
print('Actual 25th Quantile Hour: ', quantile_30_actual)

Actual 25th Quantile for Hours:  35.0


In [12]:
quantile_candidates = list(range(20, 60))
quantile_30_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.quantile(quantile, quantile_candidates)).release().collect().item()
print('DP 25th Quantile Hour: ', quantile_30_dp)

Differentially Private 25th Quantile for Hours:  35


### Computing Multiple Quantiles

In [13]:
multiple_quantiles_actual = df.select(
    [pl.col("HWUSUAL").fill_null(40.).quantile(q).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).collect()
multiple_quantiles_actual

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
f64,f64,f64,f64
34.0,35.0,39.0,44.0


In [14]:
multiple_quantiles_dp = context.query().select(
    [pl.col("HWUSUAL").fill_null(40.).dp.quantile(q, quantile_candidates).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).release().collect()
multiple_quantiles_dp

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
i64,i64,i64,i64
32,36,38,48
