# Computing Basic Statistics

In this section, you will learn to compute essential statistical measures such as: 

* Sum 
* Mean 
* Median
* Quantiles

For each method, compare the actual values to the differentially private values to demonstrate utility. The [documentation](../../api/python/opendp.polars.html) also provides more information about these methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. 

## Set Up

In [1]:
import polars as pl 
import opendp.prelude as dp

dp.enable_features("contrib")

In [2]:
# Fetch and load the data. 
!curl "https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip?raw=true" -L -o data.zip > /dev/null 2>&1

!unzip -q data.zip > /dev/null 2>&1

The compositor here is nearly identical to the one explained in the introduction. One additional parameter that is included is the `max_num_partitions.` This is required when the metric isn't sensitive to ordering. In the following examples, there is only one partition in select, so `max_num_partitions` is set to 1. 

<!-- TODO\n",
    "For further information on the context parameters, see the introduction for justification. \n",
    "-->

In [3]:
df = pl.scan_csv("sample_FR_LFS.csv")

# Filter HWUSUAL of null values. 
df = df.filter(pl.col("HWUSUAL") != 99)

estimated_max_partition_len = 60_000_000

context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    margins={
        (): dp.Margin(public_info= "lengths", 
                      max_partition_length=estimated_max_partition_len, 
                      # The max num partitions=1 specifies that our data is not partitioned 
                      # into smaller subsets. 
                      max_num_partitions=1),
    },
)

In these examples we see some differences between how regular Polars and Polars with DP are used:

### 1. Specifying the Data

In a regular query, use `df` directly. In a differentially private query, use `context.query()`. The `context.query()` coordinates the queries, allocates the privacy budget, and specifies the grouping margins. For now, it can also be viewed as a substitute for `df`. 

### 2. Applying the Method

In a regular query, you'll often be able to apply the function directly, such as `.sum()`. In a differentially private query, you may need to pass additional parameters. For example, to compute the differentially private sum, pass in the bounds as a tuple: `dp.sum((1,10))`. 

### 3. Collecting the Results 

In a regular query, use `collect()`. In a differentially private query, use `release().collect().`. 

## Sum

To demonstrate the `sum` method, calculate the actual total number of hours, a non-private value, worked for all years in the dataset. 

In [4]:
total_hours_actual = df.select(pl.col("HWUSUAL").sum()).collect().item()

The output is the non-private actual total hours, which is 2962104.0. 

The query for the sum is essentially the same, but we do need to call `fill_null` because the `sum` method requires a non-nullable input. 

The imputed value can be non-zero and depends on the context of your data. 

Do not use private data to calculate imputed values or bounds: This could leak private information, rendering the differential privacy guarantees meaningless. Instead, choose bounds and imputed values based on prior domain knowledge.

*`fill_null` imputes the null values with the provided value.*

In [5]:
total_hours_dp = context.query().select(
    pl.col("HWUSUAL").fill_null(0.).dp.sum((0,80))
).release().collect().item()

The output is the DP total hours, which is 2968795.762634444. 

## Mean

To demonstrate the `mean` method, calculate the mean number of hours worked for all years in the entire dataset. The `mean` method also requires all null values to be filled. The bounds parameter is the same as the bounds parameter used in `sum`. 

In [6]:
mean_hours_actual = df.select(pl.col("HWUSUAL").mean()).collect().item()

The output is the non-private actual mean hours, which is 37.63409056258576. 

In [7]:
mean_hours_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.mean((1,80))).release().collect().item()

The output is the DP mean hours, which is 37.72935249358343. 

## Median


To demonstrate the `median` method, calculate the median number of hours worked for all years in the entire dataset. The `median` method requires a parameter `candidates`, which are potential values for the median. 

Having more candidates may allow for a more accurate median, but will consume more of the privacy budget. The 1 hour increments in this case are arbitrary. 

In [8]:
median_hours_actual = df.select(pl.col("HWUSUAL").median()).collect().item()

The output is the non-private actual median hour, which is 37.0. 

In [9]:
median_candidates = list(range(20,60))

median_hours_dp = context.query().select(
    pl.col("HWUSUAL").fill_null(40.).dp.median(median_candidates)
).release().collect().item()

The output is the DP median hour, which is 37.0. 

## Quantiles

To demonstrate the `quantile` method, calculate the number of hours worked at the 25th percentile. 

The `median` method requires two parameters:

* `quantile`: This is between 0 and 1. We provide 0.25. 
* `candidates`: A range of possible values for your quantiles. We provide values from 20 through 60.  

In [10]:
quantile = 0.25
quantile_25_actual = df.select(pl.col("HWUSUAL").quantile(quantile)).collect().item()

The output is the non-private actual 25th quantile hour, which is 35.0. 

In [11]:
quantile_candidates = list(range(20, 60))
quantile_25_dp = context.query().select(pl.col("HWUSUAL").fill_null(40.).dp.quantile(quantile, quantile_candidates)).release().collect().item()

The output is the DP 25th quantile hour, which is 35.0. 

### Computing Multiple Quantiles

Typically you'll want a set of quantiles. Comprehensions are a good tool for this.

*Shape refers to the dimensions of the dataframe, specifically the number of rows and columns it has.* 

In [12]:
multiple_quantiles_actual = df.select(
    [pl.col("HWUSUAL").fill_null(40.).quantile(q).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).collect()
multiple_quantiles_actual

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
f64,f64,f64,f64
34.0,35.0,39.0,44.0


In [13]:
multiple_quantiles_dp = context.query().select(
    [pl.col("HWUSUAL").fill_null(40.).dp.quantile(q, quantile_candidates).alias(f"Quantile_{q}") for q in [0.2, 0.4, 0.6, 0.8]]
).release().collect()
multiple_quantiles_dp

Quantile_0.2,Quantile_0.4,Quantile_0.6,Quantile_0.8
i64,i64,i64,i64
34,36,39,42
