# Aggregrations and Filtering

In this section, you will learn how to compute differentially private statistics while applying key data manipulation techniques such as: 

- Singular Variable Groupby
- Multiple Variable Groupby
- Filtering

For each method, we will compare the actual values to the computed differentially private values to demonstrate utility. The [documentation](https://docs.opendp.org/en/nightly/api/python/opendp.polars.html#module-opendp.polars) also provides more information about the methods. We will use the [sample data](https://github.com/opendp/dp-test-datasets/blob/master/data/sample_FR_LFS.csv.zip) from the Labour Force Survey in France. 

## Set Up

In [None]:
%pip install numpy matplotlib seaborn 
%pip install "opendp[polars]"

In [2]:
import polars as pl 
import opendp.prelude as dp
import seaborn as sns 

dp.enable_features("contrib")
sns.set_theme(style='darkgrid')

In [13]:
df = pl.scan_csv("sample_FR_LFS.csv")

In [18]:
# Filter HWUSUAL of null values. 
# df = df.filter(pl.col("HWUSUAL") != 99)

estimated_max_partition_len = 60_000_000

context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=5,
    margins={
        ("ILOSTAT", ): dp.Margin(max_partition_length=estimated_max_partition_len),
        ("YEAR", ): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=4),
        ("QUARTER", ): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=13),
        ("YEAR", "QUARTER",): dp.Margin(max_partition_length=estimated_max_partition_len, max_partition_contributions=1),
        #TODO: ask Mike why the length of this needs to be public info for mean but not sum
        (): dp.Margin(public_info= "lengths",max_partition_length=estimated_max_partition_len, max_num_partitions=1),
    },
)

## Singular Variable Groupby 

To demonstrate the `group_by` method, let's calculate the count of people in each labor category. 

labor_status_codes = {
    9: 'NA',
    2: 'Not Working But Employed',
    1: 'Working for Pay', 
    3: 'Laid Off'
}

mean_hours_actual = (df.group_by("ILOSTAT").agg(pl.len()).sort("ILOSTAT")).collect()

# Use `with_columns` to impute variable names. 
mean_hours_actual = mean_hours_actual.with_columns(
    pl.col("ILOSTAT").apply(lambda x: labor_status_codes.get(x, x))
)
mean_hours_actual

To demonstrate the `group_by` method, let's calculate the count of people in each year. 

In [24]:
count_year_actual = (df.group_by("YEAR").agg(pl.len()).sort("YEAR")).collect()
count_year_actual

YEAR,len
i64,u32
2004,16491
2005,16460
2006,16291
2007,16838
2008,16774
2009,19998
2010,24081
2011,24776
2012,24952
2013,23339


Now to get the differentially private statistics, add `dp.noise` after the aggregrate function is specified and `.release` after the entire query before `.collect.` 

Calling `.release` is always the final step in compiling your differentially private data in a usable form and ensuring its compliant with differential privacy guarantees. 

In [25]:
count_year_dp = (context.query().group_by("YEAR").agg(pl.len().dp.noise()).sort("YEAR")).release().collect()
count_year_dp

YEAR,len
i64,u32
2004,16503
2005,16538
2006,16304
2007,16428
2008,16770
2009,19857
2010,24188
2011,24929
2012,24954
2013,23299


Let's visualize
#TODO 

## Multiple Variable Groupby 

## Filtering

## Chaining Groupby and Filtering