# Sequential and Parallel Composition

We rarely want just one differentially private statistic; instead, we need to compute multiple statistics on the same dataset. 
If we run many different differentially private algorithms on the same dataset, the resulting **composed** algorithm is also differentially private.

The simplest composition is **non-interactive** and **sequential**, but there are other approaches which can result in higher utility, or a smaller privacy budget, or both.

With non-interactive composition, each mechanism is run independently.
The alternative, **interactive** composition, allows the result of one mechanism to influence what is run next.
This is not just interaction with the user: It can also be your code which "interacts" with previous results.
For example, you could imagine a program which first collects high-level statistics, identifies variables with the most interesting distributions, and then uses the remaining privacy budget just to explore those variables.

In OpenDP, [make_basic_composition](../../api/python/opendp.combinators.rst#opendp.combinators.make_basic_composition) supports non-interactive composition,
and [make_sequential_composition](../../api/python/opendp.combinators.rst#opendp.combinators.make_sequential_composition) supports interactive composition.

We can also make a distinction between sequential and **parallel** composition:
In sequential composition, each mechanism operates on the entire dataset, while in parallel composition each mechanism operates on a partition.
Due to the broad nature of data partitioning, there isn't a dedicated combinator for parallel composition, but Polars employs parallel composition to reason about the privacy loss of grouping queries.


This is not to be taken as an exhaustive catalog of composition!
You can also compose compositors: For example, a non-interactive composition of interactive mechanisms.
Composition is also an active area of research.

We'll examine these combinations:

- Sequential Composition
  - Noninteractive (`make_basic_composition`; Polars `select` with multiple queries)
  - Interactive (`make_sequential_composition`)
- Parallel Composition
  - Noninteractive (`Polars group_by`)
  - Interactive

## Set Up

In [2]:
import opendp.prelude as dp
import numpy as np
import polars as pl 

dp.enable_features("contrib")

In [3]:
# Fetch and download data. 
![ -e sample_FR_LFS.csv ] || ( curl 'https://github.com/opendp/dp-test-datasets/blob/main/data/sample_FR_LFS.csv.zip?raw=true' --location --output sample_FR_LFS.csv.zip; unzip sample_FR_LFS.csv.zip )

df = pl.scan_csv("sample_FR_LFS.csv", ignore_errors=True)


In [4]:
context = dp.Context.compositor(
    data=df,
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=10,
    margins={
        ("ILOSTAT", ): dp.polars.Margin(public_info="lengths", max_partition_length=60_000_000, max_num_partitions=3),
        ("YEAR", "QUARTER", "ILOSTAT",): dp.polars.Margin(public_info="lengths", max_partition_length=60_000_000, max_partition_contributions=1, max_num_partitions=1),
        (): dp.polars.Margin(public_info="lengths", max_partition_length=60_000_000, max_num_partitions=1),
    },
)

# Filter out rows with 99 as a null value. 
df = df.filter(pl.col("HWUSUAL")!=99)

## Sequential Composition

Sequential composition is a major property of differential privacy because it bounds the total privacy costs of computing multiple differentially private queries on the same data. It's important because it allows the design of algorithms that refer to the data multiple times. 

It's an extremely useful way to get an **upper** bound on the total ε of many queries, but keep in mind that the actual impact on privacy may be lower. 

Assume that function 1, $F1$ satisfies ε-dp and function 2, $F2$, also satisfies ε-dp. 

Then a mechanism where we compose functions, $g(x)$ = ($F1$(x), $F2$(x)), we get 2 * ε-dp.

Sequential composition can be noninteractive and interactive.
 
### Noninteractive Sequential Composition

We can execute multiple queries at once using the `select` or `agg` methods. To start off, let's calculate the actual and differentially private mean and median for the variable `HWUSUAL`, which is the number of hours per week usually worked in the main job by respondents that were working for pay at the time of the survey.

Notes: 

- See the [EU Labour Force Survey User Guide](https://ec.europa.eu/eurostat/documents/1978984/6037342/EULFS-Database-UserGuide.pdf) for more information about variables. 
- Do not calculate the actual mean to impute null values. We shouldn't reference our data after creating the compositor. 

In [5]:
candidates = list(range(25,56))

actual_hours = df.select(
    pl.col("HWUSUAL").fill_null(40.0).mean().alias("Actual Mean Hours Worked"), 
    pl.col("HWUSUAL").fill_null(40.0).median().alias("Actual Median Hours Worked"),
).collect()

actual_hours

Actual Mean Hours Worked,Actual Median Hours Worked
f64,f64
37.634091,37.0


In [6]:
dp_hours = context.query().select(
    pl.col("HWUSUAL").fill_null(40.0).dp.mean((1,45)).alias("DP Mean Hours Worked"), 
    pl.col("HWUSUAL").fill_null(40.0).dp.median(candidates).alias("DP Median Hours Worked"),
).release().collect()

dp_hours

DP Mean Hours Worked,DP Median Hours Worked
f64,i64
41.413199,51


## Interactive Sequential Composition

In cases where you want to apply interactive sequential composition, you can use the `dp.c.make_sequential_composition` method. There are two ways to implement sequential composition in OpenDP: using transformations and measurements directly, or passing in Polars queries. 

Also note, that the following queries are interactive, but they should not be assumed to be adaptive.

### Interactive Sequential Composition with Transformations and Measurements

The interactive sequential composition doesn't use the compositor we defined at the start of this notebook. Instead, we construct a composition through transformations and measurements, combinator elements present in lower-level frameworks in the OpenDP library. 

We will also define a `queryable` in the following code. It can essentially be thought of as a class that executes the queries and tracks the privacy expenditure. 

In [7]:
# Define the compositor by specifying parameters. 
sc_meas = dp.c.make_sequential_composition(
    input_domain=dp.vector_domain(dp.atom_domain(T=float)),
    input_metric=dp.symmetric_distance(),
    output_measure=dp.max_divergence(T=float),
    # D_in is the upper bound on distances. 
    d_in=1,
    # d_mids is the privacy consumption for each query. 
    d_mids=[1., 1., 2.]
)

# 40 is chosen since it represents the average working hours. 
# A lower value such as 0 may skew the results
hours_list = list(np.array(df.select(
                                    pl.col("HWUSUAL").
                                    fill_null(40.0).
                                    fill_nan(40.0)).
                                    collect()
                                # The reshape of the list flattens it. 
                                # Which is required for the queryable input. 
                                ).reshape(-1))

# This is a queryable that takes in the data similar to the context. 
sc_qbl = sc_meas(hours_list)

The output value from `sc_meas.map(1)` is 4, the privacy consumption of the entire composition. 

In [8]:
# Construct a measurement for mean. 
input_space = dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance()
count_meas = input_space >> dp.t.then_count() >> dp.m.then_laplace(scale=1.0)
count_release = sc_qbl(count_meas)
print("DP count: ",count_release)

mean_meas = (
    input_space >> 
    dp.t.then_clamp((0.,79.)) >> 
    dp.t.then_resize(size=count_release, constant=40.) >> 
    dp.t.then_mean() >> 
    dp.m.then_laplace(1.))
print("DP Mean:", sc_qbl(mean_meas))

DP count:  78708
DP Mean: 35.51057042927775


In [9]:
# Construct a measurement for median. 
discrete_scores = dp.t.make_quantile_score_candidates(dp.vector_domain(dp.atom_domain(T=float)), 
                                    dp.symmetric_distance(), 
                                    np.array(candidates).astype(float), 
                                    0.5)
input_space = dp.vector_domain(dp.atom_domain(T=dp.usize), 31), dp.linf_distance(T=dp.usize)
select_index_measurement = dp.m.make_report_noisy_max_gumbel(*input_space, scale=1.0, optimize='min')

median_meas = discrete_scores >> select_index_measurement >> (lambda index: candidates[index])
print("DP Median:", sc_qbl(median_meas))

DP Median: 37


That was a lot of code! But as you saw, we could change our queries depending on the outputs of other queries, illustrating interactive sequential composition.

### Interactive Sequential Composition with Polars Queries

You can also implement interactive sequential composition with polars queries! One of the first steps is to specify the domain of our data frame and add a margin specifying the parameters for the data when it's grouped by `ILOSTAT`. This is **in addition** to specifying it in the compositor. 

In [10]:
# Get the domain of the datafarame and specify parameters. 
df_domain = dp.domain_of(df, infer=True)

df_domain = dp.with_margin(df_domain, by=[], 
                           public_info="lengths", 
                           max_partition_length=1, 
                           max_num_partitions=1)

proposed_plan = context.query().select(
    pl.col("HWUSUAL").fill_null(40.0).dp.mean((1,98)).alias("DP Mean Hours Worked"), 
    pl.col("HWUSUAL").fill_null(40.0).dp.median(candidates).alias("DP Median Hours Worked"),
)


Next, we set up a measurement to execute the sequential queries and pass them in our `df_domain` and `proposed_plan`. We can pass our data to this measurement and then collect our results. 

In [11]:
#TODO: Fix Parameters

# Polars Lazyframe
m_polars = dp.m.make_private_lazyframe(
    input_domain=df_domain, 
    input_metric=dp.symmetric_distance(), 
    output_measure=dp.max_divergence(T=float), 
    lazyframe=proposed_plan, 
    global_scale=45.
)

# Measurement for sequential composition. 
m_comp = dp.c.make_sequential_composition(
    m_polars.input_domain,
    m_polars.input_metric,
    m_polars.output_measure,
    2, 
    [3., 3.]
)

qbl_comp = m_comp(df)
qbl_comp(m_polars).collect()

DP Mean Hours Worked,DP Median Hours Worked
f64,i64
37.637408,49


## Parallel Composition

Parallel composition can be viewed as an alternative to sequential composition since it's another way to calculate a bound on the total privacy cost of multiple queries. 

Parallel composition involves partitioning the input dataset into disjoint partitions and applying a differentially private mechanism on each chunk independent of all other chunks. Assuming each individual is represented once in the dataset, their data will appear in exactly one chunk. If there are $k$ chunks, there will be $k$ runs of the differentially private mechanisms.

To apply parallel composition, we need to set up our margin or define our query to allow each individual to influence at most 1 partition, or appear at most in 1 chunk. This is different from earlier in the notebook where we assumed 36 contributions per user since each user was represented once per year-quarter across 13 years. 

### Noninteractive Parallel Composition

In the following example, each individual is represented just one each year, quarter, and labor status. Moreover, this example also satisfies the properties of noninteractive sequential composition since all the queries are executed simultaneously. 

In [12]:
context.query().group_by(["YEAR","QUARTER","ILOSTAT"]).agg(
    pl.col("HWUSUAL").fill_null(40.).dp.mean((1,98)).alias("DP Mean Hours Worked"),
    pl.col("HWUSUAL").fill_null(40.).dp.median(candidates).alias("DP Median Hours Worked")
).release().collect().sort(["YEAR", "QUARTER"])

YEAR,QUARTER,ILOSTAT,DP Mean Hours Worked,DP Median Hours Worked
i64,i64,i64,f64,i64
2004,1,1,39.467307,45
2004,1,9,98.12098,37
2004,1,2,107.907955,27
2004,1,3,98.999954,37
2004,2,9,95.86399,49
…,…,…,…,…
2013,3,1,36.534645,36
2013,4,2,92.4327,34
2013,4,1,36.891254,38
2013,4,9,98.131202,55


### Interactive Parallel Composition

We'll illustrate interactive parallel composition using the polar-interactivity methods. Much of the code is similar to interactive sequential composition. The primary change is the domain and proposed plan. 

In [13]:
#get the domain of the datafarame
lf_domain = dp.domain_of(df, infer=True)
lf_domain = dp.with_margin(lf_domain, by=["YEAR","QUARTER","ILOSTAT"], 
                           public_info="lengths", 
                           max_partition_length=50, 
                           max_num_partitions=3)

proposed_plan = context.query().group_by(["YEAR","QUARTER","ILOSTAT"]).agg(
    pl.col("HWUSUAL").fill_null(40.).dp.mean((1,98), 8).alias("DP Mean Hours Worked"),
    pl.col("HWUSUAL").fill_null(40.).dp.median(candidates, 8).alias("DP Median Hours Worked")
)
# IN SERVER
m_polars = dp.m.make_private_lazyframe(
    input_domain=lf_domain, 
    input_metric=dp.symmetric_distance(), 
    output_measure=dp.max_divergence(T=float), 
    lazyframe=proposed_plan, 
    global_scale=1.
)

m_comp = dp.c.make_sequential_composition(
    m_polars.input_domain,
    m_polars.input_metric,
    m_polars.output_measure,
    2, 
    [18.0, 18.0]
)


qbl_comp = m_comp(df)
qbl_comp(m_polars).collect()

YEAR,QUARTER,ILOSTAT,DP Mean Hours Worked,DP Median Hours Worked
i64,i64,i64,f64,i64
2013,4,1,37.032269,33
2006,3,1,37.36276,46
2007,2,1,37.543822,34
2006,4,1,37.553167,51
2007,4,1,37.903047,43
…,…,…,…,…
2010,4,1,37.719556,52
2008,3,1,37.951766,27
2009,1,1,37.086426,54
2013,1,1,37.549942,26


## Comparison of Sequential and Parallel

Overall, parallel composition provides a much better bound than sequential composition. The total privacy loss for parallel composition is just ε since there are $k$ disjoint partitions and each of them also satisfies ε-dp. With sequential composition, the total privacy loss is $k$ * ε since we would run each query $k$ times.

However, one disadvantage of parallel composition is that has the dataset is split into more parts, each part will have a weaker signal and hence less accuracy. 

Additionally, in cases where we want statistics that reference the whole dataset, sequential composition may be a better fit. 

Therefore, the recommendation is to use parallel composition unless the partitions are too small to provide adequate accuracy or the entire data needs to be referenced for the calculated statistics. 

## Conclusion

This notebook illustrated: 

- The difference between interactive and noninteractive queries. 
- The definition and properties of sequential composition. 
- How to implement non-interactive sequential composition with polars, interactive sequential composition using transformations and measurements directly, and interactive sequential composition using polars queries. 
- The definition and properties of parallel composition. 
- How to implement noninteractive parallel composition and interactive parallel composition. 
- The differences between parallel and sequential composition and recommendations for when to choose which. 


Further reading:

- [Programming Differential Privacy: Properties of Differential Privacy](https://programming-dp.com/ch4.html)
- [Concurrent Composition of Differential Privacy](https://eprint.iacr.org/2021/1196.pdf)