# Privacy Unit

The privacy unit quantifies how much influence an individual may have on your dataset.
OpenDP Polars can reason about the privacy unit in terms of _example-level_ privacy, or _user-level_ privacy.

In [1]:
import polars as pl
import opendp.prelude as dp
dp.enable_features("contrib")

## Example-Level Privacy
Attain an example-level privacy guarantee by specifying a bound on how many records an individual may contribute to the data.
The following example obtains this guarantee by setting the privacy unit based on the number of potential **contributions**:

In [None]:
context = dp.Context.compositor(
    data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
    # an individual may contribute up to 36 records
    privacy_unit=dp.unit_of(contributions=36),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=4,
    margins=[dp.polars.Margin(max_partition_length=60_000_000 * 36)]
)
query = context.query().select(pl.col.HWUSUAL.cast(int).fill_null(0).dp.sum((20, 60)))
query.summarize()

column,aggregate,distribution,scale
str,str,str,f64
"""HWUSUAL""","""Sum""","""Integer Laplace""",8640.0


Alternatively, you can calibrate to an example-level bounded-DP guarantee
by setting the privacy unit based on the number of potential **changes**:

In [3]:
context = dp.Context.compositor(
    data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
    # an individual may change up to 36 records
    privacy_unit=dp.unit_of(changes=36),
    privacy_loss=dp.loss_of(epsilon=1.0, delta=1e-7),
    split_evenly_over=4,
    margins=[dp.polars.Margin(max_partition_length=60_000_000 * 36)]
)
query = context.query().select(pl.col.HWUSUAL.cast(int).fill_null(0).dp.sum((20, 60)))
query.summarize()

column,aggregate,distribution,scale
str,str,str,f64
"""HWUSUAL""","""Sum""","""Integer Laplace""",5760.0


Bounded-DP is a weaker privacy definition than unbounded-DP, as it doesn't protect the total number of records in the data.
In fact, you can consider bounded-DP to be a special case of unbounded-DP, subject to a data invariant.

## User-Level Privacy
Obtain a user-level privacy guarantee by specifying user identifiers in the data. 
In this setting, individuals may make an unbounded number of contributions to the data,
where each row is tied back to individidual via a unique identifier.
Privacy guarantees are based on how many identifiers a user may influence.

In order to make DP releases on this data, 
an additional data truncation preprocessing step is necessary, 
where only a limited number of records corresponding to each identifier are retained.

In [None]:
context = dp.Context.compositor(
    data=pl.scan_csv(dp.examples.get_france_lfs_path(), ignore_errors=True),
    # an individual may contribute up to 1 identifier
    privacy_unit=dp.unit_of(contributions=1, identifier="COEFF"),
    privacy_loss=dp.loss_of(epsilon=1.0),
    split_evenly_over=4,
    margins=[dp.polars.Margin(max_partition_length=60_000_000 * 36)]
)
query = context.query().truncate(k=10).select(pl.col.HWUSUAL.cast(int).fill_null(0).dp.sum((20, 60)))
query.summarize()

column,aggregate,distribution,scale
str,str,str,f64
"""HWUSUAL""","""Sum""","""Integer Laplace""",2400.0


Instead of considering the worst-case of 36 records per individual, 
we instead realize that it is unlikely to see more than ten records from an individual.
Therefore, we introduce some bias by dropping records, but attain a much lower sensitivity.