# Working with Unknown Dataset Sizes

This notebook will demonstrate the features built into OpenDP to handle unknown or private dataset sizes.

### Set up libraries and load exemplar dataset

In [1]:
# load libraries
from opendp.trans import *
from opendp.meas import *
from opendp.core import *
from opendp.typing import *

enable_features("floating-point")


# establish data information
data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')
var_names = ["age", "sex", "educ", "race", "income", "married"]

# TODO: Remove column headers
with open(data_path) as input_data:
    data = input_data.read()

We see above this dataset has 1000 observations (rows).  Oftentimes the number of observations is public information.  For example, a researcher might run a random poll of 1000 respondents and publically announce the sample size.

However, there are cases where simply the number of observations itself can leak private information.  For example, if a dataset contained all the individuals with a rare disease in a community, then knowing the size of the dataset would reveal how many people in the community had that condition.  In general, a dataset maybe composed of some defined subset of a population, and the dataset size is then equivalent to a count query on that subset, and so we should protect it like any other query we want to provide privacy guarantees for.

OpenDP assumes the sample size is private information.  If it happens that you actually know the dataset size, then that information will be valuable if you add it into your analysis graph.  However, OpenDP will not assume you truthfully or correctly know the size of the dataset.  (Moreover, it can not directly send you an error message if you get this value incorrect, or this would permit an attack whereby an analyst keeps guessing different dataset sizes until the error message goes away, thereby leaking the exact dataset size.)

If we know the dataset size, we can incorporate it into the analysis as below, where we provide `n` as an argument to the release of a mean on age:

In [2]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    # TODO: how to remove column headers?
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=float) >>
    # Impute missing values to 0
    make_impute_constant(0.) >>
    # Clamp age values
    make_clamp(20., 50.)
    # make_bounded_mean(lower=1000., upper=1_000_000., n=100, T=float)
    # make_base_laplace(scale=1.0)

)

# TOOO: chain these into one process. Currently getting domain mismatch error
# Mike: Once you do the count, you could pull the count before the mean, and then use the count as an input to resize.
res = preprocessor(data)
mean_process = make_bounded_mean(20., 50., n=1000, T=float)
res = mean_process(res)
print(res)


39.65


### Providing incorrect dataset size values

However, if we provide an incorrect value of `n` we still receive an answer as we see below:

In [3]:
preprocessor = (
    # Convert data into Vec<Vec<String>>
    # TODO: how to remove column headers?
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=float) >>
    # Impute missing values to 0
    make_impute_constant(0.) >>
    # Clamp age values
    make_clamp(20., 50.)
    # make_bounded_mean(lower=1000., upper=1_000_000., n=100, T=float)
    # make_base_laplace(scale=1.0)

)

res = preprocessor(data)
mean_process_high_n = make_bounded_mean(20., 50., n=2000, T=float) >> make_base_laplace(scale=1.0)
mean_process_low_n = make_bounded_mean(20., 50., n=100, T=float) >> make_base_laplace(scale=1.0)

res_high_n = mean_process_high_n(res)
print(res_high_n)

res_low_n = mean_process_low_n(res)


print("DP mean age (n=2000): {0}".format(res_high_n))
print("DP mean age (n=100): {0}".format(res_low_n))

20.548937287969615
DP mean age (n=2000): 20.548937287969615
DP mean age (n=100): 395.40017593788036


Let's examine what is actually happening when these values are provided.
When we provide all of the metadata arguments (`data_lower`, `data_upper`, `n`) to the function `make_bounded_mean`,
it works as a convenience method that knits together a number of library components to provide a mean.  A clamping,
imputation and resize step are run on the dataset, in order for the validator to certify the analysis is privacy
preserving (for more detail see the notebook "data_analysis_tutorial").

In [4]:
#TODO: need resize for this

### Analysis with no provided dataset size
If we do not believe we have an accurate estimate for `n` we can instead pay for a query on the dataset to release
a differentially private value of the dataset size.  Then we can use that estimate in the rest of the analysis.
Here is an example:

In [5]:
count_preprocessor = (
    # Convert data into Vec<Vec<String>>
    make_split_dataframe(separator=",", col_names=var_names) >>
    # Selects a column of df, Vec<str>
    make_select_column(key="age", T=str) >>
    # Cast the column as Vec<Int>
    make_cast(TI=str, TO=int) >>
    # Impute missing values to 0
    make_impute_constant(0) >>
    make_count(TIA=int, TO=int) >>
    make_base_geometric(scale=1., D=AllDomain[int])
)

count = count_preprocessor(data)
print("DP number of records: {0}".format(count))

res = preprocessor(data)
mean_preprocessor = make_bounded_mean(20., 50., n=1000, T=float) >> make_base_laplace(scale=1.0)
mean = mean_preprocessor(res)
print("DP mean of age: {0}".format(mean))

DP number of records: 1000
DP mean of age: 40.509293244798535


Note that our privacy usage has increased because we apportioned some epsilon for both the release count of the dataset,
and the mean of the dataset.

### OpenDP `resize` vs. other approaches
The standard formula for the mean of a variable is:
$\bar{x} = \frac{\sum{x}}{n}$

The conventional, and simpler, approach in the differential privacy literature, is to: 

1. compute a DP sum of the variable for the numerator
2. compute a DP count of the dataset rows for the denominator
3. take their ratio

This is sometimes called a 'plug-in' approach, as we are plugging-in differentially private answers for each of the
terms in the original formula, without any additional modifications, and using the resulting answer as our
estimate while ignoring the noise processes of differential privacy. While this 'plug-in' approach does result in a
differentially private value, the utility here is generally lower than the solution in OpenDP.  Because the number of
terms summed in the numerator does not agree with the value in the denominator, the variance is increased and the
resulting distribution becomes both biased and asymmetrical, which is visually noticeable in smaller samples.

We see that for the same privacy loss, the distribution of answers from OpenDP's resizing approach to the mean is tighter around the true dataset value (thus lower in error) than the conventional plug-in approach.

*Note, in these simulations, we've shown equal division of the epsilon for all constituent releases, but higher utility (lower error) can be generally gained by moving more of the epsilon into the sum, and using less in the count of the dataset rows, as in earlier examples.*