## Model For Choosing $\epsilon$

This notebook tries to implement the model presented in {cite}`DBLP:journals/corr/HsuGHKNPR14` for setting the $\epsilon$ and other crucial parameters for a differentially-private study.

It first tries to replicate the results mentioned in the various cost scenarios in Section 5.2 to ensure that we have implemented the model correctly.

Then, it applies the model to the specific use-case under consideration in the report.

Each section is heavily based off Sections 4, 5.1, 5.2 and ... of ibid.

Derivation of the various key equations can be found in the paper; this notebook will simply identify and highlight the equations that are relevant for determining what values key parameters must take in a given study. More detailed explanations are provided in the report.

### Defining Model Parameters


First, we will define the model parameters for a simple analysis case where we, as analysts, want to estimate the population mean $\mu$ i.e., the proportion of some population that has some property P.

We want to conduct the study for this estimate in a differentially-private way, aiming to satisfy the following equation with the mechanism $M$ that we use:

$$
\operatorname{Pr}[M(D) \in S] \leq e^{\varepsilon} \cdot \operatorname{Pr}\left[M\left(D^{\prime}\right) \in S\right] 
$$

Where $D$ and $D^{\prime}$ are adjacent datasets (in their definition, datasets of the same size but which differ in terms of the contents of one of their records), and $S$ being a possible output of the mechanism $M$. This equation is to be satisfied for all possible $S$ and pairs $D$ & $D^{\prime}$.

The Key Parameters:
* $\epsilon$ - the *privacy budget* of our study
* $B$ - the budget the analyst has for compensating participants
* $N$ - number of participants in study
* $D_N$ - *sample*-- private database formed by contributions of $N$ participants
* $g(D_N)$ - *calculated sample mean*-- proportion of particpants with property P
* $T$ - *desired error* for our study
* $A(\epsilon, N)$ - *failure probability*-- probability that the mechanism we use exceeds $T$ 
* $\alpha$ - *target accuracy*-- the desired accuracy level for our mechanism


### Key Equations

#### Budget Constraint

Participants need to be compensated in order to incentivise them to participate in studies. Each individual needs to be paid $(e^{\epsilon} - 1)E$ (the worst-case increase in their expected cost from participating in the study), so the analysts budget has the following constraint:

$$
(e^{\epsilon} - 1)E \le B
$$

Below, we implement this in Python code:

In [2]:
from math import exp

def within_budget(epsilon: float, expected_cost: float, budget: float) -> bool:
    return ((exp(epsilon) - 1) * expected_cost) <= budget

#### Accuracy Constraint

At the same time, analysts have to ensure that their study affords them a sufficiently accurate estimate of their target metric (in this case, the population mean). That is represented by:

$$
A(\varepsilon, N):=2 \exp \left(-\frac{N T^{2}}{12}\right)+\exp \left(-\frac{T N \varepsilon}{2}\right) \leq \alpha
$$


Below is the implemtation in Python code:

In [1]:
def within_accuracy_constraint(epsilon: float, N: int, desired_error: float, accuracy_constraint: float) -> bool:
    first_term = 2 * exp(-1 * (N * ((desired_error**2)) / 12))
    second_term = exp(-1 * ((desired_error * N * epsilon) / 2))
    return (first_term + second_term) <= accuracy_constraint


The goal is to find $\epsilon$ and $N$ values that satisfy these two constraints.

#### Sufficient Conditions For Feasible $N$ and $\epsilon$ Values

The authors introduce a sufficient condition for feasible $\epsilon$ and $N$ values

$$
\begin{aligned} 3 \exp \left(\frac{-N T^{2}}{12}\right) & \leq \alpha \\\left(e^{\varepsilon}-1\right) E N & \leq B \end{aligned}
$$


i.e., $\epsilon$ and $N$ values that satisfy these equations are feasible values for a study to go ahead within the aforementioned accuracy and budget constraints. However, if $\epsilon$ and $N$ values cannot be found to satisfy these equations, that does not mean that there aren't any feasible $\epsilon$ and $N$ values for the study. To prove that, one would need to check the [accuracy](#accuracy-constraint) and [budget](#budget-constraint) constraints.

These can be solved for bounds on $N$ and $\epsilon$

$$
N \geq \frac{12}{T^{2}} \ln \frac{3}{\alpha}
$$

and 

$$
\frac{T}{6} \leq \varepsilon \leq \ln \left(1+\frac{B T^{2}}{12 E \ln \frac{3}{\alpha}}\right)
$$

Below is the implementation in Python code:

In [5]:
from math import log # natural log by default

def parameters_feasible_for_accuracy(N: int, desired_error: float, accuracy_constraint: float) -> bool:
    return (3 * exp(-1 * (N * (desired_error**2)) / 12)) <= accuracy_constraint

def parameters_feasible_for_budget(N: int, expected_cost: float, epsilon: float, budget: float) -> bool:
    return (((exp(epsilon) - 1) * expected_cost) * N) <= budget

def lower_bound_for_N(desired_error: float, accuracy_constraint: float) -> float:
    return (12 / (desired_error**2)) * log(3 / accuracy_constraint)

def lower_bound_for_epsilon(desired_error: float) -> float:
    return desired_error / 6

def max_value_for_epsilon(budget: float, desired_error: float, expected_cost: float, accuracy_constraint: float) -> float:
    return log(1 + (budget * (desired_error**2)) / (12 * expected_cost * log(3 / accuracy_constraint)))


#### Bound On Base Cost E

From the equation:

$$
\varepsilon \leq \ln \left(1+\frac{B T^{2}}{12 E \ln \frac{3}{\alpha}}\right)
$$

If we consider max value for $\varepsilon$, then if we solve for E, we have: 

$$
E = \frac{B T^2}{12 \ln \frac{3}{\alpha} (e^{\varepsilon} - 1)}
$$

Where, this gives us a max value for a feasible base expected cost value i.e., if participants have a base expected cost $E$ that exceeds this value, then the study is not feasible.

Below is the code implementation:

In [7]:
def bound_on_base_cost_E(budget: float, desired_error: float, accuracy_constraint: float, epsilon: float): 
    return (budget * (desired_error**2)) / (12 * log(3 / accuracy_constraint) * (exp(epsilon) - 1))

Let's do a sanity check for our implementations.

The authors offer the following illustration at the end of ibid. Section 5.1. We plug in the values and see if we get the same result

In [11]:
T = 0.05
a = 0.05
epsilon = T / 6
B = 3.0 * (10**4)

print(f'Given a desired error of {T}, accuracy_constraint of {a}, and budget of {B}:\n')
print(f'Lower bound for N: {lower_bound_for_N(T, a)}') # should be ~20000
print(f'Max value for Base Cost E: {bound_on_base_cost_E(B, T, a, epsilon)}') # should be ~182

Given a desired error of 0.05, accuracy_constraint of 0.05, and budget of 30000.0

Lower bound for N: 19652.85389866608
Max value for Base Cost E: 182.41731464520467


### Considering Cost Scenarios

Now that we have implementations of the key equations in the paper's model, let us now consider how to evaluate the feasibility of a study in our use-case given a particular cost scenario.

In section 5.2, the method for considering each cost scenario is this: given our aforementioned $T$ (desired error), $B$ (budget), $\alpha$ (accuracy_constraint) and base $\varepsilon$ values (from [here](#sufficient-conditions-for-feasible-n-and-epsilon-values)):
* what is the expected base cost of a prospective participant in the given scenario?
* does that fit within the [bound](#bound-on-base-cost-e) on E that our model describes? If so, the study is feasible.
* if not, to determine definitively whether a study is feasible, we plug in the parameter values into the [budget](#budget-constraint) and [accuracy](#accuracy-constraint) constraints and check (via a numerical solver) whether there are any possible solutions. If not, then the study is definitely infeasible.



#### Cost Scenario In The Use-Case

One way the model could be applied to the use case is in considering an individual who is deciding whether or not to use the mood app. For our use case, the compensation consists in benefitting from the services the app provides; the budget has already been spent and the app providers are going to run their differentially-private studies on the data of users.

So, the decision is whether or not to participate in their studies/study by using the app yourself.

We follow a similar procedure as in the paper to first consider the cost scenario. 

Let's first consider what the expected base cost of a prospective participant in the scenario might be.

The authors give an example in Section 5.2 which we could use to illustrate this.

[go connect it to the 'Social Networks' example]:# (social-network eg)