Anonymous Functions PostgreSQL Extension
This subdirectory contains a PostgreSQL extension providing several epsilon-DP aggregate functions. We will refer to them as the anonymous functions.
Install Postgres 11.
$PG_DIRto the directory of your postgres installation. E.g.
From the base directory of the differential library repo, run
Start the Postgres server.
In PostgreSQL, load the extension by calling
CREATE EXTENSION anon_func;
We offer a suite of anonymous functions. Each function wraps around the corresponding algorithm in the differential privacy library. Notice that these functions, like their algorithms, assume that each entry is owned by a distinct user.
The first argument for each function is the column over which the aggregation is performed. Each function may also take a couple of additional parameters. They must be passed as literal values.
epsilon: The differential privacy parameter for the function.
lower: A lower bound for the input data set.
upper: An upper bound for the input data set.
The lower and upper bounds are used to determine the sensitivity for the anonymous function. In the event that manual bounds are not entered, some functions can automatically infer the bounds, provided that there is enough input data. If there is not enough data, an error message to that effect will be displayed. For more information on automatic bounding, see the ApproxBounds documentation
ANON_COUNT(column) ANON_COUNT(column, epsilon)
The count function can be called on a
column of any type. The epsilon can be
optionally provided; otherwise, the default epsilon is used. An integer count is
Sum, Average, Variance, Standard Deviation
ANON_SUM(column) ANON_SUM(column, epsilon) ANON_SUM_WITH_BOUNDS(column, lower, upper) ANON_SUM_WITH_BOUNDS(column, lower, upper, epsilon)
ANON_SUM can be called with any numeric type. For integer types, the sum
returned is an integer. For floating point types, it is a double. If
is called without bounds, then the data is automatically bounded. If
ANON_SUM_WITH_BOUNDS is called, then the data is bounded using the provided
explicit bounds. Like count, epsilon is optionally configurable.
ANON_AVG(column) ANON_AVG(column, epsilon) ANON_AVG_WITH_BOUNDS(column, lower, upper) ANON_AVG_WITH_BOUNDS(column, lower, upper, epsilon) ANON_VAR(column) ANON_VAR(column, epsilon) ANON_VAR_WITH_BOUNDS(column, lower, upper) ANON_VAR_WITH_BOUNDS(column, lower, upper, epsilon) ANON_STDDEV(column) ANON_STDDEV(column, epsilon) ANON_STDDEV_WITH_BOUNDS(column, lower, upper) ANON_STDDEV_WITH_BOUNDS(column, lower, upper, epsilon)
ANON_STDDEV functions are like
the return type is always double.
ANON_NTILE(column, percentile, lower, upper) ANON_NTILE(column, percentile, lower, upper, epsilon)
ANON_NTILE accepts any numeric type. For integer types, an integer will be
returned. Otherwise, a double is returned. Unlike the other bounded functions,
ANON_NTILE requires bounds. Automatic bounding is not supported.
User-Level Differentially Private Queries
To write user-level differentially private queries, a two-stage aggregation may be used, as described in "Differentially Private SQL with Bounded User Contribution" (Wilson et al.). In this section, we provide several examples for manually rewriting regular PostgreSQL queries into queries with user-level differential privacy. For an in-depth description of the transformations made, refer to the Wilson et al. paper.
Imagine a table that records each instance of a person eating a fruit, call it
|uid||integer||Uniquely identifies a person.|
|fruit||varchar(20)||The name of the fruit the person ate.|
In this table, each row represents one fruit eaten. So if person
1 eats two
apples, then there will be two rows in the table with column values
(1, apple). Consider a simple query counting how many of each fruit have been
SELECT fruit, COUNT(fruit) FROM FruitEaten GROUP BY fruit;
Suppose that instead of getting the regular count, we want the differentially private count with the privacy parameter ε=ln(3). The final product of the query rewrite would be
SELECT result.fruit, result.number_eaten FROM ( SELECT per_person.fruit, ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten, ANON_COUNT(uid, LN(3)/2) as number_eaters FROM( SELECT * , ROW_NUMBER() OVER ( PARTITION BY uid ORDER BY random() ) as row_num FROM ( SELECT fruit, uid, COUNT(fruit) as fruit_count FROM FruitEaten GROUP BY fruit, uid ) as per_person_raw ) as per_person WHERE per_person.row_num <= 5 GROUP BY per_person.fruit ) as result WHERE result.number_eaters > 50;
As we can see, there are four
SELECTS in the query. We will explain them from
inner-most to outer-most. The following steps were taken to rewrite the query:
Construct the 1st and inner-most
SELECT, aliased as
SELECT fruit, uid, COUNT(fruit) as fruit_count FROM FruitEaten GROUP BY fruit, uid;
For each person, count how many of each fruit that person ate.
Construct the 2nd
SELECT, aliased as
SELECT *, ROW_NUMBER() OVER ( PARTITION BY uid ORDER BY random() ) as row_num FROM per_person_raw;
For each person,
per_person_rawcontains rows corresponding to the fruits they have eaten. We shuffle these rows and assign them a row number. This will allow us to effectively reservior sample rows for each user by filtering by row number in the next step. This is similar to C_u thresholding in Wilson et al.
Construct the 3rd
SELECT, aliased as
SELECT per_person.fruit, ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten, ANON_COUNT(uid, LN(3)/2) as number_eaters FROM per_person WHERE per_person.row_num <= 5 GROUP BY per_person.fruit;
First, for each person, we ensure that the person only contributed to 5 fruit groups by filtering on the randomized row number generated in the previous step. Then, for each fruit, we sum the number of the fruit that each person ate. We also count the number of people who ate that fruit. This will allow us to ensure we only release the sums which enough people contributed to.
In order to anonymously grab the sum and count, we replace the aggregates
ANON_COUNT, respectively. Note that since we are performing two anonymous aggregations, we split our privacy parameter between them, using ε=ln(3) for both.
Construct the 4th and outer-most
SELECT result.fruit, result.number_eaten FROM result WHERE result.number_eaters > 50;
For any fruit where the number of eaters was less than 50, discard the output result. This is similar to τ-thresholding in Wilson et al. In addition, we drop the
number_eaterscount, so that it does not display in the output. Dropping the count of unique users is not neccesary for differential privacy.
In our simple count example, we used a query containing a single anonymous function. For a query with N anonymous function calls, and with a desired total privacy parameter of ε, we need to use ε/(N+1) as the privacy parameter for each aggregation. This is for the N requested calls plus the additional anonymous unique-user count. For instance, consider the following dummy query:
SELECT COUNT(col1), SUM(col2) FROM Table;
Suppose we want to use the privacy paramter ε=M. In the rewritten query, we have to make the following replacements
This is because we have two requested anonymous functions, and an additional anonymous unique user count required when we perform the rewrite.
Bounding User Contribution
Consider again our fruit-eating example. Suppose we want to restrict the
contribution of each person to the fruit-eaten counts by
5. So if a person has
eaten more than
5 fruit, we want to count it as that they have eaten
fruit. To do this, add lower and upper bounds on the anonymous functions:
Query With Joins
In this section we will add a join to our query. In addition to the
table, consider the following table, which we will call
|uid||integer||Uniquely dentifies a person.|
|color||varchar(20)||The name of the person's shirt color.|
Let's say we want to find out, for each shirt color, how many fruit all the people wearing that shirt color ate, altogether.
SELECT color, COUNT(fruit) FROM FruitEaten f INNER JOIN Shirts s ON (f.uid = s.uid) GROUP BY color;
We rewrite the query into its differentially private version:
SELECT result.color, result.number_eaten FROM ( SELECT per_person.color, ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten, ANON_COUNT(uid, LN(3)/2) as number_eaters FROM( SELECT * , ROW_NUMBER() OVER ( PARTITION BY uid ORDER BY random() ) as row_num FROM ( SELECT color, f.uid as uid, COUNT(fruit) as fruit_count FROM FruitEaten f INNER JOIN Shirts s ON (f.uid = s.uid) GROUP BY color, f.uid ) as per_person_raw ) as per_person WHERE per_person.row_num <= 5 GROUP BY per_person.color ) as result WHERE result.number_eaters > 50;
The restriction with this method of rewriting is that joins must not create any rows of shared ownership. In the example above, this is satisfied.