Skip to content
Branch: master
Find file History
Differential Privacy Team and dasmdasm Project import
Latest commit 67f1bc5 Sep 5, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
BUILD Project import Sep 5, 2019
README.md Project import Sep 5, 2019
anon_func--1.0.0.sql Project import Sep 5, 2019
anon_func.cc Project import Sep 5, 2019
anon_func.control Project import Sep 5, 2019
dp_func.cc Project import Sep 5, 2019
dp_func.h Project import Sep 5, 2019
dp_func_test.cc Project import Sep 5, 2019
install_extension.sh Project import Sep 5, 2019
postgres.BUILD Project import Sep 5, 2019

README.md

Anonymous Functions PostgreSQL Extension

This subdirectory contains a PostgreSQL extension providing several epsilon-DP aggregate functions. We will refer to them as the anonymous functions.

Setup

  • Install Postgres 11.

  • Set $PG_DIR to the directory of your postgres installation. E.g.

    export PG_DIR=/usr/local/pgsql
    
  • From the base directory of the differential library repo, run

    differential_privacy/postgres/install_extension.sh
    
  • Start the Postgres server.

  • In PostgreSQL, load the extension by calling

    CREATE EXTENSION anon_func;
    

Anonymous Functions

We offer a suite of anonymous functions. Each function wraps around the corresponding algorithm in the differential privacy library. Notice that these functions, like their algorithms, assume that each entry is owned by a distinct user.

The first argument for each function is the column over which the aggregation is performed. Each function may also take a couple of additional parameters. They must be passed as literal values.

  • epsilon: The differential privacy parameter for the function.
  • lower: A lower bound for the input data set.
  • upper: An upper bound for the input data set.

The lower and upper bounds are used to determine the sensitivity for the anonymous function. In the event that manual bounds are not entered, some functions can automatically infer the bounds, provided that there is enough input data. If there is not enough data, an error message to that effect will be displayed. For more information on automatic bounding, see the ApproxBounds documentation

Count

ANON_COUNT(column)
ANON_COUNT(column, epsilon)

The count function can be called on a column of any type. The epsilon can be optionally provided; otherwise, the default epsilon is used. An integer count is returned.

Sum, Average, Variance, Standard Deviation

ANON_SUM(column)
ANON_SUM(column, epsilon)
ANON_SUM_WITH_BOUNDS(column, lower, upper)
ANON_SUM_WITH_BOUNDS(column, lower, upper, epsilon)

ANON_SUM can be called with any numeric type. For integer types, the sum returned is an integer. For floating point types, it is a double. If ANON_SUM is called without bounds, then the data is automatically bounded. If ANON_SUM_WITH_BOUNDS is called, then the data is bounded using the provided explicit bounds. Like count, epsilon is optionally configurable.

ANON_AVG(column)
ANON_AVG(column, epsilon)
ANON_AVG_WITH_BOUNDS(column, lower, upper)
ANON_AVG_WITH_BOUNDS(column, lower, upper, epsilon)

ANON_VAR(column)
ANON_VAR(column, epsilon)
ANON_VAR_WITH_BOUNDS(column, lower, upper)
ANON_VAR_WITH_BOUNDS(column, lower, upper, epsilon)

ANON_STDDEV(column)
ANON_STDDEV(column, epsilon)
ANON_STDDEV_WITH_BOUNDS(column, lower, upper)
ANON_STDDEV_WITH_BOUNDS(column, lower, upper, epsilon)

The ANON_AVG, ANON_VAR, and ANON_STDDEV functions are like ANON_SUM, but the return type is always double.

Ntile

ANON_NTILE(column, percentile, lower, upper)
ANON_NTILE(column, percentile, lower, upper, epsilon)

ANON_NTILE accepts any numeric type. For integer types, an integer will be returned. Otherwise, a double is returned. Unlike the other bounded functions, ANON_NTILE requires bounds. Automatic bounding is not supported.

User-Level Differentially Private Queries

To write user-level differentially private queries, a two-stage aggregation may be used, as described in "Differentially Private SQL with Bounded User Contribution" (Wilson et al.). In this section, we provide several examples for manually rewriting regular PostgreSQL queries into queries with user-level differential privacy. For an in-depth description of the transformations made, refer to the Wilson et al. paper.

Simple Count

Imagine a table that records each instance of a person eating a fruit, call it FruitEaten.

Column Type Description
uid integer Uniquely identifies a person.
fruit varchar(20) The name of the fruit the person ate.

In this table, each row represents one fruit eaten. So if person 1 eats two apples, then there will be two rows in the table with column values (1, apple). Consider a simple query counting how many of each fruit have been eaten.

SELECT fruit, COUNT(fruit)
FROM FruitEaten
GROUP BY fruit;

Suppose that instead of getting the regular count, we want the differentially private count with the privacy parameter ε=ln(3). The final product of the query rewrite would be

SELECT result.fruit, result.number_eaten
FROM (
  SELECT per_person.fruit,
    ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten,
    ANON_COUNT(uid, LN(3)/2) as number_eaters
    FROM(
      SELECT * , ROW_NUMBER() OVER (
        PARTITION BY uid
        ORDER BY random()
      ) as row_num
      FROM (
        SELECT fruit, uid, COUNT(fruit) as fruit_count
        FROM FruitEaten
        GROUP BY fruit, uid
      ) as per_person_raw
    ) as per_person
  WHERE per_person.row_num <= 5
  GROUP BY per_person.fruit
) as result
WHERE result.number_eaters > 50;

As we can see, there are four SELECTS in the query. We will explain them from inner-most to outer-most. The following steps were taken to rewrite the query:

  • Construct the 1st and inner-most SELECT, aliased as per_person_raw.

    SELECT fruit, uid, COUNT(fruit) as fruit_count
    FROM FruitEaten
    GROUP BY fruit, uid;
    

    For each person, count how many of each fruit that person ate.

  • Construct the 2nd SELECT, aliased as per_person.

    SELECT *, ROW_NUMBER() OVER (
      PARTITION BY uid
      ORDER BY random()
    ) as row_num
    FROM per_person_raw;
    

    For each person, per_person_raw contains rows corresponding to the fruits they have eaten. We shuffle these rows and assign them a row number. This will allow us to effectively reservior sample rows for each user by filtering by row number in the next step. This is similar to C_u thresholding in Wilson et al.

  • Construct the 3rd SELECT, aliased as result.

    SELECT per_person.fruit,
           ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten,
           ANON_COUNT(uid, LN(3)/2) as number_eaters
    FROM per_person
    WHERE per_person.row_num <= 5
    GROUP BY per_person.fruit;
    

    First, for each person, we ensure that the person only contributed to 5 fruit groups by filtering on the randomized row number generated in the previous step. Then, for each fruit, we sum the number of the fruit that each person ate. We also count the number of people who ate that fruit. This will allow us to ensure we only release the sums which enough people contributed to.

  • In order to anonymously grab the sum and count, we replace the aggregates SUM and COUNT with ANON_SUM and ANON_COUNT, respectively. Note that since we are performing two anonymous aggregations, we split our privacy parameter between them, using ε=ln(3) for both.

  • Construct the 4th and outer-most SELECT.

    SELECT result.fruit, result.number_eaten
    FROM result
    WHERE result.number_eaters > 50;
    
    

    For any fruit where the number of eaters was less than 50, discard the output result. This is similar to τ-thresholding in Wilson et al. In addition, we drop the number_eaters count, so that it does not display in the output. Dropping the count of unique users is not neccesary for differential privacy.

Multiple Aggregations

In our simple count example, we used a query containing a single anonymous function. For a query with N anonymous function calls, and with a desired total privacy parameter of ε, we need to use ε/(N+1) as the privacy parameter for each aggregation. This is for the N requested calls plus the additional anonymous unique-user count. For instance, consider the following dummy query:

SELECT COUNT(col1), SUM(col2)
FROM Table;

Suppose we want to use the privacy paramter ε=M. In the rewritten query, we have to make the following replacements

Original Replacement
COUNT(col1) ANON_COUNT(col1, M/3)
SUM(col2) ANON_SUM(col2, M/3)

This is because we have two requested anonymous functions, and an additional anonymous unique user count required when we perform the rewrite.

Bounding User Contribution

Consider again our fruit-eating example. Suppose we want to restrict the contribution of each person to the fruit-eaten counts by 5. So if a person has eaten more than 5 fruit, we want to count it as that they have eaten 5 fruit. To do this, add lower and upper bounds on the anonymous functions:

Original Replacement
ANON_SUM(per_person.fruit_count, LN(3)/2) ANON_SUM(per_person.fruit_count, 0, 5, LN(3)/2)

Query With Joins

In this section we will add a join to our query. In addition to the FruitEaten table, consider the following table, which we will call Shirts.

Column Type Description
uid integer Uniquely dentifies a person.
color varchar(20) The name of the person's shirt color.

Let's say we want to find out, for each shirt color, how many fruit all the people wearing that shirt color ate, altogether.

SELECT color, COUNT(fruit)
FROM FruitEaten f INNER JOIN Shirts s ON (f.uid = s.uid)
GROUP BY color;

We rewrite the query into its differentially private version:

SELECT result.color, result.number_eaten
FROM (
  SELECT per_person.color,
    ANON_SUM(per_person.fruit_count, LN(3)/2) as number_eaten,
    ANON_COUNT(uid, LN(3)/2) as number_eaters
    FROM(
      SELECT * , ROW_NUMBER() OVER (
        PARTITION BY uid
        ORDER BY random()
      ) as row_num
      FROM (
        SELECT color, f.uid as uid, COUNT(fruit) as fruit_count
        FROM FruitEaten f INNER JOIN Shirts s ON (f.uid = s.uid)
        GROUP BY color, f.uid
      ) as per_person_raw
    ) as per_person
  WHERE per_person.row_num <= 5
  GROUP BY per_person.color
) as result
WHERE result.number_eaters > 50;

The restriction with this method of rewriting is that joins must not create any rows of shared ownership. In the example above, this is satisfied.

You can’t perform that action at this time.