Skip to content

gmodena/spark-privacy

Repository files navigation

Project Status: Concept <E2><80><93> Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept. build

spark-privacy

Distributed differential privacy on top of Apache Spark and Google's differential-privacy Java API.

This code is an experimental work-in-progress. API can change and is expected to break compatibility.

Build

The library can be build with:

./gradlew build

Private aggregations

The goal of this repo is to explore scalable approaches for DP primitives.

privacy-spark wraps the Java API of differential-privacy. Basic Laplacian Mechanism support is currently provided for BoundedMean, BoundedQuantiles, BoundedSum, Count aggregations.

Upstream code implementation is well suited for divide-and conquer type of computations (they behave like monoids).

spark-privacy exposes primitives as Spark UDAFs, using the typed Dataset Aggregator API.

A PrivateCount monoid can be implemented with

class PrivateCount[T](epsilon: Double, contribution: Int) extends Aggregator[T, Count, Long] {
  override def zero: Count = Count.builder()
    .epsilon(epsilon)
    .maxPartitionsContributed(contribution)
    .build()

  override def reduce(b: Count, a: T): Count = {
    b.increment()
    b
  }

  override def merge(b1: Count, b2: Count): Count = {
    b1.mergeWith(b2.getSerializableSummary)
    b1
  }

  override def finish(reduction: Count): Long = reduction.computeResult().toLong

  override def bufferEncoder: Encoder[Count] = Encoders.kryo[Count]

  override def outputEncoder: Encoder[Long] = Encoders.scalaLong
}

Note that this implementation violates the expected Aggregator input for typed Datasets (= a set of Rows). This is currently intended behaviour. UDAFs are expected to be applied to DataFrames by registering them with functions.udafs. This boundary should be made implicit in our implementation.

A PrivateCount aggregation can be performed on a group of rows like any built-in aggregate function:

val privateCount = new PrivateCount[Long](epsilon, maxContributions)                                              
val privateCountUdf = functions.udaf(privateCount)                                                                

dataFrame.groupBy("Day").agg(privateCountUdf($"VisitorId") as "private_count").show 

Or with the equivalent SparkSQL API:

spark.udf.register("private_count", privateCountUdf)
dataFrame.registerTempTable("visitors")
sql.sql("select Day, private_count(*) from visitors group by Day")

Example

An example of private aggregation is available under examples.

About

Distributed differential privacy for Apache Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages