spark-privacy

Distributed differential privacy on top of Apache Spark and Google's differential-privacy Java API.

This code is an experimental work-in-progress. API can change and is expected to break compatibility.

Build

The library can be build with:

./gradlew build

Private aggregations

The goal of this repo is to explore scalable approaches for DP primitives.

privacy-spark wraps the Java API of differential-privacy. Basic Laplacian Mechanism support is currently provided for BoundedMean, BoundedQuantiles, BoundedSum, Count aggregations.

Upstream code implementation is well suited for divide-and conquer type of computations (they behave like monoids).

spark-privacy exposes primitives as Spark UDAFs, using the typed Dataset Aggregator API.

A PrivateCount monoid can be implemented with

class PrivateCount[T](epsilon: Double, contribution: Int) extends Aggregator[T, Count, Long] {
  override def zero: Count = Count.builder()
    .epsilon(epsilon)
    .maxPartitionsContributed(contribution)
    .build()

  override def reduce(b: Count, a: T): Count = {
    b.increment()
    b
  }

  override def merge(b1: Count, b2: Count): Count = {
    b1.mergeWith(b2.getSerializableSummary)
    b1
  }

  override def finish(reduction: Count): Long = reduction.computeResult().toLong

  override def bufferEncoder: Encoder[Count] = Encoders.kryo[Count]

  override def outputEncoder: Encoder[Long] = Encoders.scalaLong
}

Note that this implementation violates the expected Aggregator input for typed Datasets (= a set of Rows). This is currently intended behaviour. UDAFs are expected to be applied to DataFrames by registering them with functions.udafs. This boundary should be made implicit in our implementation.

A PrivateCount aggregation can be performed on a group of rows like any built-in aggregate function:

val privateCount = new PrivateCount[Long](epsilon, maxContributions)                                              
val privateCountUdf = functions.udaf(privateCount)                                                                

dataFrame.groupBy("Day").agg(privateCountUdf($"VisitorId") as "private_count").show

Or with the equivalent SparkSQL API:

spark.udf.register("private_count", privateCountUdf)
dataFrame.registerTempTable("visitors")
sql.sql("select Day, private_count(*) from visitors group by Day")

Example

An example of private aggregation is available under examples.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
examples		examples
gradle/wrapper		gradle/wrapper
lib/src		lib/src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
scalastyle_config.xml		scalastyle_config.xml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

examples

examples

gradle/wrapper

gradle/wrapper

lib/src

lib/src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.gradle

build.gradle

gradlew

gradlew

gradlew.bat

gradlew.bat

scalastyle_config.xml

scalastyle_config.xml

settings.gradle

settings.gradle

Repository files navigation

spark-privacy

Build

Private aggregations

Example

About

Releases

Packages 2

Languages

License

gmodena/spark-privacy

Folders and files

Latest commit

History

Repository files navigation

spark-privacy

Build

Private aggregations

Example

About

Resources

License

Stars

Watchers

Forks

Languages