Skip to content
Algebird's HyperLogLog support for Apache Spark.
Branch: master
Clone or download
Pull request Compare This branch is 27 commits ahead of vitillo:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Bug 1466936 - Distribute via Jun 28, 2018
build.sbt Prevent spPackage duplicating python files (#9) Jul 5, 2018


Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto. CircleCi


This project is published as mozilla/spark-hyperloglog on, so is available via:

spark --packages mozilla:spark-hyperloglog:2.2.0

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._

val hllMerge = new HyperLogLogMerge
spark.udf.register("hll_merge", hllMerge)
spark.udf.register("hll_create", hllCreate _)
spark.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
  .select(expr("hll_create(id, 12) as hll"))
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))


|    3|


To publish a new version of the package, you need to create a new release on GitHub with a tag version starting with v like v2.2.0. The tag will trigger a CircleCI build that publishes to Mozilla's maven repo in S3.

The CircleCI build will also attempt to publish the new tag to, but due to an outstanding bug in the sbt-spark-package plugin that publish will likely fail. You can retry locally until is succeeds by creating a GitHub personal access token and, exporting the environment variables GITHUB_USERNAME and GITHUB_PERSONAL_ACCESS_TOKEN, and then repeatedly running sbt spPublish until you get a non-404 response.

You can’t perform that action at this time.