Skip to content
Algebird's HyperLogLog support for Apache Spark.
Branch: master
Clone or download
Pull request Compare This branch is 27 commits ahead of vitillo:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
project
python
src Bug 1466936 - Distribute via spark-packages.org Jun 28, 2018
.gitignore
LICENSE
README.md
build.sbt Prevent spPackage duplicating python files (#9) Jul 5, 2018
scalastyle-config.xml

README.md

spark-hyperloglog

Algebird's HyperLogLog support for Apache Spark. This package can be used in concert with presto-hyperloglog to share HyperLogLog sets between Spark and Presto.

codecov.io CircleCi

Installing

This project is published as mozilla/spark-hyperloglog on spark-packages.org, so is available via:

spark --packages mozilla:spark-hyperloglog:2.2.0

Example usage

import com.mozilla.spark.sql.hyperloglog.aggregates._
import com.mozilla.spark.sql.hyperloglog.functions._

val hllMerge = new HyperLogLogMerge
spark.udf.register("hll_merge", hllMerge)
spark.udf.register("hll_create", hllCreate _)
spark.udf.register("hll_cardinality", hllCardinality _)

val frame = sc.parallelize(List("a", "b", "c", "c")).toDF("id")
(frame
  .select(expr("hll_create(id, 12) as hll"))
  .groupBy()
  .agg(expr("hll_cardinality(hll_merge(hll)) as count"))
  .show())

yields:

+-----+
|count|
+-----+
|    3|
+-----+

Deployment

To publish a new version of the package, you need to create a new release on GitHub with a tag version starting with v like v2.2.0. The tag will trigger a CircleCI build that publishes to Mozilla's maven repo in S3.

The CircleCI build will also attempt to publish the new tag to spark-packages.org, but due to an outstanding bug in the sbt-spark-package plugin that publish will likely fail. You can retry locally until is succeeds by creating a GitHub personal access token and, exporting the environment variables GITHUB_USERNAME and GITHUB_PERSONAL_ACCESS_TOKEN, and then repeatedly running sbt spPublish until you get a non-404 response.

You can’t perform that action at this time.