-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acmiyaguchi nice work on this, just a few questions.
|
||
Python bindings for the spark-hyperloglog package. | ||
|
||
## Usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to include the use of the register
function
python/setup.py
Outdated
import shutil | ||
|
||
|
||
VERSION = '2.1.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this matches the version in build.sbt
. Do they need to be managed together? Has this deploy run successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The versions technically don't need to be managed together, but any update to the scala library should probably retrigger a new python package. I've uploaded this successfully to pypi testing and one annoying thing is that you can't reupload a package once it's been uploaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've run into that before as well. My question is, how does JAR_FILE = "*-assembly-{}-*.jar".format(VERSION)
match the correct jar if the versions don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yes, you're right, they do have to be managed together otherwise the packaging fails. I ran into it when I bumped the version number because of pypi's weird limitations. I might try reading the version from a file in the root directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the way I'd do it as well, with a file called VERSIONS
or similar
def registerUdf() { | ||
import org.apache.spark.sql.SparkSession | ||
val spark = SparkSession.builder().getOrCreate() | ||
spark.udf.register("hll_merge", new HyperLogLogMerge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these string names to constants.
Also adds newlines to the end of files
52e90c0
to
316cfb0
Compare
This adds a VERSION file and addresses the review comments. |
@fbertsch bumping for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acmiyaguchi I just ran into this :) Has this already been deployed?
python/setup.py
Outdated
import shutil | ||
|
||
|
||
VERSION = '2.1.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the way I'd do it as well, with a file called VERSIONS
or similar
@fbertsch It think I've only deployed to the dev pypi. It's been a while since I've looked at this, so I'll put some final touches for Python 3 and CI support to wrap this up. |
This is live on PyPi. I'm going to merge this in so updates can be built on top. |
cc @wlach for visibility |
This adds python bindings for spark-hyperloglog. Example usage can be found in the tests.
This package is pip installable and plays nicely with the pyspark pip package in standalone mode. The spark-package workflow includes python files in the jar via
python -m compileall
but doesn't handle proper packaging. Instead, I've followed the convention in the apache/spark/pyspark repository to create this package. One downside is the size of this package (around 15mb).This should simplify the 1-day retention job in mozetl and open up access to hyperloglog in ATMO.