Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Bug 1305087 - Add python bindings #5

Merged
merged 6 commits into from
May 25, 2018

Conversation

acmiyaguchi
Copy link

This adds python bindings for spark-hyperloglog. Example usage can be found in the tests.

This package is pip installable and plays nicely with the pyspark pip package in standalone mode. The spark-package workflow includes python files in the jar via python -m compileall but doesn't handle proper packaging. Instead, I've followed the convention in the apache/spark/pyspark repository to create this package. One downside is the size of this package (around 15mb).

$ sbt assembly
$ cd python
$ python setup.py sdist
$ pip install dist/*.tar.gz

# can also be uploaded to pypi
$ python setup.py upload

This should simplify the 1-day retention job in mozetl and open up access to hyperloglog in ATMO.

Copy link

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acmiyaguchi nice work on this, just a few questions.


Python bindings for the spark-hyperloglog package.

## Usage
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to include the use of the register function

python/setup.py Outdated
import shutil


VERSION = '2.1.0'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this matches the version in build.sbt. Do they need to be managed together? Has this deploy run successfully?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The versions technically don't need to be managed together, but any update to the scala library should probably retrigger a new python package. I've uploaded this successfully to pypi testing and one annoying thing is that you can't reupload a package once it's been uploaded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've run into that before as well. My question is, how does JAR_FILE = "*-assembly-{}-*.jar".format(VERSION) match the correct jar if the versions don't match?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yes, you're right, they do have to be managed together otherwise the packaging fails. I ran into it when I bumped the version number because of pypi's weird limitations. I might try reading the version from a file in the root directory.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the way I'd do it as well, with a file called VERSIONS or similar

def registerUdf() {
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
spark.udf.register("hll_merge", new HyperLogLogMerge)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these string names to constants.

@mreid-moz mreid-moz removed their request for review February 5, 2018 14:49
@acmiyaguchi
Copy link
Author

This adds a VERSION file and addresses the review comments.

@acmiyaguchi
Copy link
Author

@fbertsch bumping for review

@acmiyaguchi acmiyaguchi removed the request for review from haroldwoo February 23, 2018 01:01
Copy link

@fbertsch fbertsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acmiyaguchi I just ran into this :) Has this already been deployed?

python/setup.py Outdated
import shutil


VERSION = '2.1.0'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the way I'd do it as well, with a file called VERSIONS or similar

@acmiyaguchi
Copy link
Author

@fbertsch It think I've only deployed to the dev pypi. It's been a while since I've looked at this, so I'll put some final touches for Python 3 and CI support to wrap this up.

@acmiyaguchi
Copy link
Author

acmiyaguchi commented May 25, 2018

This is live on PyPi. I'm going to merge this in so updates can be built on top.

@acmiyaguchi acmiyaguchi reopened this May 25, 2018
@acmiyaguchi acmiyaguchi merged commit ea3f427 into mozilla:master May 25, 2018
@mreid-moz
Copy link

cc @wlach for visibility

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants