Skip to content

Latest commit

 

History

History
100 lines (81 loc) · 4.62 KB

README.md

File metadata and controls

100 lines (81 loc) · 4.62 KB

spark-github-pr

Spark SQL datasource for GitHub PR API.

Build Status Coverage Status

Overview

Package allows to query GitHub API v3 to fetch pull request information. Launches first requests on driver to list available pull requests, and creates tasks with pull requests details to execute. PRs are cached in cacheDir value to save rate limit. It is recommended to use token to remove 60 requests/hour constraint. Package also supports loading pull requests using structured streaming (experimental, for Spark 2.x only), see usage example below.

Most of JSON keys are supported, see schema here, here is an example output for subset of columns you might see:

scala> df.select("number", "title", "state", "base.repo.full_name", "user.login",
  "commits", "additions", "deletions")

+------+--------------------+-----+------------+------------+-------+---------+---------+
|number|               title|state|   full_name|       login|commits|additions|deletions|
+------+--------------------+-----+------------+------------+-------+---------+---------+
| 15599|[SPARK-18022][SQL...| open|apache/spark|      srowen|      1|        1|        1|
| 15598|[SPARK-18027][YAR...| open|apache/spark|      srowen|      1|        2|        0|
| 15597|[SPARK-18063][SQL...| open|apache/spark| jiangxb1987|      2|       16|        6|
| 15596|[SQL] Remove shuf...| open|apache/spark|      viirya|      1|       13|       12|
+------+--------------------+-----+------------+------------+-------+---------+---------+

Requirements

Spark version spark-github-pr latest version
1.6.x 1.2.0
2.x.x 1.3.0

Linking

The spark-github-pr package can be added to Spark by using the --packages command line option. For example, run this to include it when starting the spark shell:

 $SPARK_HOME/bin/spark-shell --packages lightcopy:spark-github-pr:1.3.0-s_2.10

Change to lightcopy:spark-github-pr:1.3.0-s_2.11 for Scala 2.11.x

Options

Currently supported options:

Name Since Example Description
user 1.0.0 apache GitHub username or organization, default is apache
repo 1.0.0 spark GitHub repository name for provided user, default is spark
batch 1.0.0 100 number of pull requests to fetch, default is 25, must be >= 1 and <= 1000
token 1.0.0 auth_token authentication token to increase rate limit from 60 to 5000, see GitHub Auth for more info
cacheDir 1.0.0 file:/tmp/.spark-github-pr directory to store cached pull requests information, currently required to be shared folder on local file system or directory on HDFS.

Example

Scala API

// Load default number of pull requests from apache/spark
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").load().
  select("number", "title", "user.login")

val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark").load().
  select("number", "title", "state", "base.repo.full_name", "user.login", "commits")

// You can also specify batch size for number of pull requests to fetch
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark").option("batch", "52").load()

Python API

df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark")load()

res = df.where("commits > 10")

SQL API

CREATE TEMPORARY TABLE prs
USING com.github.lightcopy.spark.pr
OPTIONS (user "apache", repo "spark");

SELECT number, title FROM prs LIMIT 10;

Structured Streaming API

val df = spark.readStream.format("com.github.lightcopy.spark.pr").load()
val query = df.select("number", "title", "user.login").
  writeStream.format("console").option("checkpointLocation", "./checkpoint").start()

Building From Source

This library is built using sbt, to build a JAR file simply run sbt package from project root.

Testing

Run sbt test from project root. CI runs for Spark 2.0 only.