Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Add WordCountExample and a walk-through to the README
  • Loading branch information
Evan Chan committed Dec 8, 2013
1 parent c39e403 commit 37f0297
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 9 deletions.
81 changes: 72 additions & 9 deletions jobserver/README.md
@@ -1,5 +1,16 @@
spark-job-server provides a RESTful interface for submitting and managing Spark jobs, jars, and job contexts.

## Features

- *"Spark as a Service"*: Simple REST interface for all aspects of job, context management
- Supports sub-second low-latency jobs via long-running job contexts
- Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
- Kill running jobs via stop context
- Separate jar uploading step for faster job startup
- Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
- Works with Standalone Spark as well as Mesos
- Job and jar info is persisted via a pluggable DAO interface

## Quick start / development mode

From SBT shell, simply type "re-start". This uses a default configuration file. An optional argument is a
Expand All @@ -12,16 +23,68 @@ Note that re-start (SBT Revolver) forks the job server in a separate process. I
type re-start again at the SBT shell prompt, it will compile your changes and restart the jobserver. It enables
very fast turnaround cycles.

## Features
### WordCountExample walk-through

- *"Spark as a Service"*: Simple REST interface for all aspects of job, context management
- Supports sub-second low-latency jobs via long-running job contexts
- Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
- Kill running jobs via stop context
- Separate jar uploading step for faster job startup
- Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
- Works with Standalone Spark as well as Mesos
- Job and jar info is persisted via a pluggable DAO interface
First, to package the test jar containing the WordCountExample: `sbt/sbt jobserver/test:package`.
Then go ahead and start the job server using the instructions above.

Let's upload the jar:

curl --data-binary @jobserver/target/scala-2.9.3/spark-job-server_2.9.3-0.9.0-incubating-SNAPSHOT-test.jar localhost:8090/jars/test
OK⏎

The above jar is uploaded as app `test`. Next, let's start an ad-hoc word count job, meaning that the job
server will create its own SparkContext, and return a job ID for subsequent querying:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
"status": "STARTED",
"result": {
"jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
"context": "b7ea0eb5-spark.jobserver.WordCountExample"
}
}⏎

From this point, you could asynchronously query the status and results:

curl localhost:8090/jobs/5453779a-f004-45fc-a11d-a39dae0f9bf4
{
"status": "OK",
"result": {
"a": 2,
"b": 2,
"c": 1,
"see": 1
}
}⏎

Note that you could append `&sync=true` when you POST to /jobs to get the results back in one request, but for
real clusters and most jobs this may be too slow.

Another way of running this job is in a pre-created context. Start a new context:

curl -d "" 'localhost:8090/contexts/test-context?numCores=2&memPerNode=512m'
OK⏎

You can verify that the context has been created:

curl localhost:8090/contexts
["test-context"]⏎

Now let's run the job in the context and get the results back right away:

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true'
{
"status": "OK",
"result": {
"a": 2,
"b": 2,
"c": 1,
"see": 1
}
}⏎

Note the addition of `context=` and `sync=true`.

## Architecture

Expand Down
34 changes: 34 additions & 0 deletions jobserver/src/test/scala/spark.jobserver/WordCountExample.scala
@@ -0,0 +1,34 @@
package spark.jobserver

import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark._
import org.apache.spark.SparkContext._
import scala.util.Try

/**
* A super-simple Spark job example that implements the SparkJob trait and can be submitted to the job server.
*
* Set the config with the sentence to split or count:
* input.string = "adsfasdf asdkf safksf a sdfa"
*
* validate() returns SparkJobInvalid if there is no input.string
*/
object WordCountExample extends SparkJob {
def main(args: Array[String]) {
val sc = new SparkContext("local[4]", "WordCountExample")
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Result is " + results)
}

override def validate(sc: SparkContext, config: Config): SparkJobValidation = {
Try(config.getString("input.string"))
.map(x => SparkJobValid)
.getOrElse(SparkJobInvalid("No input.string config param"))
}

override def runJob(sc: SparkContext, config: Config): Any = {
val dd = sc.parallelize(config.getString("input.string").split(" ").toSeq)
dd.map((_, 1)).reduceByKey(_ + _).collect().toMap
}
}

0 comments on commit 37f0297

Please sign in to comment.