Sparkling Water Examples

Available Demos And Applications

Example	Description
CraigslistJobTitlesStreamingApp	Stream application - it predicts job category based on incoming job description.
CraigslistJobTitlesApp	Predict job category based on posted job description.
ChicagoCrimeAppSmall	Builds a model predicting a probability of arrest for given crime in Chicago using data in chicago datasets.
ChicagoCrimeApp	Implementation of Chicago Crime demo with setup for data stored on HDFS.
CitiBikeSharingDemo	Predicts occupancy of Citi bike stations in NYC.
HamOrSpamDemo	Shows Spam detector with Spark and H2O's DeepLearning.
ProstateDemo	Running K-means on prostate dataset.
DeepLearningDemo	Running DeepLearning on a subset of airlines dataset.
AirlinesWithWeatherDemo	Joining flights data with weather data and running Deep Learning.
AirlinesWithWeatherDemo2	New iteration of `AirlinesWithWeatherDemo`.

Run examples by typing bin/run-example.sh <name of demo> or follow text below.

Available Demos for Sparkling Shell

Example	Description
chicagoCrimeSmallShell.script.scala	Demo showing full source code of predicting arrest probability for a given crime. It covers whole machine learning process from loading and transforming data, building models, scoring incoming events.
chicagoCrimeSmall.script.scala	Example of using ChicagoCrimeApp. Creating application and using it for scoring individual crime events.
hamOrSpam.script.scala	HamOrSpam application which detects Spam messages. Presented at MLConf 2015 NYC.
strata2015.script.scala	NYC CitiBike demo presented at Strata 2015 in San Jose.
StrataAirlines.script.scala	Example of using flights and weather data to predict delay of a flight.

Run examples by typing bin/sparkling-shell -i <path to file with demo script>

Building and Running Examples

Please see Running Sparkling Water Examples for more information how to build and run examples.

Configuring Sparkling Water Variables

Please see Available Sparkling Water Configuration Properties for more information about possible Sparkling Water configurations.

Step-by-Step Weather Data Example

Run Sparkling shell with an embedded cluster:

export SPARK_HOME="/path/to/spark/installation"   export MASTER="local[*]"   bin/sparkling-shell

To see the Sparkling shell (i.e., Spark driver) status, go to http://localhost:4040/.
Initialize H2O services on top of Spark cluster:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._
import h2oContext.implicits._

Load weather data for Chicago international airport (ORD), with help from the RDD API:

import org.apache.spark.examples.h2o._
val weatherDataFile = "examples/smalldata/chicago/Chicago_Ohare_International_Airport.csv"
val wrawdata = spark.sparkContext.textFile(weatherDataFile,3).cache()
val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

Load airlines data using the H2O parser:

import java.io.File
val dataFile = "examples/smalldata/airlines/allyears2k_headers.zip"
val airlinesData = new H2OFrame(new File(dataFile))

Select flights destined for Chicago (ORD):

val airlinesTable : RDD[Airlines] = asRDD[Airlines](airlinesData)
val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))

Compute the number of these flights:

flightsToORD.count

Use Spark SQL to join the flight data with the weather data:

implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._
flightsToORD.toDF.createOrReplaceTempView("FlightsToORD")
weatherTable.toDF.createOrReplaceTempView("WeatherORD")

Perform SQL JOIN on both tables:

val bigTable = sqlContext.sql(
        """SELECT
            |f.Year,f.Month,f.DayofMonth,
            |f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
            |f.UniqueCarrier,f.FlightNum,f.TailNum,
            |f.Origin,f.Distance,
            |w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
            |f.ArrDelay
            |FROM FlightsToORD f
            |JOIN WeatherORD w
            |ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)

Transform the first 3 columns containing date information into enum columns:

val bigDataFrame: H2OFrame = h2oContext.asH2OFrame(bigTable)
for( i <- 0 to 2) bigDataFrame.replace(i, bigDataFrame.vec(i).toCategoricalVec)
bigDataFrame.update()

Run deep learning to produce a model estimating arrival delay:

import _root_.hex.deeplearning.DeepLearning
import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters.Activation
val dlParams = new DeepLearningParameters()
dlParams._train = bigDataFrame
dlParams._response_column = "ArrDelay"
dlParams._epochs = 5
dlParams._activation = Activation.RectifierWithDropout
dlParams._hidden = Array[Int](100, 100)

// Create a job
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get

Use the model to estimate the delay on the training data:

val predictionH2OFrame = dlModel.score(bigTable)("predict")
val predictionsFromModel = asDataFrame(predictionH2OFrame)(sqlContext).collect.map{
    row => if (row.isNullAt(0)) Double.NaN else row(0)
}

Generate an R-code producing residual plot:

import org.apache.spark.examples.h2o.AirlinesWithWeatherDemo2.residualPlotRCode
residualPlotRCode(predictionH2OFrame, "predict", bigTable, "ArrDelay", h2oContext)

Execute generated R-code in RStudio:

#
# R script for residual plot
#
# Import H2O library
library(h2o)
# Initialize H2O R-client
h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame("dframe_b5f449d0c04ee75fda1b9bc865b14a69")
act = h2o.getFrame ("frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columns
predDelay = pred$predict
actDelay = act$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind (as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))
nrow(compare)
plot( compare[,1:2] )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Sparkling Water Examples

Available Demos And Applications

Available Demos for Sparkling Shell

Building and Running Examples

Configuring Sparkling Water Variables

Step-by-Step Weather Data Example

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Sparkling Water Examples

Available Demos And Applications

Available Demos for Sparkling Shell

Building and Running Examples

Configuring Sparkling Water Variables

Step-by-Step Weather Data Example