Skip to content

Latest commit

 

History

History
139 lines (98 loc) · 6.52 KB

README.rst

File metadata and controls

139 lines (98 loc) · 6.52 KB

Sparkling Water Examples

Available Demos And Applications

Example Description
CraigslistJobTitlesStreamingApp Stream application - it predicts job category based on incoming job description.
CraigslistJobTitlesApp Predict job category based on posted job description.
ChicagoCrimeApp Builds a model predicting a probability of arrest for given crime in Chicago using data in chicago datasets.
CityBikeSharingDemo Predicts occupancy of City bike stations in NYC.
HamOrSpamDemo Shows Spam detector with Spark and H2O's algorithms.
ProstateDemo Run H2O's K-means on prostate dataset.
DeepLearningDemo Running DeepLearning on a subset of airlines dataset.
AirlinesWithWeatherDemo Join flights data with weather data and running Deep Learning and GBM.

You can run examples by typing ./bin/run-example.sh <name of demo> or follow text below.

Building and Running Examples

Please see Running Sparkling Water Examples for more information how to build and run examples.

Configuring Sparkling Water Variables

Please see Available Sparkling Water Configuration Properties for more information about possible Sparkling Water configurations.

Step-by-Step Weather Data Example

  1. Run Sparkling shell with an embedded cluster:
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local[*]"
bin/sparkling-shell
  1. To see the Sparkling shell (i.e., Spark driver) status, go to http://localhost:4040/.
  2. Initialize H2O services on top of Spark cluster:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
import spark.implicits._
  1. Load weather data for Chicago international airport (ORD):
val weatherDataFile = "examples/smalldata/chicago/Chicago_Ohare_International_Airport.csv"
val weatherTable = spark.read.option("header", "true")
  .option("inferSchema", "true")
  .csv(weatherDataFile)
  .withColumn("Date", to_date(regexp_replace('Date, "(\\d+)/(\\d+)/(\\d+)", "$3-$2-$1")))
  .withColumn("Year", year('Date))
  .withColumn("Month", month('Date))
  .withColumn("DayofMonth", dayofmonth('Date))
  1. Load airlines data:
val airlinesDataFile = "examples/smalldata/airlines/allyears2k_headers.csv"
val airlinesTable = spark.read.option("header", "true")
  .option("inferSchema", "true")
  .option("nullValue", "NA")
  .csv(airlinesDataFile)
  1. Select flights destined for Chicago (ORD):
val flightsToORD = airlinesTable.filter('Dest === "ORD")
  1. Compute the number of these flights:
flightsToORD.count
  1. Join the flights data frame with the weather data frame:
val joined = flightsToORD.join(weatherTable, Seq("Year", "Month", "DayofMonth"))
  1. Run deep learning to produce a model estimating arrival delay:
import ai.h2o.sparkling.ml.algos.H2ODeepLearning
val dl = new H2ODeepLearning()
    .setLabelCol("ArrDelay")
    .setColumnsToCategorical(Array("Year", "Month", "DayofMonth"))
    .setEpochs(5)
    .setActivation("RectifierWithDropout")
    .setHidden(Array(100, 100))

val model = dl.fit(joined)
  1. Use the model to estimate the delay on the training data:
val predictions = model.transform(joined)