## Quick Start

1. Take a moment to confirm the configuration details. You can run it with default settings to get a 3 node cluster with 21GB of RAM
2. Run the cell bellow to configure the spark cluster

### Note 

You can change the driver and executor max memory and number of nodes by changing the following

``“driverMemory”:”21G”
“executorMemory”:”21G
“numExecutors”:3
``

For more info, check the documentation [here][1]

[1]: http://h2o-release.s3.amazonaws.com/h2o/latest_azure_doc.html



In [None]:
%%configure -f
{
    "conf":{
        "spark.ext.h2o.announce.rest.url": "http://@@IPADDRESS@@:5000/flows",
        "spark.jars":"/H2O-Sparkling-Water-files/sparkling-water-assembly-all.jar",
        "spark.submit.pyFiles":"/H2O-Sparkling-Water-files/pySparkling.zip",
        "spark.locality.wait":"3000",
        "spark.scheduler.minRegisteredResourcesRatio":"1",
        "spark.task.maxFailures":"1",
        "spark.yarn.am.extraJavaOption":"-XX:MaxPermSize=384m",
        "spark.yarn.max.executor.failures":"1",
        "maximizeResourceAllocation": "true"
    },
    "driverMemory":"21G",
    "executorMemory":"21G",
    "numExecutors":3
}

# Sentiment Analysis with PySparkling
The Amazon Fine Food Reviews dataset consists of 568,454 food reviews Amazon users left up to October 2012.

> This data was originally published on SNAP as part of the paper: J. McAuley and J. Leskovec. _From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews_. WWW, 2013.

https://www.kaggle.com/snap/amazon-fine-food-reviews

## Prepare environment

In [None]:
import pyspark
import pysparkling, h2o
import os
os.environ["PYTHON_EGG_CACHE"] = "~/"
sc.addPyFile("wasb:///H2O-Sparkling-Water-files/pySparkling.zip") # For Azure DataLake replace wasb with adl

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

## H2O FLOW

H2O Flow is a  interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks.

With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work - all within Flow’s browser-based environment. 

An H2O Flow instance is always running when H2O is started, even from R or Python. Users can use Flow in conjunction with their coding environment to evaluate model performance & scoring history easily during an training run. They can also monitor cluster & CPU usage and perform data explorations using the built-in visualizations.

### Note
Please wait for the previous cell to finish executing (and start H2O) before opening the H2O Flow page

###### H2O FLOW can be found at @@FLOWURL@@


## Load data into H2OFrame

In [None]:
# This is just helper function returning the path to public data file Reviews.csv ~ 300MB size
def _locate(example_name): 
    return "https://h2ostore.blob.core.windows.net/examples/" + example_name 

DATASET = 'CReviews.csv'


In [None]:
# And import them into H2O
from pyspark import SparkFiles

reviews_hf = h2o.import_file(_locate(DATASET))


In [None]:
reviews_hf.show()

## Data munge data with H2O API

### Remove columns

In [None]:
selected_columns = [ "Score", "Time", "Summary", "HelpfulnessNumerator", "HelpfulnessDenominator" ]
reviews_hf = reviews_hf[selected_columns]

In [None]:
reviews_hf.show()

### Refine `Time` Column into Year/Month/Day/DayOfWeek/Hour columns
In this case the `Time` column contains number of seconds from epoch. We translate it into several new columns to help algorithms to pick right pattern.

In [None]:
# Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC"

In [None]:
def refine_time_column(data_hf, column_name):
    data_hf[column_name] = data_hf[column_name] * 1000 # Transformation to microsecond since required by H2O API
    data_hf["Day"] = data_hf[column_name].day()
    data_hf["Month"] = data_hf[column_name].month()
    data_hf["Year"] = data_hf[column_name].year()
    data_hf["DayOfWeek"] = data_hf[column_name].dayOfWeek()
    data_hf["Hour"] = data_hf[column_name].hour()
    
refine_time_column(reviews_hf, "Time")
reviews_hf.show()

## Data Munge with Spark API
We can combine H2O data munging capabilities with Spark API

### Publish H2O Frame as Spark DataFrame

The created H2OContext exposes the method `as_spark_frame` which publishes an H2OFrame as Spark DataFrame.

In [None]:
reviews_df = h2o_context.as_spark_frame(reviews_hf)
reviews_df.show()

In [None]:
# HERE is where we save the dataframe to a Hive Table
reviews_df.createOrReplaceTempView("reviewstabletemp")

### Spark DataFrame API

From this point we can run any Spark data munging operations including SQL.
We can still publish the result as H2OFrame.

In [None]:
avgScorePerYear = reviews_df.groupBy("Year").agg({"Score" : "avg", "*": "count"}).where("Year is not null").orderBy("Year")
avgScorePerYear.show()

In [None]:
avgScorePerYear.createOrReplaceTempView("avgscoretabletemp")

In [None]:
%%sql
show tables

Now we can query the hive table and output the results on a pandas dataframe (using the -o option)

In [None]:
%%sql -q -n 500 -o query1
select * from avgscoretabletemp

### visualize the results directly in Python Notebook...

In [None]:
%%local
%matplotlib inline

query1.plot.bar(x="Year", y = "count(1)")

### Prepare data for modeling
The idea is to model sentiment based on `Score` of review, `Summary` and time when the review was performed. In this case we skip all neutral reviews, but focus on positive/negative scores.

Steps:

  1. Select columns Score, Month, Day, DayOfWeek, Summary
  2. Define UDF to transform score (0..5) to binary positive/negative
  3. Use TF-IDF to vectorize summary column

#### Transform the `Score` column into binary feature

The score contains value (0, 5), however we are just interested in binary value - positive/negative review. We ignore neutral reviews.

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import UserDefinedFunction

def to_binary_score(col):
    if col < 3:
        return "negative"
    else:
        return "positive"
udf_to_binary_score = UserDefinedFunction(to_binary_score, StringType())

In [None]:
reviews_df = reviews_df.withColumn("Score", udf_to_binary_score("Score"))
reviews_df.show()

### Transforming textual data into numeric representation

#### Tokenization

In [None]:
from pyspark.ml.feature import *

tokenizer = Tokenizer(inputCol="Summary", outputCol="tokens")

#### Transform tokens into numeric representation

We use Spark `HashingTF` to represent tokens as numeric features.

In [None]:
hashingTF = HashingTF()
hashingTF.setInputCol("tokens").setOutputCol("tf-features").setNumFeatures(1024)

#### Build IDF (Inverse Document Frequency) model
The model scales a token frequency based on its occurence in a document and full set of documents.

In [None]:
idf = IDF()
idf.setInputCol("tf-features")
idf.setOutputCol("idf-features")

#### Compose individual transformation into a Spark pipeline

In [None]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [tokenizer, hashingTF, idf])
pipelineModel = pipeline.fit(reviews_df)

#### And transform input data

In [None]:
final_reviews_df = pipelineModel.transform(reviews_df)
#final_reviews_df.show()

## Back to H2O Frame (materialization)

In [None]:
final_columns = ["Score", "HelpfulnessNumerator", "HelpfulnessDenominator", "Day", "Month", "Year", "DayOfWeek", "idf-features"]
final_reviews_hf = h2o_context.as_h2o_frame(final_reviews_df.select(final_columns), "final_reviews_hf")
#final_reviews_hf.show()

### Score and DayOfWeek columns needs to be a factor

In [None]:
final_reviews_hf["Score"] = final_reviews_hf["Score"].asfactor()
final_reviews_hf["DayOfWeek"] = final_reviews_hf["DayOfWeek"].asfactor()

### Prepare training and validation dataset for modeling

In [None]:
splits = final_reviews_hf.split_frame(ratios=[0.75], destination_frames=["train", "valid"], seed=42)

In [None]:
train_hf = splits[0]
valid_hf = splits[1]
#train_hf.show()

### Memory Cleanup

In [None]:
final_reviews_hf = None
reviews_hf = None

#### List available data

In [None]:
h2o.ls()

## Model training

### Random grid search with explicit stopping criterions


#### Define a hyper space to explore

> Please feel free to play with parameters, see documentation in [H2O Python Documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#module-h2o.grid.grid_search).

In [None]:
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

hyper_params = {'activation' : ["Rectifier", "TanhWithDropout"], 
                'hidden' : [ [2,2], [10,10]],
                'epochs' : [ 1, 2, 5]
               }

#### Define stoping criterions

> Modify based on your demands and requirements (time v. accuracy bound search)

In [None]:
search_criteria = {'strategy' : 'RandomDiscrete',
                   'max_runtime_secs': 120,
                   'stopping_rounds' : 3,
                   'stopping_metric' : 'AUC', # AUTO, mse, logloss
                   'stopping_tolerance': 1e-2
                   }

#### Launch Random Hyper Search

> For more details look into [H2O Deep Learning documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2odeeplearningestimator)

In [None]:
models_grid = H2OGridSearch(H2ODeepLearningEstimator, hyper_params=hyper_params, search_criteria=search_criteria)
models_grid.train(x = train_hf.col_names, y = "Score", \
                  training_frame = train_hf, \
                  validation_frame = valid_hf, \
                  variable_importances=True)

### The best model is ...

In [None]:
models_grid.sort_by('auc', False)

### The best model details

In [None]:
best_model = h2o.get_model(models_grid.sort_by('auc', False)[0][0])
best_model.model_performance(valid_hf)

### What are most important features?

In [None]:
best_model.varimp(use_pandas=True)

# Congratulations you built your first model using Azure + PySparkling and H2O!!!