### Let's first configure the Spark cluster
Add the jars for H2O sparkling water and the spark-csv. Also change the driver Memory to the size of the VM of the clusters.
<br><b>Make sure that you are adding the right Sparkling Water version and setting the driver memory to match the RAM of the VM sizes selected on cluster creation </b>

In [None]:
%%configure -f
{
    "conf":{
        "spark.jars.packages":"ai.h2o:sparkling-water-core_2.10:1.6.8,com.databricks:spark-csv_2.10:1.5.0"
    },
    "driverMemory":"8G"
}

Here we add the pySpark egg file from the downloaded H2O Sparkling water distribution.
<br><b> Make sure that file name of the egg file below matches the downloaded distribution of Sparkling Water</b>

In [None]:
sc.addPyFile('wasb:///HdiNotebooks/H2O-Sparkling-Water/py/dist/h2o_pysparkling_1.6-1.6.8-py2.7.egg')
import os
os.environ["PYTHON_EGG_CACHE"] = "~/"

# Sentiment Analysis with PySparkling
The Amazon Fine Food Reviews dataset consists of 568,454 food reviews Amazon users left up to October 2012.

> This data was originally published on SNAP as part of the paper: J. McAuley and J. Leskovec. _From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews_. WWW, 2013.

## Prepare environment

In [None]:
import pyspark
import pysparkling, h2o

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

## Load data into H2OFrame

** Please change the variable `AMAZON_DATASET` to your environment specific location. **

In [None]:
# This is just helper function returning the path to data files within sparkling-water project directories
def _locate(example_name): 
    return "wasb:///HdiNotebooks/H2O-Sparkling-Water/examples/" + example_name 

AMAZON_DATASET = 'Reviews.csv'

# Add files to Spark Cluster

sc.addFile(_locate(AMAZON_DATASET))

In [None]:
# And import them into H2O
from pyspark import SparkFiles

# Since we have already loaded files into spark, we have to use h2o.upload_file instead of 
# h2o.import_file since h2o.import_file expects cluster-relative path (ie. the file on this
# path can be accessed from all the machines on the cluster) but SparkFiles.get(..) already
# give us relative path to the file on a current node which h2o.upload_file can handle ( it
# uploads file located on current node and distributes it to the H2O cluster)

reviews_hf = h2o.upload_file(SparkFiles.get(AMAZON_DATASET))


In [None]:
reviews_hf.show()

## Data munge data with H2O API

### Remove columns

In [None]:
selected_columns = [ "Score", "Time", "Summary", "HelpfulnessNumerator", "HelpfulnessDenominator" ]
reviews_hf = reviews_hf[selected_columns]

In [None]:
reviews_hf.show()

### Refine `Time` Column into Year/Month/Day/DayOfWeek/Hour columns
In this case the `Time` column contains number of seconds from epoch. We translate it into several new columns to help algorithms to pick right pattern.

In [None]:
# Set time zone to UTC for date manipulation
h2o.set_timezone("Etc/UTC")

In [None]:
def refine_time_column(data_hf, column_name):
    data_hf[column_name] = data_hf[column_name] * 1000 # Transformation to microsecond since required by H2O API
    data_hf["Day"] = data_hf[column_name].day()
    data_hf["Month"] = data_hf[column_name].month()
    data_hf["Year"] = data_hf[column_name].year()
    data_hf["DayOfWeek"] = data_hf[column_name].dayOfWeek()
    data_hf["Hour"] = data_hf[column_name].hour()
    
refine_time_column(reviews_hf, "Time")
reviews_hf.show()

## Data Munge with Spark API
We can combine H2O data munging capabilities with Spark API

### Publish H2O Frame as Spark DataFrame

The created H2OContext exposes the method `as_spark_frame` which publishes an H2OFrame as Spark DataFrame.

In [None]:
reviews_df = h2o_context.as_spark_frame(reviews_hf)
reviews_df.show()


In [None]:
#reviews_df.saveAsTable("reviewstable")
sqlContext.registerDataFrameAsTable(reviews_df, "reviewstable")

In [None]:
%%sql 
show tables

### Spark DataFrame API

From this point we can run any Spark data munging operations including SQL.
We can still publish the result as H2OFrame.

In [None]:
avgScorePerYear = reviews_df.groupBy("Year").agg({"Score" : "avg", "*": "count"}).orderBy("Year")
avgScorePerYear.show()

In [None]:
sqlContext.registerDataFrameAsTable(avgScorePerYear, "avgscoretable")

Now we can query the hive table and output the results on a pandas dataframe (using the -o option)

In [None]:
%%sql -q -n 500 -o query1
select * from avgscoretable

### visualize the results directly in Python Notebook...

In [None]:
%%local
%matplotlib inline

query1.plot.bar(x="Year", y = "count(1)")

### Prepare data for modeling
The idea is to model sentiment based on `Score` of review, `Summary` and time when the review was performed. In this case we skip all neutral reviews, but focus on positive/negative scores.

Steps:

  1. Select columns Score, Month, Day, DayOfWeek, Summary
  2. Define UDF to transform score (0..5) to binary positive/negative
  3. Use TF-IDF to vectorize summary column

#### Transform the `Score` column into binary feature

The score contains value (0, 5), however we are just interested in binary value - positive/negative review. We ignore neutral reviews.

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import UserDefinedFunction

def to_binary_score(col):
    if col < 3:
        return "negative"
    else:
        return "positive"
udf_to_binary_score = UserDefinedFunction(to_binary_score, StringType())

In [None]:
reviews_df = reviews_df.withColumn("Score", udf_to_binary_score("Score"))
reviews_df.show()

### Transforming textual data into numeric representation

#### Tokenization

In [None]:
from pyspark.ml.feature import *

tokenizer = Tokenizer(inputCol="Summary", outputCol="tokens")

#### Transform tokens into numeric representation

We use Spark `HashingTF` to represent tokens as numeric features.

In [None]:
hashingTF = HashingTF()
hashingTF.setInputCol("tokens").setOutputCol("tf-features").setNumFeatures(1024)

#### Build IDF (Inverse Document Frequency) model
The model scales a token frequency based on its occurence in a document and full set of documents.

In [None]:
idf = IDF()
idf.setInputCol("tf-features")
idf.setOutputCol("idf-features")

#### Compose individual transformation into a Spark pipeline

In [None]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [tokenizer, hashingTF, idf])
pipelineModel = pipeline.fit(reviews_df)

#### And transform input data

In [None]:
final_reviews_df = pipelineModel.transform(reviews_df)
#final_reviews_df.show()

## Back to H2O Frame (materialization)

In [None]:
final_columns = ["Score", "HelpfulnessNumerator", "HelpfulnessDenominator", "Day", "Month", "Year", "DayOfWeek", "idf-features"]
final_reviews_hf = h2o_context.as_h2o_frame(final_reviews_df.select(final_columns), "final_reviews_hf")
final_reviews_hf.show()

### Score and DayOfWeek columns needs to be a factor

In [None]:
final_reviews_hf["Score"] = final_reviews_hf["Score"].asfactor()
final_reviews_hf["DayOfWeek"] = final_reviews_hf["DayOfWeek"].asfactor()

### Prepare training and validation dataset for modeling

In [None]:
splits = final_reviews_hf.split_frame(ratios=[0.75], destination_frames=["train", "valid"], seed=42)

In [None]:
train_hf = splits[0]
valid_hf = splits[1]
#train_hf.show()

### Memory Cleanup

In [None]:
final_reviews_hf = None
reviews_hf = None

#### List available data

In [None]:
h2o.ls()

## Model training

### Random grid search with explicit stopping criterions


#### Define a hyper space to explore

> Please feel free to play with parameters, see documentation in [H2O Python Documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#module-h2o.grid.grid_search).

In [None]:
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

hyper_params = {'activation' : ["Rectifier", "TanhWithDropout"], 
                'hidden' : [ [2,2], [10,10]],
                'epochs' : [ 1, 2, 5]
               }

#### Define stoping criterions

> Modify based on your demands and requirements (time v. accuracy bound search)

In [None]:
search_criteria = {'strategy' : 'RandomDiscrete',
                   'max_runtime_secs': 120,
                   'stopping_rounds' : 3,
                   'stopping_metric' : 'AUC', # AUTO, mse, logloss
                   'stopping_tolerance': 1e-2
                   }

#### Launch Random Hyper Search

> For more details look into [H2O Deep Learning documentation](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2odeeplearningestimator)

In [None]:
models_grid = H2OGridSearch(H2ODeepLearningEstimator, hyper_params=hyper_params, search_criteria=search_criteria)
models_grid.train(x = train_hf.col_names, y = "Score", \
                  training_frame = train_hf, \
                  validation_frame = valid_hf, \
                  variable_importances=True)

### The best model is ...

In [None]:
models_grid.sort_by('auc', False)

### The best model details

In [None]:
best_model = h2o.get_model(models_grid.sort_by('auc', False)[0][0])
best_model.model_performance(valid_hf)

### What are most important features?

In [None]:
best_model.varimp(use_pandas=True)

# Congratulations you built your first model using Azure + PySparkling and H2O!!!