# Introduction to H2O with Python #

This is an introductiory demo. Its purpose is to demonstrate the basic usage of Python client in H2O.

## Outline ##

1. Start H2O Cluster
2. Import Data
3. Train Model
4. Predict
5. Export MOJO and predict using MOJO

## Starting H2O Cluster ##

The default way how to start H2O Cluster is using `h2o.init()` method.

In [None]:
import h2o
h2o.init()

Please check the output of previous command. It contains various important information such as IP address of the H2O server, its version, age, etc.

Now let us import data into the cluster. Data can be loaded form various sources (S3, HDFS, local file, etc.), but for purposes of this demo, we will use _local file_.

You can get the dataset at `https://s3.amazonaws.com/benchm-ml--main/train-1m.csv`.

In [None]:
data = h2o.import_file('/Users/michalraska/Development/h2o/datasets/train-1m.csv')
data.head(6)

To distinguish between _regression_ and _classification_ problems H2O is checking the type of the response column. The `describe()` method is useful for checking the type of the columns as well as getting some additional information about the dataset.

In [None]:
data.describe()

Based on the previous output, we can see the response column is of type enum, which is correct. If we want to change the type of some column to enum, we will use `data['dep_delayed_15min'] = data['dep_delayed_15min'].asfactor()`. There are similar methods for other types, for example `asnumeric()`, `ascharacter()` or `as_date()`.

A common practice, is to split the original data into two datasets: 

- training dataset
- validation dataset

In [None]:
training, validation = data.split_frame(ratios=[0.8], seed=1)
print('Training dataset:')
training.describe()
print('Validation dataset:')
validation.describe()

To train a model in H2O, we need to import the `H2OGradientBoostingEstimator` in case of GBM. In case of other algorithms, for example the Distributed Random Forest, we would import `H2ORandomForestEstimator`.

In [None]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

At first we have to create an instance of `H2OGradientBoostingEstimator`. During instantiation various parameters can be specified. When no arguments are supplied, the defaults are used.

In [None]:
gbm_model = H2OGradientBoostingEstimator(ntrees=120, model_id='gbm_airlines_python', seed=1)

To train the model, we have to specify the features - `x` (all columns except dep_delayed_15min) and response column - `y` (dep_delayed_15min). Besides these, we have to specify the training dataset - `training_frame`. The validation dataset - `validation_frame` is optional. 

In [None]:
features = data[:, 0:-1].col_names
response = 'dep_delayed_15min'
gbm_model.train(x=features, y='dep_delayed_15min', training_frame=training, validation_frame=validation)
print("AUC Train: %f" % gbm_model.auc(train=True))
print("AUC Validation: %f" % gbm_model.auc(valid=True))

We can use Flow to monitor the training of the model.

Once we have the trained model, we can use it to make some predictions. First we need to import the test dataset and then we can do the predictions, using the `predict()` method. The dataset can be downloaded from `https://s3.amazonaws.com/benchm-ml--main/test.csv`.

In [None]:
test = h2o.import_file('/Users/michalraska/Development/h2o/datasets/test.csv')
test.describe()
prediction = gbm_model.predict(test)
prediction.head(20)

If we are satisfied with the results, we export the model to MOJO - Model Object, Optimized.

In [None]:
mojo_path = gbm_model.download_mojo(path="./", get_genmodel_jar=True)
print("Path to zip: %s" % mojo_path)
!ls -alh gbm_airlines_python.zip
!ls -alh h2o-genmodel.jar

To score using MOJO we don't need H2O Cluster so shut it down.

In [None]:
h2o.cluster().shutdown()

Now let's use the MOJO for predictions. For this, we will use the `mojo_predict_csv` from `h2o.utils.shared_utils` method. We will do predictions for the same CSV file.

In [None]:
import h2o.utils.shared_utils as h2o_utils
prediction = h2o_utils.mojo_predict_csv(
    input_csv_path='/Users/michalraska/Development/h2o/datasets/test.csv', 
    mojo_zip_path='./gbm_airlines_python.zip', 
    genmodel_jar_path='./h2o-genmodel.jar',
    verbose=False
)
for i, p in enumerate(prediction[0:11]):
    print("%s.\tPredict: %s\tN: %s\tY: %s" % ((i + 1), p['predict'], p['N'], p['Y']))