This tutorial describes how to create a Distributed Random Forest (DRF) model using H2O Flow.
Those who have never used H2O before should refer to Getting Started for additional instructions on how to run H2O Flow.
This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/
The data are composed of 3279 observations, 1557 attributes, and an a priori grouping assignment. The objective is to build a prediction tool that predicts whether an object is an internet ad or not.
If you don't have any data of your own to work with, you can find some example datasets at http://data.h2o.ai.
Before creating a model, import data into H2O:
- Click the Assist Me! button (the last button in the row of buttons below the menus).
- Click the importFiles link and enter the file path to the dataset in the Search entry field.
- Click the Add all link to add the file to the import queue, then click the Import button.
Now, parse the imported data:
- Click the Parse these files... button.
Note: The default options typically do not need to be changed unless the data does not parse correctly.
- From the drop-down Parser list, select the file type of the data set (Auto, XLS, CSV, or SVMLight).
- If the data uses a separator, select it from the drop-down Separator list.
- If the data uses a column header as the first row, select the First row contains column names radio button. If the first row contains data, select the First row contains data radio button. To have H2O automatically determine if the first row of the dataset contains column names or data, select the Auto radio button.
- If the data uses apostrophes (
'
- also known as single quotes), check the Enable single quotes as a field quotation character checkbox. - To delete the imported dataset after parsing, check the Delete on done checkbox.
NOTE: In general, we recommend enabling this option. Retaining data requires memory resources, but does not aid in modeling because unparsed data cannot be used by H2O.
-
Review the data in the Edit Column Names and Types section.
-
Click the Next page button until you reach the last page.
-
For column 1559, select
Enum
from the drop-down column type menu. -
Click the Parse button.
NOTE: Make sure the parse is complete by confirming progress is 100% before continuing to the next step, model building. For small datasets, this should only take a few seconds, but larger datasets take longer to parse.
-
Once data are parsed, click the View button, then click the Build Model button.
-
Select
Distributed RF
from the drop-down Select an algorithm menu, then click the Build model button. -
If the parsed ad.hex file is not already listed in the Training_frame drop-down list, select it. Otherwise, continue to the next step.
-
From the Response column drop-down list, select
C1
. -
In the Ntrees field, specify the number of trees for the model to build. For this example, enter
150
. -
In the Max_depth field, specify the maximum distance from the root to the terminal node. For this example, use the default value of
20
. -
In the Mtries field, specify the number of features on which the trees will be split. For this example, enter
1000
. -
Click the Build Model button.
The DRF model output includes the following:
-
Model parameters (hidden)
-
Scoring history graph (number for each tree and MSE)
-
ROC curve, training metrics, AUC (with drop-down menus to select thresholds and criterion)
-
Variable importances (variable name, relative importance, scaled importance, percentage)
-
Output (model category, validation metrics, initf)
-
Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
-
Scoring history (in tabular format)
-
Training metrics (model name, model checksum, frame name, frame checksum, description if applicable, model category, duration in ms, scoring time, predictions, MSE, R2, Logloss, AUC, Gini)
-
Domain
-
Training metrics (thresholds, F1, F2, F0Points, Accuracy, Precision, Recall, Specificity, Absolute MCC, min. per-class accuracy, TNS, FNS, FPS, TPS, IDX)
-
Maximum metrics (metric, threshold, value, IDX)
-
Variable importances
-
Preview POJO
To generate a prediction, click the Predict button in the model results and select the ad.hex
file from the drop-down Frame list, then click the Predict button.
You can also click the Inspect button to access more information (for example, columns or data).