Permalink
Newer
Older
100644 1939 lines (1095 sloc) 134 KB
May 15, 2015 @jessica0xdata Batch docs update
1 # Data Science Algorithms
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
2
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
3 >**Note**: This topic is no longer being maintained. Refer to the topics in the [Data Science](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/data-science) folder for the most up-to-date documentation.
4
5
Jun 11, 2015 @jessica0xdata Misc. typo fixes
6 This document describes how to define the models and how to interpret the model, as well the algorithm itself, and provides an FAQ.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
7
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
8 ## Commonalities
May 5, 2015 @jessica0xdata Update data science doc
9
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
10 ### Quantiles
Nov 17, 2015 @jessica0xdata Updates for PUBDEV-93
11
12
13 **Note**: The quantile results in Flow are computed lazily on-demand and cached. It is a fast approximation (max - min / 1024) that is very accurate for most use cases.
14 If the distribution is skewed, the quantile results may not be as accurate as the results obtained using `h2o.quantile` in R or `H2OFrame.quantile` in Python.
15
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
16 <a name="Kmeans"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
17 ## K-Means
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
18
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
19 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
20
Apr 24, 2015 @arnocandel WIP for data science docs.
21 K-Means falls in the general category of clustering algorithms.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
22
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
23 ### Defining a K-Means Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
24
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
25 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
26
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
27 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
28 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
29
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
30 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
31
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
32 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
33
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
34 - **ignore\_const\_cols**: (Optional) Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
May 12, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
35
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
36 - **k***: Specify the number of clusters.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
37
Oct 20, 2015 @jessica0xdata Updates
38 - **user_points**: Specify a vector of initial cluster centers. The user-specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
39
Oct 20, 2015 @jessica0xdata Updates
40 - **max_iterations**: Specify the maximum number of training iterations. The range is 0 to 1e6.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
41
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
42 - **init**: Select the initialization mode. The options are Random, Furthest, PlusPlus, or User. **Note**: If PlusPlus is selected, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
43
Nov 5, 2015 @jessica0xdata Updates to docs
44 - **fold_assignment**: (Applicable only if a value for **nfolds** is specified and **fold_column** is not selected) Select the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, or [Modulo](https://en.wikipedia.org/wiki/Modulo_operation).
45
46 - **fold_column**: Select the column that contains the cross-validation fold index assignment per observation.
47
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
48 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
49
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
50 - **standardize**: To standardize the numeric columns to have mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
51
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
52 >**Note**: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (`centers_std`) and the de-standardized scale (`centers`) are displayed.
53 >To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using `h2o.scale` in R with `center` = TRUE and `scale` = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.
Nov 5, 2015 @jessica0xdata Updates to docs
54
55 - **keep\_cross\_validation\_predictions**: To keep the cross-validation predictions, check this checkbox.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
56
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
57 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
58
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
59 ### Interpreting a K-Means Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
60
61 By default, the following output displays:
62
63 - A graph of the scoring history (number of iterations vs. average within the cluster's sum of squares)
64 - Output (model category, validation metrics if applicable, and centers std)
65 - Model Summary (number of clusters, number of categorical columns, number of iterations, avg. within sum of squares, avg. sum of squares, avg. between the sum of squares)
66 - Scoring history (number of iterations, avg. change of standardized centroids, avg. within cluster sum of squares)
67 - Training metrics (model name, checksum name, frame name, frame checksum name, description if applicable, model category, duration in ms, scoring time, predictions, MSE, avg. within sum of squares, avg. between sum of squares)
68 - Centroid statistics (centroid number, size, within sum of squares)
69 - Cluster means (centroid number, column)
70
Apr 24, 2015 @arnocandel WIP for data science docs.
71 K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary, and should be thought of as a tuning parameter.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
72 The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
73
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
74 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
75
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
76 - **How does the algorithm handle missing values during training?**
77
Dec 2, 2015 @jessica0xdata Updated Ubuntu 14.04 install instructions, team list; Updated Data Sc…
78 Missing values are automatically imputed by the column mean. K-means also handles missing values by assuming that missing feature distance contributions are equal to the average of all other distance term contributions.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
79
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
80 - **How does the algorithm handle missing values during testing?**
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
81
82 Missing values are automatically imputed by the column mean of the training data.
83
84 - **Does it matter if the data is sorted?**
85
86 No.
87
88 - **Should data be shuffled before training?**
89
90 No.
91
92 - **What if there are a large number of columns?**
93
94 K-Means suffers from the curse of dimensionality: all points are roughly at the same distance from each other in high dimensions, making the algorithm less and less useful.
95
96 - **What if there are a large number of categorical factor levels?**
97
98 This can be problematic, as categoricals are one-hot encoded on the fly, which can lead to the same problem as datasets with a large number of columns.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
99
100
101
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
102 ### K-Means Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
103
May 14, 2015 @lo5 Add MathJax rendering
104 The number of clusters \(K\) is user-defined and is determined a priori.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
105
May 14, 2015 @lo5 Add MathJax rendering
106 1. Choose \(K\) initial cluster centers \(m_{k}\) according to one of
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
107 the following:
108
May 14, 2015 @lo5 Add MathJax rendering
109 - **Randomization**: Choose \(K\) clusters from the set of \(N\) observations at random so that each observation has an equal chance of being chosen.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
110
111 - **Plus Plus**
112
May 14, 2015 @lo5 Add MathJax rendering
113 a. Choose one center \(m_{1}\) at random.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
114
May 14, 2015 @lo5 Add MathJax rendering
115 2. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\).
116 \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
117
May 14, 2015 @lo5 Add MathJax rendering
118 3. Let \(P(i)\) be the probability of choosing \(x_{i}\) as \(m_{2}\). Weight \(P(i)\) by \(d(x_{i}, m_{1})\) so that those \(x_{i}\) furthest from \(m_{2}\) have a higher probability of being selected than those \(x_{i}\) close to \(m_{1}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
119
May 14, 2015 @lo5 Add MathJax rendering
120 4. Choose the next center \(m_{2}\) by drawing at random according to the weighted probability distribution.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
121
May 14, 2015 @lo5 Add MathJax rendering
122 5. Repeat until \(K\) centers have been chosen.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
123
124 - **Furthest**
125
May 14, 2015 @lo5 Add MathJax rendering
126 a. Choose one center \(m_{1}\) at random.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
127
May 14, 2015 @lo5 Add MathJax rendering
128 2. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\).
129 \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
130
May 14, 2015 @lo5 Add MathJax rendering
131 3. Choose \(m_{2}\) to be the \(x_{i}\) that maximizes \(d(x_{i}, m_{1})\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
132
May 14, 2015 @lo5 Add MathJax rendering
133 4. Repeat until \(K\) centers have been chosen.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
134
May 14, 2015 @lo5 Add MathJax rendering
135 2. Once \(K\) initial centers have been chosen calculate the difference between each observation \(x_{i}\) and each of the centers \(m_{1},...,m_{K}\), where difference is the squared Euclidean distance taken over \(p\) parameters.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
136
May 14, 2015 @lo5 Add MathJax rendering
137 \(d(x_{i}, m_{k})=\)
138 \(\sum_{j=1}^{p}(x_{ij}-m_{k})^2=\)
139 \(\lVert(x_{i}-m_{k})\rVert^2\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
140
141
May 14, 2015 @lo5 Add MathJax rendering
142 3. Assign \(x_{i}\) to the cluster \(k\) defined by \(m_{k}\) that minimizes \(d(x_{i}, m_{k})\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
143
May 14, 2015 @lo5 Add MathJax rendering
144 4. When all observations \(x_{i}\) are assigned to a cluster calculate the mean of the points in the cluster.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
145
May 14, 2015 @lo5 Add MathJax rendering
146 \(\bar{x}(k)=\lbrace\bar{x_{i1}},…\bar{x_{ip}}\rbrace\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
147
May 14, 2015 @lo5 Add MathJax rendering
148 5. Set the \(\bar{x}(k)\) as the new cluster centers \(m_{k}\). Repeat steps 2 through 5 until the specified number of max iterations is reached or cluster assignments of the \(x_{i}\) are stable.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
149
150
151
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
152 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
153
154 [Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf)
155
156 Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Measures: A Data- distribution Perspective.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.
157
158 ---
159
160 <a name="GLM"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
161 ## GLM
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
162
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
163 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
164
165 Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.
166
167 The GLM suite includes:
168
169 - Gaussian regression
170 - Poisson regression
Feb 11, 2016 @tomasnykodym doc updates for glm
171 - Binomial regression (classification)
172 - Multinomial classification
Apr 24, 2015 @arnocandel WIP for data science docs.
173 - Gamma regression
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
174
175
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
176 ### Defining a GLM Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
177
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
178 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
179
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
180 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
181 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
182
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
183 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
184
Apr 22, 2016 @tomasnykodym PUBDEV-2843
185 - **nfolds**: Specify the number of folds for cross-validation.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
186
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
187 - **response_column**: (Required) Select the column to use as the independent variable.
Oct 20, 2015 @jessica0xdata Updates
188
189 - For a regression model, this column must be numeric (**Real** or **Int**).
190 - For a classification model, this column must be categorical (**Enum** or **String**). If the family is **Binomial**, the dataset cannot contain more than two levels.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
191
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
192 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
193
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
194 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
195
Nov 5, 2015 @jessica0xdata Updates to docs
196 - **family**: Select the model type.
197 > - If the family is **gaussian**, the data must be numeric (**Real** or **Int**).
Feb 11, 2016 @tomasnykodym doc updates for glm
198 > - If the family is **binomial**, the data must be categorical 2 levels/classes or binary (**Enum** or **Int**).
199 > - If the family is **multinomial**, the data can be categorical with more than two levels/classes (**Enum**).
200 > - If the family is **poisson**, the data must be numeric and non-negative (**Int**).
201 > - If the family is **gamma**, the data must be numeric and continuous and positive (**Real** or **Int**).
202 > - If the family is **tweedie**, the data must be numeric and continuous (**Real**) and non-negative.
Jul 4, 2015 @jessica0xdata Update for build 26
203
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
204 - **tweedie_variance_power**: (Only applicable if *Tweedie* is selected for **Family**) Specify the Tweedie variance power.
Jul 4, 2015 @jessica0xdata Update for build 26
205
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
206 - **tweedie_link_power**: (Only applicable if *Tweedie* is selected for **Family**) Specify the Tweedie link power.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
207
Sep 11, 2015 @jessica0xdata Changes for Slater 3.2.0.1
208 - **solver**: Select the solver to use (AUTO, IRLSM, L\_BFGS, COORDINATE\_DESCENT\_NAIVE, or COORDINATE\_DESCENT). IRLSM is fast on on problems with a small number of predictors and for lambda-search with L1 penalty, while [L_BFGS](http://cran.r-project.org/web/packages/lbfgs/vignettes/Vignette.pdf) scales better for datasets with many columns. COORDINATE\_DESCENT is IRLSM with the covariance updates version of cyclical coordinate descent in the innermost loop. COORDINATE\_DESCENT\_NAIVE is IRLSM with the naive updates version of cyclical coordinate descent in the innermost loop. COORDINATE\_DESCENT\_NAIVE and COORDINATE\_DESCENT are currently experimental.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
209
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
210 - **alpha**: Specify the regularization distribution between L2 and L2.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
211
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
212 - **lambda**: Specify the regularization strength.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
213
Apr 22, 2016 @tomasnykodym PUBDEV-2843
214 - **lambda_search**: Check this checkbox to enable lambda search, starting with lambda max. The given lambda is then interpreted as lambda min.
Oct 20, 2015 @jessica0xdata Updates
215
216 - **nlambdas**: (Applicable only if **lambda\_search** is enabled) Specify the number of lambdas to use in the search. The default is 100.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
217
Feb 11, 2016 @tomasnykodym doc updates for glm
218 - **standardize**: To standardize the numeric columns to have a mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
219
220 - **remove_collinear_columns**: Automatically remove collinear columns during model-building. Collinear columns will be dropped from the model and will have 0 coefficient in the returned model. Can only be set if there is no regularization (lambda=0)
221
222 - **compute_p_values**: Request computation of p-values. Only applicable with no penalty (lambda = 0 and no beta constraints). Setting remove_collinear_columns is recommended. H2O will return an error if p-values are requested and there are collinear columns and remove_collinear_columns flag is not set.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
223
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
224 - **non-negative**: To force coefficients to have non-negative values, check this checkbox.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
225
Oct 20, 2015 @jessica0xdata Updates
226 - **beta_constraints**: To use beta constraints, select a dataset from the drop-down menu. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
227
Nov 5, 2015 @jessica0xdata Updates to docs
228 - **fold_assignment**: (Applicable only if a value for **nfolds** is specified and **fold_column** is not selected) Select the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, or [Modulo](https://en.wikipedia.org/wiki/Modulo_operation).
229
230 - **fold_column**: Select the column that contains the cross-validation fold index assignment per observation.
231
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
232 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
233
Nov 5, 2015 @jessica0xdata Updates to docs
234 - **offset_column**: Select a column to use as the offset; the value cannot be the same as the value for the `weights_column`.
235 >*Note*: Offsets are per-row "bias values" that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. For more information, refer to the following [link](http://www.idg.pl/mirrors/CRAN/web/packages/gbm/vignettes/gbm.pdf).
236
237 - **weights_column**: Select a column to use for the observation weights, which are used for bias correction. The specified `weights_column` must be included in the specified `training_frame`. *Python only*: To use a weights column when passing an H2OFrame to `x` instead of a list of column names, the specified `training_frame` must contain the specified `weights_column`.
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
238 >*Note*: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Nov 5, 2015 @jessica0xdata Updates to docs
239
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
240 - **max_iterations**: Specify the number of training iterations.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
241
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
242 - **link**: Select a link function (Identity, Family_Default, Logit, Log, Inverse, or Tweedie).
May 29, 2015 @jessica0xdata Doc update
243
Oct 20, 2015 @jessica0xdata Updates
244 > - If the family is **Gaussian**, **Identity**, **Log**, and **Inverse** are supported.
245 > - If the family is **Binomial**, **Logit** is supported.
246 > - If the family is **Poisson**, **Log** and **Identity** are supported.
247 > - If the family is **Gamma**, **Inverse**, **Log**, and **Identity** are supported.
248 > - If the family is **Tweedie**, only **Tweedie** is supported.
249
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
250 - **max\_confusion\_matrix\_size**: Specify the maximum size (number of classes) for the confusion matrices printed in the logs.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
251
Oct 20, 2015 @jessica0xdata Updates
252 - **max\_hit\_ratio\_k**: (Applicable for classification only) Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter `0`.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
253
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
254 - **keep\_cross\_validation\_predictions**: To keep the cross-validation predictions, check this checkbox.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
255
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
256 - **intercept**: To include a constant term in the model, check this checkbox. This option is selected by default.
Jun 13, 2015 @jessica0xdata Update docs to add `intercept` option for GLM
257
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
258 - **objective_epsilon**: Specify a threshold for convergence. If the objective value is less than this threshold, the model is converged.
May 29, 2015 @jessica0xdata Doc update
259
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
260 - **beta_epsilon**: Specify the beta epsilon value. If the L1 normalization of the current beta change is below this threshold, consider using convergence.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
261
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
262 - **gradient_epsilon**: (For L-BFGS only) Specify a threshold for convergence. If the objective value (using the L-infinity norm) is less than this threshold, the model is converged.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
263
Feb 11, 2016 @tomasnykodym doc updates for glm
264 - **prior**: Specify prior probability for p(y==1). Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. Note: this is simple method affecting only the intercept, you may want to use weights and offset for better fit.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
265
Nov 5, 2015 @jessica0xdata Updates to docs
266 - **lambda\_min\_ratio**: Specify the minimum lambda to use for lambda search (specified as a ratio of **lambda\_max**).
267
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
268 - **max\_active\_predictors**: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
269
Jun 16, 2016 missing_values_handling option is available in GLM
270 - **missing\_values\_handling**: Specify how to handle missing values (Skip or MeanImputation). This defaults to MeanImputation.
271
May 31, 2016 Added "seed" option to GLM
272 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
273
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
274 ### Interpreting a GLM Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
275
276 By default, the following output displays:
277
278 - A graph of the normalized coefficient magnitudes
279 - Output (model category, model summary, scoring history, training metrics, validation metrics, best lambda, threshold, residual deviance, null deviance, residual degrees of freedom, null degrees of freedom, AIC, AUC, binomial, rank)
280 - Coefficients
281 - Coefficient magnitudes
282
May 17, 2016 @tomasnykodym Added docs for handling categoricals inside GLM.
283 ### Handling of Categorical Variables
284 GLM auto-expands categorical variables into one-hot encoded binary variables (i.e. if variable has levels "cat","dog", "mouse", cat is encoded as 1,0,0, mouse is 0,1,0 and dog is 0,0,1).
285 It is generally more efficient to let GLM perform auto-expansion instead of expanding data manually and it also adds the benefit of correct handling of different categorical mappings between different datasets as welll as handling of unseen categorical levels.
286 Unlike binary numeric columns, auto-expanded variables are not standardized.
287
288 It is common to skip one of the levels during the one-hot encoding to prevent linear dependency between the variable and the intercept.
289 H3O follows the convention of skipping the first level.
290 This behavior can be controlled by setting use_all_factor_levels_flag (no level is going to be skipped if the flag is true).
291 The default depends on regularization parameter - it is set to false if no regularization and to true otherwise.
292 The reference level which is skipped is always the first level, you can change which level is the reference level by calling h2o.relevel function prior to building the model.
293
294
295 ### Lambda Search and Full Regularization Path
Apr 22, 2016 @tomasnykodym PUBDEV-2843
296 If lambda_search option is set, GLM will compute models for full regularization path similar to glmnet (see glmnet paper).
297 Regularziation path starts at lambda max (highest lambda values which makes sense - i.e. lowest value driving all coefficients to zero) and goes down to lambda min on log scale, decreasing regularization strength at each step.
298 The returned model will have coefficients corresponding to the "optimal" lambda value as decided during training.
299
300 It can sometimes be useful to see the coefficients for all lambda values. Or to override default lambda selection.
301 Full regularization path can be extracted from both R and python clients (currently not from Flow). It returns coefficients (and standardized coefficients)
302 for all computed lambda values and also explained deviances on both train and validation.
303 Subsequently, makeGLMModel call can be used to create h2o glm model with selected coefficients.
304
305 To extract the regularization path from R or python:
306 - R: call h2o.getGLMFullRegularizationPath, takes the model as an argument
307 - pyton: H2OGeneralizedLinearEstimator.getGLMRegularizationPath (static method), takes the model as an rgument
308
309 ### Modifying or Creating Custom GLM Model
310 In R and python, makeGLMModel call can be used to create h2o model from given coefficients.
311 It needs a source glm model trained on the same dataset to extract dataset information.
312 To make custom GLM model from R or python:
313 - R: call h2o.makeGLMModel, takes a model and a vector of coefficients and (optional) decision threshold as parameters.
314 - pyton: H2OGeneralizedLinearEstimator.makeGLMModel (static method), takes a model, dictionary containing coefficients and (optional) decision threshold as parameters.
315
316
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
317 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
318
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
319 - **How does the algorithm handle missing values during training?**
320
Feb 24, 2016 @tomasnykodym Updated missing value handling documentation for glm.
321 Depending on the selected missing value handling policy, they are either imputed mean or the whole row is skipped.
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
322 The default behavior is mean imputation. Note that categorical variables are imputed by adding an extra "missing" level.
Feb 24, 2016 @tomasnykodym Updated missing value handling documentation for glm.
323 Optionally, glm can skip all rows with any missing values.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
324
325 - **How does the algorithm handle missing values during testing?**
Feb 24, 2016 @tomasnykodym Updated missing value handling documentation for glm.
326 Same as during training. If the missing value handling is set to skip and we are generating predictions, skipped rows will have Na (missing) prediction.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
327
328 - **What happens if the response has missing values?**
329
Feb 24, 2016 @tomasnykodym Updated missing value handling documentation for glm.
330 The rows with missing response are ignored during model training and validation.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
331
332 - **What happens during prediction if the new sample has categorical levels not seen in training?**
Feb 24, 2016 @tomasnykodym Updated missing value handling documentation for glm.
333 The value will be filled with either special missing level (if trained with missing values and missing_value_handling was set to MeanImputation) or 0.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
334
335 - **Does it matter if the data is sorted?**
336
337 No.
338
339 - **Should data be shuffled before training?**
340
341 No.
342
343 - **How does the algorithm handle highly imbalanced data in a response column?**
344
345 GLM does not require special handling for imbalanced data.
346
347 - **What if there are a large number of columns?**
348
349 IRLS will get quadratically slower with the number of columns. Try L-BFGS for datasets with more than 5-10 thousand columns.
350
351 - **What if there are a large number of categorical factor levels?**
352
353 GLM internally one-hot encodes the categorical factor levels; the same limitations as with a high column count will apply.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
354
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
355 - **When building the model, does GLM use all features or a selection of the best features?**
356
357 Typically, GLM picks the best predictors, especially if lasso is used (`alpha = 1`). By default, the GLM model includes an L1 penalty and will pick only the most predictive predictors.
358
Dec 8, 2015 @jessica0xdata Updated Data Science doc with recommendations from Tomas for GLM clus…
359 - **When running GLM, is it better to create a cluster that uses many smaller nodes or fewer larger nodes?**
360
361 A rough heuristic would be:
362
363 nodes ~=M*N^2/(p*1e8)
364
365 where M is the number of observations, N is the number of columns (categorical columns count as a single column in this case), and p is the number of CPU cores per node.
366
367 For example, a dataset with 250 columns and 1M rows would optimally use about 20 nodes with 32 cores each (following the formula 250^2*1000000/(32*1e8) = 19.5 ~= 20).
368
Dec 15, 2015 @jessica0xdata Some updates on varimp & S3 data import
369 - **How is variable importance calculated for GLM?**
370
371 For GLM, the variable importance represents the coefficient magnitudes.
372
373
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
374
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
375 ### GLM Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
376
377 Following the definitive text by P. McCullagh and J.A. Nelder (1989) on the generalization of linear models to non-linear distributions of the response variable Y, H2O fits GLM models based on the maximum likelihood estimation via iteratively reweighed least squares.
378
May 14, 2015 @lo5 Add MathJax rendering
379 Let \(y_{1},…,y_{n}\) be n observations of the independent, random response variable \(Y_{i}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
380
381 Assume that the observations are distributed according to a function from the exponential family and have a probability density function of the form:
382
May 14, 2015 @lo5 Add MathJax rendering
383 \(f(y_{i})=exp[\frac{y_{i}\theta_{i} - b(\theta_{i})}{a_{i}(\phi)} + c(y_{i}; \phi)]\)
384 where \(\theta\) and \(\phi\) are location and scale parameters,
385 and \(\: a_{i}(\phi), \:b_{i}(\theta_{i}),\: c_{i}(y_{i}; \phi)\) are known functions.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
386
May 14, 2015 @lo5 Add MathJax rendering
387 \(a_{i}\) is of the form \(\:a_{i}=\frac{\phi}{p_{i}}; p_{i}\) is a known prior weight.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
388
May 14, 2015 @lo5 Add MathJax rendering
389 When \(Y\) has a pdf from the exponential family:
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
390
May 14, 2015 @lo5 Add MathJax rendering
391 \(E(Y_{i})=\mu_{i}=b^{\prime}\)
392 \(var(Y_{i})=\sigma_{i}^2=b^{\prime\prime}(\theta_{i})a_{i}(\phi)\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
393
May 14, 2015 @lo5 Add MathJax rendering
394 Let \(g(\mu_{i})=\eta_{i}\) be a monotonic, differentiable transformation of the expected value of \(y_{i}\). The function \(\eta_{i}\) is the link function and follows a linear model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
395
May 14, 2015 @lo5 Add MathJax rendering
396 \(g(\mu_{i})=\eta_{i}=\mathbf{x_{i}^{\prime}}\beta\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
397
398 When inverted:
May 14, 2015 @lo5 Add MathJax rendering
399 \(\mu=g^{-1}(\mathbf{x_{i}^{\prime}}\beta)\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
400
401 **Maximum Likelihood Estimation**
402
May 14, 2015 @lo5 Add MathJax rendering
403 For an initial rough estimate of the parameters \(\hat{\beta}\), use the estimate to generate fitted values:
404 \(\mu_{i}=g^{-1}(\hat{\eta_{i}})\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
405
May 14, 2015 @lo5 Add MathJax rendering
406 Let \(z\) be a working dependent variable such that
407 \(z_{i}=\hat{\eta_{i}}+(y_{i}-\hat{\mu_{i}})\frac{d\eta_{i}}{d\mu_{i}}\),
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
408
May 14, 2015 @lo5 Add MathJax rendering
409 where \(\frac{d\eta_{i}}{d\mu_{i}}\) is the derivative of the link function evaluated at the trial estimate.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
410
411 Calculate the iterative weights:
May 14, 2015 @lo5 Add MathJax rendering
412 \(w_{i}=\frac{p_{i}}{[b^{\prime\prime}(\theta_{i})\frac{d\eta_{i}}{d\mu_{i}}^{2}]}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
413
May 14, 2015 @lo5 Add MathJax rendering
414 Where \(b^{\prime\prime}\) is the second derivative of \(b(\theta_{i})\) evaluated at the trial estimate.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
415
416
May 14, 2015 @lo5 Add MathJax rendering
417 Assume \(a_{i}(\phi)\) is of the form \(\frac{\phi}{p_{i}}\). The weight \(w_{i}\) is inversely proportional to the variance of the working dependent variable \(z_{i}\) for current parameter estimates and proportionality factor \(\phi\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
418
May 14, 2015 @lo5 Add MathJax rendering
419 Regress \(z_{i}\) on the predictors \(x_{i}\) using the weights \(w_{i}\) to obtain new estimates of \(\beta\).
420 \(\hat{\beta}=(\mathbf{X}^{\prime}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{W}\mathbf{z}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
421
May 14, 2015 @lo5 Add MathJax rendering
422 Where \(\mathbf{X}\) is the model matrix, \(\mathbf{W}\) is a diagonal matrix of \(w_{i}\), and \(\mathbf{z}\) is a vector of the working response variable \(z_{i}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
423
May 14, 2015 @lo5 Add MathJax rendering
424 This process is repeated until the estimates \(\hat{\beta}\) change by less than the specified amount.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
425
426 **Cost of computation**
427
428
429 H2O can process large data sets because it relies on parallel processes. Large data sets are divided into smaller data sets and processed simultaneously and the results are communicated between computers as needed throughout the process.
430
431 In GLM, data are split by rows but not by columns, because the predicted Y values depend on information in each of the predictor variable vectors. If O is a complexity function, N is the number of observations (or rows), and P is the number of predictors (or columns) then
432
433
May 14, 2015 @lo5 Add MathJax rendering
434 &nbsp;&nbsp;&nbsp;&nbsp;\(Runtime\propto p^3+\frac{(N*p^2)}{CPUs}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
435
436 Distribution reduces the time it takes an algorithm to process because it decreases N.
437
438
439 Relative to P, the larger that (N/CPUs) becomes, the more trivial p becomes to the overall computational cost. However, when p is greater than (N/CPUs), O is dominated by p.
440
441
442
May 14, 2015 @lo5 Add MathJax rendering
443 &nbsp;&nbsp;&nbsp;&nbsp;\(Complexity = O(p^3 + N*p^2)\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
444
Oct 7, 2015 @jessica0xdata Updated links
445 For more information about how GLM works, refer to the [Generalized Linear Modeling booklet](http://h2o.ai/resources).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
446
447
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
448 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
449
450 Breslow, N E. “Generalized Linear Models: Checking Assumptions and Strengthening Conclusions.” Statistica Applicata 8 (1996): 23-41.
451
452 [Frome, E L. “The Analysis of Rates Using Poisson Regression Models.” Biometrics (1983): 665-674.](http://www.csm.ornl.gov/~frome/BE/FP/FromeBiometrics83.pdf)
453
454 [Goldberger, Arthur S. “Best Linear Unbiased Prediction in the Generalized Linear Regression Model.” Journal of the American Statistical Association 57.298 (1962): 369-375.](http://people.umass.edu/~bioep740/yr2009/topics/goldberger-jasa1962-369.pdf)
455
456 [Guisan, Antoine, Thomas C Edwards Jr, and Trevor Hastie. “Generalized Linear and Generalized Additive Models in Studies of Species Distributions: Setting the Scene.” Ecological modeling 157.2 (2002): 89-100.](http://www.stanford.edu/~hastie/Papers/GuisanEtAl_EcolModel-2003.pdf)
457
458 [Nelder, John A, and Robert WM Wedderburn. “Generalized Linear Models.” Journal of the Royal Statistical Society. Series A (General) (1972): 370-384.](http://biecek.pl/MIMUW/uploads/Nelder_GLM.pdf)
459
460 [Niu, Feng, et al. “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent.” Advances in Neural Information Processing Systems 24 (2011): 693-701.*implemented algorithm on p.5](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)
461
462 [Pearce, Jennie, and Simon Ferrier. “Evaluating the Predictive Performance of Habitat Models Developed Using Logistic Regression.” Ecological modeling 133.3 (2000): 225-245.](http://www.whoi.edu/cms/files/Ecological_Modelling_2000_Pearce_53557.pdf)
463
464 [Press, S James, and Sandra Wilson. “Choosing Between Logistic Regression and Discriminant Analysis.” Journal of the American Statistical Association 73.364 (April, 2012): 699–705.](http://www.statpt.com/logistic/press_1978.pdf)
465
466 Snee, Ronald D. “Validation of Regression Models: Methods and Examples.” Technometrics 19.4 (1977): 415-428.
467
468 ---
469
470
471 <a name="DRF"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
472 ## DRF
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
473
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
474 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
475
May 18, 2016 Updates to DRF description and group_by.py file
476 Distributed Random Forest (DRF) is a powerful classification and regression tool. When given a set of data, DRF generates a forest of classification (or regression) trees, rather than a single classification (or regression) tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. Both classification and regression take the average prediction over all of their trees to make a final prediction, whether predicting for a class or numeric value (note: for a categorical response column, DRF maps factors (e.g. 'dog', 'cat', 'mouse) in lexicographic order to a name lookup array with integer indices (e.g. 'cat ->0, 'dog' -> 1, 'mouse' ->2).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
477
Aug 12, 2015 @jessica0xdata Updates for simons-7; some other fixes
478 The current version of DRF is fundamentally the same as in previous versions of H2O (same algorithmic steps, same histogramming techniques), with the exception of the following changes:
479
480 - Improved ability to train on categorical variables (using the `nbins_cats` parameter)
481 - Minor changes in histogramming logic for some corner cases
May 19, 2016 Description improvement
482 - By default, DRF builds half as many trees for binomial problems, similar to GBM: it uses a single tree to estimate class 0 (probability "p0"), and then computes the probability of class 0 as ``1.0 - p0``. For multiclass problems, a tree is used to estimate the probability of each class separately.
Aug 12, 2015 @jessica0xdata Updates for simons-7; some other fixes
483
484 There was some code cleanup and refactoring to support the following features:
485
486 - Per-row observation weights
487 - Per-row offsets
488 - N-fold cross-validation
489
May 18, 2016 DRF also performs regression
490 DRF no longer has a special-cased histogram for classification or regression (class DBinomHistogram has been superseded by DRealHistogram) since it was not applicable to cases with observation weights or for cross-validation.
Aug 12, 2015 @jessica0xdata Updates for simons-7; some other fixes
491
492
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
493 ### Defining a DRF Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
494
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
495 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
496
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
497 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
498 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
499
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
500 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
501
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
502 - **nfolds**: Specify the number of folds for cross-validation.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
503
Oct 20, 2015 @jessica0xdata Updates
504 - **response_column**: (Required) Select the column to use as the independent variable. The data can be numeric or categorical.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
505
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
506 - **Ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
507
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
508 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
May 12, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
509
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
510 - **ntrees**: Specify the number of trees.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
511
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
512 - **max\_depth**: Specify the maximum tree depth.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
513
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
514 - **min\_rows**: Specify the minimum number of observations for a leaf (`nodesize` in R).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
515
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
516 - **nbins**: (Numerical/real/int only) Specify the number of bins for the histogram to build, then split at the best point.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
517
Oct 5, 2015 @jessica0xdata Added more info re:`nbins_cats`
518 - **nbins_cats**: (Categorical/enums only) Specify the maximum number of bins for the histogram to build, then split at the best point. Higher values can lead to more overfitting. The levels are ordered alphabetically; if there are more levels than bins, adjacent levels share bins. This value has a more significant impact on model fitness than **nbins**. Larger values may increase runtime, especially for deep trees and large clusters, so tuning may be required to find the optimal value for your configuration.
May 29, 2015 @jessica0xdata Doc update for Arno's changes (nbins_cats)
519
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
520 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
May 29, 2015 @jessica0xdata Doc update
521
Oct 20, 2015 @jessica0xdata Updates
522 - **mtries**: Specify the columns to randomly select at each level. If the default value of `-1` is used, the number of variables is the square root of the number of columns for classification and p/3 for regression (where p is the number of predictors). The range is -1 to >=1.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
523
Apr 27, 2016 GBM/DRF documentation updates
524 - **sample_rate**: Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to "Stochastic Gradient Boosting" ([Friedman, 1999](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)). If this option is specified along with **sample\_rate_per\_class**, then only the first option that DRF encounters will be used.
Dec 17, 2015 @jessica0xdata Updates for `col_sample_rate_per_tree`
525
526 - **col\_sample_rate**: Specify the column sampling rate (y-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to "Stochastic Gradient Boosting" ([Friedman, 1999](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
527
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
528 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
529
Feb 10, 2016 @arnocandel Add score_tree_interval to DRF/GBM docs.
530 - **score\_tree\_interval**: Score the model after every so many trees. Disabled if set to 0.
531
Nov 5, 2015 @jessica0xdata Updates to docs
532 - **fold_assignment**: (Applicable only if a value for **nfolds** is specified and **fold_column** is not selected) Select the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, or [Modulo](https://en.wikipedia.org/wiki/Modulo_operation).
533
534 - **fold_column**: Select the column that contains the cross-validation fold index assignment per observation.
535
536 - **offset_column**: Select a column to use as the offset.
537 >*Note*: Offsets are per-row "bias values" that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. For more information, refer to the following [link](http://www.idg.pl/mirrors/CRAN/web/packages/gbm/vignettes/gbm.pdf).
538
539 - **weights_column**: Select a column to use for the observation weights, which are used for bias correction. The specified `weights_column` must be included in the specified `training_frame`. *Python only*: To use a weights column when passing an H2OFrame to `x` instead of a list of column names, the specified `training_frame` must contain the specified `weights_column`.
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
540 >*Note*: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Nov 5, 2015 @jessica0xdata Updates to docs
541
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
542 - **balance_classes**: Oversample the minority classes to balance the class distribution. This option is not selected by default and can increase the data frame size. This option is only applicable for classification.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
543
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
544 - **max\_confusion\_matrix\_size**: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
545
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
546 - **max\_hit\_ratio\_k**: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
547
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
548 - **r2_stopping**: Specify a threshold for the coefficient of determination (\(r^2\)) metric value. When this threshold is met or exceeded, H2O stops making trees.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
549
Nov 5, 2015 @jessica0xdata Updates to docs
550 - **stopping\_rounds**: Stops training when the option selected for **stopping\_metric** doesn't improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify `0`. The metric is computed on the validation data (if provided); otherwise, training data is used. When used with **overwrite\_with\_best\_model**, the final model is the best model generated for the given **stopping\_metric** option.
551 >**Note**: If cross-validation is enabled:
552 1. All cross-validation models stop training when the validation metric doesn't improve.
553 2. The main model runs for the mean number of epochs.
554 3. N+1 models do *not* use **overwrite\_with\_best\_model**
555 4. N+1 models may be off by the number specified for **stopping\_rounds** from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).
556
557 - **stopping\_metric**: Select the metric to use for early stopping. The available options are:
558
Jun 3, 2016 Documentation: Added new stopping_metric option
559 - **AUTO**: Logloss for classification; deviance for regression
Feb 10, 2016 @arnocandel Update docs with quantile regression and max_runtime_secs arguments.
560 - **deviance**
Nov 5, 2015 @jessica0xdata Updates to docs
561 - **logloss**
562 - **MSE**
563 - **AUC**
564 - **r2**
Jun 3, 2016 Documentation: Added new stopping_metric option
565 - **misclassification**
566 - **mean\_per\_class\_error**
Nov 5, 2015 @jessica0xdata Updates to docs
567
568 - **stopping\_tolerance**: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value.
569
Feb 10, 2016 @arnocandel Update docs with quantile regression and max_runtime_secs arguments.
570 - **max\_runtime\_secs**: Maximum allowed runtime in seconds for model training. Use 0 to disable.
571
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
572 - **build\_tree\_one\_node**: To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
573
Apr 27, 2016 GBM/DRF documentation updates
574 - **sample\_rate\_per\_class**: When building models from imbalanced datasets, this option specifies that each tree in the ensemble should sample from the full training dataset using a per-class-specific sampling rate rather than a global sample factor (as with `sample_rate`). The range for this option is 0.0 to 1.0. If this option is specified along with **sample_rate**, then only the first option that DRF encounters will be used.
575
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
576 - **binomial\_double\_trees**: (Binary classification only) Build twice as many trees (one per class). Enabling this option can lead to higher accuracy, while disabling can result in faster model building. This option is disabled by default.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
577
Nov 5, 2015 @jessica0xdata Updates to docs
578 - **checkpoint**: Enter a model key associated with a previously-trained model. Use this option to build a new model as a continuation of a previously-generated model.
579
Apr 27, 2016 GBM/DRF documentation updates
580 - **col\_sample_rate\_change\_per\_level**: This option specifies to change the column sampling rate as a function of the depth in the tree. For example:
581 >level 1: **col\_sample_rate**
582
583 >level 2: **col\_sample_rate** * **factor**
584
585 >level 3: **col\_sample_rate** * **factor^2**
586
587 >level 4: **col\_sample_rate** * **factor^3**
588
589 >etc.
590
591 - **col\_sample\_rate\_per\_tree**: Specifies the column sample rate per tree. This can be a value from 0.0 to 1.0.
592
593 - **min\_split_improvement**: The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range.
594
May 19, 2016 @arnocandel PUBDEV-2915: Add 'RoundRobin' histogram_type, to R/Py/Flow/Docs/JUnits.
595 - **histogram_type**: By default (AUTO) DRF bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well. RoundRobin can be specified to cycle through all histogram types (one per tree). Use this option to specify the type of histogram to use for finding optimal split points:
May 19, 2016 Changes to DRF/GBM
596
May 19, 2016 Change "Auto" to "AUTO"
597 - AUTO
May 19, 2016 Changes to DRF/GBM
598 - UniformAdaptive
599 - Random
600 - QuantilesGlobal
May 19, 2016 @arnocandel PUBDEV-2915: Add 'RoundRobin' histogram_type, to R/Py/Flow/Docs/JUnits.
601 - RoundRobin
Apr 27, 2016 GBM/DRF documentation updates
602
Jun 27, 2016 PUBDEV-3075 - Add documentation for extremely randomized trees in RF
603 >**Note**: H2O supports extremely randomized trees via ``histogram_type="Random"``. In extremely randomized trees (Extra-Trees), randomness goes one step further in the way splits are computed. As in Random Forests, a random subset of candidate features is used, but instead of looking for the best split, thresholds (for the split) are drawn at random for each candidate feature, and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.
604
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
605 - **keep\_cross\_validation\_predictions**: To keep the cross-validation predictions, check this checkbox.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
606
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
607 - **class\_sampling\_factors**: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
608
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
609 - **max\_after\_balance\_size**: Specify the maximum relative size of the training data after balancing class counts (**balance\_classes** must be enabled). The value can be less than 1.0.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
610
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
611 - **nbins\_top\_level**: (For numerical/real/int columns only) Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level.
Aug 7, 2015 @jessica0xdata Updates for booklets; updates for simons-5 (inc. `nbins_top_level`)
612
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
613
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
614 ### Interpreting a DRF Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
615
616 By default, the following output displays:
617
618 - Model parameters (hidden)
619 - A graph of the scoring history (number of trees vs. training MSE)
620 - A graph of the ROC curve (TPR vs. FPR)
621 - A graph of the variable importances
622 - Output (model category, validation metrics, initf)
623 - Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
624 - Scoring history in tabular format
625 - Training metrics (model name, checksum name, frame name, frame checksum name, description, model category, duration in ms, scoring time, predictions, MSE, R2, logloss, AUC, GINI)
626 - Training metrics for thresholds (thresholds, F1, F2, F0Points, Accuracy, Precision, Recall, Specificity, Absolute MCC, min. per-class accuracy, TNS, FNS, FPS, TPS, IDX)
627 - Maximum metrics (metric, threshold, value, IDX)
628 - Variable importances in tabular format
629
Mar 2, 2016 @arnocandel PUBDEV-2698: Add docs for leaf_node_assignment.
630
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
631 ### Leaf Node Assignment
Mar 2, 2016 @arnocandel PUBDEV-2698: Add docs for leaf_node_assignment.
632 Trees cluster observations into leaf nodes, and this information can be useful for feature engineering or model interpretability. Use **h2o.predict\_leaf\_node\_assignment\(model, frame\)** to get an H2OFrame with the leaf node assignments, or click the checkbox when making predictions from Flow. Those leaf nodes represent decision rules that can be fed to other models (i.e., GLM with lambda search and strong rules) to obtain a limited set of the most important rules.
633
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
634 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
635
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
636 - **How does the algorithm handle missing values during training?**
637
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
638 Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
639
640 - **How does the algorithm handle missing values during testing?**
641
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
642 During scoring, missing values follow the optimal path that was determined for them during training (minimized loss function).
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
643
644 - **What happens if the response has missing values?**
645
646 No errors will occur, but nothing will be learned from rows containing missing the response.
647
648 - **Does it matter if the data is sorted?**
649
650 No.
651
652 - **Should data be shuffled before training?**
653
654 No.
655
656 - **How does the algorithm handle highly imbalanced data in a response column?**
657
658 Specify `balance_classes`, `class_sampling_factors` and `max_after_balance_size` to control over/under-sampling.
659
660 - **What if there are a large number of columns?**
661
662 DRFs are best for datasets with fewer than a few thousand columns.
663
664 - **What if there are a large number of categorical factor levels?**
665
666 Large numbers of categoricals are handled very efficiently - there is never any one-hot encoding.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
667
Nov 26, 2015 @jessica0xdata Updated Model Results, added PySparkling and Sparkling Water info, ad…
668 - **How is variable importance calculated for DRF?**
669
670 Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result.
671
Dec 17, 2015 @arnocandel Cosmetics in docs for column sampling.
672 - **How is column sampling implemented for DRF?**
Dec 17, 2015 @jessica0xdata Updates for `col_sample_rate_per_tree`
673
674 For an example model using:
675
676 - 100 columns
677 - `col_sample_rate_per_tree` is 0.602
678 - `mtries` is -1 or 7 (refers to the number of active predictor columns for the dataset)
679
680 For each tree, the floor is used to determine the number - for this example, (0.602*100)=60 out of the 100 - of columns that are randomly picked. For classification cases where `mtries=-1`, the square root - for this example, (100)=10 columns - are then randomly chosen for each split decision (out of the total 60).
681
682 For regression, the floor - in this example, (100/3)=33 columns - is used for each split by default. If `mtries=7`, then 7 columns are picked for each split decision (out of the 60).
683
684 `mtries` is configured independently of `col_sample_rate_per_tree`, but it can be limited by it. For example, if `col_sample_rate_per_tree=0.01`, then there's only one column left for each split, regardless of how large the value for `mtries` is.
685
Nov 26, 2015 @jessica0xdata Updated Model Results, added PySparkling and Sparkling Water info, ad…
686
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
687 ### DRF Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
688
689
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
690 <iframe src="//www.slideshare.net/slideshow/embed_code/key/tASzUyJ19dtJsQ" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="//www.slideshare.net/0xdata/rf-brighttalk" title="Building Random Forest at Scale" target="_blank">Building Random Forest at Scale</a> </strong> from <strong><a href="//www.slideshare.net/0xdata" target="_blank">Sri Ambati</a></strong> </div>
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
691
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
692 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
693
Jun 27, 2016 PUBDEV-3075 - Add documentation for extremely randomized trees in RF
694 <a href="http://link.springer.com/article/10.1007%2Fs10994-006-6226-1" target="_blank">P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.</a>
695
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
696 ---
697
698 <a name="NB"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
699 ## Naïve Bayes
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
700
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
701 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
702
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
703 Naïve Bayes (NB) is a classification algorithm that relies on strong assumptions of the independence of covariates in applying Bayes Theorem. NB models are commonly used as an alternative to decision trees for classification problems.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
704
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
705 ### Defining a Naïve Bayes Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
706
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
707 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
708
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
709 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
710 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
711
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
712 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
713
Oct 20, 2015 @jessica0xdata Updates
714 - **response_column**: (Required) Select the column to use as the independent variable. The data must be categorical and must contain at least two unique categorical levels.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
715
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
716 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
717
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
718 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
May 12, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
719
Oct 20, 2015 @jessica0xdata Updates
720 - **laplace**: Specify the Laplace smoothing parameter. The value must be an integer >= 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
721
Oct 20, 2015 @jessica0xdata Updates
722 - **min\_sdev**: Specify the minimum standard deviation to use for observations without enough data. The value must be at least 1e-10.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
723
Oct 20, 2015 @jessica0xdata Updates
724 - **eps\_sdev**: Specify the threshold for standard deviation. The value must be positive. If this threshold is not met, the **min\_sdev** value is used.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
725
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
726 - **min\_prob**: Specify the minimum probability to use for observations without enough data.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
727
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
728 - **eps\_prob**: Specify the threshold for standard deviation. If this threshold is not met, the **min\_sdev** value is used.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
729
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
730 - **compute_metrics**: To compute metrics on training data, check this checkbox. The Naïve Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a Naïve Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.
May 29, 2015 @jessica0xdata Doc update
731
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
732 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
733
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
734 - **max\_confusion\_matrix\_size**: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
735
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
736 - **max\_hit\_ratio\_k**: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
737
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
738 - **max\_runtime\_secs**: Maximum allowed runtime in seconds for model training. Use 0 to disable.
739
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
740
741
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
742 ### Interpreting a Naïve Bayes Model
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
743
744 The output from Naïve Bayes is a list of tables containing the a-priori and conditional probabilities of each class of the response. The a-priori probability is the estimated probability of a particular class before observing any of the predictors. Each conditional probability table corresponds to a predictor column. The row headers are the classes of the response and the column headers are the classes of the predictor. Thus, in the table below, the probability of survival (y) given a person is male (x) is 0.91543624.
745
746 ```
747 Sex
748 Survived Male Female
749 No 0.91543624 0.08456376
750 Yes 0.51617440 0.48382560
751 ```
752
753
754 When the predictor is numeric, Naïve Bayes assumes it is sampled from a Gaussian distribution given the class of the response. The first column contains the mean and the second column contains the standard deviation of the distribution.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
755
756 By default, the following output displays:
757
758 - Output (model category, model summary, scoring history, training metrics, validation metrics)
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
759 - Y-Levels (levels of the response column)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
760 - P-conditionals
761
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
762 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
763
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
764 - **How does the algorithm handle missing values during training?**
765
766 All rows with one or more missing values (either in the predictors or the response) will be skipped during model building.
767
768 - **How does the algorithm handle missing values during testing?**
769
770 If a predictor is missing, it will be skipped when taking the product of conditional probabilities in calculating the joint probability conditional on the response.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
771
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
772 - **What happens if the response domain is different in the training and test datasets?**
773
774 The response column in the test dataset is not used during scoring, so any response categories absent in the training data will not be predicted.
775
776 - **What happens during prediction if the new sample has categorical levels not seen in training?**
777
778 The conditional probability of that predictor level will be set according to the Laplace smoothing factor. If Laplace smoothing is disabled (set to zero), the joint probability will be zero. See pgs. 13-14 of Andrew Ng’s "Generative learning algorithms" in the References section for mathematical details.
779
780 - **Does it matter if the data is sorted?**
781
782 No.
783
784 - **Should data be shuffled before training?**
785
786 This does not affect model building.
787
788 - **How does the algorithm handle highly imbalanced data in a response column?**
789
790 Unbalanced data will not affect the model. However, if one response category has very few observations compared to the total, the conditional probability may be very low. A cutoff (`eps_prob`) and minimum value (`min_prob`) are available for the user to set a floor on the calculated probability.
791
792
793 - **What if there are a large number of columns?**
794
795 More memory will be allocated on each node to store the joint frequency counts and sums.
796
797 - **What if there are a large number of categorical factor levels?**
798
799 More memory will be allocated on each node to store the joint frequency count of each categorical predictor level with the response’s level.
800
Nov 3, 2015 @jessica0xdata Updated Data Sci doc with cluster size recommendations for PCA & Naiv…
801 - **When running PCA, is it better to create a cluster that uses many smaller nodes or fewer larger nodes?**
802
803 For Naïve Bayes, we recommend using many smaller nodes because the distributed task doesn't require intensive computation.
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
804
805
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
806 ### Naïve Bayes Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
807
808 The algorithm is presented for the simplified binomial case without loss of generality.
809
810 Under the Naive Bayes assumption of independence, given a training set
811 for a set of discrete valued features X
May 14, 2015 @lo5 Add MathJax rendering
812 \({(X^{(i)},\ y^{(i)};\ i=1,...m)}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
813
814 The joint likelihood of the data can be expressed as:
815
May 14, 2015 @lo5 Add MathJax rendering
816 \(\mathcal{L} \: (\phi(y),\: \phi_{i|y=1},\:\phi_{i|y=0})=\Pi_{i=1}^{m} p(X^{(i)},\: y^{(i)})\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
817
818 The model can be parameterized by:
819
May 14, 2015 @lo5 Add MathJax rendering
820 \(\phi_{i|y=0}=\ p(x_{i}=1|\ y=0);\: \phi_{i|y=1}=\ p(x_{i}=1|y=1);\: \phi(y)\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
821
May 14, 2015 @lo5 Add MathJax rendering
822 Where \(\phi_{i|y=0}=\ p(x_{i}=1|\ y=0)\) can be thought of as the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=0, \phi_{i|y=1}=p(x_{i}=1|\ y=1)\) is the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=1\), and so on.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
823
824 The objective of the algorithm is to maximize with respect to
May 14, 2015 @lo5 Add MathJax rendering
825 \(\phi_{i|y=0}, \ \phi_{i|y=1},\ and \ \phi(y)\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
826
827 Where the maximum likelihood estimates are:
828
May 14, 2015 @lo5 Add MathJax rendering
829 \(\phi_{j|y=1}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1)}{\Sigma_{i=1}^{m}(y^{(i)}=1}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
830
May 14, 2015 @lo5 Add MathJax rendering
831 \(\phi_{j|y=0}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0)}{\Sigma_{i=1}^{m}(y^{(i)}=0}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
832
May 14, 2015 @lo5 Add MathJax rendering
833 \(\phi(y)= \frac{(y^{i} = 1)}{m}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
834
835
May 14, 2015 @lo5 Add MathJax rendering
836 Once all parameters \(\phi_{j|y}\) are fitted, the model can be used to predict new examples with features \(X_{(i^*)}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
837
838 This is carried out by calculating:
839
May 14, 2015 @lo5 Add MathJax rendering
840 \(p(y=1|x)=\frac{\Pi p(x_i|y=1) p(y=1)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
841
May 14, 2015 @lo5 Add MathJax rendering
842 \(p(y=0|x)=\frac{\Pi p(x_i|y=0) p(y=0)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
843
844 and predicting the class with the highest probability.
845
846
847 It is possible that prediction sets contain features not originally seen in the training set. If this occurs, the maximum likelihood estimates for these features predict a probability of 0 for all cases of y.
848
849 Laplace smoothing allows a model to predict on out of training data features by adjusting the maximum likelihood estimates to be:
850
851
May 14, 2015 @lo5 Add MathJax rendering
852 \(\phi_{j|y=1}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1) \: + \: 1}{\Sigma_{i=1}^{m}(y^{(i)}=1 \: + \: 2}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
853
May 14, 2015 @lo5 Add MathJax rendering
854 \(\phi_{j|y=0}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0) \: + \: 1}{\Sigma_{i=1}^{m}(y^{(i)}=0 \: + \: 2}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
855
856 Note that in the general case where y takes on k values, there are k+1 modified parameter estimates, and they are added in when the denominator is k (rather than two, as shown in the two-level classifier shown here.)
857
858 Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, train new models when possible to account for a broader set of possible X values.
859
860
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
861 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
862
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
863
864 [Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf)
865
May 5, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
866 [Ng, Andrew. "Generative Learning algorithms." (2008).](http://cs229.stanford.edu/notes/cs229-notes2.pdf)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
867
868 ---
869
870 <a name="PCA"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
871 ## PCA
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
872
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
873 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
874
875 Principal Components Analysis (PCA) is closely related to Principal Components Regression. The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features.
876
Apr 24, 2015 @arnocandel More work on algo description.
877 PCA is commonly used to model without regularization or perform dimensionality reduction. It can also be useful to carry out as a preprocessing step before distance-based algorithms such as K-Means since PCA guarantees that all dimensions of a manifold are orthogonal.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
878
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
879 ### Defining a PCA Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
880
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
881 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
882
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
883 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
884 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
885
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
886 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
887
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
888 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Jul 4, 2015 @jessica0xdata Update for build 26
889
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
890 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Jul 4, 2015 @jessica0xdata Update for build 26
891
Nov 5, 2015 @jessica0xdata Updates to docs
892 - **transform**: Select the transformation method for the training data: None, Standardize, Normalize, Demean, or Descale. The default is None.
893
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
894 - **pca_method**: Select the algorithm to use for computing the principal components:
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
895 - *GramSVD*: Uses a distributed computation of the Gram matrix, followed by a local SVD using the JAMA package
Nov 5, 2015 @jessica0xdata Updates to docs
896 - *Power*: Computes the SVD using the power iteration method (experimental)
Oct 22, 2015 @jessica0xdata Updates for Recent Changes; PUBDEV-2247; Timeline desc.; new "Randomi…
897 - *Randomized*: Uses randomized subspace iteration method
Nov 5, 2015 @jessica0xdata Updates to docs
898 - *GLRM*: Fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental)
899
900 - **k***: Specify the rank of matrix approximation. The default is 1.
901
902 - **max_iterations**: Specify the number of training iterations. The value must be between 1 and 1e6 and the default is 1000.
903
904 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Jul 4, 2015 @jessica0xdata Update for build 26
905
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
906 - **use\_all\_factor\_levels**: Check this checkbox to use all factor levels in the possible set of predictors; if you enable this option, sufficient regularization is required. By default, the first factor level is skipped. For PCA models, this option ignores the first factor level of each categorical column when expanding into indicator columns.
Jul 4, 2015 @jessica0xdata Update for build 26
907
Nov 5, 2015 @jessica0xdata Updates to docs
908 - **compute\_metrics**: Enable metrics computations on the training data.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
909
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
910 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
911
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
912 - **max\_runtime\_secs**: Maximum allowed runtime in seconds for model training. Use 0 to disable.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
913
914
915
916
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
917 ### Interpreting a PCA Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
918
Jul 4, 2015 @jessica0xdata Update for build 26
919 PCA output returns a table displaying the number of components specified by the value for `k`.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
920
921 Scree and cumulative variance plots for the components are returned as well. Users can access them by clicking on the black button labeled "Scree and Variance Plots" at the top left of the results page. A scree plot shows the variance of each component, while the cumulative variance plot shows the total variance accounted for by the set of components.
922
923 The output for PCA includes the following:
924
925 - Model parameters (hidden)
926 - Output (model category, model summary, scoring history, training metrics, validation metrics, iterations)
927 - Archetypes
928 - Standard deviation
929 - Rotation
930 - Importance of components (standard deviation, proportion of variance, cumulative proportion)
931
932
933
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
934 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
935
Jul 4, 2015 @jessica0xdata Update for build 26
936 - **How does the algorithm handle missing values during scoring?**
937
938 For the GramSVD and Power methods, all rows containing missing values are ignored during training. For the GLRM method, missing values are excluded from the sum over the loss function in the objective. For more information, refer to section 4 Generalized Loss Functions, equation (13), in ["Generalized Low Rank Models"](https://web.stanford.edu/~boyd/papers/pdf/glrm.pdf) by Boyd et al.
939
940
941 - **How does the algorithm handle missing values during testing?**
942
943 During scoring, the test data is right-multiplied by the eigenvector matrix produced by PCA. Missing categorical values are skipped in the row product-sum. Missing numeric values propagate an entire row of NAs in the resulting projection matrix.
944
945
946 - **What happens during prediction if the new sample has categorical levels not seen in training?**
947
948 Categorical levels in the test data not present in the training data are skipped in the row product-sum.
949
950
951 - **Does it matter if the data is sorted?**
952
953 No, sorting data does not affect the model.
954
955 - **Should data be shuffled before training?**
956
957 No, shuffling data does not affect the model.
958
959
960 - **What if there are a large number of columns?**
961
962 Calculating the SVD will be slower, since computations on the Gram matrix are handled locally.
963
964 - **What if there are a large number of categorical factor levels?**
965
966 Each factor level (with the exception of the first, depending on whether **use\_all\_factor\_levels** is enabled) is assigned an indicator column. The indicator column is 1 if the observation corresponds to a particular factor; otherwise, it is 0. As a result, many factor levels result in a large Gram matrix and slower computation of the SVD.
967
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
968 - **How are categorical columns handled during model building?**
969
Nov 3, 2015 @jessica0xdata Updated Data Sci doc with cluster size recommendations for PCA & Naiv…
970 If the GramSVD or Power methods are used, the categorical columns are expanded into 0/1 indicator columns for each factor level. The algorithm is then performed on this expanded training frame. For GLRM, the multidimensional loss function for categorical columns is discussed in Section 6.1 of ["Generalized Low Rank Models"](https://web.stanford.edu/~boyd/papers/pdf/glrm.pdf) by Boyd et al.
971
972 - **When running PCA, is it better to create a cluster that uses many smaller nodes or fewer larger nodes?**
973
974 For PCA, this is dependent on the selected `pca_method` parameter:
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
975
Nov 3, 2015 @jessica0xdata Updated Data Sci doc with cluster size recommendations for PCA & Naiv…
976 - For **GramSVD**, use fewer larger nodes for better performance. Forming the Gram matrix requires few intensive calculations and the main bottleneck is the JAMA library's SVD function, which is not parallelized and runs on a single machine. We do not recommend selecting GramSVD for datasets with many columns and/or categorical levels in one or more columns.
977 - For **Randomized**, use many smaller nodes for better performance, since H2O calls a few different distributed tasks in a loop, where each task does fairly simple matrix algebra computations.
978 - For **GLRM**, the number of nodes depends on whether the dataset contains many categorical columns with many levels. If this is the case, we recommend using fewer larger nodes, since computing the loss function for categoricals is an intensive task. If the majority of the data is numeric and the categorical columns have only a small number of levels (~10-20), we recommend using many small nodes in the cluster.
979 - For **Power**, we recommend using fewer larger nodes because the intensive calculations are single-threaded. However, this method is only recommended for obtaining principal component values (such as `k << ncol(train))` because the other methods are far more efficient.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
980
Nov 5, 2015 @jessica0xdata Updates for variable importances, PCA
981 - **I ran PCA on my dataset - how do I input the new parameters into a model?**
982
983 After the PCA model has been built using `h2o.prcomp`, use `h2o.predict` on the original data frame and the PCA model to produce the dimensionality-reduced representation. Use `cbind` to add the predictor column from the original data frame to the data frame produced by the output of `h2o.predict`. At this point, you can build supervised learning models on the new data frame.
984
985
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
986 ### PCA Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
987
May 14, 2015 @lo5 Add MathJax rendering
988 Let \(X\) be an \(M\times N\) matrix where
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
989
990 - Each row corresponds to the set of all measurements on a particular
991 attribute, and
992
993 - Each column corresponds to a set of measurements from a given
994 observation or trial
995
May 14, 2015 @lo5 Add MathJax rendering
996 The covariance matrix \(C_{x}\) is
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
997
May 14, 2015 @lo5 Add MathJax rendering
998 \(C_{x}=\frac{1}{n}XX^{T}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
999
May 14, 2015 @lo5 Add MathJax rendering
1000 where \(n\) is the number of observations.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1001
May 14, 2015 @lo5 Add MathJax rendering
1002 \(C_{x}\) is a square, symmetric \(m\times m\) matrix, the diagonal entries of which are the variances of attributes, and the off-diagonal entries are covariances between attributes.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1003
Jun 25, 2015 @jessica0xdata Update to Porting/Migration doc to fix MSE typo for DL; added info re…
1004 PCA convergence is based on the method described by Gockenbach: "The rate of convergence of the power method depends on the ratio \(lambda_2|/|\lambda_1\). If this is small...then the power method converges rapidly. If the ratio is close to 1, then convergence is quite slow. The power method will fail if \(lambda_2| = |\lambda_1\)." (567).
1005
1006
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1007 The objective of PCA is to maximize variance while minimizing covariance.
1008
May 14, 2015 @lo5 Add MathJax rendering
1009 To accomplish this, for a new matrix \(C_{y}\) with off diagonal entries of 0, and each successive dimension of Y ranked according to variance, PCA finds an orthonormal matrix \(P\) such that \(Y=PX\) constrained by the requirement that \(C_{y}=\frac{1}{n}YY^{T}\) be a diagonal matrix.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1010
May 14, 2015 @lo5 Add MathJax rendering
1011 The rows of \(P\) are the principal components of X.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1012
May 14, 2015 @lo5 Add MathJax rendering
1013 \(C_{y}=\frac{1}{n}YY^{T}\)
1014 \(=\frac{1}{n}(PX)(PX)^{T}\)
1015 \(C_{y}=PC_{x}P^{T}.\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1016
May 14, 2015 @lo5 Add MathJax rendering
1017 Because any symmetric matrix is diagonalized by an orthogonal matrix of its eigenvectors, solve matrix \(P\) to be a matrix where each row is an eigenvector of
1018 \(\frac{1}{n}XX^{T}=C_{x}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1019
May 14, 2015 @lo5 Add MathJax rendering
1020 Then the principal components of \(X\) are the eigenvectors of \(C_{x}\), and the \(i^{th}\) diagonal value of \(C_{y}\) is the variance of \(X\) along \(p_{i}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1021
May 14, 2015 @lo5 Add MathJax rendering
1022 Eigenvectors of \(C_{x}\) are found by first finding the eigenvalues \(\lambda\) of \(C_{x}\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1023
May 14, 2015 @lo5 Add MathJax rendering
1024 For each eigenvalue \(\lambda\) \((C-{x}-\lambda I)x =0\) where \(x\) is the eigenvector associated with \(\lambda\).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1025
May 14, 2015 @lo5 Add MathJax rendering
1026 Solve for \(x\) by Gaussian elimination.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1027
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1028 #### Recovering SVD from GLRM
Jul 4, 2015 @jessica0xdata Update for build 26
1029
1030 GLRM gives \(x\) and \(y\), where \(x \in \rm \Bbb I \!\Bbb R ^{n * k}\) and \( y \in \rm \Bbb I \!\Bbb R ^{k*m} \)
1031
1032 &nbsp;&nbsp;&nbsp;- \(n\)= number of rows (A)
1033
1034 &nbsp;&nbsp;&nbsp;- \(m\)= number of columns (A)
1035
1036 &nbsp;&nbsp;&nbsp;- \(k\)= user-specified rank
1037 &nbsp;&nbsp;&nbsp;- \(A\)= training matrix
1038
1039 It is assumed that the \(x\) and \(y\) columns are independent.
1040
1041 First, perform QR decomposition of \(x\) and \(y^T\):
1042
1043 &nbsp;&nbsp;&nbsp;\(x = QR\)
1044
1045 &nbsp;&nbsp;&nbsp; \(y^T = ZS\), where \(Q^TQ = I = Z^TZ\)
1046
1047 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Call JAMA QR Decomposition directly on \(y^T\) to get \( Z \in \rm \Bbb I \! \Bbb R\), \( S \in \Bbb I \! \Bbb R \)
1048
1049 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\( R \) from QR decomposition of \( x \) is the upper triangular factor of Cholesky of \(X^TX\) Gram
1050
1051 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\( X^TX = LL^T, X = QR \)
1052
1053 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\( X^TX= (R^TQ^T) QR = R^TR \), since \(Q^TQ=I \) => \(R=L^T\) (transpose lower triangular)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1054
Jul 4, 2015 @jessica0xdata Update for build 26
1055 **Note**: In code, \(X^TX \over n\) = \( LL^T \)
1056
1057 &nbsp;&nbsp;&nbsp;\( X^TX = (L \sqrt{n})(L \sqrt{n})^T =R^TR \)
1058
1059 &nbsp;&nbsp;&nbsp;\( R = L^T \sqrt{n} \in \rm \Bbb I \! \Bbb R^{k * k} \) reduced QR decomposition.
1060
1061 For more information, refer to the [Rectangular matrix](https://en.wikipedia.org/wiki/QR_decomposition#Rectangular_matrix) section of "QR Decomposition" on Wikipedia.
1062
1063 \( XY = QR(ZS)^T = Q(RS^T)Z^T \)
1064
1065 **Note**: \( (RS^T) \in \rm \Bbb I \!\Bbb R \)
1066
1067 Find SVD (locally) of \( RS^T \)
1068
1069 \( RS^T = U \sum V^T, U^TU = I = V^TV \) orthogonal
1070
1071 \( XY = Q(RS^T)Z^T = (QU \sum (V^T Z^T) SVD \)
1072
1073 &nbsp;&nbsp;&nbsp;\( (QU)^T(QU) = U^T Q^TQU U^TU = I\)
1074
1075 &nbsp;&nbsp;&nbsp;\( (ZV)^T(ZV) = V^TZ^TZV = V^TV =I \)
1076
1077 Right singular vectors: \( ZV \in \rm \Bbb I \!\Bbb R^{m * k} \)
1078
1079 Singular values: \( \sum \in \rm \Bbb I \!\Bbb R^{k * k} \) diagonal
1080
1081 Left singular vectors: \( (QU) \in \rm \Bbb I \!\Bbb R^{n * k}\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1082
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1083
1084
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1085 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1086
Jun 25, 2015 @jessica0xdata Update to Porting/Migration doc to fix MSE typo for DL; added info re…
1087 Gockenbach, Mark S. "Finite-Dimensional Linear Algebra (Discrete Mathematics and Its Applications)." (2010): 566-567.
1088
Jul 4, 2015 @jessica0xdata Update for build 26
1089
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1090 ---
1091
1092 <a name="GBM"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1093 ## GBM
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1094
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1095 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1096
Apr 24, 2015 @arnocandel More work on algo description.
1097 Gradient Boosted Regression and Gradient Boosted Classification are forward learning ensemble methods. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O's GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1098
Aug 12, 2015 @jessica0xdata Updates for simons-7; some other fixes
1099 The current version of GBM is fundamentally the same as in previous versions of H2O (same algorithmic steps, same histogramming techniques), with the exception of the following changes:
1100
1101 - Improved ability to train on categorical variables (using the `nbins_cats` parameter)
1102 - Minor changes in histogramming logic for some corner cases
1103
1104 There was some code cleanup and refactoring to support the following features:
1105
1106 - Per-row observation weights
1107 - Per-row offsets
1108 - N-fold cross-validation
1109 - Support for more distribution functions (such as Gamma, Poisson, and Tweedie)
1110
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1111 ### Defining a GBM Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1112
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1113 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1114
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1115 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1116 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
1117
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1118 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1119
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1120 - **nfolds**: Specify the number of folds for cross-validation.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1121
Oct 20, 2015 @jessica0xdata Updates
1122 - **response_column**: (Required) Select the column to use as the independent variable. The data can be numeric or categorical.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1123
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1124 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1125
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1126 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1127
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1128 - **ntrees**: Specify the number of trees.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1129
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1130 - **max\_depth**: Specify the maximum tree depth.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1131
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1132 - **min\_rows**: Specify the minimum number of observations for a leaf (`nodesize` in R).
May 29, 2015 @jessica0xdata Doc update
1133
Apr 27, 2016 GBM/DRF documentation updates
1134 - **nbins**: (Numerical/real/int only) Specify the number of bins for the histogram to build, then split at the best point.
1135
1136 - **max\_abs\_leafnode\_pred**: When building a GBM classification model, this option reduces overfitting by limiting the maximum absolute value of a leaf node prediction. This option defaults to Double.MAX_VALUE.
May 29, 2015 @jessica0xdata Doc update for Arno's changes (nbins_cats)
1137
Oct 5, 2015 @jessica0xdata Added more info re:`nbins_cats`
1138 - **nbins_cats**: (Categorical/enums only) Specify the maximum number of bins for the histogram to build, then split at the best point. Higher values can lead to more overfitting. The levels are ordered alphabetically; if there are more levels than bins, adjacent levels share bins. This value has a more significant impact on model fitness than **nbins**. Larger values may increase runtime, especially for deep trees and large clusters, so tuning may be required to find the optimal value for your configuration.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1139
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1140 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
May 29, 2015 @jessica0xdata Doc update
1141
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1142 - **learn_rate**: Specify the learning rate. The range is 0.0 to 1.0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1143
Apr 27, 2016 GBM/DRF documentation updates
1144 - **learn\_rate\_annealing**: Specifies to reduce the **learn_rate** by this factor after every tree. So for *N* trees, GBM starts with **learn_rate** and ends with **learn_rate** * **learn\_rate\_annealing**^*N*. For example, instead of using **learn_rate=0.01**, you can now try **learn_rate=0.05** and **learn\_rate\_annealing=0.99**. This method would converge much faster with almost the same accuracy. Use caution not to overfit.
1145
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1146 - **distribution**: Select the loss function. The options are auto, bernoulli, multinomial, gaussian, poisson, gamma, or tweedie.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1147
Oct 20, 2015 @jessica0xdata Updates
1148 > - If the distribution is **multinomial**, the response column must be categorical.
1149 > - If the distribution is **poisson**, the response column must be numeric.
1150 > - If the distribution is **gamma**, the response column must be numeric.
1151 > - If the distribution is **tweedie**, the response column must be numeric.
1152 > - If the distribution is **gaussian**, the response column must be numeric.
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1153 > - If the distribution is **laplace**, the data must be numeric and continuous (**Int**).
1154 > - If the distribution is **quantile**, the data must be numeric and continuous (**Int**).
Oct 20, 2015 @jessica0xdata Updates
1155
1156
Apr 27, 2016 GBM/DRF documentation updates
1157 - **sample_rate**: Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to "Stochastic Gradient Boosting" ([Friedman, 1999](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)). If this option is specified along with **sample\_rate_per\_class**, then only the first option that GBM encounters will be used.
1158
1159 - **sample\_rate_per\_class**: When building models from imbalanced datasets, this option specifies that each tree in the ensemble should sample from the full training dataset using a per-class-specific sampling rate rather than a global sample factor (as with `sample_rate`). The range for this option is 0.0 to 1.0. If this option is specified along with **sample_rate**, then only the first option that GBM encounters will be used.
Oct 6, 2015 @jessica0xdata Added varimp for Py, `col_sample_rate` and `sample_rate`
1160
1161 - **col\_sample_rate**: Specify the column sampling rate (y-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to "Stochastic Gradient Boosting" ([Friedman, 1999](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)).
1162
Apr 27, 2016 GBM/DRF documentation updates
1163 - **col\_sample_rate\_change\_per\_level**: This option specifies to change the column sampling rate as a function of the depth in the tree. For example:
1164 >level 1: **col\_sample_rate**
1165
1166 >level 2: **col\_sample_rate** * **factor**
1167
1168 >level 3: **col\_sample_rate** * **factor^2**
1169
1170 >level 4: **col\_sample_rate** * **factor^3**
1171
1172 >etc.
1173
May 19, 2016 Changes to DRF/GBM
1174 - **min\_split_improvement**: The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range.
1175
May 19, 2016 @arnocandel PUBDEV-2915: Add 'RoundRobin' histogram_type, to R/Py/Flow/Docs/JUnits.
1176 - **histogram_type**: By default (AUTO) GBM bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well. RoundRobin can be specified to cycle through all histogram types (one per tree). Use this option to specify the type of histogram to use for finding optimal split points:
Apr 27, 2016 GBM/DRF documentation updates
1177
May 19, 2016 Change "Auto" to "AUTO"
1178 - AUTO
May 19, 2016 Changes to DRF/GBM
1179 - UniformAdaptive
1180 - Random
1181 - QuantilesGlobal
May 19, 2016 @arnocandel PUBDEV-2915: Add 'RoundRobin' histogram_type, to R/Py/Flow/Docs/JUnits.
1182 - RoundRobin
Apr 27, 2016 GBM/DRF documentation updates
1183
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1184 - **score\_each\_iteration**: (Optional) Check this checkbox to score during each iteration of the model training.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
1185
Feb 10, 2016 @arnocandel Add score_tree_interval to DRF/GBM docs.
1186 - **score\_tree\_interval**: Score the model after every so many trees. Disabled if set to 0.
1187
Nov 5, 2015 @jessica0xdata Updates to docs
1188 - **fold_assignment**: (Applicable only if a value for **nfolds** is specified and **fold_column** is not selected) Select the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, or [Modulo](https://en.wikipedia.org/wiki/Modulo_operation).
1189
1190 - **fold_column**: Select the column that contains the cross-validation fold index assignment per observation.
1191
1192 - **offset_column**: (Not applicable if the **distribution** is **multinomial**) Select a column to use as the offset.
1193 >*Note*: Offsets are per-row "bias values" that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. For more information, refer to the following [link](http://www.idg.pl/mirrors/CRAN/web/packages/gbm/vignettes/gbm.pdf). If the **distribution** is **Bernoulli**, the value must be less than one.
1194
1195 - **weights_column**: Select a column to use for the observation weights, which are used for bias correction. The specified `weights_column` must be included in the specified `training_frame`. *Python only*: To use a weights column when passing an H2OFrame to `x` instead of a list of column names, the specified `training_frame` must contain the specified `weights_column`.
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
1196 >*Note*: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Nov 5, 2015 @jessica0xdata Updates to docs
1197
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
1198 - **balance_classes**: Oversample the minority classes to balance the class distribution. This option is not selected by default and can increase the data frame size. This option is only applicable for classification. Majority classes can be undersampled to satisfy the **Max\_after\_balance\_size** parameter.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1199
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1200 - **max\_confusion\_matrix\_size**: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1201
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1202 - **max\_hit\_ratio\_k**: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1203
Jul 30, 2015 @jessica0xdata Final final docs push ^_^
1204
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1205 - **r2_stopping**: Specify a threshold for the coefficient of determination (\(r^2\)) metric value. When this threshold is met or exceeded, H2O stops making trees.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
1206
Nov 5, 2015 @jessica0xdata Updates to docs
1207 - **stopping\_rounds**: Stops training when the option selected for **stopping\_metric** doesn't improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify `0`. The metric is computed on the validation data (if provided); otherwise, training data is used. When used with **overwrite\_with\_best\_model**, the final model is the best model generated for the given **stopping\_metric** option.
1208 >**Note**: If cross-validation is enabled:
1209 1. All cross-validation models stop training when the validation metric doesn't improve.
1210 2. The main model runs for the mean number of epochs.
1211 3. N+1 models do *not* use **overwrite\_with\_best\_model**
1212 4. N+1 models may be off by the number specified for **stopping\_rounds** from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).
1213
1214 - **stopping\_metric**: Select the metric to use for early stopping. The available options are:
1215
Jun 3, 2016 Documentation: Added new stopping_metric option
1216 - **AUTO**: Logloss for classification; deviance for regression
1217 - **deviance**
Nov 5, 2015 @jessica0xdata Updates to docs
1218 - **logloss**
1219 - **MSE**
1220 - **AUC**
1221 - **r2**
Jun 3, 2016 Documentation: Added new stopping_metric option
1222 - **misclassification**
1223 - **mean\_per\_class\_error**
Nov 5, 2015 @jessica0xdata Updates to docs
1224
1225 - **stopping\_tolerance**: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value.
1226
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1227 - **max\_runtime\_secs**: Maximum allowed runtime in seconds for model training. Use 0 to disable.
1228
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1229 - **build\_tree\_one\_node**: To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used.
Jun 17, 2015 @jessica0xdata Update Recent Changes; update docs for `build_tree_one_node` addition…
1230
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1231 - **quantile_alpha**: (Only applicable if *Quantile* is selected for **distribution**) Specify the quantile to be used for Quantile Regression.
1232
1233 - **tweedie_power**: (Only applicable if *Tweedie* is selected for **distribution**) Specify the Tweedie power. The range is from 1 to 2. For a normal distribution, enter `0`. For Poisson distribution, enter `1`. For a gamma distribution, enter `2`. For a compound Poisson-gamma distribution, enter a value greater than 1 but less than 2. For more information, refer to [Tweedie distribution](https://en.wikipedia.org/wiki/Tweedie_distribution).
Nov 5, 2015 @jessica0xdata Updates to docs
1234
1235 - **checkpoint**: Enter a model key associated with a previously-trained model. Use this option to build a new model as a continuation of a previously-generated model.
1236
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1237 - **keep\_cross\_validation\_predictions**: To keep the cross-validation predictions, check this checkbox.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1238
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1239 - **class\_sampling\_factors**: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1240
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1241 - **max\_after\_balance\_size**: Specify the maximum relative size of the training data after balancing class counts (**balance\_classes** must be enabled). The value can be less than 1.0.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
1242
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1243 - **nbins\_top\_level**: (For numerical/real/int columns only) Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level.
Aug 7, 2015 @jessica0xdata Updates for booklets; updates for simons-5 (inc. `nbins_top_level`)
1244
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1245
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1246 ### Interpreting a GBM Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1247
1248 The output for GBM includes the following:
1249
1250 - Model parameters (hidden)
1251 - A graph of the scoring history (training MSE vs number of trees)
1252 - A graph of the variable importances
1253 - Output (model category, validation metrics, initf)
1254 - Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
1255 - Scoring history in tabular format
1256 - Training metrics (model name, model checksum name, frame name, description, model category, duration in ms, scoring time, predictions, MSE, R2)
1257 - Variable importances in tabular format
1258
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1259 ### Leaf Node Assignment
Mar 2, 2016 @arnocandel PUBDEV-2698: Add docs for leaf_node_assignment.
1260 Trees cluster observations into leaf nodes, and this information can be useful for feature engineering or model interpretability. Use **h2o.predict\_leaf\_node\_assignment\(model, frame\)** to get an H2OFrame with the leaf node assignments, or click the checkbox when making predictions from Flow. Those leaf nodes represent decision rules that can be fed to other models (i.e., GLM with lambda search and strong rules) to obtain a limited set of the most important rules.
1261
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1262 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1263
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1264 - **How does the algorithm handle missing values during training?**
1265
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
1266 Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1267
1268 - **How does the algorithm handle missing values during testing?**
1269
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
1270 During scoring, missing values follow the optimal path that was determined for them during training (minimized loss function).
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1271
1272 - **What happens if the response has missing values?**
1273
1274 No errors will occur, but nothing will be learned from rows containing missing the response.
1275
May 17, 2016 New handling of missing values in GBM
1276 - **What happens when you try to predict on a categorical level not seen during training?**
1277
1278 GBM converts a new categorical level to an "undefined" value in the test set, and then splits either left or right during scoring.
1279
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1280 - **Does it matter if the data is sorted?**
1281
1282 No.
1283
1284 - **Should data be shuffled before training?**
1285
1286 No.
1287
1288 - **How does the algorithm handle highly imbalanced data in a response column?**
1289
1290 You can specify `balance_classes`, `class_sampling_factors` and `max_after_balance_size` to control over/under-sampling.
1291
1292 - **What if there are a large number of columns?**
1293
1294 DRF models are best for datasets with fewer than a few thousand columns.
1295
1296 - **What if there are a large number of categorical factor levels?**
1297
Aug 12, 2015 @jessica0xdata Updates for simons-7; some other fixes
1298 Large numbers of categoricals are handled very efficiently - there is never any one-hot encoding.
1299
1300 - **Given the same training set and the same GBM parameters, will GBM produce a different model with two different validation data sets, or the same model?**
1301
1302 The same model will be generated.
1303
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1304 - **How deterministic is GBM?**
1305
1306 The `nfolds` and `balance_classes` parameters use the seed directly. Otherwise, GBM is deterministic up to floating point rounding errors (out-of-order atomic addition of multiple threads during histogram building). Any observed variations in the AUC curve should be the same up to at least three to four significant digits.
1307
1308 - **When fitting a random number between 0 and 1 as a single feature, the training ROC curve is consistent with `random` for low tree numbers and overfits as the number of trees is increased, as expected. However, when a random number is included as part of a set of hundreds of features, as the number of trees increases, the random number increases in feature importance. Why is this?**
1309
May 17, 2016 New handling of missing values in GBM
1310 This is a known behavior of GBM that is similar to its behavior in R. If, for example, it takes 50 trees to learn all there is to learn from a frame without the random features, when you add a random predictor and train 1000 trees, the first 50 trees will be approximately the same. The final 950 trees are used to make sense of the random number, which will take a long time since there's no structure. The variable importance will reflect the fact that all the splits from the first 950 trees are devoted to the random feature.
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1311
Dec 17, 2015 @arnocandel Cosmetics in docs for column sampling.
1312 - **How is column sampling implemented for GBM?**
Dec 17, 2015 @jessica0xdata Updates for `col_sample_rate_per_tree`
1313
May 17, 2016 New handling of missing values in GBM
1314 For an example model using:
Dec 17, 2015 @jessica0xdata Updates for `col_sample_rate_per_tree`
1315
May 17, 2016 New handling of missing values in GBM
1316 - 100 columns
1317 - `col_sample_rate_per_tree=0.754`
1318 - `col_sample_rate=0.8` (refers to available columns after per-tree sampling)
Dec 17, 2015 @jessica0xdata Updates for `col_sample_rate_per_tree`
1319
Jun 14, 2016 Added links to new tutorials
1320 For each tree, the floor is used to determine the number - in this example, (0.754*100)=75 out of the 100 - of columns that are randomly picked, and then the floor is used to determine the number - in this case, (0.754*0.8*100)=60 - of columns that are then randomly chosen for each split decision (out of the 75).
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1321
May 17, 2016 New handling of missing values in GBM
1322 - **I want to score multiple models on a huge dataset. Is it possible to score these models in parallel?**
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1323
May 17, 2016 New handling of missing values in GBM
1324 The best way to score models in parallel is to use the in-H2O binary models. To do this, import the binary (non-POJO, previously exported) model into an H2O cluster; import the datasets into H2O as well; call the predict endpoint either from R, Python, Flow or the REST API directly; then export the predictions to file or download them from the server.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1325
Jun 14, 2016 Added links to new tutorials
1326 - **Are there any tutorials for GBM?**
1327
1328 You can find tutorials for using GBM with R, Python, and Flow at the following location: <a href="https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/tutorials/gbm" target="_blank">https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/tutorials/gbm</a>
1329
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1330 ### GBM Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1331
1332 H2O's Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):
1333
1334
May 14, 2015 @lo5 Add MathJax rendering
1335 Initialize \(f_{k0} = 0,\: k=1,2,…,K\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1336
May 14, 2015 @lo5 Add MathJax rendering
1337 For \(m=1\) to \(M:\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1338
May 14, 2015 @lo5 Add MathJax rendering
1339 &nbsp;&nbsp;(a) Set \(p_{k}(x)=\frac{e^{f_{k}(x)}}{\sum_{l=1}^{K}e^{f_{l}(x)}},\:k=1,2,…,K\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1340
May 14, 2015 @lo5 Add MathJax rendering
1341 &nbsp;&nbsp;(b) For \(k=1\) to \(K\):
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1342
May 14, 2015 @lo5 Add MathJax rendering
1343 &nbsp;&nbsp;&nbsp;&nbsp;i. Compute \(r_{ikm}=y_{ik}-p_{k}(x_{i}),\:i=1,2,…,N.\)
1344 &nbsp;&nbsp;&nbsp;&nbsp;ii. Fit a regression tree to the targets \(r_{ikm},\:i=1,2,…,N\), giving terminal regions \(R_{jim},\:j=1,2,…,J_{m}.\)
1345 \(iii. Compute\) \(\gamma_{jkm}=\frac{K-1}{K}\:\frac{\sum_{x_{i}\in R_{jkm}}(r_{ikm})}{\sum_{x_{i}\in R_{jkm}}|r_{ikm}|(1-|r_{ikm})},\:j=1,2,…,J_{m}.\)
1346 \(\:iv.\:Update\:f_{km}(x)=f_{k,m-1}(x)+\sum_{j=1}^{J_{m}}\gamma_{jkm}I(x\in\:R_{jkm}).\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1347
May 14, 2015 @lo5 Add MathJax rendering
1348 Output \(\:\hat{f_{k}}(x)=f_{kM}(x),\:k=1,2,…,K.\)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1349
Jul 4, 2015 @jessica0xdata Update for build 26
1350 Be aware that the column type affects how the histogram is created and the column type depends on whether rows are excluded or assigned a weight of 0. For example:
1351
1352 val weight
1353 1 1
1354 0.5 0
1355 5 1
1356 3.5 0
1357
1358 The above vec has a real-valued type if passed as a whole, but if the zero-weighted rows are sliced away first, the integer weight is used. The resulting histogram is either kept at full `nbins` resolution or potentially shrunk to the discrete integer range, which affects the split points.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1359
Oct 7, 2015 @jessica0xdata Updated links
1360 For more information about the GBM algorithm, refer to the [Gradient Boosted Machines booklet](http://h2o.ai/resources).
1361
Jan 15, 2016 @spennihana update docs, flush case after inserting missing values
1362
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1363 ### Binning In GBM
Jan 15, 2016 @spennihana update docs, flush case after inserting missing values
1364
1365 **Is the binning range-based or percentile-based?**
1366
1367 It's range based, and re-binned at each tree split.
1368 NAs always "go to the left" (smallest) bin.
1369 There's a minimum observations required value (default 10).
1370 There has to be at least 1 FP ULP improvement in error to split (all-constant predictors won't split).
1371 nbins is at least 1024 at the top-level, and divides by 2 down each level until you hit the nbins parameter (default: 20).
1372 Categoricals use a separate, more aggressive, binning range.
1373
1374 Re-binning means, eg, suppose your column C1 data is: {1,1,2,4,8,16,100,1000}.
1375 Then a 20-way binning will use the range from 1 to 1000, bin by units of 50.
1376 The first binning will be a lumpy: {1,1,2,4,8,16},{100},{47_empty_bins},{1000}. Suppose the split peels out the {1000} bin from the rest.
1377
1378 Next layer in the tree for the left-split has value from 1 to 100 (not 1000!) and so re-bins in units of 5: {1,1,2,4},{8},{},{16},{lots of empty bins}{100}
1379 (the RH split has the single value 1000).
1380
1381 And so on: important dense ranges with split essentially logrithmeticaly at each layer.
1382
1383 **What should I do if my variables are long skewed in the tail and might have large outliers?**
1384
1385 You can try adding a new predictor column which is either pre-binned (e.g. as a categorical - "small", "median", and "giant" values), or a log-transform - plus keep the old column.
1386
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1387 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1388
1389 Dietterich, Thomas G, and Eun Bae Kong. "Machine Learning Bias,
1390 Statistical Bias, and Statistical Variance of Decision Tree
1391 Algorithms." ML-95 255 (1995).
1392
1393 Elith, Jane, John R Leathwick, and Trevor Hastie. "A Working Guide to
1394 Boosted Regression Trees." Journal of Animal Ecology 77.4 (2008): 802-813
1395
1396 Friedman, Jerome H. "Greedy Function Approximation: A Gradient
1397 Boosting Machine." Annals of Statistics (2001): 1189-1232.
1398
1399 Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani,
1400 and Ji Zhu. "Discussion of Boosting Papers." Ann. Statist 32 (2004):
1401 102-107
1402
1403 [Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "Additive
1404 Logistic Regression: A Statistical View of Boosting (With Discussion
1405 and a Rejoinder by the Authors)." The Annals of Statistics 28.2
1406 (2000): 337-407](http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1016218223)
1407
1408 [Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The
1409 Elements of Statistical Learning.
1410 Vol.1. N.p., page 339: Springer New York, 2001.](http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf)
1411
1412 ---
1413
1414 <a name="DL"></a>
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1415 ## Deep Learning
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1416
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1417 ### Introduction
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1418
1419 H2O’s Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network.
1420
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1421 ### Defining a Deep Learning Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1422
1423 H2O Deep Learning models have many input parameters, many of which are only accessible via the expert mode. For most cases, use the default values. Please read the following instructions before building extensive Deep Learning models. The application of grid search and successive continuation of winning models via checkpoint restart is highly recommended, as model performance can vary greatly.
1424
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1425 - **model_id**: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1426
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1427 - **training_frame**: (Required) Select the dataset used to build the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1428 **NOTE**: If you click the **Build a model** button from the `Parse` cell, the training frame is entered automatically.
1429
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1430 - **validation_frame**: (Optional) Select the dataset used to evaluate the accuracy of the model.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1431
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1432 - **nfolds**: Specify the number of folds for cross-validation.
Oct 20, 2015 @jessica0xdata Updates
1433 >**Note**: Cross-validation is not supported when autoencoder is enabled.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1434
Oct 20, 2015 @jessica0xdata Updates
1435 - **response_column**: Select the column to use as the independent variable. The data can be numeric or categorical.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1436
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1437 - **ignored_columns**: (Optional) Click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the **All** button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the **None** button. To search for a specific column, type the column name in the **Search** field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the **Only show columns with more than 0% missing values** field. To change the selections for the hidden columns, use the **Select Visible** or **Deselect Visible** buttons.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1438
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1439 - **ignore\_const\_cols**: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1440
Oct 20, 2015 @jessica0xdata Updates
1441 - **activation**: Select the activation function (Tahn, Tahn with dropout, Rectifier, Rectifier with dropout, Maxout, Maxout with dropout).
1442 > - **Maxout** is not supported when **autoencoder** is enabled.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1443
Oct 20, 2015 @jessica0xdata Updates
1444 - **hidden**: Specify the hidden layer sizes (e.g., 100,100). The value must be positive.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1445
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1446 - **epochs**: Specify the number of times to iterate (stream) the dataset. The value can be a fraction.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1447
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1448 - **variable_importances**: Check this checkbox to compute variable importance. This option is not selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1449
Nov 5, 2015 @jessica0xdata Updates to docs
1450 - **fold_assignment**: (Applicable only if a value for **nfolds** is specified and **fold_column** is not selected) Select the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, or [Modulo](https://en.wikipedia.org/wiki/Modulo_operation).
1451
1452 - **fold_column**: Select the column that contains the cross-validation fold index assignment per observation.
1453
1454 - **weights_column**: Select a column to use for the observation weights, which are used for bias correction. The specified `weights_column` must be included in the specified `training_frame`. *Python only*: To use a weights column when passing an H2OFrame to `x` instead of a list of column names, the specified `training_frame` must contain the specified `weights_column`.
1455 >*Note*: Weights are per-row observation weights. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
1456
1457 - **offset_column**: (Applicable for regression only) Select a column to use as the offset.
1458 >*Note*: Offsets are per-row "bias values" that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. For more information, refer to the following [link](http://www.idg.pl/mirrors/CRAN/web/packages/gbm/vignettes/gbm.pdf).
1459
Nov 24, 2015 @jessica0xdata Updates to clarify weights, Sparkling Water, roxygen2 requirement
1460 - **balance_classes**: (Applicable for classification only) Oversample the minority classes to balance the class distribution. This option is not selected by default and can increase the data frame size. This option is only applicable for classification. Majority classes can be undersampled to satisfy the **Max\_after\_balance\_size** parameter.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1461
Feb 10, 2016 @arnocandel Add parameter 'standardize' to the DL docs.
1462 - **standardize**: If enabled, automatically standardize the data (mean 0, variance 0). If disabled, the user must provide properly scaled input data.
1463
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1464 - **max\_confusion\_matrix\_size**: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1465
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1466 - **max\_hit\_ratio\_k**: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1467
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1468 - **checkpoint**: Enter a model key associated with a previously-trained Deep Learning model. Use this option to build a new model as a continuation of a previously-generated model.
Oct 20, 2015 @jessica0xdata Updates
1469 >**Note**: Cross-validation is not supported during checkpoint restarts.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1470
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1471 - **use\_all\_factor\_levels**: Check this checkbox to use all factor levels in the possible set of predictors; if you enable this option, sufficient regularization is required. By default, the first factor level is skipped. For Deep Learning models, this option is useful for determining variable importances and is automatically enabled if the autoencoder is selected.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1472
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1473 - **train\_samples\_per\_iteration**: Specify the number of global training samples per MapReduce iteration. To specify one epoch, enter 0. To specify all available data (e.g., replicated training data), enter -1. To use the automatic values, enter -2.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1474
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1475 - **adaptive_rate**: Check this checkbox to enable the adaptive learning rate (ADADELTA). This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1476
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1477 - **input\_dropout\_ratio**: Specify the input layer dropout ratio to improve generalization. Suggested values are 0.1 or 0.2.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1478
Oct 20, 2015 @jessica0xdata Updates
1479 - **hidden\_dropout\_ratios**: (Applicable only if the activation type is **TanhWithDropout**, **RectifierWithDropout**, or **MaxoutWithDropout**) Specify the hidden layer dropout ratio to improve generalization. Specify one value per hidden layer. The range is >= 0 to <1 and the default is 0.5.
1480
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1481 - **l1**: Specify the L1 regularization to add stability and improve generalization; sets the value of many weights to 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1482
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1483 - **l2**: Specify the L2 regularization to add stability and improve generalization; sets the value of many weights to smaller values.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1484
Nov 5, 2015 @jessica0xdata Updates to docs
1485 - **loss**: Select the loss function. The options are Automatic, CrossEntropy, Quadratic, Huber, or Absolute and the default value is Automatic.
Oct 20, 2015 @jessica0xdata Updates
1486 > - Use **Absolute**, **Quadratic**, or **Huber** for regression
1487 > - Use **Absolute**, **Quadratic**, **Huber**, or **CrossEntropy** for classification
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1488
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1489 - **distribution**: Select the distribution type from the drop-down list. The options are auto, bernoulli, multinomial, gaussian, poisson, gamma, laplace, quantile or tweedie.
1490
1491 - **quantile_alpha**: (Only applicable if *Quantile* is selected for **distribution**) Specify the quantile to be used for Quantile Regression.
Jul 30, 2015 @jessica0xdata Doc updates
1492
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1493 - **tweedie_power**: (Only applicable if *Tweedie* is selected for **distribution**) Specify the Tweedie power. The range is from 1 to 2. For a normal distribution, enter `0`. For Poisson distribution, enter `1`. For a gamma distribution, enter `2`. For a compound Poisson-gamma distribution, enter a value greater than 1 but less than 2. For more information, refer to [Tweedie distribution](https://en.wikipedia.org/wiki/Tweedie_distribution).
Jul 30, 2015 @jessica0xdata Doc updates
1494
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1495 - **score_interval**: Specify the shortest time interval (in seconds) to wait between model scoring.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1496
Oct 20, 2015 @jessica0xdata Updates
1497 - **score\_training\_samples**: Specify the number of training set samples for scoring. The value must be >= 0. To use all training samples, enter 0.
1498
1499 - **score\_validation\_samples**: (Applicable only if **validation\_frame** is specified) Specify the number of validation set samples for scoring. The value must be >= 0. To use all validation samples, enter 0.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1500
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1501 - **score\_duty\_cycle**: Specify the maximum duty cycle fraction for scoring. A lower value results in more training and a higher value results in more scoring.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1502
Nov 5, 2015 @jessica0xdata Updates to docs
1503 - **stopping\_rounds**: Stops training when the option selected for **stopping\_metric** doesn't improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify `0`. The metric is computed on the validation data (if provided); otherwise, training data is used. When used with **overwrite\_with\_best\_model**, the final model is the best model generated for the given **stopping\_metric** option.
1504 >**Note**: If cross-validation is enabled:
1505 1. All cross-validation models stop training when the validation metric doesn't improve.
1506 2. The main model runs for the mean number of epochs.
1507 3. N+1 models do *not* use **overwrite\_with\_best\_model**
1508 4. N+1 models may be off by the number specified for **stopping\_rounds** from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).
1509
1510 - **stopping\_metric**: Select the metric to use for early stopping. The available options are:
1511
Jun 3, 2016 Documentation: Added new stopping_metric option
1512 - **AUTO**: Logloss for classification; deviance for regression
1513 - **deviance**
Nov 5, 2015 @jessica0xdata Updates to docs
1514 - **logloss**
1515 - **MSE**
1516 - **AUC**
1517 - **r2**
1518 - **misclassification**
Jun 3, 2016 Documentation: Added new stopping_metric option
1519 - **mean\_per\_class\_error**
Nov 5, 2015 @jessica0xdata Updates to docs
1520
1521 - **stopping\_tolerance**: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value.
1522
Feb 10, 2016 @arnocandel Fix docs - put quantile/quantile_alpha and max_runtime_secs to all ap…
1523 - **max\_runtime\_secs**: Maximum allowed runtime in seconds for model training. Use 0 to disable.
1524
Sep 29, 2015 @jessica0xdata Updates for booklets; created online versions; updates for MeanSquare…
1525 - **autoencoder**: Check this checkbox to enable the Deep Learning autoencoder. This option is not selected by default.
Oct 20, 2015 @jessica0xdata Updates
1526 >**Note**: Cross-validation is not supported when autoencoder is enabled.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1527
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1528 - **keep\_cross\_validation\_predictions**: To keep the cross-validation predictions, check this checkbox.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1529
Oct 20, 2015 @jessica0xdata Updates
1530 - **class\_sampling\_factors**: (Applicable only for classification and when **balance\_classes** is enabled) Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1531
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1532 - **max\_after\_balance\_size**: Specify the maximum relative size of the training data after balancing class counts (**balance\_classes** must be enabled). The value can be less than 1.0.
Jul 17, 2015 @jessica0xdata Updates for nightly build & other doc fixes
1533
Nov 5, 2015 @jessica0xdata Updates to docs
1534 - **overwrite\_with\_best\_model**: Check this checkbox to overwrite the final model with the best model found during training, based on the option selected for **stopping\_metric**. This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1535
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1536 - **target\_ratio\_comm\_to\_comp**: Specify the target ratio of communication overhead to computation. This option is only enabled for multi-node operation and if **train\_samples\_per\_iteration** equals -2 (auto-tuning).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1537
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1538 - **seed**: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1539
Oct 20, 2015 @jessica0xdata Updates
1540 - **rho**: (Applicable only if **adaptive\_rate** is enabled) Specify the adaptive learning rate time decay factor.
1541
1542 - **epsilon**:(Applicable only if **adaptive\_rate** is enabled) Specify the adaptive learning rate time smoothing factor to avoid dividing by zero.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1543
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1544 - **max_w2**: Specify the constraint for the squared sum of the incoming weights per unit (e.g., for Rectifier).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1545
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1546 - **initial\_weight\_distribution**: Select the initial weight distribution (Uniform Adaptive, Uniform, or Normal).
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1547
Oct 20, 2015 @jessica0xdata Updates
1548 - **regression_stop**: (Regression models only) Specify the stopping criterion for regression error (MSE) on the training data. To disable this option, enter -1.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1549
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1550 - **diagnostics**: Check this checkbox to compute the variable importances for input features (using the Gedeon method). For large networks, selecting this option can reduce speed. This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1551
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1552 - **fast_mode**: Check this checkbox to enable fast mode, a minor approximation in back-propagation. This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1553
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1554 - **force\_load\_balance**: Check this checkbox to force extra load balancing to increase training speed for small datasets and use all cores. This option is selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1555
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1556 - **single\_node\_mode**: Check this checkbox to force H2O to run on a single node for fine-tuning of model parameters. This option is not selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1557
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1558 - **shuffle\_training\_data**: Check this checkbox to shuffle the training data. This option is recommended if the training data is replicated and the value of **train\_samples\_per\_iteration** is close to the number of nodes times the number of rows. This option is not selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1559
Jun 16, 2016 missing_values_handling option is available in GLM
1560 - **missing\_values\_handling**: Specify how to handle missing values (Skip or MeanImputation). This defaults to MeanImputation.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1561
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1562 - **quiet_mode**: Check this checkbox to display less output in the standard output. This option is not selected by default.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1563
Oct 27, 2015 @jessica0xdata Updated definition for "sparse" for DL
1564 - **sparse**: Check this checkbox to enable sparse data handling, which is more efficient for data with many zero values.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1565
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1566 - **col_major**: Check this checkbox to use a column major weight matrix for the input layer. This option can speed up forward propagation but may reduce the speed of backpropagation. This option is not selected by default.
May 29, 2015 @jessica0xdata Update for Arno's changes (binomial_double_trees); some cleanup
1567
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1568 - **average_activation**: Specify the average activation for the sparse autoencoder.
Oct 20, 2015 @jessica0xdata Updates
1569 > - If **Rectifier** is used, the **average\_activation** value must be positive.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1570
Oct 20, 2015 @jessica0xdata Updates
1571 - **sparsity_beta**: (Applicable only if **autoencoder** is enabled) Specify the sparsity-based regularization optimization. For more information, refer to the following [link](http://www.mit.edu/~9.520/spring09/Classes/class11_sparsity.pdf).
Jul 30, 2015 @jessica0xdata Doc updates
1572
Oct 20, 2015 @jessica0xdata Updates
1573 - **max\_categorical\_features**: Specify the maximum number of categorical features enforced via hashing. The value must be at least one.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1574
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1575 - **reproducible**: To force reproducibility on small data, check this checkbox. If this option is enabled, the model takes more time to generate, since it uses only one thread.
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1576
Nov 5, 2015 @jessica0xdata Updates to docs
1577 - **export\_weights\_and\_biases**: To export the neural network weights and biases as H2O frames, check this checkbox.
1578
1579 - **elastic\_averaging**: To enable elastic averaging between computing nodes, which can improve distributed model convergence, check this checkbox (experimental).
1580
1581
1582 - **rate**: (Applicable only if **adaptive\_rate** is disabled) Specify the learning rate. Higher values result in a less stable model, while lower values lead to slower convergence.
1583
1584 - **rate\_annealing**: (Applicable only if **adaptive\_rate** is disabled) Specify the rate annealing value. The rate annealing is calculated as **rate**\(1 + **rate\_annealing** * samples).
1585
1586 - **rate\_decay**: (Applicable only if **adaptive\_rate** is disabled) Specify the rate decay factor between layers. The rate decay is calculated as (N-th layer: **rate** * alpha^(N-1)).
1587
1588 - **momentum\_start**: (Applicable only if **adaptive\_rate** is disabled) Specify the initial momentum at the beginning of training; we suggest 0.5.
1589
1590 - **momentum\_ramp**: (Applicable only if **adaptive\_rate** is disabled) Specify the number of training samples for which the momentum increases.
1591
1592 - **momentum\_stable**: (Applicable only if **adaptive\_rate** is disabled) Specify the final momentum after the ramp is over; we suggest 0.99.
1593
1594 - **nesterov\_accelerated\_gradient**: (Applicable only if **adaptive\_rate** is disabled) Enables the [Nesterov Accelerated Gradient](http://premolab.ru/pub_files/pub88/qhkDNEyp8.pdf).
1595
1596
1597 - **initial\_weight\_scale**: (Applicable only if **initial\_weight\_distribution** is **Uniform** or **Normal**) Specify the scale of the distribution function. For **Uniform**, the values are drawn uniformly. For **Normal**, the values are drawn from a Normal distribution with a standard deviation.
1598
1599
1600
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1601 ### Interpreting a Deep Learning Model
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1602
1603 To view the results, click the View button. The output for the Deep Learning model includes the following information for both the training and testing sets:
1604
1605 - Model parameters (hidden)
1606 - A chart of the variable importances
1607 - A graph of the scoring history (training MSE and validation MSE vs epochs)
1608 - Output (model category, weights, biases)
1609 - Status of neuron layers (layer number, units, type, dropout, L1, L2, mean rate, rate RMS, momentum, mean weight, weight RMS, mean bias, bias RMS)
1610 - Scoring history in tabular format
1611 - Training metrics (model name, model checksum name, frame name, frame checksum name, description, model category, duration in ms, scoring time, predictions, MSE, R2, logloss)
Apr 24, 2015 @arnocandel More work on algo description.
1612 - Top-K Hit Ratios (for multi-class classification)
1613 - Confusion matrix (for classification)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1614
1615
1616
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1617 ### FAQ
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1618
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1619 - **How does the algorithm handle missing values during training?**
1620
Jun 22, 2016 @arnocandel PUBDEV-2822: Update missing value handling for DRF and GBM - no longe…
1621 Depending on the selected missing value handling policy, they are either imputed mean or the whole row is skipped.
1622 The default behavior is mean imputation. Note that categorical variables are imputed by adding an extra "missing" level.
1623 Optionally, Deep Learning can skip all rows with any missing values.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1624
1625 - **How does the algorithm handle missing values during testing?**
1626
1627 Missing values in the test set will be mean-imputed during scoring.
1628
1629 - **What happens if the response has missing values?**
1630
1631 No errors will occur, but nothing will be learned from rows containing missing the response.
1632
1633 - **Does it matter if the data is sorted?**
1634
May 5, 2015 @jessica0xdata Update data science doc
1635 Yes, since the training set is processed in order. Depending whether `train_samples_per_iteration` is enabled, some rows will be skipped. If `shuffle_training_data` is enabled, then each thread that is processing a small subset of rows will process rows randomly, but it is not a global shuffle.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1636
1637 - **Should data be shuffled before training?**
1638
May 5, 2015 @jessica0xdata Update data science doc
1639 Yes, the data should be shuffled before training, especially if the dataset is sorted.
Apr 28, 2015 @jessica0xdata Update DataScienceH2O-Dev.md
1640
1641 - **How does the algorithm handle highly imbalanced data in a response column?**
1642
1643 Specify `balance_classes`, `class_sampling_factors` and `max_after_balance_size` to control over/under-sampling.
1644
1645 - **What if there are a large number of columns?**
1646
1647 The input neuron layer's size is scaled to the number of input features, so as the number of columns increases, the model complexity increases as well.
1648
1649 - **What if there are a large number of categorical factor levels?**
1650
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1651 This is something to look out for. Say you have three columns: zip code (70k levels), height, and income. The resulting number of internally one-hot encoded features will be 70,002 and only 3 of them will be activated (non-zero). If the first hidden layer has 200 neurons, then the resulting weight matrix will be of size 70,002 x 200, which can take a long time to train and converge. In this case, we recommend either reducing the number of categorical factor levels upfront (e.g., using `h2o.interaction()` from R), or specifying `max_categorical_features` to use feature hashing to reduce the dimensionality.
Jul 30, 2015 @jessica0xdata Doc updates
1652
1653 - **How does your Deep Learning Autoencoder work? Is it deep or shallow?**
1654
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1655 H2O’s DL autoencoder is based on the standard deep (multi-layer) neural net architecture, where the entire network is learned together, instead of being stacked layer-by-layer. The only difference is that no response is required in the input and that the output layer has as many neurons as the input layer. If you don’t achieve convergence, then try using the *Tanh* activation and fewer layers. We have some example test scripts [here](https://github.com/h2oai/h2o-3/blob/master/h2o-r/tests/testdir_algos/deeplearning/), and even some that show [how stacked auto-encoders can be implemented in R](https://github.com/h2oai/h2o-3/blob/master/h2o-r/tests/testdir_algos/deeplearning/runit_deeplearning_stacked_autoencoder_large.R).
1656
1657 - **When building the model, does Deep Learning use all features or a selection of the best features?**
1658
1659 For Deep Learning, all features are used, unless you manually specify that columns should be ignored. Adding an L1 penalty can make the model sparse, but it is still the full size.
1660
1661 - **What is the relationship between iterations, epochs, and the `train_samples_per_iteration` parameter?**
1662
1663 Epochs measures the amount of training. An iteration is one MapReduce (MR) step - essentially, one pass over the data. The `train_samples_per_iteration` parameter is the amount of data to use for training for each MR step, which can be more or less than the number of rows.
1664
1665
1666 - **When do `reduce()` calls occur, after each iteration or each epoch?**
1667
1668 Neither; `reduce()` calls occur after every two `map()` calls, between threads and ultimately between nodes. There are many `reduce()` calls, much more than one per MapReduce step (also known as an "iteration"). Epochs are not related to MR iterations, unless you specify `train_samples_per_iteration` as `0` or `-1` (or to number of rows/nodes). Otherwise, one MR iteration can train with an arbitrary number of training samples (as specified by `train_samples_per_iteration`).
1669
1670 - **Does each Mapper task work on a separate neural-net model that is combined during reduction, or is each Mapper manipulating a shared object that's persistent across nodes?**
1671
1672 Neither; there's one model per compute node, so multiple Mappers/threads share one model, which is why H2O is not reproducible unless a small dataset is used and `force_load_balance=F` or `reproducible=T`, which effectively rebalances to a single chunk and leads to only one thread to launch a `map()`. The current behavior is simple model averaging; between-node model averaging via "Elastic Averaging" is currently [in progress](https://0xdata.atlassian.net/browse/HEXDEV-206).
1673
1674 - **Is the loss function and backpropagation performed after each individual training sample, each iteration, or at the epoch level?**
1675
1676 Loss function and backpropagation are performed after each training sample (mini-batch size 1 == online stochastic gradient descent).
1677
1678 - **When using Hinton's dropout and specifying an input dropout ratio of ~20% and `train_samples_per_iteration` is set to 50, will each of the 50 samples have a different set of the 20% input neurons suppressed?**
1679
1680 Yes - suppression is not done at the iteration level across as samples in that iteration. The dropout mask is different for each training sample.
1681
1682 - **When using dropout parameters such as `input_dropout_ratio`, what happens if you use only `Rectifier` instead of `RectifierWithDropout` in the activation parameter?**
1683
1684 The amount of dropout on the input layer can be specified for all activation functions, but hidden layer dropout is only supported is set to `WithDropout`. The default hidden dropout is 50%, so you don't need to specify anything but the activation type to get good results, but you can set the hidden dropout values for each layer separately.
1685
1686 - **When using the `score_validation_sampling` and `score_training_samples` parameters, is scoring done at the end of the Deep Learning run?**
1687
1688 The majority of scoring takes place after each MR iteration. After the iteration is complete, it may or may not be scored, depending on two criteria: the time since the last scoring and the time needed for scoring.
1689
1690 The maximum time between scoring (`score_interval`, default = 5 seconds) and the maximum fraction of time spent scoring (`score_duty_cycle`) independently of loss function, backpropagation, etc.
1691
1692 Of course, using more training or validation samples will increase the time for scoring, as well as scoring more frequently. For more information about how this affects runtime, refer to the [Deep Learning Performance Guide](http://h2o.ai/blog/2015/02/deep-learning-performance/).
1693
1694 - **How does the validation frame affect the built neuron network?**
1695
1696 The validation frame is only used for scoring and does not directly affect the model. However, the validation frame can be used stopping the model early if `overwrite_with_best_model = T`, which is the default. If this parameter is enabled, the model with the lowest validation error is displayed at the end of the training.
1697
1698 By default, the validation frame is used to tune the model parameters (such as number of epochs) and will return the best model as measured by the validation metrics, depending on how often the validation metrics are computed (`score_duty_cycle`) and whether the validation frame itself was sampled.
1699
1700 Model-internal sampling of the validation frame (`score_validation_samples` and `score_validation_sampling` for optional stratification) will affect early stopping quality. If you specify a validation frame but set `score_validation_samples` to more than the number of rows in the validation frame (instead of 0, which represents the entire frame), the validation metrics received at the end of training will not be reproducible, since the model does internal sampling.
1701
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1702 - **Are there any best practices for building a model using checkpointing?**
1703
Jun 16, 2016 Documentation updates
1704 In general, to get the best possible model, we recommend building a model with `train\_samples\_per\_iteration = -2` (which is the default value for auto-tuning) and saving it.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1705
Jun 16, 2016 Documentation updates
1706 To improve the initial model, start from the previous model and add iterations by building another model, setting the checkpoint to the previous model, and changing `train\_samples\_per\_iteration`, `target\_ratio\_comm\_to\_comp`, or other parameters.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1707
Jun 16, 2016 Documentation updates
1708 If you don't know your model ID because it was generated by R, look it up using `h2o.ls()`. By default, Deep Learning model names start with `deeplearning_` To view the model, use `m <- h2o.getModel("my\_model\_id")` or `summary(m)`.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1709
Jun 16, 2016 Documentation updates
1710 There are a few ways to manage checkpoint restarts:
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1711
Jun 16, 2016 Documentation updates
1712 *Option 1*: (Multi-node only) Leave `train\_samples\_per\_iteration = -2`, increase `target\_comm\_to\_comp` from 0.05 to 0.25 or 0.5, which provides more communication. This should result in a better model when using multiple nodes. **Note:** This does not affect single-node performance.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1713
Jun 16, 2016 Documentation updates
1714 *Option 2*: (Single or multi-node) Set `train\_samples\_per\_iteration` to \(N\), where \(N\) is the number of training samples used for training by the entire cluster for one iteration. Each of the nodes then trains on \(N\) randomly-chosen rows for every iteration. The number defined as \(N\) depends on the dataset size and the model complexity.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1715
Jun 16, 2016 Documentation updates
1716 *Option 3*: (Single or multi-node) Change regularization parameters such as `l1, l2, max\_w2, input\_droput\_ratio` or `hidden\_dropout\_ratios`. We recommend build the first mode using `RectifierWithDropout`, `input\_dropout\_ratio = 0` (if there is suspected noise in the input), and `hidden\_dropout\_ratios=c(0,0,0)` (for the ability to enable dropout regularization later).
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1717
1718 - **How does class balancing work?**
1719
Jun 16, 2016 Documentation updates
1720 The `max\_after\_balance\_size` parameter defines the maximum size of the over-sampled dataset. For example, if `max\_after\_balance\_size = 3`, the over-sampled dataset will not be greater than three times the size of the original dataset.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1721
Jun 16, 2016 Documentation updates
1722 For example, if you have five classes with priors of 90%, 2.5%, 2.5%, and 2.5% (out of a total of one million rows) and you oversample to obtain a class balance using `balance\_classes = T`, the result is all four minor classes are oversampled by forty times and the total dataset will be 4.5 times as large as the original dataset (900,000 rows of each class). If `max\_after\_balance\_size = 3`, all five balance classes are reduced by 3/5 resulting in 600,000 rows each (three million total).
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1723
Jun 16, 2016 Documentation updates
1724 To specify the per-class over- or under-sampling factors, use `class\_sampling\_factors`. In the previous example, the default behavior with `balance\_classes` is equivalent to `c(1,40,40,40,40)`, while when `max\_after\_balance\_size = 3`, the results would be `c(3/5,40*3/5,40*3/5,40*3/5)`.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1725
Jun 16, 2016 Documentation updates
1726 In all cases, the probabilities are adjusted to the pre-sampled space, so the minority classes will have lower average final probabilities than the majority class, even if they were sampled to reach class balance.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1727
Dec 15, 2015 @jessica0xdata Some updates on varimp & S3 data import
1728 - **How is variable importance calculated for Deep Learning?**
1729
Jun 16, 2016 Documentation updates
1730 For Deep Learning, variable importance is calculated using the Gedeon method.
1731
1732 - **Why do my results include a negative R^2 value?**
1733
1734 H2O computes the R^2 as `1 - MSE/variance`, where `MSE` is the mean squared error of the prediction, and `variance` is the (weighted) variance: `sum(w*Y*Y)/sum(w) - sum(w*Y)^2/sum(w)^2`, where `w` is the row weight (1 by default), and `Y` is the centered response.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1735
Jun 16, 2016 Documentation updates
1736 If the MSE is greater than the variance of the response, you will see a negative R^2 value. This indicates that the model got a really bad fit, and the results are not to be trusted.
Oct 28, 2015 @jessica0xdata Updated DL Data Sci w/ info from Arno (checkpoint restart, balance cl…
1737
Sep 1, 2015 @jessica0xdata Working on first Skolem build.
1738 ---
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1739
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1740 ### Deep Learning Algorithm
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1741
Jul 30, 2015 @jessica0xdata Doc updates
1742 To compute deviance for a Deep Learning regression model, the following formula is used:
1743
Sep 29, 2015 @jessica0xdata Updates for booklets; created online versions; updates for MeanSquare…
1744 Loss = Quadratic -> MSE==Deviance
Jul 30, 2015 @jessica0xdata Doc updates
1745 For Absolute/Laplace or Huber -> MSE != Deviance
1746
Oct 7, 2015 @jessica0xdata Updated links
1747 For more information about how the Deep Learning algorithm works, refer to the [Deep Learning booklet](http://h2o.ai/resources).
May 5, 2015 @jessica0xdata Update data science doc
1748
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1749 ### References
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1750
May 5, 2015 @jessica0xdata Update data science doc
1751 ["Deep Learning." *Wikipedia: The free encyclopedia*. Wikimedia Foundation, Inc. 1 May 2015. Web. 4 May 2015.](http://en.wikipedia.org/wiki/Deep_learning)
1752
1753 ["Artificial Neural Network." *Wikipedia: The free encyclopedia*. Wikimedia Foundation, Inc. 22 April 2015. Web. 4 May 2015.](http://en.wikipedia.org/wiki/Artificial_neural_network)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1754
May 5, 2015 @jessica0xdata Update data science doc
1755 [Zeiler, Matthew D. 'ADADELTA: An Adaptive Learning Rate Method'. Arxiv.org. N.p., 2012. Web. 4 May 2015.](http://arxiv.org/abs/1212.5701)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1756
May 5, 2015 @jessica0xdata Update data science doc
1757 [Sutskever, Ilya et al. "On the importance of initialization and momementum in deep learning." JMLR:W&CP vol. 28. (2013).](http://www.cs.toronto.edu/~fritz/absps/momentum.pdf)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1758
May 5, 2015 @jessica0xdata Update data science doc
1759 [Hinton, G.E. et. al. "Improving neural networks by preventing co-adaptation of feature detectors." University of Toronto. (2012).](http://arxiv.org/pdf/1207.0580.pdf)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1760
May 5, 2015 @jessica0xdata Update data science doc
1761 [Wager, Stefan et. al. "Dropout Training as Adaptive Regularization." Advances in Neural Information Processing Systems. (2013).](http://arxiv.org/abs/1307.1493)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1762
May 5, 2015 @jessica0xdata Update data science doc
1763 [Gedeon, TD. "Data mining of inputs: analysing magnitude and functional measures." University of New South Wales. (1997).](http://www.ncbi.nlm.nih.gov/pubmed/9327276)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1764
May 5, 2015 @jessica0xdata Update data science doc
1765 [Candel, Arno and Parmar, Viraj. "Deep Learning with H2O." H2O.ai, Inc. (2015).](https://leanpub.com/deeplearning)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1766
May 5, 2015 @jessica0xdata Update data science doc
1767 [Deep Learning Training](http://learn.h2o.ai/content/hands-on_training/deep_learning.html)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1768
May 5, 2015 @jessica0xdata Update data science doc
1769 [Slideshare slide decks](http://www.slideshare.net/0xdata/presentations?order=latest)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1770
May 5, 2015 @jessica0xdata Update data science doc
1771 [Youtube channel](https://www.youtube.com/user/0xdata)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1772
May 5, 2015 @jessica0xdata Update data science doc
1773 [Candel, Arno. "The Definitive Performance Tuning Guide for H2O Deep Learning." H2O.ai, Inc. (2015).](http://h2o.ai/blog/2015/02/deep-learning-performance/)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1774
Apr 24, 2015 @arnocandel More work on algo description.
1775 [Niu, Feng, et al. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." Advances in Neural Information Processing Systems 24 (2011): 693-701. (algorithm implemented is on p.5)](https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf)
Apr 23, 2015 @jessica0xdata Add Data Science doc for review
1776
1777 [Hawkins, Simon et al. "Outlier Detection Using Replicator Neural Networks." CSIRO Mathematical and Information Sciences](http://neuro.bstu.by/ai/To-dom/My_research/Paper-0-again/For-research/D-mining/Anomaly-D/KDD-cup-99/NN/dawak02.pdf)
1778
Jan 8, 2016 @tomkraljevic Added explanation about cross-validation written by Arno.
1779 ## Cross-Validation
1780
1781 N-fold cross-validation is used to validate a model internally, i.e., estimate the model performance without having to sacrifice a validation split. Also, you avoid statistical issues with your validation split (it might be a “lucky” split, especially for imbalanced data). Good values for N are around 5 to 10. Comparing the N validation metrics is always a good idea, to check the stability of the estimation, before “trusting” the main model.
1782
1783 You have to make sure, however, that the holdout sets for each of the N models are good. For i.i.d. data, the random splitting of the data into N pieces (default behavior) or modulo-based splitting is fine. For temporal or otherwise structured data with distinct “events”, you have to make sure to split the folds based on the events. For example, if you have observations (e.g., user transactions) from N cities and you want to build models on users from only N-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column" to be the city column. Otherwise, you will have rows (users) from all N cities randomly blended into the N folds, and all N cv models will see all N cities, making the validation less useful (or totally wrong, depending on the distribution of the data). This is known as “data leakage”: https://youtu.be/NHw_aKO5KUM?t=889
1784
1785 ### How Cross-Validation is Calculated
1786
1787 In general, for all algos that support the nfolds parameter, H2O’s cross-validation works as follows:
1788
1789 For example, for nfolds=5, 6 models are built. The first 5 models (cross-validation models) are built on 80% of the training data, and a different 20% is held out for each of the 5 models. Then the main model is built on 100% of the training data. This main model is the model you get back from H2O in R, Python and Flow.
1790
1791 This main model contains training metrics and cross-validation metrics (and optionally, validation metrics if a validation frame was provided). The main model also contains pointers to the 5 cross-validation models for further inspection.
1792
1793 All 5 cross-validation models contain training metrics (from the 80% training data) and validation metrics (from their 20% holdout/validation data). To compute their individual validation metrics, each of the 5 cross-validation models had to make predictions on their 20% of of rows of the original training frame, and score against the true labels of the 20% holdout.
1794
1795 For the main model, this is how the cross-validation metrics are computed: The 5 holdout predictions are combined into one prediction for the full training dataset (i.e., predictions for every row of the training data, but the model making the prediction for a particular row has not seen that row during training). This “holdout prediction" is then scored against the true labels, and the overall cross-validation metrics are computed.
1796
1797 This approach has some implications. Scoring the holdout predictions freshly can result in different metrics than taking the average of the 5 validation metrics of the cross-validation models. For example, if the sizes of the holdout folds differ a lot (e.g., when a user-given fold_column is used), then the average should probably be replaced with a weighted average. Also, if the cross-validation models map to slightly different probability spaces, which can happen for small DL models that converge to different local minima, then the confused rank ordering of the combined predictions would lead to a significantly different AUC than the average.
1798
1799 ### Example in R
1800
1801 To gain more insights into the variance of the holdout metrics (e.g., AUCs), you can look up the cross-validation models, and inspect their validation metrics. Here’s an R code example showing the two approaches:
1802
1803 ```
1804 library(h2o)
1805 h2o.init()
1806 df <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
1807 df$CAPSULE <- as.factor(df$CAPSULE)
1808 model_fit <- h2o.gbm(3:8,2,df,nfolds=5,seed=1234)
1809
1810 # Default: AUC of holdout predictions
1811 h2o.auc(model_fit,xval=TRUE)
1812
1813 # Optional: Average the holdout AUCs
1814 cvAUCs <- sapply(sapply(model_fit@model$cross_validation_models, `[[`, "name"), function(x) { h2o.auc(h2o.getModel(x), valid=TRUE) })
1815 print(cvAUCs)
1816 mean(cvAUCs)
1817 ```
Jan 8, 2016 @spennihana merge nfold docs together
1818
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1819 ## Using Cross-Validated Predictions
Jan 8, 2016 @spennihana merge nfold docs together
1820
1821 With cross-validated model building, H2O builds N+1 models: N cross-validated model and 1 overarching model over all of the training data.
1822
1823 Each cv-model produces a prediction frame pertaining to its fold. It can be saved and probed from the various clients if `keep_cross_validation_predictions` parameter is set in the model constructor.
1824
1825 These holdout predictions have some interesting properties. First they have names like:
1826
1827 ```
1828 prediction_GBM_model_1452035702801_1_cv_1
1829 ```
1830 and they contain, unsurprisingly, predictions for the data held out in the fold. They also have the same number of rows as the entire input training frame with `0`s filled in for all rows that are not in the hold out.
1831
1832 Let's look at an example.
1833
1834 Here is a snippet of a three-class classification dataset (last column is the response column), with a 3-fold identification column appended to the end:
1835
1836
1837 | sepal_len | sepal_wid | petal_len | petal_wid | class | foldId |
1838 |-----------|-----------|-----------|-----------|---------|--------|
1839 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 0 |
1840 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 0 |
1841 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2 |
1842 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
1843 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 2 |
1844 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | 1 |
1845 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | 1 |
1846 | 5.0 | 3.4 | 1.5 | 0.2 | setosa | 0 |
1847 | 4.4 | 2.9 | 1.4 | 0.4 | setosa | 1 |
1848
1849
1850 Each cross-validated model produces a prediction frame
1851
1852 ```
1853 prediction_GBM_model_1452035702801_1_cv_1
1854 prediction_GBM_model_1452035702801_1_cv_2
1855 prediction_GBM_model_1452035702801_1_cv_3
1856 ```
1857
1858 and each one has the following shape (for example the first one):
1859
1860 ```
1861 prediction_GBM_model_1452035702801_1_cv_1
1862 ```
1863
1864 | prediction | setosa | versicolor | virginica |
1865 |------------|--------|------------|-----------|
1866 | 1 | 0.0232 | 0.7321 | 0.2447 |
1867 | 2 | 0.0543 | 0.2343 | 0.7114 |
1868 | 0 | 0 | 0 | 0 |
1869 | 0 | 0 | 0 | 0 |
1870 | 0 | 0 | 0 | 0 |
1871 | 0 | 0 | 0 | 0 |
1872 | 0 | 0 | 0 | 0 |
1873 | 0 | 0.8921 | 0.0321 | 0.0758 |
1874 | 0 | 0 | 0 | 0 |
1875
1876 The training rows receive a prediction of `0` (more on this below) as well as `0` for all class probabilities. Each of these holdout predictions has the same number of rows as the input frame.
1877
Jun 23, 2017 @angela0xdata PUBDEV-4213: Updates to markdown syntax (#1307)
1878 ## Combining holdout predictions
Jan 8, 2016 @spennihana merge nfold docs together
1879
1880 The frame of cross-validated predictions is simply the superposition of the individual predictions. [Here's an example from R](https://0xdata.atlassian.net/browse/PUBDEV-2236):
1881
1882 ```
1883 library(h2o)
1884 h2o.init()
1885
1886 # H2O Cross-validated K-means example
1887 prosPath <- system.file("extdata", "prostate.csv", package="h2o")
1888 prostate.hex <- h2o.uploadFile(path = prosPath)
1889 fit <- h2o.kmeans(training_frame = prostate.hex,
1890 k = 10,
1891 x = c("AGE", "RACE", "VOL", "GLEASON"),
1892 nfolds = 5, #If you want to specify folds directly, then use "fold_column" arg
1893 keep_cross_validation_predictions = TRUE)
1894
1895 # This is where cv preds are stored:
1896 fit@model$cross_validation_predictions$name
1897
1898
1899 # Compress the CV preds into a single H2O Frame:
1900 # Each fold's preds are stored in a N x 1 col, where the row values for non-active folds are set to zero
1901 # So we will compress this into a single 1-col H2O Frame (easier to digest)
1902
1903 nfolds <- fit@parameters$nfolds
1904 predlist <- sapply(1:nfolds, function(v) h2o.getFrame(fit@model$cross_validation_predictions[[v]]$name)$predict, simplify = FALSE)
1905 cvpred_sparse <- h2o.cbind(predlist) # N x V Hdf with rows that are all zeros, except corresponding to the v^th fold if that rows is associated with v
1906 pred <- apply(cvpred_sparse, 1, sum) # These are the cross-validated predicted cluster IDs for each of the 1:N observations
1907 ```
1908
1909 This can be extended to other family types as well (multinomial, binomial, regression):
1910
1911 ```
1912 # helper function
1913 .compress_to_cvpreds <- function(h2omodel, family) {
1914 # return the frame_id of the resulting 1-col Hdf of cvpreds for learner l
1915 V <- h2omodel@allparameters$nfolds
1916 if (family %in% c("bernoulli", "binomial")) {
1917 predlist <- sapply(1:V, function(v) h2o.getFrame(h2omodel@model$cross_validation_predictions[[v]]$name)[,3], simplify = FALSE)
1918 } else {
1919 predlist <- sapply(1:V, function(v) h2o.getFrame(h2omodel@model$cross_validation_predictions[[v]]$name)$predict, simplify = FALSE)
1920 }
1921 cvpred_sparse <- h2o.cbind(predlist) # N x V Hdf with rows that are all zeros, except corresponding to the v^th fold if that rows is associated with v
1922 cvpred_col <- apply(cvpred_sparse, 1, sum)
1923 return(cvpred_col)
1924 }
1925
1926
1927 # Extract cross-validated predicted values (in order of original rows)
1928 h2o.cvpreds <- function(object) {
1929
1930 # Need to extract family from model object
1931 if (class(object) == "H2OBinomialModel") family <- "binomial"
1932 if (class(object) == "H2OMulticlassModel") family <- "multinomial"
1933 if (class(object) == "H2ORegressionModel") family <- "gaussian"
1934
1935 cvpreds <- .compress_to_cvpreds(h2omodel = object, family = family)
1936 return(cvpreds)
1937 }
Feb 10, 2016 @arnocandel Update docs with quantile regression and max_runtime_secs arguments.
1938 ```