[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRe…

…gressor (scikit-learn#6667) * feature: add initial node_value method * testing code for node_impurity and node_value This code runs into 'Bus Error: 10' at node_value final assignment. * fix: node_value now correctly calculating weighted median for sorted data. Still need to change the code to work with unsorted data. * fix: node_value now correctly calculates median regardless of initial order * fix: correct bug in calculating median when taking midpoint is necessary * feature: add initial version of children_impurity * feature: refactor median calculation into one function * fix: fix use of DOUBLE_t vs double * feature: move helper functions to _utils.pyx, fix mismatched pointer type * fix: fix some bugs in children_impurity method * push a debug version to try to solve segfault * push latest changes, segfault probably happening bc of something in _utils.pyx * fix: fix segfault in median calculation and remove excessive logging * chore: revert some misc spacing changes I accidentally made * chore: one last spacing fix in _splitter.pyx * feature: don't calculate weighted median if no weights are passed in * remove extraneous logging statement * fix: fix children impurity calculation * fix: fix bug with children impurity not being initally set to 0 * fix: hacky fix for a float accuracy error * fix: incorrect type cast in median array generation for node_impurity * slightly tweak node_impurity function * fix: be more explicit with casts * feature: revert cosmetic changes and free temporary arrays * fix: only free weight array in median calcuation if it was created * style: remove extraneous newline / trigger CI build * style: remove extraneous 0 from range * feature: save sorts within a node to speed it up * fix: move parts of dealloc to regression criterion * chore: add comment to splitter to try to force recythonizing * chore: add comment to _tree.pyx to try to force recythonizing * chore: add empty comment to gradient boosting to force recythonizing * fix: fix bug in weighted median * try moving sorted values to a class variable * feature: refactor criterion to sort once initially, then draw all samples from this sorted data * style: remove extraneous parens from if condition * implement median-heap method for calculating impurity * style: remove extra line * style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR * feature: change minmaxheap to internally use sorted arrays * refactored MAE and push to share work * fix errors wrt median insertion case * spurious comment to force recythonization * general code cleanup * fix typo in _tree.pyx * removed some extraneous comments * [ci skip] remove earlier microchanges * [ci skip] remove change to priorityheap * [ci skip] fix indentation * [ci skip] fix class-specific issues with heaps * [ci skip] restore a newline * [ci skip] remove microchange to refactor later * reword a comment * remove heapify methods from queue class * doc: update docstrings for dt, rf, and et regressors * doc: revert incorrect spacing to shorten diff * convert get_median to return value directly * [ci skip] remove accidental whitespace * remove extraneous unpacking of values * style: misc changes to identifiers * add docstrings and more informative variable identifiers * [ci skip] add trivial comments to recythonize * remove trivial comments for recythonizing * force recythonization for real this time * remove trivial comments for recythonization * rfc: harmonize arg. names and remove unnecessary checks * convert allocations to safe_realloc * fix bug in weighted case and add tests for MAE * change all medians to DOUBLE_t * add loginc allocate mediancalculators once, and reset otherwise * misc style fixes * modify cinit of regressioncriterion to take n_samples * add MAE formula and force rebuild bc. travis was down * add criterion parameter to gradient boosting and add forest tests * add entries to what's new
olologin · Aug 24, 2016 · da560cc · da560cc
1 parent 635be02
commit da560cc
Show file tree

Hide file tree

Showing 9 changed files with 794 additions and 20 deletions.
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -117,6 +117,14 @@ New features
      and Harabaz score to evaluate the resulting clustering of a set of points.
      By `Arnaud Fouchet`_ and `Thierry Guillemot`_.
 
+   - Added a new splitting criterion for :class:`tree.DecisionTreeRegressor`,
+     the mean absolute error. This criterion can also be used in
+     :class:`ensemble.ExtraTreesRegressor`,
+     :class:`ensemble.RandomForestRegressor`, and the gradient boosting
+     estimators. (`#6667
+     <https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
+     Liu`_.
+
 Enhancements
 ............
 
@@ -142,6 +150,11 @@ Enhancements
      provided as a percentage of the training samples. By
      `yelite`_ and `Arnaud Joly`_.
 
+   - Gradient boosting estimators accept the parameter ``criterion`` to specify
+     to splitting criterion used in built decision trees. (`#6667
+     <https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
+     Liu`_.
+
    - Codebase does not contain C/C++ cython generated files: they are
      generated during build. Distribution packages will still contain generated
      C/C++ files. By `Arthur Mensch`_.
@@ -4286,3 +4299,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 .. _Sebastian Säger: https://github.com/ssaeger
 
 .. _YenChen Lin: https://github.com/yenchenlin
+
+.. _Nelson Liu: https://github.com/nelson-liu
diff --git a/sklearn/ensemble/forest.py b/sklearn/ensemble/forest.py
@@ -948,8 +948,10 @@ class RandomForestRegressor(ForestRegressor):
         The number of trees in the forest.
 
     criterion : string, optional (default="mse")
-        The function to measure the quality of a split. The only supported
-        criterion is "mse" for the mean squared error.
+        The function to measure the quality of a split. Supported criteria
+        are "mse" for the mean squared error, which is equal to variance
+        reduction as feature selection criterion, and "mae" for the mean
+        absolute error.
 
     max_features : int, float, string or None, optional (default="auto")
         The number of features to consider when looking for the best split:
@@ -1300,8 +1302,10 @@ class ExtraTreesRegressor(ForestRegressor):
         The number of trees in the forest.
 
     criterion : string, optional (default="mse")
-        The function to measure the quality of a split. The only supported
-        criterion is "mse" for the mean squared error.
+        The function to measure the quality of a split. Supported criteria
+        are "mse" for the mean squared error, which is equal to variance
+        reduction as feature selection criterion, and "mae" for the mean
+        absolute error.
 
     max_features : int, float, string or None, optional (default="auto")
         The number of features to consider when looking for the best split:

diff --git a/sklearn/ensemble/gradient_boosting.py b/sklearn/ensemble/gradient_boosting.py
@@ -720,15 +720,16 @@ class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
     """Abstract base class for Gradient Boosting. """
 
     @abstractmethod
-    def __init__(self, loss, learning_rate, n_estimators, min_samples_split,
-                 min_samples_leaf, min_weight_fraction_leaf,
+    def __init__(self, loss, learning_rate, n_estimators, criterion,
+                 min_samples_split, min_samples_leaf, min_weight_fraction_leaf,
                  max_depth, init, subsample, max_features,
                  random_state, alpha=0.9, verbose=0, max_leaf_nodes=None,
                  warm_start=False, presort='auto'):
 
         self.n_estimators = n_estimators
         self.learning_rate = learning_rate
         self.loss = loss
+        self.criterion = criterion
         self.min_samples_split = min_samples_split
         self.min_samples_leaf = min_samples_leaf
         self.min_weight_fraction_leaf = min_weight_fraction_leaf
@@ -762,7 +763,7 @@ def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
 
             # induce regression tree on residuals
             tree = DecisionTreeRegressor(
-                criterion='friedman_mse',
+                criterion=self.criterion,
                 splitter='best',
                 max_depth=self.max_depth,
                 min_samples_split=self.min_samples_split,
@@ -1296,6 +1297,14 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
         of the input variables.
         Ignored if ``max_leaf_nodes`` is not None.
 
+    criterion : string, optional (default="friedman_mse")
+        The function to measure the quality of a split. Supported criteria
+        are "friedman_mse" for the mean squared error with improvement
+        score by Friedman, "mse" for mean squared error, and "mae" for
+        the mean absolute error. The default value of "friedman_mse" is
+        generally the best as it can provide a better approximation in
+        some cases.
+
     min_samples_split : int, float, optional (default=2)
         The minimum number of samples required to split an internal node:
 
@@ -1426,7 +1435,7 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
     _SUPPORTED_LOSS = ('deviance', 'exponential')
 
     def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
-                 subsample=1.0, min_samples_split=2,
+                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                  min_samples_leaf=1, min_weight_fraction_leaf=0.,
                  max_depth=3, init=None, random_state=None,
                  max_features=None, verbose=0,
@@ -1435,7 +1444,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
 
         super(GradientBoostingClassifier, self).__init__(
             loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
-            min_samples_split=min_samples_split,
+            criterion=criterion, min_samples_split=min_samples_split,
             min_samples_leaf=min_samples_leaf,
             min_weight_fraction_leaf=min_weight_fraction_leaf,
             max_depth=max_depth, init=init, subsample=subsample,
@@ -1643,6 +1652,14 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
         of the input variables.
         Ignored if ``max_leaf_nodes`` is not None.
 
+    criterion : string, optional (default="friedman_mse")
+        The function to measure the quality of a split. Supported criteria
+        are "friedman_mse" for the mean squared error with improvement
+        score by Friedman, "mse" for mean squared error, and "mae" for
+        the mean absolute error. The default value of "friedman_mse" is
+        generally the best as it can provide a better approximation in
+        some cases.
+
     min_samples_split : int, float, optional (default=2)
         The minimum number of samples required to split an internal node:
 
@@ -1772,15 +1789,15 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
     _SUPPORTED_LOSS = ('ls', 'lad', 'huber', 'quantile')
 
     def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
-                 subsample=1.0, min_samples_split=2,
+                 subsample=1.0, criterion='friedman_mse', min_samples_split=2,
                  min_samples_leaf=1, min_weight_fraction_leaf=0.,
                  max_depth=3, init=None, random_state=None,
                  max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
                  warm_start=False, presort='auto'):
 
         super(GradientBoostingRegressor, self).__init__(
             loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
-            min_samples_split=min_samples_split,
+            criterion=criterion, min_samples_split=min_samples_split,
             min_samples_leaf=min_samples_leaf,
             min_weight_fraction_leaf=min_weight_fraction_leaf,
             max_depth=max_depth, init=init, subsample=subsample,

diff --git a/sklearn/ensemble/tests/test_forest.py b/sklearn/ensemble/tests/test_forest.py
@@ -159,7 +159,7 @@ def check_boston_criterion(name, criterion):
 
 
 def test_boston():
-    for name, criterion in product(FOREST_REGRESSORS, ("mse", )):
+    for name, criterion in product(FOREST_REGRESSORS, ("mse", "mae", "friedman_mse")):
         yield check_boston_criterion, name, criterion
 
 
@@ -244,7 +244,7 @@ def test_importances():
     for name, criterion in product(FOREST_CLASSIFIERS, ["gini", "entropy"]):
         yield check_importances, name, criterion, X, y
 
-    for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse"]):
+    for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse", "mae"]):
         yield check_importances, name, criterion, X, y