Skip to content

Commit

Permalink
[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRe…
Browse files Browse the repository at this point in the history
…gressor (scikit-learn#6667)

* feature: add initial node_value method

* testing code for node_impurity and node_value

This code runs into 'Bus Error: 10' at node_value final assignment.

* fix: node_value now correctly calculating weighted median for sorted data.

Still need to change the code to work with unsorted data.

* fix: node_value now correctly calculates median regardless of initial order

* fix: correct bug in calculating median when taking midpoint is necessary

* feature: add initial version of children_impurity

* feature: refactor median calculation into one function

* fix: fix use of DOUBLE_t vs double

* feature: move helper functions to _utils.pyx, fix mismatched pointer type

* fix: fix some bugs in children_impurity method

* push a debug version to try to solve segfault

* push latest changes, segfault probably happening bc of something in _utils.pyx

* fix: fix segfault in median calculation and remove excessive logging

* chore: revert some misc spacing changes I accidentally made

* chore: one last spacing fix in _splitter.pyx

* feature: don't calculate weighted median if no weights are passed in

* remove extraneous logging statement

* fix: fix children impurity calculation

* fix: fix bug with children impurity not being initally set to 0

* fix: hacky fix for a float accuracy error

* fix: incorrect type cast in median array generation for node_impurity

* slightly tweak node_impurity function

* fix: be more explicit with casts

* feature: revert cosmetic changes and free temporary arrays

* fix: only free weight array in median calcuation if it was created

* style: remove extraneous newline / trigger CI build

* style: remove extraneous 0 from range

* feature: save sorts within a node to speed it up

* fix: move parts of dealloc to regression criterion

* chore: add comment to splitter to try to force recythonizing

* chore: add comment to _tree.pyx to try to force recythonizing

* chore: add empty comment to gradient boosting to force recythonizing

* fix: fix bug in weighted median

* try moving sorted values to a class variable

* feature: refactor criterion to sort once initially, then draw all samples from this sorted data

* style: remove extraneous parens from if condition

* implement median-heap method for calculating impurity

* style: remove extra line

* style: fix inadvertent cosmetic changes; i'll address some of these in a separate PR

* feature: change minmaxheap to internally use sorted arrays

* refactored MAE and push to share work

* fix errors wrt median insertion case

* spurious comment to force recythonization

* general code cleanup

* fix typo in _tree.pyx

* removed some extraneous comments

* [ci skip] remove earlier microchanges

* [ci skip] remove change to priorityheap

* [ci skip] fix indentation

* [ci skip] fix class-specific issues with heaps

* [ci skip] restore a newline

* [ci skip] remove microchange to refactor later

* reword a comment

* remove heapify methods from queue class

* doc: update docstrings for dt, rf, and et regressors

* doc: revert incorrect spacing to shorten diff

* convert get_median to return value directly

* [ci skip] remove accidental whitespace

* remove extraneous unpacking of values

* style: misc changes to identifiers

* add docstrings and more informative variable identifiers

* [ci skip] add trivial comments to recythonize

* remove trivial comments for recythonizing

* force recythonization for real this time

* remove trivial comments for recythonization

* rfc: harmonize arg. names and remove unnecessary checks

* convert allocations to safe_realloc

* fix bug in weighted case and add tests for MAE

* change all medians to DOUBLE_t

* add loginc allocate mediancalculators once, and reset otherwise

* misc style fixes

* modify cinit of regressioncriterion to take n_samples

* add MAE formula and force rebuild bc. travis was down

* add criterion parameter to gradient boosting and add forest tests

* add entries to what's new
  • Loading branch information
nelson-liu authored and olologin committed Aug 24, 2016
1 parent 635be02 commit da560cc
Show file tree
Hide file tree
Showing 9 changed files with 794 additions and 20 deletions.
15 changes: 15 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ New features
and Harabaz score to evaluate the resulting clustering of a set of points.
By `Arnaud Fouchet`_ and `Thierry Guillemot`_.

- Added a new splitting criterion for :class:`tree.DecisionTreeRegressor`,
the mean absolute error. This criterion can also be used in
:class:`ensemble.ExtraTreesRegressor`,
:class:`ensemble.RandomForestRegressor`, and the gradient boosting
estimators. (`#6667
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
Liu`_.

Enhancements
............

Expand All @@ -142,6 +150,11 @@ Enhancements
provided as a percentage of the training samples. By
`yelite`_ and `Arnaud Joly`_.

- Gradient boosting estimators accept the parameter ``criterion`` to specify
to splitting criterion used in built decision trees. (`#6667
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
Liu`_.

- Codebase does not contain C/C++ cython generated files: they are
generated during build. Distribution packages will still contain generated
C/C++ files. By `Arthur Mensch`_.
Expand Down Expand Up @@ -4286,3 +4299,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Sebastian Säger: https://github.com/ssaeger

.. _YenChen Lin: https://github.com/yenchenlin

.. _Nelson Liu: https://github.com/nelson-liu
12 changes: 8 additions & 4 deletions sklearn/ensemble/forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -948,8 +948,10 @@ class RandomForestRegressor(ForestRegressor):
The number of trees in the forest.
criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
The function to measure the quality of a split. Supported criteria
are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand Down Expand Up @@ -1300,8 +1302,10 @@ class ExtraTreesRegressor(ForestRegressor):
The number of trees in the forest.
criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
The function to measure the quality of a split. Supported criteria
are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand Down
31 changes: 24 additions & 7 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -720,15 +720,16 @@ class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
"""Abstract base class for Gradient Boosting. """

@abstractmethod
def __init__(self, loss, learning_rate, n_estimators, min_samples_split,
min_samples_leaf, min_weight_fraction_leaf,
def __init__(self, loss, learning_rate, n_estimators, criterion,
min_samples_split, min_samples_leaf, min_weight_fraction_leaf,
max_depth, init, subsample, max_features,
random_state, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto'):

self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.loss = loss
self.criterion = criterion
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
Expand Down Expand Up @@ -762,7 +763,7 @@ def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,

# induce regression tree on residuals
tree = DecisionTreeRegressor(
criterion='friedman_mse',
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
Expand Down Expand Up @@ -1296,6 +1297,14 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
of the input variables.
Ignored if ``max_leaf_nodes`` is not None.
criterion : string, optional (default="friedman_mse")
The function to measure the quality of a split. Supported criteria
are "friedman_mse" for the mean squared error with improvement
score by Friedman, "mse" for mean squared error, and "mae" for
the mean absolute error. The default value of "friedman_mse" is
generally the best as it can provide a better approximation in
some cases.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
Expand Down Expand Up @@ -1426,7 +1435,7 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
_SUPPORTED_LOSS = ('deviance', 'exponential')

def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, init=None, random_state=None,
max_features=None, verbose=0,
Expand All @@ -1435,7 +1444,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,

super(GradientBoostingClassifier, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
min_samples_split=min_samples_split,
criterion=criterion, min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
Expand Down Expand Up @@ -1643,6 +1652,14 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
of the input variables.
Ignored if ``max_leaf_nodes`` is not None.
criterion : string, optional (default="friedman_mse")
The function to measure the quality of a split. Supported criteria
are "friedman_mse" for the mean squared error with improvement
score by Friedman, "mse" for mean squared error, and "mae" for
the mean absolute error. The default value of "friedman_mse" is
generally the best as it can provide a better approximation in
some cases.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
Expand Down Expand Up @@ -1772,15 +1789,15 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
_SUPPORTED_LOSS = ('ls', 'lad', 'huber', 'quantile')

def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, init=None, random_state=None,
max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto'):

super(GradientBoostingRegressor, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
min_samples_split=min_samples_split,
criterion=criterion, min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
Expand Down
4 changes: 2 additions & 2 deletions sklearn/ensemble/tests/test_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def check_boston_criterion(name, criterion):


def test_boston():
for name, criterion in product(FOREST_REGRESSORS, ("mse", )):
for name, criterion in product(FOREST_REGRESSORS, ("mse", "mae", "friedman_mse")):
yield check_boston_criterion, name, criterion


Expand Down Expand Up @@ -244,7 +244,7 @@ def test_importances():
for name, criterion in product(FOREST_CLASSIFIERS, ["gini", "entropy"]):
yield check_importances, name, criterion, X, y

for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse"]):
for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse", "mae"]):
yield check_importances, name, criterion, X, y


Expand Down
Loading

0 comments on commit da560cc

Please sign in to comment.