[MRG] Add support for multiclass classification (#49)

Adds CategoricalCrossEntropy loss to support multiclass classification. K trees per iteration are built instead of just one. * First draft for pluggable losses * Added basic test for losses * Addressed comments * Removed use of LeastSquares loss in tests/test_predictor.py * WIP, doesnt work * fixed gradient and hessian of ls * fixed y_pred shape * L2 gradient and hessian again * removed unnecessary convertions * Partially fixed logloss * Added prediction for classifiers, quick hack * Strict comparison to min_gain_to_split * lots of debug stuff (will be cleaned up) * udated grower * set lower decimal in tests * Added test * Changed condition to <= 0 and updated note * print stuff, to be removed * debugging, ignore * n_bins is now feature dependent * addressed comments, changed default max_bins to 256, tests arent robust now :( * Fixed n_bins = n_threhsolds + 1... Also corrected some tests * Added test for n_bins_per_feature * removed optional for min_smaples_leaf * set min_samples_leaf to 1 if None * Probably fixed overflow issue... Need to update stats BEFORE any 'continue' of course * higgs boson benchmark now uses classifiers * Parametrized test_boston_dataset after bugfix #43 * None is now invalid for min_samples_leaf * removed trailing whitespace * set default to 20 in SplittingContext * Used sklearn gradient computation to avoid overflow * Consolidated compare_lightgbm test * Some cleaning * pep8 * Added subclasses Regressor and Classifier * added check for losses * predict() is now custom in child classes * Higgs Boson benchmark now uses classifiers and accuracy * changed log_loss to logistic * Added test for losses * should fix test * cosmetics * Use get_gradients_and_hessians * Comment * Put back ROC AUC * comment * Parallelized gradients and hessians updates * Comment * First draft, looks like it's working OK. Needs lots of cleaning and testing * Removed get_gradients methods and added helper in tests * some pep8 * Added sanity check for multinomial loss * Put predict_proba() into loss functions * Fixed predict_binned * Updated gradient and hessians before loop over k * Minor changes * Fixed verbose printing * switch to multinomial if problem is multiclass if loss=logistic * parallelized multinomial loss * Temporary fix for numba update issue * cosmetics and comments * Fixed __call__ and added gradient test * minor modif * Added auto loss for classification * fix for auto loss * some comments * Fixed plotting * Updated notes * Added custom logsumexp jitted function * Apply shrinkage at all iterations including first one * renamed all_gradients and all_hessians * Addressed comments * Classes are now label-encoded to support str targets and other inputs * Changed name _validate_y to _encode_y * Addressed comments * Renaming * y_pred now 2D array * raw_predict now returns 2D array * renamed y_pred into raw_predictions * pep8 and comments
ogrisel · Dec 6, 2018 · 8284e12 · 8284e12
1 parent c5ebdfc
commit 8284e12
Show file tree

Hide file tree

Showing 12 changed files with 531 additions and 147 deletions.
diff --git a/pygbm/binning.py b/pygbm/binning.py
@@ -133,11 +133,11 @@ class BinMapper(BaseEstimator, TransformerMixin):
     ----------
     max_bins : int, optional (default=256)
         The maximum number of bins to use. If for a given feature the number of
-        unique values is less than ``max_bins``, then those unique values will be
-        used instead of the quantiles.
+        unique values is less than ``max_bins``, then those unique values
+        will be used instead of the quantiles.
     subsample : int, optional (default=1e5)
-        If ``n_samples > subsample``, then ``sub_samples`` samples will be randomly
-        choosen to compute the quantiles.
+        If ``n_samples > subsample``, then ``sub_samples`` samples will be
+        randomly choosen to compute the quantiles.
         TODO: accept None?
     random_state: int or numpy.random.RandomState or None, \
         optional (default=None)

diff --git a/pygbm/gradient_boosting.py b/pygbm/gradient_boosting.py
diff --git a/pygbm/grower.py b/pygbm/grower.py
@@ -158,7 +158,7 @@ class TreeGrower:
         The shrinkage parameter to apply to the leaves values, also known as
         learning rate.
     """
-    def __init__(self, features_data, all_gradients, all_hessians,
+    def __init__(self, features_data, gradients, hessians,
                  max_leaf_nodes=None, max_depth=None, min_samples_leaf=20,
                  min_gain_to_split=0., max_bins=256, n_bins_per_feature=None,
                  l2_regularization=0., min_hessian_to_split=1e-3,
@@ -178,7 +178,7 @@ def __init__(self, features_data, all_gradients, all_hessians,
 
         self.splitting_context = SplittingContext(
             features_data.shape[1], features_data, max_bins,
-            n_bins_per_feature, all_gradients, all_hessians,
+            n_bins_per_feature, gradients, hessians,
             l2_regularization, min_hessian_to_split, min_samples_leaf,
             min_gain_to_split)
         self.max_leaf_nodes = max_leaf_nodes
@@ -238,13 +238,13 @@ def _intilialize_root(self):
         n_samples = self.features_data.shape[0]
         depth = 0
         if self.splitting_context.constant_hessian:
-            hessian = self.splitting_context.all_hessians[0] * n_samples
+            hessian = self.splitting_context.hessians[0] * n_samples
         else:
-            hessian = self.splitting_context.all_hessians.sum()
+            hessian = self.splitting_context.hessians.sum()
         self.root = TreeNode(
             depth=depth,
             sample_indices=self.splitting_context.partition.view(),
-            sum_gradients=self.splitting_context.all_gradients.sum(),
+            sum_gradients=self.splitting_context.gradients.sum(),
             sum_hessians=hessian
         )
         if (self.max_leaf_nodes is not None and self.max_leaf_nodes == 1):

diff --git a/pygbm/loss.py b/pygbm/loss.py
@@ -1,11 +1,27 @@
-from abc import ABC
+from abc import ABC, abstractmethod
 
-from scipy.special import expit
+from scipy.special import expit, logsumexp
 import numpy as np
 from numba import njit, prange
 import numba
 
 
+# TODO: Write proper docstrings
+
+@njit
+def _logsumexp(a):
+    # custom logsumexp function with numerical stability, based on scipy's
+    # logsumexp which is unfortunately not supported (neither is
+    # np.logaddexp.reduce). Only supports 1d arrays.
+
+    a_max = np.amax(a)
+    if not np.isfinite(a_max):
+        a_max = 0
+
+    s = np.sum(np.exp(a - a_max))
+    return np.log(s) + a_max
+
+
 @njit
 def _get_threads_chunks(total_size):
     # Divide [0, total_size - 1] into n_threads contiguous regions, and
@@ -29,80 +45,161 @@ def _expit(x):
 
 class Loss(ABC):
 
-    def init_gradients_and_hessians(self, n_samples):
-        gradients = np.empty(shape=n_samples, dtype=np.float32)
+    def init_gradients_and_hessians(self, n_samples, n_trees_per_iteration):
+        shape = n_samples * n_trees_per_iteration
+        gradients = np.empty(shape=shape, dtype=np.float32)
         if self.hessian_is_constant:
             hessians = np.ones(shape=1, dtype=np.float32)
         else:
-            hessians = np.empty(shape=n_samples, dtype=np.float32)
+            hessians = np.empty(shape=shape, dtype=np.float32)
 
         return gradients, hessians
 
+    @abstractmethod
+    def update_gradients_and_hessians(self, gradients, hessians, y_true,
+                                      raw_predictions):
+        pass
+
 
 class LeastSquares(Loss):
 
     hessian_is_constant = True
 
-    def __call__(self, y_true, y_pred, average=True):
-        loss = np.power(y_true - y_pred, 2)
+    def __call__(self, y_true, raw_predictions, average=True):
+        # shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
+        # return a view.
+        raw_predictions = raw_predictions.reshape(-1)
+        loss = np.power(y_true - raw_predictions, 2)
         return loss.mean() if average else loss
 
     def inverse_link_function(self, raw_predictions):
         return raw_predictions
 
     def update_gradients_and_hessians(self, gradients, hessians, y_true,
-                                      y_pred):
-        return _update_gradients_least_squares(gradients, y_true, y_pred)
+                                      raw_predictions):
+        return _update_gradients_least_squares(gradients, y_true,
+                                               raw_predictions)
 
 
 @njit(parallel=True, fastmath=True)
-def _update_gradients_least_squares(gradients, y_true, y_pred):
-    n_samples = gradients.shape[0]
+def _update_gradients_least_squares(gradients, y_true, raw_predictions):
+    # shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
+    # return a view.
+    raw_predictions = raw_predictions.reshape(-1)
+    n_samples = raw_predictions.shape[0]
     starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
     for thread_idx in prange(n_threads):
         for i in range(starts[thread_idx], ends[thread_idx]):
-            # Note: a more correct exp is 2 * (y_pred - y_true) but since we
-            # use 1 for the constant hessian value (and not 2) this is
-            # strictly equivalent for the leaves values.
-            gradients[i] = y_pred[i] - y_true[i]
+            # Note: a more correct exp is 2 * (raw_predictions - y_true) but
+            # since we use 1 for the constant hessian value (and not 2) this
+            # is strictly equivalent for the leaves values.
+            gradients[i] = raw_predictions[i] - y_true[i]
 
 
 class BinaryCrossEntropy(Loss):
 
     hessian_is_constant = False
     inverse_link_function = staticmethod(expit)
 
-    def __call__(self, y_true, y_pred, average=True):
+    def __call__(self, y_true, raw_predictions, average=True):
+        # shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
+        # return a view.
+        raw_predictions = raw_predictions.reshape(-1)
         # logaddexp(0, x) = log(1 + exp(x))
-        loss = np.logaddexp(0, y_pred) - y_true * y_pred
+        loss = np.logaddexp(0, raw_predictions) - y_true * raw_predictions
         return loss.mean() if average else loss
 
     def update_gradients_and_hessians(self, gradients, hessians, y_true,
-                                      y_pred):
-        return _update_gradients_hessians_logistic(gradients, hessians,
-                                                   y_true, y_pred)
+                                      raw_predictions):
+        return _update_gradients_hessians_binary_crossentropy(
+            gradients, hessians, y_true, raw_predictions)
 
     def predict_proba(self, raw_predictions):
+        # shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
+        # return a view.
+        raw_predictions = raw_predictions.reshape(-1)
         proba = np.empty((raw_predictions.shape[0], 2), dtype=np.float32)
         proba[:, 1] = expit(raw_predictions)
         proba[:, 0] = 1 - proba[:, 1]
         return proba
 
 
 @njit(parallel=True, fastmath=True)
-def _update_gradients_hessians_logistic(gradients, hessians, y_true, y_pred):
+def _update_gradients_hessians_binary_crossentropy(gradients, hessians,
+                                                   y_true, raw_predictions):
     # Note: using LightGBM version (first mapping {0, 1} into {-1, 1})
     # will cause overflow issues in the exponential as we're using float32
     # precision.
 
-    n_samples = gradients.shape[0]
+    # shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
+    # return a view.
+    raw_predictions = raw_predictions.reshape(-1)
+    n_samples = raw_predictions.shape[0]
     starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
     for thread_idx in prange(n_threads):
         for i in range(starts[thread_idx], ends[thread_idx]):
-            gradients[i] = _expit(y_pred[i]) - y_true[i]
+            gradients[i] = _expit(raw_predictions[i]) - y_true[i]
             gradient_abs = np.abs(gradients[i])
             hessians[i] = gradient_abs * (1. - gradient_abs)
 
 
+class CategoricalCrossEntropy(Loss):
+
+    hessian_is_constant = False
+
+    def __call__(self, y_true, raw_predictions, average=True):
+        one_hot_true = np.zeros_like(raw_predictions)
+        n_trees_per_iteration = raw_predictions.shape[1]
+        for k in range(n_trees_per_iteration):
+            one_hot_true[:, k] = (y_true == k)
+
+        return (logsumexp(raw_predictions, axis=1) -
+                (one_hot_true * raw_predictions).sum(axis=1))
+
+    def update_gradients_and_hessians(self, gradients, hessians, y_true,
+                                      raw_predictions):
+        return _update_gradients_hessians_categorical_crossentropy(
+            gradients, hessians, y_true, raw_predictions)
+
+    def predict_proba(self, raw_predictions):
+        # TODO: This could be done in parallel
+        # compute softmax (using exp(log(softmax)))
+        return np.exp(raw_predictions -
+                      logsumexp(raw_predictions, axis=1)[:, np.newaxis])
+
+
+@njit(parallel=True)
+def _update_gradients_hessians_categorical_crossentropy(
+        gradients, hessians, y_true, raw_predictions):
+    # Here gradients and hessians are of shape
+    # (n_samples * n_trees_per_iteration,).
+    # y_true is of shape (n_samples,).
+    # raw_predictions is of shape (n_samples, raw_predictions)
+    #
+    # Instead of passing the whole gradients and hessians arrays and slicing
+    # them here, we could instead do the update in the 'for k in ...' loop of
+    # fit(), by passing gradients_at_k and hessians_at_k which are of size
+    # (n_samples,).
+    # That would however require to pass a copy of raw_predictions, so it does
+    # not get partially overwritten at the end of the loop when
+    # _update_y_pred() is called (see sklearn PR 12715)
+    n_samples, n_trees_per_iteration = raw_predictions.shape
+    starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
+    for k in range(n_trees_per_iteration):
+        gradients_at_k = gradients[n_samples * k:n_samples * (k + 1)]
+        hessians_at_k = hessians[n_samples * k:n_samples * (k + 1)]
+        for thread_idx in prange(n_threads):
+            for i in range(starts[thread_idx], ends[thread_idx]):
+                # p_k is the probability that class(ith sample) == k.
+                # This is a regular softmax.
+                p_k = np.exp(raw_predictions[i, k] -
+                             _logsumexp(raw_predictions[i, :]))
+                gradients_at_k[i] = p_k - (y_true[i] == k)
+                hessians_at_k[i] = p_k * (1. - p_k)
+                # LightGBM uses 2 * p_k * (1 - p_k) which is not stricly
+                # correct but equivalent to using half the learning rate.
+
+
 _LOSSES = {'least_squares': LeastSquares,
-           'binary_crossentropy': BinaryCrossEntropy}
+           'binary_crossentropy': BinaryCrossEntropy,
+           'categorical_crossentropy': CategoricalCrossEntropy}
diff --git a/pygbm/multiclass_notes b/pygbm/multiclass_notes
@@ -0,0 +1,25 @@
+Notes about multiclass classification
+=====================================
+
+Adding support for multiclass classification involves a few changes. The
+main one is that instead of building 1 tree per iteration (like in binary
+classification and regression), we build K trees per iteration, where K is
+the number of classes.
+
+Each tree is a kind of OVR tree, but trees are not completely independent
+because they influence each others when the gradients and hessians are
+updated in CategoricalCrossEntropy.update_gradients_and_hessians().
+Concretely, the K trees of the ith iteration do not depend on each other,
+but each tree at iteration i depends on *all* the K trees of iteration i -
+1.
+
+For a given sample, the probability that it belongs to class k is computed
+as a regular softmax between scores = [scores_0, scores_1, ... scores_K-1]
+where scores_k = sum(<leaf value of kth tree of iteration i>
+                     for i in range(n_iterations)).
+The predicted class is then the argmax of the K probabilities.
+
+Regarding implementation details, the arrays gradients and hessians (for
+non-constant hessians) are now 1D arrays of size (n_samples *
+n_trees_per_iteration), instead of just (n_samples). raw_predictions is now
+a 2D array of shape (n_samples, n_trees_per_iteration)
diff --git a/pygbm/plotting.py b/pygbm/plotting.py
@@ -14,6 +14,10 @@ def plot_tree(est_or_grower, est_lightgbm=None, tree_index=0, view=True,
     profiling information that are not kept in the predictor trees that
     result from fitting a GradientBoostingMachine.
 
+    tree_index corresponds to the ith built tree. In a multiclass setting,
+    e.g. with 3 classes, tree_index=5 will print the third tree of the
+    second iteration.
+
     Can also plot a LightGBM estimator (on the left) for comparison.
 
     Requires matplotlib and graphviz (both python package and binary program).
@@ -25,7 +29,9 @@ def plot_tree(est_or_grower, est_lightgbm=None, tree_index=0, view=True,
     """
     def make_pygbm_tree():
         def add_predictor_node(node_idx, parent=None, decision=None):
-            predictor_tree = est_or_grower.predictors_[tree_index]
+            iteration = tree_index // est_or_grower.n_trees_per_iteration_
+            k = tree_index % est_or_grower.n_trees_per_iteration_
+            predictor_tree = est_or_grower.predictors_[iteration][k]
             node = predictor_tree.nodes[node_idx]
             name = 'split__{}'.format(node_idx)
             label = 'split_feature_index: {}'.format(

diff --git a/pygbm/predictor.py b/pygbm/predictor.py
@@ -25,6 +25,9 @@ def __init__(self, nodes):
     def get_n_leaf_nodes(self):
         return int(self.nodes['is_leaf'].sum())
 
+    def get_max_depth(self):
+        return int(self.nodes['depth'].max())
+
     def predict_binned(self, binned_data, out=None):
         if out is None:
             out = np.empty(binned_data.shape[0], dtype=np.float32)