Skip to content

Commit

Permalink
[MRG] Add support for multiclass classification (#49)
Browse files Browse the repository at this point in the history
Adds CategoricalCrossEntropy loss to support multiclass classification.

K trees per iteration are built instead of just one.

* First draft for pluggable losses

* Added basic test for losses

* Addressed comments

* Removed use of LeastSquares loss in tests/test_predictor.py

* WIP, doesnt work

* fixed gradient and hessian of ls

* fixed y_pred shape

* L2 gradient and hessian again

* removed unnecessary convertions

* Partially fixed logloss

* Added prediction for classifiers, quick hack

* Strict comparison to min_gain_to_split

* lots of debug stuff (will be cleaned up)

* udated grower

* set lower decimal in tests

* Added test

* Changed condition to <= 0 and updated note

* print stuff, to be removed

* debugging, ignore

* n_bins is now feature dependent

* addressed comments, changed default max_bins to 256, tests arent robust now :(

* Fixed n_bins = n_threhsolds + 1... Also corrected some tests

* Added test for n_bins_per_feature

* removed optional for min_smaples_leaf

* set min_samples_leaf to 1 if None

* Probably fixed overflow issue... Need to update stats BEFORE any 'continue' of course

* higgs boson benchmark now uses classifiers

* Parametrized test_boston_dataset after bugfix #43

* None is now invalid for min_samples_leaf

* removed trailing whitespace

* set default to 20 in SplittingContext

* Used sklearn gradient computation to avoid overflow

* Consolidated compare_lightgbm test

* Some cleaning

* pep8

* Added subclasses Regressor and Classifier

* added check for losses

* predict() is now custom in child classes

* Higgs Boson benchmark now uses classifiers and accuracy

* changed log_loss to logistic

* Added test for losses

* should fix test

* cosmetics

* Use get_gradients_and_hessians

* Comment

* Put back ROC AUC

* comment

* Parallelized gradients and hessians updates

* Comment

* First draft, looks like it's working OK.

Needs lots of cleaning and testing

* Removed get_gradients methods and added helper in tests

* some pep8

* Added sanity check for multinomial loss

* Put predict_proba() into loss functions

* Fixed predict_binned

* Updated gradient and hessians before loop over k

* Minor changes

* Fixed verbose printing

* switch to multinomial if problem is multiclass if loss=logistic

* parallelized multinomial loss

* Temporary fix for numba update issue

* cosmetics and comments

* Fixed __call__ and added gradient test

* minor modif

* Added auto loss for classification

* fix for auto loss

* some comments

* Fixed plotting

* Updated notes

* Added custom logsumexp jitted function

* Apply shrinkage at all iterations including first one

* renamed all_gradients and all_hessians

* Addressed comments

* Classes are now label-encoded to support str targets and other inputs

* Changed name _validate_y to _encode_y

* Addressed comments

* Renaming

* y_pred now 2D array

* raw_predict now returns 2D array

* renamed y_pred into raw_predictions

* pep8 and comments
  • Loading branch information
NicolasHug committed Dec 6, 2018
1 parent c5ebdfc commit 8284e12
Show file tree
Hide file tree
Showing 12 changed files with 531 additions and 147 deletions.
8 changes: 4 additions & 4 deletions pygbm/binning.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,11 +133,11 @@ class BinMapper(BaseEstimator, TransformerMixin):
----------
max_bins : int, optional (default=256)
The maximum number of bins to use. If for a given feature the number of
unique values is less than ``max_bins``, then those unique values will be
used instead of the quantiles.
unique values is less than ``max_bins``, then those unique values
will be used instead of the quantiles.
subsample : int, optional (default=1e5)
If ``n_samples > subsample``, then ``sub_samples`` samples will be randomly
choosen to compute the quantiles.
If ``n_samples > subsample``, then ``sub_samples`` samples will be
randomly choosen to compute the quantiles.
TODO: accept None?
random_state: int or numpy.random.RandomState or None, \
optional (default=None)
Expand Down
249 changes: 176 additions & 73 deletions pygbm/gradient_boosting.py

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions pygbm/grower.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ class TreeGrower:
The shrinkage parameter to apply to the leaves values, also known as
learning rate.
"""
def __init__(self, features_data, all_gradients, all_hessians,
def __init__(self, features_data, gradients, hessians,
max_leaf_nodes=None, max_depth=None, min_samples_leaf=20,
min_gain_to_split=0., max_bins=256, n_bins_per_feature=None,
l2_regularization=0., min_hessian_to_split=1e-3,
Expand All @@ -178,7 +178,7 @@ def __init__(self, features_data, all_gradients, all_hessians,

self.splitting_context = SplittingContext(
features_data.shape[1], features_data, max_bins,
n_bins_per_feature, all_gradients, all_hessians,
n_bins_per_feature, gradients, hessians,
l2_regularization, min_hessian_to_split, min_samples_leaf,
min_gain_to_split)
self.max_leaf_nodes = max_leaf_nodes
Expand Down Expand Up @@ -238,13 +238,13 @@ def _intilialize_root(self):
n_samples = self.features_data.shape[0]
depth = 0
if self.splitting_context.constant_hessian:
hessian = self.splitting_context.all_hessians[0] * n_samples
hessian = self.splitting_context.hessians[0] * n_samples
else:
hessian = self.splitting_context.all_hessians.sum()
hessian = self.splitting_context.hessians.sum()
self.root = TreeNode(
depth=depth,
sample_indices=self.splitting_context.partition.view(),
sum_gradients=self.splitting_context.all_gradients.sum(),
sum_gradients=self.splitting_context.gradients.sum(),
sum_hessians=hessian
)
if (self.max_leaf_nodes is not None and self.max_leaf_nodes == 1):
Expand Down
145 changes: 121 additions & 24 deletions pygbm/loss.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,27 @@
from abc import ABC
from abc import ABC, abstractmethod

from scipy.special import expit
from scipy.special import expit, logsumexp
import numpy as np
from numba import njit, prange
import numba


# TODO: Write proper docstrings

@njit
def _logsumexp(a):
# custom logsumexp function with numerical stability, based on scipy's
# logsumexp which is unfortunately not supported (neither is
# np.logaddexp.reduce). Only supports 1d arrays.

a_max = np.amax(a)
if not np.isfinite(a_max):
a_max = 0

s = np.sum(np.exp(a - a_max))
return np.log(s) + a_max


@njit
def _get_threads_chunks(total_size):
# Divide [0, total_size - 1] into n_threads contiguous regions, and
Expand All @@ -29,80 +45,161 @@ def _expit(x):

class Loss(ABC):

def init_gradients_and_hessians(self, n_samples):
gradients = np.empty(shape=n_samples, dtype=np.float32)
def init_gradients_and_hessians(self, n_samples, n_trees_per_iteration):
shape = n_samples * n_trees_per_iteration
gradients = np.empty(shape=shape, dtype=np.float32)
if self.hessian_is_constant:
hessians = np.ones(shape=1, dtype=np.float32)
else:
hessians = np.empty(shape=n_samples, dtype=np.float32)
hessians = np.empty(shape=shape, dtype=np.float32)

return gradients, hessians

@abstractmethod
def update_gradients_and_hessians(self, gradients, hessians, y_true,
raw_predictions):
pass


class LeastSquares(Loss):

hessian_is_constant = True

def __call__(self, y_true, y_pred, average=True):
loss = np.power(y_true - y_pred, 2)
def __call__(self, y_true, raw_predictions, average=True):
# shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
# return a view.
raw_predictions = raw_predictions.reshape(-1)
loss = np.power(y_true - raw_predictions, 2)
return loss.mean() if average else loss

def inverse_link_function(self, raw_predictions):
return raw_predictions

def update_gradients_and_hessians(self, gradients, hessians, y_true,
y_pred):
return _update_gradients_least_squares(gradients, y_true, y_pred)
raw_predictions):
return _update_gradients_least_squares(gradients, y_true,
raw_predictions)


@njit(parallel=True, fastmath=True)
def _update_gradients_least_squares(gradients, y_true, y_pred):
n_samples = gradients.shape[0]
def _update_gradients_least_squares(gradients, y_true, raw_predictions):
# shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
# return a view.
raw_predictions = raw_predictions.reshape(-1)
n_samples = raw_predictions.shape[0]
starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
for thread_idx in prange(n_threads):
for i in range(starts[thread_idx], ends[thread_idx]):
# Note: a more correct exp is 2 * (y_pred - y_true) but since we
# use 1 for the constant hessian value (and not 2) this is
# strictly equivalent for the leaves values.
gradients[i] = y_pred[i] - y_true[i]
# Note: a more correct exp is 2 * (raw_predictions - y_true) but
# since we use 1 for the constant hessian value (and not 2) this
# is strictly equivalent for the leaves values.
gradients[i] = raw_predictions[i] - y_true[i]


class BinaryCrossEntropy(Loss):

hessian_is_constant = False
inverse_link_function = staticmethod(expit)

def __call__(self, y_true, y_pred, average=True):
def __call__(self, y_true, raw_predictions, average=True):
# shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
# return a view.
raw_predictions = raw_predictions.reshape(-1)
# logaddexp(0, x) = log(1 + exp(x))
loss = np.logaddexp(0, y_pred) - y_true * y_pred
loss = np.logaddexp(0, raw_predictions) - y_true * raw_predictions
return loss.mean() if average else loss

def update_gradients_and_hessians(self, gradients, hessians, y_true,
y_pred):
return _update_gradients_hessians_logistic(gradients, hessians,
y_true, y_pred)
raw_predictions):
return _update_gradients_hessians_binary_crossentropy(
gradients, hessians, y_true, raw_predictions)

def predict_proba(self, raw_predictions):
# shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
# return a view.
raw_predictions = raw_predictions.reshape(-1)
proba = np.empty((raw_predictions.shape[0], 2), dtype=np.float32)
proba[:, 1] = expit(raw_predictions)
proba[:, 0] = 1 - proba[:, 1]
return proba


@njit(parallel=True, fastmath=True)
def _update_gradients_hessians_logistic(gradients, hessians, y_true, y_pred):
def _update_gradients_hessians_binary_crossentropy(gradients, hessians,
y_true, raw_predictions):
# Note: using LightGBM version (first mapping {0, 1} into {-1, 1})
# will cause overflow issues in the exponential as we're using float32
# precision.

n_samples = gradients.shape[0]
# shape (n_samples, 1) --> (n_samples,). reshape(-1) is more likely to
# return a view.
raw_predictions = raw_predictions.reshape(-1)
n_samples = raw_predictions.shape[0]
starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
for thread_idx in prange(n_threads):
for i in range(starts[thread_idx], ends[thread_idx]):
gradients[i] = _expit(y_pred[i]) - y_true[i]
gradients[i] = _expit(raw_predictions[i]) - y_true[i]
gradient_abs = np.abs(gradients[i])
hessians[i] = gradient_abs * (1. - gradient_abs)


class CategoricalCrossEntropy(Loss):

hessian_is_constant = False

def __call__(self, y_true, raw_predictions, average=True):
one_hot_true = np.zeros_like(raw_predictions)
n_trees_per_iteration = raw_predictions.shape[1]
for k in range(n_trees_per_iteration):
one_hot_true[:, k] = (y_true == k)

return (logsumexp(raw_predictions, axis=1) -
(one_hot_true * raw_predictions).sum(axis=1))

def update_gradients_and_hessians(self, gradients, hessians, y_true,
raw_predictions):
return _update_gradients_hessians_categorical_crossentropy(
gradients, hessians, y_true, raw_predictions)

def predict_proba(self, raw_predictions):
# TODO: This could be done in parallel
# compute softmax (using exp(log(softmax)))
return np.exp(raw_predictions -
logsumexp(raw_predictions, axis=1)[:, np.newaxis])


@njit(parallel=True)
def _update_gradients_hessians_categorical_crossentropy(
gradients, hessians, y_true, raw_predictions):
# Here gradients and hessians are of shape
# (n_samples * n_trees_per_iteration,).
# y_true is of shape (n_samples,).
# raw_predictions is of shape (n_samples, raw_predictions)
#
# Instead of passing the whole gradients and hessians arrays and slicing
# them here, we could instead do the update in the 'for k in ...' loop of
# fit(), by passing gradients_at_k and hessians_at_k which are of size
# (n_samples,).
# That would however require to pass a copy of raw_predictions, so it does
# not get partially overwritten at the end of the loop when
# _update_y_pred() is called (see sklearn PR 12715)
n_samples, n_trees_per_iteration = raw_predictions.shape
starts, ends, n_threads = _get_threads_chunks(total_size=n_samples)
for k in range(n_trees_per_iteration):
gradients_at_k = gradients[n_samples * k:n_samples * (k + 1)]
hessians_at_k = hessians[n_samples * k:n_samples * (k + 1)]
for thread_idx in prange(n_threads):
for i in range(starts[thread_idx], ends[thread_idx]):
# p_k is the probability that class(ith sample) == k.
# This is a regular softmax.
p_k = np.exp(raw_predictions[i, k] -
_logsumexp(raw_predictions[i, :]))
gradients_at_k[i] = p_k - (y_true[i] == k)
hessians_at_k[i] = p_k * (1. - p_k)
# LightGBM uses 2 * p_k * (1 - p_k) which is not stricly
# correct but equivalent to using half the learning rate.


_LOSSES = {'least_squares': LeastSquares,
'binary_crossentropy': BinaryCrossEntropy}
'binary_crossentropy': BinaryCrossEntropy,
'categorical_crossentropy': CategoricalCrossEntropy}
25 changes: 25 additions & 0 deletions pygbm/multiclass_notes
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Notes about multiclass classification
=====================================

Adding support for multiclass classification involves a few changes. The
main one is that instead of building 1 tree per iteration (like in binary
classification and regression), we build K trees per iteration, where K is
the number of classes.

Each tree is a kind of OVR tree, but trees are not completely independent
because they influence each others when the gradients and hessians are
updated in CategoricalCrossEntropy.update_gradients_and_hessians().
Concretely, the K trees of the ith iteration do not depend on each other,
but each tree at iteration i depends on *all* the K trees of iteration i -
1.

For a given sample, the probability that it belongs to class k is computed
as a regular softmax between scores = [scores_0, scores_1, ... scores_K-1]
where scores_k = sum(<leaf value of kth tree of iteration i>
for i in range(n_iterations)).
The predicted class is then the argmax of the K probabilities.

Regarding implementation details, the arrays gradients and hessians (for
non-constant hessians) are now 1D arrays of size (n_samples *
n_trees_per_iteration), instead of just (n_samples). raw_predictions is now
a 2D array of shape (n_samples, n_trees_per_iteration)
8 changes: 7 additions & 1 deletion pygbm/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ def plot_tree(est_or_grower, est_lightgbm=None, tree_index=0, view=True,
profiling information that are not kept in the predictor trees that
result from fitting a GradientBoostingMachine.
tree_index corresponds to the ith built tree. In a multiclass setting,
e.g. with 3 classes, tree_index=5 will print the third tree of the
second iteration.
Can also plot a LightGBM estimator (on the left) for comparison.
Requires matplotlib and graphviz (both python package and binary program).
Expand All @@ -25,7 +29,9 @@ def plot_tree(est_or_grower, est_lightgbm=None, tree_index=0, view=True,
"""
def make_pygbm_tree():
def add_predictor_node(node_idx, parent=None, decision=None):
predictor_tree = est_or_grower.predictors_[tree_index]
iteration = tree_index // est_or_grower.n_trees_per_iteration_
k = tree_index % est_or_grower.n_trees_per_iteration_
predictor_tree = est_or_grower.predictors_[iteration][k]
node = predictor_tree.nodes[node_idx]
name = 'split__{}'.format(node_idx)
label = 'split_feature_index: {}'.format(
Expand Down
3 changes: 3 additions & 0 deletions pygbm/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ def __init__(self, nodes):
def get_n_leaf_nodes(self):
return int(self.nodes['is_leaf'].sum())

def get_max_depth(self):
return int(self.nodes['depth'].max())

def predict_binned(self, binned_data, out=None):
if out is None:
out = np.empty(binned_data.shape[0], dtype=np.float32)
Expand Down

0 comments on commit 8284e12

Please sign in to comment.