-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Histogram computation optimization #14
Conversation
WIP... Works OK but no performance gain yet
Codecov Report
@@ Coverage Diff @@
## master #14 +/- ##
===========================================
+ Coverage 73.91% 93.13% +19.22%
===========================================
Files 8 8
Lines 598 714 +116
===========================================
+ Hits 442 665 +223
+ Misses 156 49 -107
Continue to review full report at Codecov.
|
This can be fixed easily in the grower itself by choosing to call compute split-ability of the smallest node first. |
Too bad. You might need to add some |
You should replot the above analysis on a larger dataset such as the Higgs boson dataset with 1e7 samples. Otherwise, the splits are too quick and the timing probes might induce some non-trivial overhead. To make it easier to debug, use a smaller tree with a 5 leaf nodes limit for instance In the above graph, only the first level splits with large sample counts benefit from the optimization. |
Actually, your PR is already a good improvement on a larger dataset: When running the
on your branch (omitting the LightGBM output to avoid redundancy):
|
When running single threaded (
So it's already a 25% time reduction w.r.t. master, halfway to LightGBM. |
Based on your tree, it seems that when count < 3e5 samples, it's always faster to compute the histogram (<1ms). Doing the histogram substraction is slower: >1ms. So I think you should enable this histogram substraction if the node count is larger than 5e5 as a rule of thumb. |
Actually my analysis is wrong because it depends on the imbalance between the sibling nodes. Doing the histogram optimization probably never hurts but it's not expected to yield huge improvements when the count are below 5e5. In your plot, think it would be more interesting to report "count / time" split speeds and the ratio of |
I think it would be interesting to build LightGBM with its TIMETAG feature enabled and collect similar statistics in pygbm to be able to compare. https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L34 It would also be interesting to count the number of times the histogram computation functions are called for each implementation to check that we are consistent. |
To enable TIMETAG, just edit the CMakeLists.txt at the root of the LightGBM source folder to add: diff --git a/CMakeLists.txt b/CMakeLists.txt
index 057302b..b735793 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -37,6 +37,9 @@ elseif(MSVC)
cmake_minimum_required(VERSION 3.8)
endif()
+
+add_definitions(-DTIMETAG)
+
if(USE_SWIG)
find_package(SWIG REQUIRED)
find_package(Java REQUIRED) the rebuild the python package: Passing verbose >= 2 should output the following (e.g. in the Higgs Boson benchmark):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipping ordered_gradients
computation when parent_histogram is not None
might indeed bring some speed up.
This computation is a sequential section of the code, so it might also increase parallelism.
pygbm/histogram.py
Outdated
histogram = np.zeros(n_bins, dtype=HISTOGRAM_DTYPE) | ||
unrolled_upper = (n_bins // 4) * 4 | ||
|
||
for i in range(0, unrolled_upper, 4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that unrolling bring any speed up for just a short iteration? n_bins is expected to be 255 max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea... I just thought it coudn't hurt but indeed I don't think it's really useful.
pygbm/histogram.py
Outdated
@@ -20,6 +20,40 @@ def _build_ghc_histogram_naive(n_bins, sample_indices, binned_feature, | |||
return histogram | |||
|
|||
|
|||
@njit | |||
def _build_ghc_histogram_unrolled_fast(n_bins, parent_histogram, | |||
sibling_histogram): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this a _substract_histograms
and parent_histogram
to hist_a
and sibling_histogram
to hist_b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think that you should create a new method name find_node_split_by_substraction
that would call a new function named _parallel_substract_histograms
.
And keep find_node_split
for the case where we actually compute histograms with _parallel_find_splits
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this definitely needs some refactoring. Thanks for all the comments, I'll go back to it soon.
Also, any idea on how to report execution time from inside numba? importing the time
module does not work when in no-python mode. We're currently reporting time(hist + splittability) and what we're interested in is only time(hist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea: I found this question on stackoverflow but it's unanswered yet.
Maybe change the decorator from numba.njit
to numba.jit
and just use time()
and see if that causes some significant overhead?
Soooo we're getting closer :) I simply duplicated the whole pipeline, the only difference between calling And just like that, I get to run boson in 23 seconds (17 for lightgbm). That's pretty weird to me because the whole Now, I tried to get rid of all the It's super worth it though, because once we get rid of the Last comments:
|
Nice! It's possible that the numba type inference had trouble with the previous code layout and was missing additional typing hints. It's always a good idea to check the type inferred for all the local variables of the performance critical sections of the code so as to ensure that there is not unnecessary type casts that might prevent the compiler to optimize the generated code: https://github.com/ogrisel/pygbm/#debugging-numba-type-inference
Yes I believe so. Maybe you could alternative try to store the gradient and hessian sums on the grower nodes themselves along with the histograms as ancillary arrays.
Sounds promising. Once we get good performance on a few cores, it would be interesting to check the scalability on hardware with many more cores (e.g. 32 on a cloud machine for instance) and contrast that to the performance profile of lightgbm. Also note, the following test times out when disabling the numba jit: https://travis-ci.org/ogrisel/pygbm/jobs/439885373 Maybe it could be made faster so that it does not fail when numba is disabled. Running without numba is useful to collect coverage data (and also possibly to check that numba itself not introducing unexpected bugs :) |
tests/test_sanity.py
Outdated
predicted_test = pygbm_model.predict(data_test) | ||
roc_auc = roc_auc_score(target_test, predicted_test) | ||
|
||
assert roc_auc == 0.9809811751028725 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will likely be some rounding errors in the summing of floating point values (total gradient sums, gradient histograms, substractions), leading to the choice of slightly different splits. Therefore this kind of text should use some tolerance for instance using pytest.approx.
All right I just found what was wrong, it was the hessian computation:
this was run regardless of the
Indeed, there are rounding errors. The relative error between the
I'll clean this all up!! :) |
I've added some tests and everything looks fine. I didn't have to fix anything so the fishyness was probably just me not wrapping my head around it. I've moved back the computation of gradients and hessians into |
Is there anything I should do w.r.t. the failing tests with |
Yes: remove it :) More seriously, it's just way to slow: use a much smaller dataset. You cannot expect the pure python version to do large compute heavy tasks in a time limited test. |
The new version of the code seems a bit slower than previously: I get 21+s for pygbm on the higgs boson benchmark (vs 18s for lightgbm). |
Is it consistently slower than before? Because I get a few variations between each tries. Makes me think: is there any way to have consistent benchmarks, so that we're sure that a change actually causes a slow down and that it's not just due to random causes? I guess CI isn't an option given the size of higgs boson dataset. |
Looking at the logs from numba annotation, it seems that Is there a way to set |
It's not that a big deal. The difference is close to the noise level of repeated runs. We can optimize that later. I opened an issue in numba/numba#3417 to track this. In the mean time ok to use more functions if we really need to. Let's remove the test_sanity test and merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a batch of comments on the timing info collected in this PR.
I also thing there should be an example script (in a new "examples/" folder) that builds a single grower tree with less than 10 nodes and plot it. The dataset should be large enough in terms of samples (e.g. at least one million samples so that the timing information can be useful to analyze). A bit like you did in the sanity test, but as an example instead.
pygbm/grower.py
Outdated
# Computation time of the histograms, or more precisely time to compute | ||
# splitability, which may involve some useless computations | ||
time = 0 | ||
ratio = 1 # sibling.time / node.time if node.hist_subtraction, else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this attribute is hard to interpret correctly. I would rather not compute and report it but instead report construction_speed
as sample_indices.shape[0] / time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean time / sample_indices.shape[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processing speed is sample_indices.shape[0] / time
in samples per seconds.
pygbm/grower.py
Outdated
parent = None # Link to parent node, None for root | ||
# Computation time of the histograms, or more precisely time to compute | ||
# splitability, which may involve some useless computations | ||
time = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this to find_split_time
and also collect apply_split_time
that times grower.splitter.split_indices()
.
pygbm/grower.py
Outdated
split_info, histograms = self.splitter.find_node_split( | ||
node.sample_indices) | ||
toc = time() | ||
node.time = toc - tic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also accumulate the the total time spent finding the best splits in a grower level attribute and the same for the time spent in applying the splits.
In the gradient boosting class we should also record the time spent in binning and finally in computing the predictions to get the new gradients and hessians for the next boosting iteration.
At the end of the fit of the gradient boosting, if verbose > 0 we should print a profiling report with the accumulated time for each of those type of operations in a similar way as LightGBM does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
There's a lot of
tic = time()
blablah()
toc = time()
time_spent = toc - tic
Do we want a context manager that would allow us to do something like
with Timer() as time_spent:
blahblah()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wish but I think it's fine like it is now. Using a context manager will add an indentation level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's keep it this way then
pygbm/predictor.py
Outdated
('time', np.float32), | ||
('ratio', np.float32), | ||
('sum_g', np.float32), | ||
('sum_h', np.float32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to store the sum of gradients and hessians in the predictor? Maybe we should find a way to just plot the grower tree instead of the predictor tree if the goal is to present a tree-structured profile report of the cost of growing a tree.
The predictor tree is part of the public API of the model and I would rather note pollute it with attributes that are only useful for pygbm developers who try to optimize the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't need those. Same goes for time
and ratio
, etc. Those were only useful for debugging.
I agree that we should instead plot the grower tree. That's quite easy to do I think, we would just need to keep a growers_
list attribute in the GradientBoostingMachine
class and slightly change the plotting procedure. Should I do this? Here or another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to keep the grower nodes on the GradientBoostingMachine. Instead I think we should only plot it when calling the TreeGrower code directly as we do in bench_grower.py. This is just for pygbm developer. It's not meant to be part of the public estimator API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plotting the predictor trees is still useful for the users to learn about the structure of the trees. So the plotting utility should still work in that case.
I just want to extend it to make it work with additional info on the grower tree and use it in an example folder stating explicitly that this is only to analyze the performance profile of a single small tree. E.g.:
example/plot_performance_profile_single_small_tree.py
and computing predictions. Also added construction speed
Also added example
Ok merged! |
I rebased / merged instead of squash merging... Too bad for the history of this project... |
BTW I also pushed 3f7c36f to fix the plotting example timings (pre-compiling). The results are now constant: the speed is always faster when using histogram substraction. |
This is an attempt to use the relation
hist(parent) = hist(parent.left) + hist(parent.right)
to only compute explicitly either
hist(parent.left)
orhist(parent.right)
, and get the sibling's histogram by a simple subtraction.This is supposed to be faster because explicitly computing a histogram is
O(n_samples_at_node)
while the subtraction is onlyO(n_bins)
andn_samples_at_node >> n_bins
.TODO:
The good news is, the predictions are the same as before.
The bad news is, no clear performance gain is achieved... I'll try to understand what's going on.