Experimental Gradient Boosting Machines in Python.
The goal of this project is to evaluate whether it's possible to implement a pure Python yet efficient version histogram-binning of Gradient Boosting Trees (possibly with all the LightGBM optimizations) while staying in pure Python 3.6+ using the numba jit compiler.
pygbm provides a set of scikit-learn compatible estimator classes that
should play well with the scikit-learn
Pipeline and model selection tools
(grid search and randomized hyperparameter search).
Longer term plans include integration with dask and dask-ml for out-of-core and distributed fitting on a cluster.
The project is available on PyPI and can be installed with
pip install pygbm
You'll need Python 3.6 at least.
The API documentation is available at:
You might also want to have a look at the
examples/ folder of this repo.
The project is experimental. The API is subject to change without deprecation notice. Use at your own risk.
We welcome any feedback in the github issue tracker:
Running the development version
Use pip to install in "editable" mode:
git clone https://github.com/ogrisel/pygbm.git cd pygbm pip install -r requirements.txt pip install --editable .
Run the tests with pytest:
pip install -r requirements.txt pytest
benchmarks folder contains some scripts to evaluate the computation
performance of various parts of pygbm. Keep in mind that numba's JIT
To profile the benchmarks, you can use snakeviz to get an interactive HTML report:
pip install snakeviz python -m cProfile -o bench_higgs_boson.prof benchmarks/bench_higgs_boson.py snakeviz bench_higgs_boson.prof
Debugging numba type inference
To introspect the results of type inference steps in the numba sections called by a given benchmarking script:
numba --annotate-html bench_higgs_boson.html benchmarks/bench_higgs_boson.py
In particular it is interesting to check that the numerical variables in
the hot loops highlighted by the snakeviz profiling report have the
expected precision level (e.g.
float32 for loss computation,
for binned feature values, ...).
Impact of thread-based parallelism
Some benchmarks can call numba functions that leverage the built-in
thread-based parallelism with
On a multicore machine you can evaluate how the thread-based parallelism
scales by explicitly setting the
variable. For instance try:
NUMBA_NUM_THREADS=1 python benchmarks/bench_binning.py
NUMBA_NUM_THREADS=4 python benchmarks/bench_binning.py
The work from Nicolas Hug is supported by the National Science Foundation under Grant No. 1740305 and by DARPA under Grant No. DARPA-BAA-16-51
The work from Olivier Grisel is supported by the scikit-learn initiative and its partners at Inria Fondation