GPU optimizations, profiling tooling, and test infrastructure#15
GPU optimizations, profiling tooling, and test infrastructure#15
Conversation
…rk restructuring - Wire fit_tree_gpu_native into training loop for ~7x GPU speedup - Add histogram subtraction in LevelWise growth (halves histogram work) - Cache GPU arrays in TreeStructure to avoid repeated to_device() calls - Skip loss computation when no callbacks registered - Add ProfilingCallback with structured JSON reports for improvement loops - Add pytest-xdist for parallel test execution (1.4x faster test suite) - Restructure benchmarks: replace scattered scripts with focused modules - Add shared test fixtures in conftest.py and new test files - Update CI workflow with fast/full/performance test stages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37f8edeb31
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if use_gpu_native: | ||
| from .._core._tree import fit_tree_gpu_native | ||
| tree = fit_tree_gpu_native( | ||
| self.X_binned_, | ||
| grad_gpu, |
There was a problem hiding this comment.
Keep NaN/categorical data off the GPU-native builder
When CUDA is enabled and the model uses the default reg_alpha/subsample/colsample settings, this branch now sends every BinnedArray through fit_tree_gpu_native(). Unlike fit_tree(), that builder never reads BinnedArray.has_missing / is_categorical (src/openboost/_core/_tree.py:354-369) and returns the legacy Tree type with no missing_go_left or categorical metadata (src/openboost/_core/_tree.py:200-234). Any GPU run on data containing NaNs or categorical features will therefore silently learn plain ordinal splits instead of the documented missing/category-aware behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in bda4781. Added has_missing and has_categorical checks to the use_gpu_native gate — falls back to fit_tree() which handles missing values and categorical features correctly.
src/openboost/_models/_boosting.py
Outdated
| pred_cpu += self.learning_rate * tree_pred | ||
| cuda.to_device(pred_cpu, to=pred_gpu) | ||
|
|
||
| self.trees_.append(tree) |
There was a problem hiding this comment.
Preserve TreeStructure entries in GPU-trained boosters
On the default CUDA fast path, tree is a legacy _core._tree.Tree, and this line stores it in self.trees_. The sklearn wrappers call compute_feature_importances() immediately after fitting (src/openboost/_models/_sklearn.py:287-289), but _importance._accumulate_importance() expects every tree to expose n_nodes, features, and left_children (src/openboost/_importance.py:156-158). As a result, OpenBoostRegressor.fit() / OpenBoostClassifier.fit() now fail right after a successful GPU training run with the default hyperparameters.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in bda4781. The GPU-native path now converts the legacy Tree to TreeStructure (with features, left_children, n_nodes, etc.) before appending to trees_, so compute_feature_importances() and sklearn wrappers work correctly.
src/openboost/_models/_boosting.py
Outdated
| pred_cpu += self.learning_rate * tree_pred | ||
| cuda.to_device(pred_cpu, to=pred_gpu) | ||
|
|
||
| self.trees_.append(tree) |
There was a problem hiding this comment.
Keep GPU-native trees compatible with PersistenceMixin
This append also breaks persistence for CUDA-trained models, because PersistenceMixin.save() serializes each entry in trees_ through _tree_to_dict() (src/openboost/_persistence.py:45-64), which assumes a TreeStructure with features/thresholds/left_children/.... A legacy Tree from fit_tree_gpu_native() does not expose those fields, so saving a GPU-trained GradientBoosting—or using ModelCheckpoint, which calls model.save() during training—will now raise instead of writing a checkpoint.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in bda4781 (same commit as above). Since the legacy Tree is converted to TreeStructure before storing, PersistenceMixin.save() / _tree_to_dict() now gets the fields it expects. ModelCheckpoint callback works correctly as well.
- Add E402, F821 to ruff ignore list (intentional patterns) - Auto-fix: unused imports, f-string placeholders, nested ifs, ternaries - Add `from err` to raise-in-except clauses (B904) - Suppress intentional ABC no-op methods (B024/B027) - Rename unused loop variables (B007) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… Tree to TreeStructure Addresses PR review comments: - Skip fit_tree_gpu_native when BinnedArray has missing values or categorical features, since the GPU-native builder doesn't support them - Convert legacy Tree to TreeStructure after GPU-native building so that feature importance, persistence, and sklearn wrappers work correctly - Do fused prediction on the legacy Tree before converting (keeps perf benefit) - Restructure prediction update to avoid double-append to trees_ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…raints - Add profiling commands and OPENBOOST_PROFILE env var - Document pytest-xdist parallel execution and conftest.py - Update ruff ignore list (E402, F821) - Document GPU-native builder limitations (no missing/categorical) - Add _profiler.py to architecture section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
fit_tree_gpu_nativeinto training loop, add fused prediction updates, histogram subtraction, and GPU array cachingProfilingCallbackproduces structured JSON reports identifying bottlenecks for self-recursive improvement loopspytest-xdistfor 1.4x faster test suite (312s → 226s)compare_cpu,compare_gpu,profile_kernels,profile_loop,check_performance,check_scaling)conftest.pywith shared fixtures, 7 new test filesGPU Performance Results (A100)
Profiling Usage
Test plan
-n auto)🤖 Generated with Claude Code