GPU optimizations, profiling tooling, and test infrastructure by jxucoder · Pull Request #15 · jxucoder/openboost

jxucoder · 2026-03-23T02:23:23Z

Summary

GPU training ~7x faster: Wire fit_tree_gpu_native into training loop, add fused prediction updates, histogram subtraction, and GPU array caching
Profiling tooling: ProfilingCallback produces structured JSON reports identifying bottlenecks for self-recursive improvement loops
Parallel tests: Add pytest-xdist for 1.4x faster test suite (312s → 226s)
Benchmark restructuring: Replace scattered scripts with focused modules (compare_cpu, compare_gpu, profile_kernels, profile_loop, check_performance, check_scaling)
Test infrastructure: Add conftest.py with shared fixtures, 7 new test files

GPU Performance Results (A100)

Task	Before	After	Speedup	vs XGBoost
Regression	4.74s	0.65s	7.3x	0.56x
Binary	4.88s	0.73s	6.7x	0.43x
Poisson	4.39s	0.58s	7.6x	0.53x

Profiling Usage

# CLI runner
uv run python benchmarks/profile_loop.py --summarize

# Env var (zero-code-change)
OPENBOOST_PROFILE=1 uv run python train.py

# Explicit callback
profiler = ProfilingCallback(output_dir="logs/")
model.fit(X, y, callbacks=[profiler])

Test plan

All 509 tests pass with parallel execution (-n auto)
Profiler produces valid JSON with phase breakdown
Env-var profiling activation works
Comparison with previous runs works
CI passes on ubuntu + macOS matrix

🤖 Generated with Claude Code

…rk restructuring - Wire fit_tree_gpu_native into training loop for ~7x GPU speedup - Add histogram subtraction in LevelWise growth (halves histogram work) - Cache GPU arrays in TreeStructure to avoid repeated to_device() calls - Skip loss computation when no callbacks registered - Add ProfilingCallback with structured JSON reports for improvement loops - Add pytest-xdist for parallel test execution (1.4x faster test suite) - Restructure benchmarks: replace scattered scripts with focused modules - Add shared test fixtures in conftest.py and new test files - Update CI workflow with fast/full/performance test stages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37f8edeb31

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T02:28:37Z

src/openboost/_models/_boosting.py

+                if use_gpu_native:
+                    from .._core._tree import fit_tree_gpu_native
+                    tree = fit_tree_gpu_native(
+                        self.X_binned_,
+                        grad_gpu,


Keep NaN/categorical data off the GPU-native builder

When CUDA is enabled and the model uses the default reg_alpha/subsample/colsample settings, this branch now sends every BinnedArray through fit_tree_gpu_native(). Unlike fit_tree(), that builder never reads BinnedArray.has_missing / is_categorical (src/openboost/_core/_tree.py:354-369) and returns the legacy Tree type with no missing_go_left or categorical metadata (src/openboost/_core/_tree.py:200-234). Any GPU run on data containing NaNs or categorical features will therefore silently learn plain ordinal splits instead of the documented missing/category-aware behavior.

Useful? React with 👍 / 👎.

Fixed in bda4781. Added has_missing and has_categorical checks to the use_gpu_native gate — falls back to fit_tree() which handles missing values and categorical features correctly.

chatgpt-codex-connector · 2026-03-23T02:28:37Z

src/openboost/_models/_boosting.py

+                    pred_cpu += self.learning_rate * tree_pred
+                    cuda.to_device(pred_cpu, to=pred_gpu)
+
            self.trees_.append(tree)


Preserve TreeStructure entries in GPU-trained boosters

On the default CUDA fast path, tree is a legacy _core._tree.Tree, and this line stores it in self.trees_. The sklearn wrappers call compute_feature_importances() immediately after fitting (src/openboost/_models/_sklearn.py:287-289), but _importance._accumulate_importance() expects every tree to expose n_nodes, features, and left_children (src/openboost/_importance.py:156-158). As a result, OpenBoostRegressor.fit() / OpenBoostClassifier.fit() now fail right after a successful GPU training run with the default hyperparameters.

Useful? React with 👍 / 👎.

Fixed in bda4781. The GPU-native path now converts the legacy Tree to TreeStructure (with features, left_children, n_nodes, etc.) before appending to trees_, so compute_feature_importances() and sklearn wrappers work correctly.

chatgpt-codex-connector · 2026-03-23T02:28:37Z

src/openboost/_models/_boosting.py

+                    pred_cpu += self.learning_rate * tree_pred
+                    cuda.to_device(pred_cpu, to=pred_gpu)
+
            self.trees_.append(tree)


Keep GPU-native trees compatible with PersistenceMixin

This append also breaks persistence for CUDA-trained models, because PersistenceMixin.save() serializes each entry in trees_ through _tree_to_dict() (src/openboost/_persistence.py:45-64), which assumes a TreeStructure with features/thresholds/left_children/.... A legacy Tree from fit_tree_gpu_native() does not expose those fields, so saving a GPU-trained GradientBoosting—or using ModelCheckpoint, which calls model.save() during training—will now raise instead of writing a checkpoint.

Useful? React with 👍 / 👎.

Fixed in bda4781 (same commit as above). Since the legacy Tree is converted to TreeStructure before storing, PersistenceMixin.save() / _tree_to_dict() now gets the fields it expects. ModelCheckpoint callback works correctly as well.

- Add E402, F821 to ruff ignore list (intentional patterns) - Auto-fix: unused imports, f-string placeholders, nested ifs, ternaries - Add `from err` to raise-in-except clauses (B904) - Suppress intentional ABC no-op methods (B024/B027) - Rename unused loop variables (B007) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… Tree to TreeStructure Addresses PR review comments: - Skip fit_tree_gpu_native when BinnedArray has missing values or categorical features, since the GPU-native builder doesn't support them - Convert legacy Tree to TreeStructure after GPU-native building so that feature importance, persistence, and sklearn wrappers work correctly - Do fused prediction on the legacy Tree before converting (keeps perf benefit) - Restructure prediction update to avoid double-append to trees_ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…raints - Add profiling commands and OPENBOOST_PROFILE env var - Document pytest-xdist parallel execution and conftest.py - Update ruff ignore list (E402, F821) - Document GPU-native builder limitations (no missing/categorical) - Add _profiler.py to architecture section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

jxucoder and others added 5 commits March 22, 2026 19:32

Remove .claude directory from tracking

844ff4b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add .claude/ to .gitignore

14231e3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jxucoder merged commit 9826200 into main Mar 23, 2026
5 checks passed

jxucoder deleted the gpu-perf-profiling-tests branch March 23, 2026 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU optimizations, profiling tooling, and test infrastructure#15

GPU optimizations, profiling tooling, and test infrastructure#15
jxucoder merged 6 commits intomainfrom
gpu-perf-profiling-tests

jxucoder commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

jxucoder Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

jxucoder Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

jxucoder Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jxucoder commented Mar 23, 2026

Summary

GPU Performance Results (A100)

Profiling Usage

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

jxucoder Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

jxucoder Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

jxucoder Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant