GPU performance optimizations, profiling, and benchmarks by jxucoder · Pull Request #17 · jxucoder/openboost

jxucoder · 2026-03-23T16:00:18Z

Summary

GPU kernel optimizations: workspace caching, histogram subtraction trick, smaller-child histograms, async D2D copies, float64 precision for histograms/node sums, in-place gradient computation
Profiling infrastructure: ProfilingCallback with per-phase timing, bottleneck identification, and JSON reports; benchmarks/profile_loop.py CLI runner
Benchmark suite: fair GPU benchmarks (OpenBoost vs XGBoost) with Modal A100 support, CPU comparisons, realistic data generators, and automated performance checks
Test suite expansion: shared fixtures via conftest.py, new test files for binning, callbacks, GAM, kernel correctness, linear leaf, loss functions, and numerical agreement with XGBoost
Bug fixes: float32 precision in GPU split finding, MSE gradient convention (0.5*MSE), D2D optimization disabling for all callbacks, GPU-native builder guards for missing/categorical data
README update: benchmarks section with A100 results, active development notice
Autoresearch scripts: automated evaluation, profiling, and progress tracking on Modal

Test plan

uv run pytest tests/ -v --tb=short passes on CPU
OPENBOOST_BACKEND=cuda uv run pytest tests/ passes on GPU
uv run python benchmarks/bench_gpu.py --task all runs without errors
Numerical agreement tests validate parity with XGBoost

🤖 Generated with Claude Code

The GPU-native _find_level_splits_kernel accumulated prefix sums and computed split gains in float32, while the CPU path uses float64. Over 300 trees x depth 8, this caused 3+ percentage points of R² degradation (0.907 vs 0.939 on CPU). Promote prefix sums, total sums, and gain computation to float64 in the split kernel. Shared memory totals also use float64. Final results downcast to float32 for storage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change MSE gradient from 2*(pred-y) to (pred-y) and hessian from 2.0 to 1.0, matching XGBoost's convention of optimizing 0.5*(pred-y)^2. This makes reg_lambda have equivalent regularization strength across libraries, closing the accuracy gap on benchmarks. compute_loss_value still reports standard MSE (matching XGBoost's eval metric convention). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Float32 histogram accumulation via atomic adds, combined with histogram subtraction (right = parent - left), caused catastrophic cancellation and precision loss that degraded R² by ~3% vs the CPU path. This change: - Histogram storage: float32 → float64 (shared memory stays float32, values promoted on write to global) - node_sum_grad/node_sum_hess: float32 → float64 - Leaf value computation: float64 precision - Split gain comparison: avoid float64→float32 downcast Shared memory histograms remain float32 (48KB limit). Per-chunk accumulation (~16K samples) is fine in float32; the precision issue was in global accumulation across many chunks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add GPU benchmark results (OpenBoost vs XGBoost on A100), reproducibility instructions, variant model comparisons, and a note that OpenBoost is in active development. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Automated benchmarking and profiling pipeline: Modal-based evaluation, timing experiments, score tracking, and progress visualization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e18429ba20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T16:09:40Z

src/openboost/_loss.py

+    grad = (pred - y).astype(np.float32)
+    hess = np.full_like(pred, 1.0, dtype=np.float32)


Keep fit_trees_batch consistent with the new MSE scaling

This switches the built-in MSE objective to grad = pred - y / hess = 1, but the CPU batch-training fallback still recomputes later rounds as current_grad = grad + 2.0 * pred and assumes a constant Hessian of 2 in src/openboost/_core/_tree.py:889-893. Any multi-round fit_trees_batch(...) run for MSE on CPU will therefore build trees from gradients that are 2× too large from round 2 onward, so batch hyperparameter searches stop matching GradientBoosting(loss="mse") after this commit.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T16:09:40Z

development/autoresearch/run.sh

+4. Run: uv run python development/autoresearch/evaluate.py
+   Parse the output for RESULT and SCORE.


Propagate evaluation flags into each autoresearch iteration

The usage string advertises ./development/autoresearch/run.sh 10 --quick, but every iteration still hard-codes uv run python development/autoresearch/evaluate.py here. Because EVAL_FLAGS is never appended to this command (or to the baseline run), callers cannot actually request --quick or any other evaluation mode, so the documented smoke-test workflow still runs the full benchmark/test path each iteration.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T16:09:41Z

development/autoresearch/time_breakdown_modal.py

+        _find_level_splits_kernel[n_nodes_at_level, 256](
+            histograms, level_start, level_end,
+            np.float32(1.0), np.float32(1.0), np.float32(0.0),
+            node_features, node_thresholds, node_gains,
+            node_sum_grad, node_sum_hess


Pass the new node_left_hess buffer to split profiling

The kernel-level breakdown still launches _find_level_splits_kernel with the pre-change argument list. In this commit src/openboost/_backends/_cuda.py adds a required node_left_hess output to that kernel, so uv run modal run development/autoresearch/time_breakdown_modal.py now fails at runtime before it can print the per-depth timing table.

Useful? React with 👍 / 👎.

jxucoder and others added 5 commits March 23, 2026 01:24

Add autoresearch scripts for GPU performance evaluation

e18429b

Automated benchmarking and profiling pipeline: Modal-based evaluation, timing experiments, score tracking, and progress visualization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jxucoder merged commit 45a3125 into main Mar 23, 2026
4 checks passed

jxucoder deleted the gpu-perf-profiling-tests branch March 23, 2026 16:00

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU performance optimizations, profiling, and benchmarks#17

GPU performance optimizations, profiling, and benchmarks#17
jxucoder merged 5 commits intomainfrom
gpu-perf-profiling-tests

jxucoder commented Mar 23, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		grad = (pred - y).astype(np.float32)
		hess = np.full_like(pred, 1.0, dtype=np.float32)

		4. Run: uv run python development/autoresearch/evaluate.py
		Parse the output for RESULT and SCORE.

Conversation

jxucoder commented Mar 23, 2026

Summary

Test plan

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant