Skip to content

GPU performance optimizations, profiling, and benchmarks#17

Merged
jxucoder merged 5 commits intomainfrom
gpu-perf-profiling-tests
Mar 23, 2026
Merged

GPU performance optimizations, profiling, and benchmarks#17
jxucoder merged 5 commits intomainfrom
gpu-perf-profiling-tests

Conversation

@jxucoder
Copy link
Copy Markdown
Owner

Summary

  • GPU kernel optimizations: workspace caching, histogram subtraction trick, smaller-child histograms, async D2D copies, float64 precision for histograms/node sums, in-place gradient computation
  • Profiling infrastructure: ProfilingCallback with per-phase timing, bottleneck identification, and JSON reports; benchmarks/profile_loop.py CLI runner
  • Benchmark suite: fair GPU benchmarks (OpenBoost vs XGBoost) with Modal A100 support, CPU comparisons, realistic data generators, and automated performance checks
  • Test suite expansion: shared fixtures via conftest.py, new test files for binning, callbacks, GAM, kernel correctness, linear leaf, loss functions, and numerical agreement with XGBoost
  • Bug fixes: float32 precision in GPU split finding, MSE gradient convention (0.5*MSE), D2D optimization disabling for all callbacks, GPU-native builder guards for missing/categorical data
  • README update: benchmarks section with A100 results, active development notice
  • Autoresearch scripts: automated evaluation, profiling, and progress tracking on Modal

Test plan

  • uv run pytest tests/ -v --tb=short passes on CPU
  • OPENBOOST_BACKEND=cuda uv run pytest tests/ passes on GPU
  • uv run python benchmarks/bench_gpu.py --task all runs without errors
  • Numerical agreement tests validate parity with XGBoost

🤖 Generated with Claude Code

jxucoder and others added 5 commits March 23, 2026 01:24
The GPU-native _find_level_splits_kernel accumulated prefix sums and
computed split gains in float32, while the CPU path uses float64.
Over 300 trees x depth 8, this caused 3+ percentage points of R²
degradation (0.907 vs 0.939 on CPU).

Promote prefix sums, total sums, and gain computation to float64 in
the split kernel. Shared memory totals also use float64. Final results
downcast to float32 for storage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change MSE gradient from 2*(pred-y) to (pred-y) and hessian from 2.0
to 1.0, matching XGBoost's convention of optimizing 0.5*(pred-y)^2.
This makes reg_lambda have equivalent regularization strength across
libraries, closing the accuracy gap on benchmarks.

compute_loss_value still reports standard MSE (matching XGBoost's eval
metric convention).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Float32 histogram accumulation via atomic adds, combined with
histogram subtraction (right = parent - left), caused catastrophic
cancellation and precision loss that degraded R² by ~3% vs the CPU
path. This change:

- Histogram storage: float32 → float64 (shared memory stays float32,
  values promoted on write to global)
- node_sum_grad/node_sum_hess: float32 → float64
- Leaf value computation: float64 precision
- Split gain comparison: avoid float64→float32 downcast

Shared memory histograms remain float32 (48KB limit). Per-chunk
accumulation (~16K samples) is fine in float32; the precision
issue was in global accumulation across many chunks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU benchmark results (OpenBoost vs XGBoost on A100), reproducibility
instructions, variant model comparisons, and a note that OpenBoost is in
active development.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Automated benchmarking and profiling pipeline: Modal-based evaluation,
timing experiments, score tracking, and progress visualization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jxucoder jxucoder merged commit 45a3125 into main Mar 23, 2026
4 checks passed
@jxucoder jxucoder deleted the gpu-perf-profiling-tests branch March 23, 2026 16:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e18429ba20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +167 to +168
grad = (pred - y).astype(np.float32)
hess = np.full_like(pred, 1.0, dtype=np.float32)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep fit_trees_batch consistent with the new MSE scaling

This switches the built-in MSE objective to grad = pred - y / hess = 1, but the CPU batch-training fallback still recomputes later rounds as current_grad = grad + 2.0 * pred and assumes a constant Hessian of 2 in src/openboost/_core/_tree.py:889-893. Any multi-round fit_trees_batch(...) run for MSE on CPU will therefore build trees from gradients that are 2× too large from round 2 onward, so batch hyperparameter searches stop matching GradientBoosting(loss="mse") after this commit.

Useful? React with 👍 / 👎.

Comment on lines +67 to +68
4. Run: uv run python development/autoresearch/evaluate.py
Parse the output for RESULT and SCORE.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate evaluation flags into each autoresearch iteration

The usage string advertises ./development/autoresearch/run.sh 10 --quick, but every iteration still hard-codes uv run python development/autoresearch/evaluate.py here. Because EVAL_FLAGS is never appended to this command (or to the baseline run), callers cannot actually request --quick or any other evaluation mode, so the documented smoke-test workflow still runs the full benchmark/test path each iteration.

Useful? React with 👍 / 👎.

Comment on lines +190 to +194
_find_level_splits_kernel[n_nodes_at_level, 256](
histograms, level_start, level_end,
np.float32(1.0), np.float32(1.0), np.float32(0.0),
node_features, node_thresholds, node_gains,
node_sum_grad, node_sum_hess
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass the new node_left_hess buffer to split profiling

The kernel-level breakdown still launches _find_level_splits_kernel with the pre-change argument list. In this commit src/openboost/_backends/_cuda.py adds a required node_left_hess output to that kernel, so uv run modal run development/autoresearch/time_breakdown_modal.py now fails at runtime before it can print the per-depth timing table.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant