Skip to content
This repository was archived by the owner on Apr 29, 2026. It is now read-only.

new benchmarks#185

Merged
navicore merged 3 commits intomainfrom
i-179
Jan 5, 2026
Merged

new benchmarks#185
navicore merged 3 commits intomainfrom
i-179

Conversation

@navicore
Copy link
Copy Markdown
Owner

@navicore navicore commented Jan 5, 2026

No description provided.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 5, 2026

PR Review: New Benchmarks

Summary

This PR adds a comprehensive compute benchmark suite alongside the existing concurrency benchmarks, comparing Seq performance against Rust and Go. Overall, this is a high-quality addition that significantly expands the project's benchmarking capabilities.

Code Quality & Best Practices ✅

Strengths:

  • Excellent organization: Clear separation between concurrency and compute benchmarks
  • Consistent implementation: All three languages (Seq, Rust, Go) implement identical algorithms
  • Well-documented: README updates are thorough and explain benchmark purposes clearly
  • Good shell scripting: run.sh properly uses set -e, handles colors, checks dependencies
  • Proper optimization flags: Rust uses -O, Go uses default optimizations
  • Validation built-in: All benchmarks verify expected results and exit with proper codes

Seq Code Quality:

  • fib.seq: Clean recursive implementation with proper base cases
  • sum_squares.seq: Good use of quotations with while combinator
  • primes.seq: Well-structured nested loops with clear stack effect comments

Potential Issues 🔍

1. Integer Overflow Risk (Low Severity)

In sum_squares benchmarks, the result is 333,333,833,333,500,000 which is safely within i64 range, but if someone increases the limit beyond ~3 million, overflow will occur silently. Consider adding a comment about the safe upper limit.

2. Shell Script Cleanup Issue

benchmarks/compute/run.sh:78-80 - Cleanup only removes specific binary names. If a benchmark is added/renamed, old binaries won't be cleaned. Consider using wildcards instead.

3. LATEST_RUN.txt Update

Ensure CI actually ran all benchmarks including the new compute ones.

Performance Considerations 🚀

Good:

  • Benchmark selection is appropriate (fib tests recursion, sum_squares tests loops, primes tests nested loops)
  • Rust and Go use release/optimized builds
  • Hyperfine integration with proper warmup and min-runs

Considerations:

  • The naive recursive fibonacci is exponential O(2^n). This is intentional for testing recursion overhead, but a comment clarifying this choice vs memoized/iterative would be helpful

Security Concerns 🔒

No significant security issues. Good practices observed:

  • All shell variables are properly quoted
  • mktemp is used correctly, and files are cleaned up
  • Fixed benchmark parameters (no user input)
  • Proper exit code validation
  • No use of eval or unquoted variables

Test Coverage 📊

Strengths:

  • All benchmarks have built-in validation (exit code 0 on success, 1 on failure)
  • Expected values are documented and checked
  • Three implementations per benchmark ensure correctness

Gaps:

  • No explicit test suite for the benchmarks themselves
  • Consider adding a test that verifies all three implementations produce identical results

Documentation 📚

Excellent:

  • README updates are thorough and well-structured
  • New benchmarks/compute/README.md clearly explains each benchmark
  • Sample results table format is clear
  • Usage examples are helpful

Minor suggestions:

  • The TBD placeholders in compute/README.md should be filled in before merge

Recommendations 📝

Must Fix:

  • Fill in sample results in benchmarks/compute/README.md (currently TBD)
  • Verify CI runs compute benchmarks and updates LATEST_RUN.txt accordingly

Should Fix:

  • Add comment about overflow risk in sum_squares implementations
  • Change cleanup in compute/run.sh to use wildcards
  • Consider adding a correctness test script

Nice to Have:

  • Add comment explaining why naive recursive fib is chosen over iterative
  • Document the theoretical performance expectations (interpreter overhead range)
  • Consider adding a benchmark for string manipulation or data structure operations

Verdict ✨

Recommendation: APPROVE with minor suggestions

This is a well-crafted PR that significantly enhances the benchmark suite. The code quality is high, the implementations are correct, and the documentation is excellent. The identified issues are minor and don't block merging.

Impact:

  • Provides valuable insight into Seq's interpreter overhead
  • Demonstrates that ~15-30x slowdown vs native code is reasonable for an interpreter
  • Complements existing concurrency benchmarks well
  • Sets good foundation for future benchmark additions

Great work! The project now has both concurrency and compute benchmarks for comprehensive performance evaluation.

  Changes Made

  Must Fix (Done)

  1. Filled in sample results in benchmarks/compute/README.md - Added actual benchmark results from MacBook Pro M-series

  Should Fix (Done)

  2. Added overflow risk comments to all three sum_squares implementations:
    - sum_squares.seq
    - sum_squares.rs
    - sum_squares.go
  3. Changed cleanup to use wildcards in benchmarks/compute/run.sh:
  # Before
  rm -f fib_seq fib_rust fib_go
  rm -f sum_squares_seq sum_squares_rust sum_squares_go
  rm -f primes_seq primes_rust primes_go

  # After
  rm -f *_seq *_rust *_go

  Nice to Have (Done)

  4. Added explanation for naive recursive fib in benchmarks/compute/README.md:
  Why naive recursion? The exponential O(2^n) algorithm is intentional - it stress-tests function call overhead and stack operations, which are key performance indicators for interpreted languages.
  5. Added performance expectations section explaining typical interpreter overhead ranges (10-20x good, 20-50x expected, >50x investigate)

  All CI checks pass.
@claude
Copy link
Copy Markdown

claude Bot commented Jan 5, 2026

Code Review - PR #185: New Benchmarks

This PR adds comprehensive compute benchmarks and Rust comparison benchmarks. Overall, this is excellent work with high-quality implementations. Here's my detailed review:


Strengths

1. Well-Structured Additions

  • Clear separation between concurrency benchmarks and compute benchmarks
  • Excellent documentation in both main and compute README files
  • Consistent implementation patterns across Seq, Rust, and Go versions

2. Code Quality

  • Seq implementations: Clean, idiomatic stack-based code with correct tail recursion

    • fib.seq: Proper recursive structure
    • sum_squares.seq: Good use of while loop pattern
    • primes.seq: Correctly implements trial division with tail-recursive helpers
  • Rust implementations: Well-written with proper optimization flags (-O)

  • Go implementations: Clean and equivalent to other implementations

  • All implementations use equivalent algorithms (no unfair optimizations)

3. Testing & Validation

  • Each benchmark validates its output against expected values
  • Exit codes properly indicate success/failure
  • Expected values are documented in comments

4. Documentation Excellence

  • benchmarks/README.md: Comprehensive guide with clear categories, usage instructions, and interpretation guidelines
  • benchmarks/compute/README.md: Excellent explanations of why each benchmark matters
  • Good rationale for naive Fibonacci (testing function call overhead vs. iterative)

🔍 Issues & Recommendations

Critical: Missing Justfile Integration

The new compute benchmarks are not integrated into the justfile. Users won't discover them easily.

Recommendation: Add these recipes to justfile:

# Run compute benchmarks (Seq vs Rust vs Go)
bench-compute: build
    @echo "Running compute benchmarks..."
    cd benchmarks/compute && ./run.sh

# Run individual compute benchmarks
bench-fib: build
    @echo "Running Fibonacci benchmark..."
    cd benchmarks && ./run.sh fib

bench-sum-squares: build
    @echo "Running sum of squares benchmark..."
    cd benchmarks && ./run.sh sum_squares

bench-primes: build
    @echo "Running prime counting benchmark..."
    cd benchmarks && ./run.sh primes

Also update the main bench recipe description to mention both categories:

# Run all benchmarks (concurrency + compute)
bench: build
    @echo "Running all benchmarks..."
    cd benchmarks && ./run.sh

Minor: Shell Script Robustness

File: benchmarks/compute/run.sh

The script is good but could benefit from:

  1. Variable quoting: While $SEQC is set and used correctly, the pattern is inconsistent
  2. Error handling: The script uses set -e which is good, but individual command failures in run_benchmark could be more explicit

Not blocking, but consider for future iterations.

Minor: Documentation Clarity

File: benchmarks/run.sh (lines 8-9)

The usage comment mentions ./run.sh compute and ./run.sh concurrency, but looking at the implementation (truncated in diff), verify these commands actually work. If not, update the usage docs.

Minor: Benchmark Freshness Check

The check-bench-freshness in the justfile checks benchmarks/LATEST_RUN.txt. The PR updates this file, which is correct, but consider:

  • Does the CI run benchmarks automatically on PRs?
  • Should compute benchmarks also update this timestamp?

Not blocking, but worth clarifying in a comment.


🔒 Security

✅ No security concerns. The code:

  • Uses safe arithmetic (proper bounds checking)
  • No file I/O or network operations
  • No unsafe code blocks in Rust implementations
  • Proper memory management in all implementations

Performance Considerations

Good Design Choices

  1. Overflow awareness: Sum of squares correctly notes the 3M limit for i64
  2. Algorithm equivalence: All three languages use the same algorithms (fair comparison)
  3. Compiler flags: Rust uses -O, Go uses default optimizations

Observation: Integer Overflow Risk

File: benchmarks/compute/sum_squares.seq:24

The current limit of 1M is safe, but the comment warns about 3M+. Consider adding runtime overflow detection in future iterations, though this is not blocking for this PR since the benchmarks use safe values.


🧪 Test Coverage

Excellent

  • All benchmarks have expected output validation
  • Exit codes properly indicate pass/fail
  • Consistent test approach across all implementations

Suggestion: Add CI Integration Test

Currently, the benchmarks aren't run in CI (only freshness is checked). Consider adding a quick smoke test:

# In .github/workflows/ci-linux.yml
- name: Verify benchmarks compile and run
  run: |
    just bench-fib
    just bench-sum-squares  
    just bench-primes

This ensures benchmarks don't break without requiring full benchmark runs on every PR.


📊 Code Quality Summary

Category Rating Notes
Code Quality ⭐⭐⭐⭐⭐ Clean, idiomatic implementations
Documentation ⭐⭐⭐⭐⭐ Excellent READMEs with clear guidance
Testing ⭐⭐⭐⭐ Good validation, could add CI smoke tests
Performance ⭐⭐⭐⭐⭐ Fair comparisons, appropriate algorithms
Integration ⭐⭐⭐ Missing justfile recipes

🎯 Recommendations Before Merge

  1. Required: Add compute benchmark recipes to justfile
  2. Recommended: Verify ./run.sh compute and ./run.sh concurrency work as documented
  3. Optional: Consider adding CI smoke test for benchmark compilation

🎉 Conclusion

This is high-quality work that significantly expands the benchmark suite. The compute benchmarks provide valuable insight into interpreter overhead, and the Rust comparisons offer an interesting perspective on concurrency models.

The implementations are correct, well-documented, and fair. The only blocking issue is the missing justfile integration, which is a quick fix.

Recommendation: Approve after adding justfile recipes.

Great work! 🚀

@claude
Copy link
Copy Markdown

claude Bot commented Jan 5, 2026

PR Review - New Benchmarks Suite

Summary

This PR adds a comprehensive compute benchmark suite comparing Seq, Rust, and Go performance on pure computation tasks (fibonacci, sum of squares, prime counting), plus Rust implementations for existing concurrency benchmarks. The PR also addresses feedback from a previous review.

Code Quality ✅

Strengths:

  • Excellent documentation: README files are thorough and well-structured with clear explanations of what each benchmark tests
  • Consistent implementation patterns: All three language implementations (Seq/Rust/Go) follow equivalent algorithms, making comparisons fair
  • Good code organization: Compute benchmarks properly separated into their own directory with dedicated runner script
  • Clear commenting: Seq implementations include helpful stack effect comments that aid understanding
  • Professional shell scripting: The run.sh scripts properly handle errors (set -e), have color-coded output, and gracefully degrade when optional tools (hyperfine, jq) are unavailable

Areas for improvement:

  • The run.sh script at benchmarks/run.sh has grown quite large (255 lines). Consider extracting the summary table generation into a separate function or helper script for better maintainability
  • Minor: The .gitignore patterns could be consolidated (e.g., *_rust and *_seq cover compute benchmarks already)

Potential Bugs and Issues ✅

No critical bugs found. However, a few observations:

  1. Integer overflow noted but not handled: All three sum_squares implementations correctly document the overflow risk for n > 3M. This is acceptable for a benchmark, but the comment is accurate and helpful.

  2. Division by zero protection: In benchmarks/run.sh lines 73-76 and similar, the code properly checks elapsed_ms > 0 before division. Good defensive programming.

  3. Shell script portability: Uses bash-specific features (${BASH_SOURCE[0]}, arrays). This is fine since the shebang explicitly declares bash, but worth noting for portability.

  4. Temporary file cleanup: The json_file=$(mktemp) followed by rm -f "$json_file" pattern is correct. Consider using a trap for cleanup to handle interrupt signals:

    trap 'rm -f "$json_file"' EXIT

Performance Considerations ✅

Benchmark design is sound:

  1. Algorithm equivalence: All implementations use the same algorithms (naive recursive fib, iterative sum_squares, trial division for primes). This ensures fair comparison.

  2. Appropriate test sizes:

    • fib(40) is a good stress test for recursion without being excessive
    • 1M iterations for sum_squares is safe for i64 and provides meaningful timing
    • 100K limit for prime counting provides good timing separation
  3. Compilation flags: Rust uses -O and Go uses default optimizations. Consistent with seqc's release build.

  4. Benchmark runner: Uses hyperfine when available (5 runs, 2 warmups for concurrency; 3 runs, 1 warmup for compute). These are reasonable defaults.

Minor suggestions:

  • Consider documenting in the README whether these benchmarks are representative of real-world workloads or purely synthetic micro-benchmarks (they're clearly the latter, which is fine)
  • The Rust concurrency benchmarks use std::thread (OS threads) rather than green threads. The README correctly explains this, but consider noting that this comparison is apples-to-oranges for pingpong

Security Concerns ✅

No security issues found.

  • No unsafe code blocks in Rust implementations
  • Shell scripts properly quote paths and variables
  • No credential or secret handling
  • Build commands use explicit flags without shell injection risks
  • The rm -f *_seq *_rust *_go cleanup uses safe wildcards within the controlled directory

Test Coverage ⚠️

Current state:

  • Each benchmark includes a correctness check (exit code 0 if result matches expected value)
  • The run.sh scripts build and run benchmarks but don't verify correctness automatically
  • No automated CI test that verifies benchmark correctness

Recommendations:

  1. Add a verification mode to the run scripts that checks exit codes:

    # Run in verify mode
    if ./${name}_seq; then
        echo "✓ Seq result correct"
    else
        echo "✗ Seq result INCORRECT" && exit 1
    fi
  2. CI integration: The PR updates LATEST_RUN.txt but doesn't show CI configuration. Consider adding a CI job that:

    • Builds all benchmark variants
    • Runs each once to verify correctness (exit code 0)
    • Doesn't require timing thresholds (too environment-dependent)
  3. Add edge case tests: Consider testing fib(0), fib(1), sum_squares(0), primes(2) as quick correctness checks

Additional Observations

Positive:

  • Addresses previous review feedback comprehensively (overflow comments, wildcards in cleanup, performance expectations)
  • The commit message for 26557d9 clearly documents what was changed and why
  • Sample results in README provide useful context for interpreting benchmark output
  • The distinction between concurrency and compute benchmarks is clear

Questions/Suggestions:

  1. Compiler change: The diff shows +1/-0 in crates/compiler/src/lib.rs. What was this change? It's not visible in the provided diff context.

  2. Benchmark stability: Consider documenting expected variance. Micro-benchmarks can be noisy - what's an acceptable stddev?

  3. Future work: The README mentions "adding new benchmarks". Consider adding a template file or example to make contributions easier.

Verdict

Approved with minor suggestions. This is high-quality work that significantly expands the benchmark coverage. The code is well-documented, properly structured, and implements fair comparisons. The few suggestions above are minor improvements, not blockers.

Recommended next steps:

  1. Add verification mode to run scripts (optional but recommended)
  2. Document the compiler change if it's related to this PR
  3. Consider CI integration for correctness verification

Great work on the comprehensive documentation and consistent implementation across languages!

@navicore navicore merged commit 58bb5bd into main Jan 5, 2026
2 checks passed
@navicore navicore deleted the i-179 branch January 5, 2026 03:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant