Skip to content

Conversation

@nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Oct 27, 2025

take cpu and memory utilisation for 2 min rolling window before decide to reject the request

Summary by CodeRabbit

Release Notes

  • New Features

    • Automatic hourly memory release scheduler for optimized memory usage.
    • Rolling 2-minute resource history tracking for CPU and memory metrics.
  • Improvements

    • Resource monitoring now enabled by default for enhanced system stability.
    • Optimized memory management in query processing with batched operations.
    • Integrated jemalloc allocator for improved memory efficiency on non-Windows systems.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

This PR integrates jemalloc as the global memory allocator and introduces a memory release scheduler that periodically purges jemalloc arenas. Query handlers and response serialization are optimized to reduce memory retention, and resource monitoring is enhanced with rolling averages to improve decision-making.

Changes

Cohort / File(s) Summary
Memory allocator integration
Cargo.toml, src/main.rs
Added tikv-jemalloc dependencies (ctl, jemallocator, jemalloc-sys) and configured jemalloc as global allocator for non-MSVC targets
Memory scheduler module
src/memory.rs, src/lib.rs
New memory module providing force_memory_release() and init_memory_release_scheduler() for hourly jemalloc arena purging via scheduled tasks
Server initialization
src/handlers/http/modal/server.rs, src/handlers/http/modal/query_server.rs, src/handlers/http/modal/ingest_server.rs
Integrated memory release scheduler initialization into startup sequences for all three server types
Query response optimization
src/handlers/http/query.rs, src/response.rs, src/utils/arrow/mod.rs
Optimized JSON serialization and batch processing with explicit memory drops, pre-allocation, and chunked iteration to reduce memory retention
Resource monitoring enhancement
src/handlers/http/resource_check.rs
Migrated from instantaneous checks to rolling 2-minute averages using VecDeque-based history; updated thresholds and logging to reflect rolling average context; added unit tests
Minor refactoring
src/metastore/metastores/object_store_metastore.rs
Inlined await in delete_overview method

Sequence Diagram

sequenceDiagram
    participant Server as Server Init
    participant Scheduler as Memory Scheduler
    participant Jemalloc as Jemalloc

    Server->>Scheduler: init_memory_release_scheduler()
    activate Scheduler
    Scheduler->>Scheduler: Create AsyncScheduler
    Scheduler->>Scheduler: Schedule hourly task
    Scheduler->>Scheduler: Spawn Tokio poller (60s interval)
    Scheduler-->>Server: Ok(())
    deactivate Scheduler

    loop Every 60 seconds
        Scheduler->>Scheduler: Poll scheduled tasks
        alt Task ready
            Scheduler->>Jemalloc: force_memory_release()
            Jemalloc->>Jemalloc: Advance epoch
            Jemalloc->>Jemalloc: Purge arenas
            Jemalloc-->>Scheduler: Success
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Memory scheduler logic in src/memory.rs: Verify jemalloc epoch advancement and arena purging correctness, error handling consistency
  • Server initialization ordering: Confirm all three server types (server.rs, query_server.rs, ingest_server.rs) initialize memory scheduler at appropriate points before spawning servers
  • Query optimization pathways: Review explicit drops and chunked processing in src/handlers/http/query.rs, src/response.rs, and src/utils/arrow/mod.rs for correctness and memory safety
  • Rolling average logic: Validate ResourceHistory window cleanup, sample accumulation, and average computation in src/handlers/http/resource_check.rs

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • parmesant

Poem

🐰 Hops through memory gardens with glee,
Jemalloc springs free, so shiny and clean,
Arenas purged hourly, no waste in between,
Query responses chirp with delight—
Memory optimized, running just right!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely incomplete compared to the repository's template. The provided description consists of a single sentence ("take cpu and memory utilisation for 2 min rolling window before decide to reject the request") and lacks the structured sections required by the template, including a detailed description of the goal, rationale for the chosen solution, key changes made, and the required testing and documentation checklists. While the sentence does convey the general intent of the PR, the minimal scope falls well short of the template's expectations for comprehensive PR documentation. Expand the PR description to follow the template structure more closely. Include a section explaining the goal (why 2-minute rolling averages are needed), the rationale for this approach over alternatives, and a summary of key changes (jemalloc integration, resource history tracking, memory optimizations in handlers, etc.). Additionally, complete the checklist items by confirming testing of log ingestion and querying, verifying code comments explain the "why," and documenting the new behavior for resource-based request rejection.
Title Check ❓ Inconclusive The title "Resource check" is vague and generic. While it relates to a real component modified in this PR (the resource checking mechanism in src/handlers/http/resource_check.rs), it fails to capture the core improvement that distinguishes this change: the implementation of a 2-minute rolling window for CPU and memory utilization decisions. A developer scanning commit history would not understand from this title alone that the PR introduces rolling averages for resource decision-making rather than a general refactor or bug fix to resource checking. The title would benefit from being more specific, such as "Implement 2-minute rolling window for resource checks" to clearly convey the main objective.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
src/memory.rs (2)

67-72: Consider tracking the spawned task or reducing poll frequency.

The scheduler polls every 60 seconds but tasks only run hourly. This is somewhat inefficient—consider reducing the poll frequency to every 5-10 minutes. Additionally, the spawned task is fire-and-forget with no shutdown mechanism. While this may be acceptable for a cleanup task, consider returning a JoinHandle to allow graceful shutdown if needed.

Apply this diff to reduce poll frequency:

     tokio::spawn(async move {
         loop {
             scheduler.run_pending().await;
-            tokio::time::sleep(Duration::from_secs(60)).await; // Check every minute
+            tokio::time::sleep(Duration::from_secs(300)).await; // Check every 5 minutes
         }
     });

34-48: Minor robustness: CString::new could theoretically fail.

The format!("arena.{i}.purge") is unlikely to produce null bytes, but CString::new returns a Result that could fail. The current code silently skips the arena on failure. Consider logging a warning if this occurs, though in practice this should never happen.

src/main.rs (1)

34-37: LGTM! Global allocator configuration is correct.

The jemalloc global allocator is properly configured with an appropriate cfg guard to exclude MSVC targets. This is a standard pattern and the placement before main() is correct.

Note that this is a significant runtime change affecting all memory allocations throughout the application. Consider monitoring memory metrics after deployment to validate the expected improvements.

src/handlers/http/query.rs (3)

243-251: Incomplete memory management implementation - commented-out code should be addressed.

The commented-out force_memory_release() call suggests this feature is incomplete or under development.

Please either:

  1. Enable the feature: Uncomment and import force_memory_release() from the memory module if memory release is intended here
  2. Remove the commented code: If memory release isn't needed or is handled elsewhere (e.g., by the memory scheduler), remove the comment to avoid confusion
  3. Add a TODO with context: If this is intentionally deferred, add a TODO comment explaining why and when it should be enabled

The current state creates uncertainty about whether this is intentional or forgotten work.


332-349: Explicit drop may not achieve intended memory reduction.

The explicit drop(response) at line 347 is meant to reduce memory retention, but its effectiveness is questionable since:

  1. Already consumed: The response is converted to bytes at line 346 before being dropped, so the memory is still held by bytes_result
  2. Automatic drop: Even without the explicit drop, response would be dropped at the end of the scope (after line 349)
  3. Missing the real issue: The primary memory holder is the RecordBatch inside query_response, not the JSON Value

If the goal is to reduce memory retention, consider:

-    // Create response and immediately process to reduce memory retention
-    let query_response = QueryResponse {
-        records: vec![batch],
-        fields: Vec::new(),
-        fill_null: send_null,
-        with_fields: false,
-    };
-
-    let response = query_response.to_json().map_err(|e| {
-        error!("Failed to parse record batch into JSON: {}", e);
-        actix_web::error::ErrorInternalServerError(e)
-    })?;
-
-    // Convert to bytes and explicitly drop the response object
-    let bytes_result = Bytes::from(format!("{response}\n"));
-    drop(response); // Explicit cleanup
-
-    Ok(bytes_result)
+    // Convert batch directly to JSON and bytes to minimize intermediate allocations
+    let bytes_result = {
+        let query_response = QueryResponse {
+            records: vec![batch],
+            fields: Vec::new(),
+            fill_null: send_null,
+            with_fields: false,
+        };
+        
+        let response = query_response.to_json().map_err(|e| {
+            error!("Failed to parse record batch into JSON: {}", e);
+            actix_web::error::ErrorInternalServerError(e)
+        })?;
+        
+        Bytes::from(format!("{response}\n"))
+    }; // query_response and response dropped here
+    
+    Ok(bytes_result)

This ensures all intermediate data is dropped before returning.


394-407: Explicit drop is appropriate here but consider the trade-offs.

The explicit drop(records) at line 397 is more appropriate than the similar pattern in create_batch_processor because:

  1. After conversion: It drops the original RecordBatch data after record_batches_to_json has borrowed and converted it
  2. Large data: RecordBatch objects can be memory-intensive, so dropping them early could help

However, consider:

Benefits vs complexity: The explicit drop adds code complexity for a memory optimization that may be marginal in practice. The records would be automatically dropped at the end of the block (line 399) anyway, just a few lines later.

Consistency: If you adopt this pattern here, consider whether it should be applied consistently throughout the codebase for similar scenarios, or document when/why explicit drops are warranted.

Measurement: Has this been profiled to confirm it reduces peak memory usage in production workloads? Without measurements, it's unclear if the added complexity is justified.

The current code is correct, but the value of the explicit drop depends on your memory constraints and typical data sizes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f2b324 and 0afc991.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • Cargo.toml (1 hunks)
  • src/handlers/http/modal/ingest_server.rs (1 hunks)
  • src/handlers/http/modal/query_server.rs (1 hunks)
  • src/handlers/http/modal/server.rs (1 hunks)
  • src/handlers/http/query.rs (3 hunks)
  • src/handlers/http/resource_check.rs (7 hunks)
  • src/lib.rs (1 hunks)
  • src/main.rs (1 hunks)
  • src/memory.rs (1 hunks)
  • src/metastore/metastores/object_store_metastore.rs (1 hunks)
  • src/response.rs (1 hunks)
  • src/utils/arrow/mod.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-24T11:54:20.259Z
Learnt from: parmesant
PR: parseablehq/parseable#1449
File: src/metastore/metastores/object_store_metastore.rs:83-98
Timestamp: 2025-10-24T11:54:20.259Z
Learning: In the `get_overviews` method in `src/metastore/metastores/object_store_metastore.rs`, using `.ok()` to convert all storage errors to `None` when fetching overview objects is the intended behavior. This intentionally treats missing files and other errors (network, permissions, etc.) the same way.

Applied to files:

  • src/metastore/metastores/object_store_metastore.rs
🧬 Code graph analysis (6)
src/handlers/http/modal/query_server.rs (1)
src/memory.rs (1)
  • init_memory_release_scheduler (56-75)
src/response.rs (1)
src/utils/arrow/mod.rs (1)
  • record_batches_to_json (50-72)
src/handlers/http/modal/ingest_server.rs (1)
src/memory.rs (1)
  • init_memory_release_scheduler (56-75)
src/metastore/metastores/object_store_metastore.rs (1)
src/correlation.rs (1)
  • path (227-234)
src/handlers/http/modal/server.rs (1)
src/memory.rs (1)
  • init_memory_release_scheduler (56-75)
src/handlers/http/query.rs (1)
src/utils/arrow/mod.rs (1)
  • record_batches_to_json (50-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: coverage
🔇 Additional comments (11)
src/metastore/metastores/object_store_metastore.rs (1)

111-114: LGTM!

The inlining of the await and Ok wrapping is a clean refactor with no functional change.

src/utils/arrow/mod.rs (1)

50-72: LGTM! Memory-conscious optimizations.

The early return, pre-allocation, cursor usage, and explicit error propagation are all solid improvements that align with the PR's memory management focus.

src/handlers/http/resource_check.rs (4)

51-100: LGTM! Solid rolling window implementation.

The ResourceHistory implementation correctly maintains a 2-minute window with automatic cleanup of stale samples. The average calculation properly handles edge cases (empty samples).


169-177: Good warm-up period design.

The warm-up logic (requiring min_samples_for_decision before using rolling averages) prevents premature rejections during startup. The fallback to current instantaneous values with logging is a conservative and sensible approach.


240-301: LGTM! Comprehensive test coverage.

The unit tests cover basic averaging, window cleanup, empty state, and single-sample scenarios. Good coverage for the ResourceHistory implementation.


102-106: Behavior change: Resource checks now enabled by default.

The initialization of RESOURCE_CHECK_ENABLED has changed from false to true (commit 0afc991). With thresholds defaulting to 80% CPU and 80% memory, the server will immediately begin rejecting requests if current resource usage exceeds these levels, even during the warm-up period before the 2-minute rolling average is available. While the conservative approach of using instantaneous values during warm-up is reasonable, ensure thresholds are appropriately tuned for your deployment to avoid rejecting legitimate traffic on startup or during brief resource fluctuations.

src/handlers/http/modal/ingest_server.rs (1)

132-134: LGTM! Clean integration.

The memory release scheduler initialization is properly placed before spawning the airplane server, with correct error propagation.

src/lib.rs (1)

33-33: LGTM! Module exposure.

The new memory module is properly exposed as public, following the existing pattern.

src/handlers/http/modal/server.rs (1)

157-159: LGTM! Clean integration.

The memory release scheduler initialization is properly placed in the startup sequence after analytics and before launching livetail/airplane servers, with correct error propagation.

src/handlers/http/modal/query_server.rs (1)

132-134: LGTM! Clean integration.

The memory release scheduler initialization is properly placed in the startup sequence after analytics and before cluster metrics scheduling, with correct error propagation.

src/response.rs (1)

35-55: Verify the batch size and memory improvement.

The batched processing approach processes 100 RecordBatch items at a time. However, the choice of BATCH_SIZE = 100 appears arbitrary and may not be optimal.

Consider:

  1. Batch size tuning: The optimal batch size depends on the typical size of individual RecordBatch items. If record batches are already large, processing 100 at a time may still consume significant memory.

  2. Memory improvement validation: Has this change been benchmarked to confirm it actually reduces peak memory usage compared to the previous implementation? The memory benefit depends on whether the intermediate json_records vectors are released between chunks.

  3. Trade-offs: Smaller batch sizes reduce peak memory but increase processing overhead. Larger batches do the opposite.

Would you like me to generate a script to help profile memory usage with different batch sizes, or provide guidance on determining an optimal value based on your typical workload?

Comment on lines +31 to +33
tikv-jemalloc-ctl = "0.6.0"
tikv-jemallocator = "0.6.0"
tikv-jemalloc-sys = "0.6.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that jemalloc dependency versions are current and secure.

The tikv-jemalloc dependencies have been added with specific versions. Please confirm these are the latest stable versions and check for any known security vulnerabilities.

Run the following script to check for the latest versions and security advisories:


🏁 Script executed:

#!/bin/bash
# Check latest versions on crates.io
echo "=== Checking latest versions ==="
for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do
  echo "Checking $crate..."
  curl -s "https://crates.io/api/v1/crates/$crate" | jq -r ".crate.max_stable_version // .crate.max_version"
done

echo ""
echo "=== Checking for security advisories ==="
for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do
  gh api graphql -f query="
  {
    securityVulnerabilities(first: 5, ecosystem: RUST, package: \"$crate\") {
      nodes {
        advisory {
          summary
          severity
          publishedAt
        }
        vulnerableVersionRange
        firstPatchedVersion {
          identifier
        }
      }
    }
  }" 2>/dev/null || echo "No advisories found for $crate"
done

Length of output: 380


Update tikv-jemalloc-ctl and tikv-jemallocator to latest stable versions.

tikv-jemalloc-ctl and tikv-jemallocator are pinned to 0.6.0, but version 0.6.1 is available on crates.io. Update both to 0.6.1 for consistency with tikv-jemalloc-sys and to use the latest stable patch. No security vulnerabilities were found for any of these crates.

🤖 Prompt for AI Agents
In Cargo.toml around lines 31 to 33, tikv-jemalloc-ctl and tikv-jemallocator are
pinned at 0.6.0 while tikv-jemalloc-sys is 0.6.1; update both tikv-jemalloc-ctl
and tikv-jemallocator to version "0.6.1" to keep versions consistent, then run
cargo update -p tikv-jemalloc-ctl -p tikv-jemallocator (or regenerate
Cargo.lock) to ensure the lockfile reflects the new patch versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant