Resource check #1452

nikhilsinhaparseable · 2025-10-27T02:48:34Z

take cpu and memory utilisation for 2 min rolling window before decide to reject the request

Summary by CodeRabbit

Release Notes

New Features
- Automatic hourly memory release scheduler for optimized memory usage.
- Rolling 2-minute resource history tracking for CPU and memory metrics.
Improvements
- Resource monitoring now enabled by default for enhanced system stability.
- Optimized memory management in query processing with batched operations.
- Integrated jemalloc allocator for improved memory efficiency on non-Windows systems.

take cpu and memory utilisation for 2 min rolling window before decide to reject the request

coderabbitai · 2025-10-27T02:48:43Z

Walkthrough

This PR integrates jemalloc as the global memory allocator and introduces a memory release scheduler that periodically purges jemalloc arenas. Query handlers and response serialization are optimized to reduce memory retention, and resource monitoring is enhanced with rolling averages to improve decision-making.

Changes

Cohort / File(s)	Summary
Memory allocator integration `Cargo.toml`, `src/main.rs`	Added tikv-jemalloc dependencies (ctl, jemallocator, jemalloc-sys) and configured jemalloc as global allocator for non-MSVC targets
Memory scheduler module `src/memory.rs`, `src/lib.rs`	New memory module providing `force_memory_release()` and `init_memory_release_scheduler()` for hourly jemalloc arena purging via scheduled tasks
Server initialization `src/handlers/http/modal/server.rs`, `src/handlers/http/modal/query_server.rs`, `src/handlers/http/modal/ingest_server.rs`	Integrated memory release scheduler initialization into startup sequences for all three server types
Query response optimization `src/handlers/http/query.rs`, `src/response.rs`, `src/utils/arrow/mod.rs`	Optimized JSON serialization and batch processing with explicit memory drops, pre-allocation, and chunked iteration to reduce memory retention
Resource monitoring enhancement `src/handlers/http/resource_check.rs`	Migrated from instantaneous checks to rolling 2-minute averages using VecDeque-based history; updated thresholds and logging to reflect rolling average context; added unit tests
Minor refactoring `src/metastore/metastores/object_store_metastore.rs`	Inlined await in delete_overview method

Sequence Diagram

sequenceDiagram
    participant Server as Server Init
    participant Scheduler as Memory Scheduler
    participant Jemalloc as Jemalloc

    Server->>Scheduler: init_memory_release_scheduler()
    activate Scheduler
    Scheduler->>Scheduler: Create AsyncScheduler
    Scheduler->>Scheduler: Schedule hourly task
    Scheduler->>Scheduler: Spawn Tokio poller (60s interval)
    Scheduler-->>Server: Ok(())
    deactivate Scheduler

    loop Every 60 seconds
        Scheduler->>Scheduler: Poll scheduled tasks
        alt Task ready
            Scheduler->>Jemalloc: force_memory_release()
            Jemalloc->>Jemalloc: Advance epoch
            Jemalloc->>Jemalloc: Purge arenas
            Jemalloc-->>Scheduler: Success
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Memory scheduler logic in src/memory.rs: Verify jemalloc epoch advancement and arena purging correctness, error handling consistency
Server initialization ordering: Confirm all three server types (server.rs, query_server.rs, ingest_server.rs) initialize memory scheduler at appropriate points before spawning servers
Query optimization pathways: Review explicit drops and chunked processing in src/handlers/http/query.rs, src/response.rs, and src/utils/arrow/mod.rs for correctness and memory safety
Rolling average logic: Validate ResourceHistory window cleanup, sample accumulation, and average computation in src/handlers/http/resource_check.rs

Possibly related PRs

feat: streaming response #1317: Modifies the same query handler functions (handle_non_streaming_query, create_batch_processor) in src/handlers/http/query.rs for memory optimization
feat: add resource utilisation middleware to monitor CPU and memory. #1352: Introduces the resource-monitoring middleware infrastructure that this PR extends with rolling averages and integrates into server startup

Suggested labels

for next release

Suggested reviewers

parmesant

Poem

🐰 Hops through memory gardens with glee,
Jemalloc springs free, so shiny and clean,
Arenas purged hourly, no waste in between,
Query responses chirp with delight—
Memory optimized, running just right! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is largely incomplete compared to the repository's template. The provided description consists of a single sentence ("take cpu and memory utilisation for 2 min rolling window before decide to reject the request") and lacks the structured sections required by the template, including a detailed description of the goal, rationale for the chosen solution, key changes made, and the required testing and documentation checklists. While the sentence does convey the general intent of the PR, the minimal scope falls well short of the template's expectations for comprehensive PR documentation.	Expand the PR description to follow the template structure more closely. Include a section explaining the goal (why 2-minute rolling averages are needed), the rationale for this approach over alternatives, and a summary of key changes (jemalloc integration, resource history tracking, memory optimizations in handlers, etc.). Additionally, complete the checklist items by confirming testing of log ingestion and querying, verifying code comments explain the "why," and documenting the new behavior for resource-based request rejection.
Title Check	❓ Inconclusive	The title "Resource check" is vague and generic. While it relates to a real component modified in this PR (the resource checking mechanism in `src/handlers/http/resource_check.rs`), it fails to capture the core improvement that distinguishes this change: the implementation of a 2-minute rolling window for CPU and memory utilization decisions. A developer scanning commit history would not understand from this title alone that the PR introduces rolling averages for resource decision-making rather than a general refactor or bug fix to resource checking. The title would benefit from being more specific, such as "Implement 2-minute rolling window for resource checks" to clearly convey the main objective.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

src/memory.rs (2)
67-72: Consider tracking the spawned task or reducing poll frequency.

The scheduler polls every 60 seconds but tasks only run hourly. This is somewhat inefficient—consider reducing the poll frequency to every 5-10 minutes. Additionally, the spawned task is fire-and-forget with no shutdown mechanism. While this may be acceptable for a cleanup task, consider returning a JoinHandle to allow graceful shutdown if needed.

Apply this diff to reduce poll frequency:
     tokio::spawn(async move {
         loop {
             scheduler.run_pending().await;
-            tokio::time::sleep(Duration::from_secs(60)).await; // Check every minute
+            tokio::time::sleep(Duration::from_secs(300)).await; // Check every 5 minutes
         }
     });
34-48: Minor robustness: CString::new could theoretically fail.

The format!("arena.{i}.purge") is unlikely to produce null bytes, but CString::new returns a Result that could fail. The current code silently skips the arena on failure. Consider logging a warning if this occurs, though in practice this should never happen.
src/main.rs (1)

34-37: LGTM! Global allocator configuration is correct.

The jemalloc global allocator is properly configured with an appropriate cfg guard to exclude MSVC targets. This is a standard pattern and the placement before main() is correct.

Note that this is a significant runtime change affecting all memory allocations throughout the application. Consider monitoring memory metrics after deployment to validate the expected improvements.
src/handlers/http/query.rs (3)
243-251: Incomplete memory management implementation - commented-out code should be addressed.

The commented-out force_memory_release() call suggests this feature is incomplete or under development.

Please either:

Enable the feature: Uncomment and import force_memory_release() from the memory module if memory release is intended here

Remove the commented code: If memory release isn't needed or is handled elsewhere (e.g., by the memory scheduler), remove the comment to avoid confusion

Add a TODO with context: If this is intentionally deferred, add a TODO comment explaining why and when it should be enabled

The current state creates uncertainty about whether this is intentional or forgotten work.

332-349: Explicit drop may not achieve intended memory reduction.

The explicit drop(response) at line 347 is meant to reduce memory retention, but its effectiveness is questionable since:

Already consumed: The response is converted to bytes at line 346 before being dropped, so the memory is still held by bytes_result

Automatic drop: Even without the explicit drop, response would be dropped at the end of the scope (after line 349)

Missing the real issue: The primary memory holder is the RecordBatch inside query_response, not the JSON Value

If the goal is to reduce memory retention, consider:
-    // Create response and immediately process to reduce memory retention
-    let query_response = QueryResponse {
-        records: vec![batch],
-        fields: Vec::new(),
-        fill_null: send_null,
-        with_fields: false,
-    };
-
-    let response = query_response.to_json().map_err(|e| {
-        error!("Failed to parse record batch into JSON: {}", e);
-        actix_web::error::ErrorInternalServerError(e)
-    })?;
-
-    // Convert to bytes and explicitly drop the response object
-    let bytes_result = Bytes::from(format!("{response}\n"));
-    drop(response); // Explicit cleanup
-
-    Ok(bytes_result)
+    // Convert batch directly to JSON and bytes to minimize intermediate allocations
+    let bytes_result = {
+        let query_response = QueryResponse {
+            records: vec![batch],
+            fields: Vec::new(),
+            fill_null: send_null,
+            with_fields: false,
+        };
+        
+        let response = query_response.to_json().map_err(|e| {
+            error!("Failed to parse record batch into JSON: {}", e);
+            actix_web::error::ErrorInternalServerError(e)
+        })?;
+        
+        Bytes::from(format!("{response}\n"))
+    }; // query_response and response dropped here
+    
+    Ok(bytes_result)
This ensures all intermediate data is dropped before returning.

394-407: Explicit drop is appropriate here but consider the trade-offs.

The explicit drop(records) at line 397 is more appropriate than the similar pattern in create_batch_processor because:

After conversion: It drops the original RecordBatch data after record_batches_to_json has borrowed and converted it

Large data: RecordBatch objects can be memory-intensive, so dropping them early could help

However, consider:

Benefits vs complexity: The explicit drop adds code complexity for a memory optimization that may be marginal in practice. The records would be automatically dropped at the end of the block (line 399) anyway, just a few lines later.

Consistency: If you adopt this pattern here, consider whether it should be applied consistently throughout the codebase for similar scenarios, or document when/why explicit drops are warranted.

Measurement: Has this been profiled to confirm it reduces peak memory usage in production workloads? Without measurements, it's unclear if the added complexity is justified.

The current code is correct, but the value of the explicit drop depends on your memory constraints and typical data sizes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f2b324 and 0afc991.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

Cargo.toml (1 hunks)
src/handlers/http/modal/ingest_server.rs (1 hunks)
src/handlers/http/modal/query_server.rs (1 hunks)
src/handlers/http/modal/server.rs (1 hunks)
src/handlers/http/query.rs (3 hunks)
src/handlers/http/resource_check.rs (7 hunks)
src/lib.rs (1 hunks)
src/main.rs (1 hunks)
src/memory.rs (1 hunks)
src/metastore/metastores/object_store_metastore.rs (1 hunks)
src/response.rs (1 hunks)
src/utils/arrow/mod.rs (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-24T11:54:20.259Z

Learnt from: parmesant
PR: parseablehq/parseable#1449
File: src/metastore/metastores/object_store_metastore.rs:83-98
Timestamp: 2025-10-24T11:54:20.259Z
Learning: In the `get_overviews` method in `src/metastore/metastores/object_store_metastore.rs`, using `.ok()` to convert all storage errors to `None` when fetching overview objects is the intended behavior. This intentionally treats missing files and other errors (network, permissions, etc.) the same way.

Applied to files:

src/metastore/metastores/object_store_metastore.rs

🧬 Code graph analysis (6)

src/handlers/http/modal/query_server.rs (1)

src/memory.rs (1)

init_memory_release_scheduler (56-75)

src/response.rs (1)

src/utils/arrow/mod.rs (1)

record_batches_to_json (50-72)

src/handlers/http/modal/ingest_server.rs (1)

src/memory.rs (1)

init_memory_release_scheduler (56-75)

src/metastore/metastores/object_store_metastore.rs (1)

src/correlation.rs (1)

path (227-234)

src/handlers/http/modal/server.rs (1)

src/memory.rs (1)

init_memory_release_scheduler (56-75)

src/handlers/http/query.rs (1)

src/utils/arrow/mod.rs (1)

record_batches_to_json (50-72)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

🔇 Additional comments (11)

src/metastore/metastores/object_store_metastore.rs (1)

111-114: LGTM!

The inlining of the await and Ok wrapping is a clean refactor with no functional change.

src/utils/arrow/mod.rs (1)

50-72: LGTM! Memory-conscious optimizations.

The early return, pre-allocation, cursor usage, and explicit error propagation are all solid improvements that align with the PR's memory management focus.

src/handlers/http/resource_check.rs (4)

51-100: LGTM! Solid rolling window implementation.

The ResourceHistory implementation correctly maintains a 2-minute window with automatic cleanup of stale samples. The average calculation properly handles edge cases (empty samples).

169-177: Good warm-up period design.

The warm-up logic (requiring min_samples_for_decision before using rolling averages) prevents premature rejections during startup. The fallback to current instantaneous values with logging is a conservative and sensible approach.

240-301: LGTM! Comprehensive test coverage.

The unit tests cover basic averaging, window cleanup, empty state, and single-sample scenarios. Good coverage for the ResourceHistory implementation.

102-106: Behavior change: Resource checks now enabled by default.

The initialization of RESOURCE_CHECK_ENABLED has changed from false to true (commit 0afc991). With thresholds defaulting to 80% CPU and 80% memory, the server will immediately begin rejecting requests if current resource usage exceeds these levels, even during the warm-up period before the 2-minute rolling average is available. While the conservative approach of using instantaneous values during warm-up is reasonable, ensure thresholds are appropriately tuned for your deployment to avoid rejecting legitimate traffic on startup or during brief resource fluctuations.

src/handlers/http/modal/ingest_server.rs (1)

132-134: LGTM! Clean integration.

The memory release scheduler initialization is properly placed before spawning the airplane server, with correct error propagation.

src/lib.rs (1)

33-33: LGTM! Module exposure.

The new memory module is properly exposed as public, following the existing pattern.

src/handlers/http/modal/server.rs (1)

157-159: LGTM! Clean integration.

The memory release scheduler initialization is properly placed in the startup sequence after analytics and before launching livetail/airplane servers, with correct error propagation.

src/handlers/http/modal/query_server.rs (1)

132-134: LGTM! Clean integration.

The memory release scheduler initialization is properly placed in the startup sequence after analytics and before cluster metrics scheduling, with correct error propagation.

src/response.rs (1)

35-55: Verify the batch size and memory improvement.

The batched processing approach processes 100 RecordBatch items at a time. However, the choice of BATCH_SIZE = 100 appears arbitrary and may not be optimal.

Consider:

Batch size tuning: The optimal batch size depends on the typical size of individual RecordBatch items. If record batches are already large, processing 100 at a time may still consume significant memory.

Memory improvement validation: Has this change been benchmarked to confirm it actually reduces peak memory usage compared to the previous implementation? The memory benefit depends on whether the intermediate json_records vectors are released between chunks.

Trade-offs: Smaller batch sizes reduce peak memory but increase processing overhead. Larger batches do the opposite.

Would you like me to generate a script to help profile memory usage with different batch sizes, or provide guidance on determining an optimal value based on your typical workload?

coderabbitai · 2025-10-27T02:52:49Z

Cargo.toml

+tikv-jemalloc-ctl = "0.6.0"
+tikv-jemallocator = "0.6.0"
+tikv-jemalloc-sys = "0.6.1"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that jemalloc dependency versions are current and secure.

The tikv-jemalloc dependencies have been added with specific versions. Please confirm these are the latest stable versions and check for any known security vulnerabilities.

Run the following script to check for the latest versions and security advisories:

🏁 Script executed:

#!/bin/bash # Check latest versions on crates.io echo "=== Checking latest versions ===" for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do echo "Checking $crate..." curl -s "https://crates.io/api/v1/crates/$crate" | jq -r ".crate.max_stable_version // .crate.max_version" done echo "" echo "=== Checking for security advisories ===" for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do gh api graphql -f query=" { securityVulnerabilities(first: 5, ecosystem: RUST, package: \"$crate\") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }" 2>/dev/null || echo "No advisories found for $crate" done

Length of output: 380

Update tikv-jemalloc-ctl and tikv-jemallocator to latest stable versions.

tikv-jemalloc-ctl and tikv-jemallocator are pinned to 0.6.0, but version 0.6.1 is available on crates.io. Update both to 0.6.1 for consistency with tikv-jemalloc-sys and to use the latest stable patch. No security vulnerabilities were found for any of these crates.

🤖 Prompt for AI Agents

In Cargo.toml around lines 31 to 33, tikv-jemalloc-ctl and tikv-jemallocator are pinned at 0.6.0 while tikv-jemalloc-sys is 0.6.1; update both tikv-jemalloc-ctl and tikv-jemallocator to version "0.6.1" to keep versions consistent, then run cargo update -p tikv-jemalloc-ctl -p tikv-jemallocator (or regenerate Cargo.lock) to ensure the lockfile reflects the new patch versions.

nikhilsinhaparseable added 3 commits October 25, 2025 04:18

chore: release memory to the OS every hour

001c478

moved from query handler to scheduled job

db6ba4b

chore: resource check on 2-min interval

0afc991

take cpu and memory utilisation for 2 min rolling window before decide to reject the request

coderabbitai bot requested changes Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Resource check #1452

Resource check #1452

Uh oh!

nikhilsinhaparseable commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Resource check #1452

Are you sure you want to change the base?

Resource check #1452

Uh oh!

Conversation

nikhilsinhaparseable commented Oct 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikhilsinhaparseable commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 27, 2025 •

edited

Loading