-
-
Couldn't load subscription status.
- Fork 153
Resource check #1452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Resource check #1452
Conversation
take cpu and memory utilisation for 2 min rolling window before decide to reject the request
WalkthroughThis PR integrates jemalloc as the global memory allocator and introduces a memory release scheduler that periodically purges jemalloc arenas. Query handlers and response serialization are optimized to reduce memory retention, and resource monitoring is enhanced with rolling averages to improve decision-making. Changes
Sequence DiagramsequenceDiagram
participant Server as Server Init
participant Scheduler as Memory Scheduler
participant Jemalloc as Jemalloc
Server->>Scheduler: init_memory_release_scheduler()
activate Scheduler
Scheduler->>Scheduler: Create AsyncScheduler
Scheduler->>Scheduler: Schedule hourly task
Scheduler->>Scheduler: Spawn Tokio poller (60s interval)
Scheduler-->>Server: Ok(())
deactivate Scheduler
loop Every 60 seconds
Scheduler->>Scheduler: Poll scheduled tasks
alt Task ready
Scheduler->>Jemalloc: force_memory_release()
Jemalloc->>Jemalloc: Advance epoch
Jemalloc->>Jemalloc: Purge arenas
Jemalloc-->>Scheduler: Success
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
src/memory.rs (2)
67-72: Consider tracking the spawned task or reducing poll frequency.The scheduler polls every 60 seconds but tasks only run hourly. This is somewhat inefficient—consider reducing the poll frequency to every 5-10 minutes. Additionally, the spawned task is fire-and-forget with no shutdown mechanism. While this may be acceptable for a cleanup task, consider returning a
JoinHandleto allow graceful shutdown if needed.Apply this diff to reduce poll frequency:
tokio::spawn(async move { loop { scheduler.run_pending().await; - tokio::time::sleep(Duration::from_secs(60)).await; // Check every minute + tokio::time::sleep(Duration::from_secs(300)).await; // Check every 5 minutes } });
34-48: Minor robustness: CString::new could theoretically fail.The
format!("arena.{i}.purge")is unlikely to produce null bytes, butCString::newreturns aResultthat could fail. The current code silently skips the arena on failure. Consider logging a warning if this occurs, though in practice this should never happen.src/main.rs (1)
34-37: LGTM! Global allocator configuration is correct.The jemalloc global allocator is properly configured with an appropriate
cfgguard to exclude MSVC targets. This is a standard pattern and the placement beforemain()is correct.Note that this is a significant runtime change affecting all memory allocations throughout the application. Consider monitoring memory metrics after deployment to validate the expected improvements.
src/handlers/http/query.rs (3)
243-251: Incomplete memory management implementation - commented-out code should be addressed.The commented-out
force_memory_release()call suggests this feature is incomplete or under development.Please either:
- Enable the feature: Uncomment and import
force_memory_release()from the memory module if memory release is intended here- Remove the commented code: If memory release isn't needed or is handled elsewhere (e.g., by the memory scheduler), remove the comment to avoid confusion
- Add a TODO with context: If this is intentionally deferred, add a TODO comment explaining why and when it should be enabled
The current state creates uncertainty about whether this is intentional or forgotten work.
332-349: Explicit drop may not achieve intended memory reduction.The explicit
drop(response)at line 347 is meant to reduce memory retention, but its effectiveness is questionable since:
- Already consumed: The
responseis converted to bytes at line 346 before being dropped, so the memory is still held bybytes_result- Automatic drop: Even without the explicit drop,
responsewould be dropped at the end of the scope (after line 349)- Missing the real issue: The primary memory holder is the
RecordBatchinsidequery_response, not the JSONValueIf the goal is to reduce memory retention, consider:
- // Create response and immediately process to reduce memory retention - let query_response = QueryResponse { - records: vec![batch], - fields: Vec::new(), - fill_null: send_null, - with_fields: false, - }; - - let response = query_response.to_json().map_err(|e| { - error!("Failed to parse record batch into JSON: {}", e); - actix_web::error::ErrorInternalServerError(e) - })?; - - // Convert to bytes and explicitly drop the response object - let bytes_result = Bytes::from(format!("{response}\n")); - drop(response); // Explicit cleanup - - Ok(bytes_result) + // Convert batch directly to JSON and bytes to minimize intermediate allocations + let bytes_result = { + let query_response = QueryResponse { + records: vec![batch], + fields: Vec::new(), + fill_null: send_null, + with_fields: false, + }; + + let response = query_response.to_json().map_err(|e| { + error!("Failed to parse record batch into JSON: {}", e); + actix_web::error::ErrorInternalServerError(e) + })?; + + Bytes::from(format!("{response}\n")) + }; // query_response and response dropped here + + Ok(bytes_result)This ensures all intermediate data is dropped before returning.
394-407: Explicit drop is appropriate here but consider the trade-offs.The explicit
drop(records)at line 397 is more appropriate than the similar pattern increate_batch_processorbecause:
- After conversion: It drops the original
RecordBatchdata afterrecord_batches_to_jsonhas borrowed and converted it- Large data:
RecordBatchobjects can be memory-intensive, so dropping them early could helpHowever, consider:
Benefits vs complexity: The explicit drop adds code complexity for a memory optimization that may be marginal in practice. The records would be automatically dropped at the end of the block (line 399) anyway, just a few lines later.
Consistency: If you adopt this pattern here, consider whether it should be applied consistently throughout the codebase for similar scenarios, or document when/why explicit drops are warranted.
Measurement: Has this been profiled to confirm it reduces peak memory usage in production workloads? Without measurements, it's unclear if the added complexity is justified.
The current code is correct, but the value of the explicit drop depends on your memory constraints and typical data sizes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
Cargo.toml(1 hunks)src/handlers/http/modal/ingest_server.rs(1 hunks)src/handlers/http/modal/query_server.rs(1 hunks)src/handlers/http/modal/server.rs(1 hunks)src/handlers/http/query.rs(3 hunks)src/handlers/http/resource_check.rs(7 hunks)src/lib.rs(1 hunks)src/main.rs(1 hunks)src/memory.rs(1 hunks)src/metastore/metastores/object_store_metastore.rs(1 hunks)src/response.rs(1 hunks)src/utils/arrow/mod.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-24T11:54:20.259Z
Learnt from: parmesant
PR: parseablehq/parseable#1449
File: src/metastore/metastores/object_store_metastore.rs:83-98
Timestamp: 2025-10-24T11:54:20.259Z
Learning: In the `get_overviews` method in `src/metastore/metastores/object_store_metastore.rs`, using `.ok()` to convert all storage errors to `None` when fetching overview objects is the intended behavior. This intentionally treats missing files and other errors (network, permissions, etc.) the same way.
Applied to files:
src/metastore/metastores/object_store_metastore.rs
🧬 Code graph analysis (6)
src/handlers/http/modal/query_server.rs (1)
src/memory.rs (1)
init_memory_release_scheduler(56-75)
src/response.rs (1)
src/utils/arrow/mod.rs (1)
record_batches_to_json(50-72)
src/handlers/http/modal/ingest_server.rs (1)
src/memory.rs (1)
init_memory_release_scheduler(56-75)
src/metastore/metastores/object_store_metastore.rs (1)
src/correlation.rs (1)
path(227-234)
src/handlers/http/modal/server.rs (1)
src/memory.rs (1)
init_memory_release_scheduler(56-75)
src/handlers/http/query.rs (1)
src/utils/arrow/mod.rs (1)
record_batches_to_json(50-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
🔇 Additional comments (11)
src/metastore/metastores/object_store_metastore.rs (1)
111-114: LGTM!The inlining of the await and Ok wrapping is a clean refactor with no functional change.
src/utils/arrow/mod.rs (1)
50-72: LGTM! Memory-conscious optimizations.The early return, pre-allocation, cursor usage, and explicit error propagation are all solid improvements that align with the PR's memory management focus.
src/handlers/http/resource_check.rs (4)
51-100: LGTM! Solid rolling window implementation.The
ResourceHistoryimplementation correctly maintains a 2-minute window with automatic cleanup of stale samples. The average calculation properly handles edge cases (empty samples).
169-177: Good warm-up period design.The warm-up logic (requiring
min_samples_for_decisionbefore using rolling averages) prevents premature rejections during startup. The fallback to current instantaneous values with logging is a conservative and sensible approach.
240-301: LGTM! Comprehensive test coverage.The unit tests cover basic averaging, window cleanup, empty state, and single-sample scenarios. Good coverage for the
ResourceHistoryimplementation.
102-106: Behavior change: Resource checks now enabled by default.The initialization of
RESOURCE_CHECK_ENABLEDhas changed fromfalsetotrue(commit 0afc991). With thresholds defaulting to 80% CPU and 80% memory, the server will immediately begin rejecting requests if current resource usage exceeds these levels, even during the warm-up period before the 2-minute rolling average is available. While the conservative approach of using instantaneous values during warm-up is reasonable, ensure thresholds are appropriately tuned for your deployment to avoid rejecting legitimate traffic on startup or during brief resource fluctuations.src/handlers/http/modal/ingest_server.rs (1)
132-134: LGTM! Clean integration.The memory release scheduler initialization is properly placed before spawning the airplane server, with correct error propagation.
src/lib.rs (1)
33-33: LGTM! Module exposure.The new
memorymodule is properly exposed as public, following the existing pattern.src/handlers/http/modal/server.rs (1)
157-159: LGTM! Clean integration.The memory release scheduler initialization is properly placed in the startup sequence after analytics and before launching livetail/airplane servers, with correct error propagation.
src/handlers/http/modal/query_server.rs (1)
132-134: LGTM! Clean integration.The memory release scheduler initialization is properly placed in the startup sequence after analytics and before cluster metrics scheduling, with correct error propagation.
src/response.rs (1)
35-55: Verify the batch size and memory improvement.The batched processing approach processes 100
RecordBatchitems at a time. However, the choice ofBATCH_SIZE = 100appears arbitrary and may not be optimal.Consider:
Batch size tuning: The optimal batch size depends on the typical size of individual
RecordBatchitems. If record batches are already large, processing 100 at a time may still consume significant memory.Memory improvement validation: Has this change been benchmarked to confirm it actually reduces peak memory usage compared to the previous implementation? The memory benefit depends on whether the intermediate
json_recordsvectors are released between chunks.Trade-offs: Smaller batch sizes reduce peak memory but increase processing overhead. Larger batches do the opposite.
Would you like me to generate a script to help profile memory usage with different batch sizes, or provide guidance on determining an optimal value based on your typical workload?
| tikv-jemalloc-ctl = "0.6.0" | ||
| tikv-jemallocator = "0.6.0" | ||
| tikv-jemalloc-sys = "0.6.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify that jemalloc dependency versions are current and secure.
The tikv-jemalloc dependencies have been added with specific versions. Please confirm these are the latest stable versions and check for any known security vulnerabilities.
Run the following script to check for the latest versions and security advisories:
🏁 Script executed:
#!/bin/bash
# Check latest versions on crates.io
echo "=== Checking latest versions ==="
for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do
echo "Checking $crate..."
curl -s "https://crates.io/api/v1/crates/$crate" | jq -r ".crate.max_stable_version // .crate.max_version"
done
echo ""
echo "=== Checking for security advisories ==="
for crate in tikv-jemalloc-ctl tikv-jemallocator tikv-jemalloc-sys; do
gh api graphql -f query="
{
securityVulnerabilities(first: 5, ecosystem: RUST, package: \"$crate\") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}" 2>/dev/null || echo "No advisories found for $crate"
doneLength of output: 380
Update tikv-jemalloc-ctl and tikv-jemallocator to latest stable versions.
tikv-jemalloc-ctl and tikv-jemallocator are pinned to 0.6.0, but version 0.6.1 is available on crates.io. Update both to 0.6.1 for consistency with tikv-jemalloc-sys and to use the latest stable patch. No security vulnerabilities were found for any of these crates.
🤖 Prompt for AI Agents
In Cargo.toml around lines 31 to 33, tikv-jemalloc-ctl and tikv-jemallocator are
pinned at 0.6.0 while tikv-jemalloc-sys is 0.6.1; update both tikv-jemalloc-ctl
and tikv-jemallocator to version "0.6.1" to keep versions consistent, then run
cargo update -p tikv-jemalloc-ctl -p tikv-jemallocator (or regenerate
Cargo.lock) to ensure the lockfile reflects the new patch versions.
take cpu and memory utilisation for 2 min rolling window before decide to reject the request
Summary by CodeRabbit
Release Notes
New Features
Improvements