Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap error heap usage #987

Merged
merged 7 commits into from
Aug 5, 2024
Merged

Cap error heap usage #987

merged 7 commits into from
Aug 5, 2024

Conversation

Jake-Shadle
Copy link
Collaborator

This is a rework of #986

This resolves an unbounded memory growth issue that can occur in a particularly hostile context, ie, every packet is invalid in some way, generating an error, and CPU usage is high enough that tasks, notably

tokio::spawn(async move {
let mut log_task = tokio::time::interval(std::time::Duration::from_secs(5));
let mut pipeline_errors = std::collections::HashMap::<String, u64>::new();
loop {
tokio::select! {
_ = log_task.tick() => {
for (error, instances) in &pipeline_errors {
tracing::warn!(%error, %instances, "pipeline report");
}
pipeline_errors.clear();
}
received = error_receiver.recv() => {
let Some(error) = received else {
tracing::info!("pipeline reporting task closed");
return;
};
let entry = pipeline_errors.entry(error.to_string()).or_default();
*entry += 1;
}
}
}
});

do not consistently receive CPU time. In this context, where every received packet generates an error, which is then sent to an unbounded channel before being dequeued, the dequeue task might not run quickly enough to dequeue each error, causing both the memory allocated by the error, as well as by the queue, grow quickly enough that the process can be OOM killed.

This PR changes it so that instead of each error being sent to an unbounded channel, the errors are instead accumulated by the receive tasks themselves, and those accumulations being sent to the print task once they reach a cap, or a certain amount of time since the first error was pushed has elapsed to a bounded channel. If the bounded channel is full and too many errors are accumulated on a particular receive task, we drop the errors and note that it happened, rather than risk consuming too much heap memory.

Previously:

image

This PR:

image

While investigating I ran into a bug with xDS filterchain where the server will send empty delta updates, so now the server doesn't send updates if the resource(s) haven't changed.

Also added 2 ways to generate unique tokens, one where n tokens of length l are generated and written to ./quilkin-test-config.yaml that can be used with -c, and the other is with a new --to-tokens arg on proxy that can do the same, assigning unique tokens of a specific length to each address in --to.

Resolves: #983

crates/gen.rs Show resolved Hide resolved
src/components/proxy/error.rs Show resolved Hide resolved
src/components/proxy/packet_router.rs Show resolved Hide resolved
src/cli/proxy.rs Show resolved Hide resolved
Don't bother sending resource update if there aren't actually any changes
This allows easier testing of proxies by creating a certain number of unique tokens of a specified length to each address in --to
This can be used to do the same thing as --to-tokens but to a config file instead of directly via cli
@Jake-Shadle Jake-Shadle merged commit ee0b70f into main Aug 5, 2024
12 checks passed
@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: d2794817-c17c-442a-ab47-3ba5ba9930af

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch git@github.com:googleforgames/quilkin.git pull/987/head:pr_987 && git checkout pr_987
cargo build

@Jake-Shadle Jake-Shadle deleted the cap-error-heap branch August 5, 2024 13:48
@markmandel markmandel added kind/bug Something isn't working area/performance Anything to do with Quilkin being slow, or making it go faster. and removed kind/other labels Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Quilkin being slow, or making it go faster. kind/bug Something isn't working size/l
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in proxies
4 participants