Cap error heap usage #987

Jake-Shadle · 2024-07-04T15:12:49Z

This is a rework of #986

This resolves an unbounded memory growth issue that can occur in a particularly hostile context, ie, every packet is invalid in some way, generating an error, and CPU usage is high enough that tasks, notably

quilkin/src/components/proxy/packet_router.rs

Lines 272 to 295 in 50d91e4

    
           tokio::spawn(async move { 
        
               let mut log_task = tokio::time::interval(std::time::Duration::from_secs(5)); 
        
               let mut pipeline_errors = std::collections::HashMap::<String, u64>::new(); 
        
               loop { 
        
                   tokio::select! { 
        
                       _ = log_task.tick() => { 
        
                           for (error, instances) in &pipeline_errors { 
        
                               tracing::warn!(%error, %instances, "pipeline report"); 
        
                           } 
        
                           pipeline_errors.clear(); 
        
                       } 
        
                       received = error_receiver.recv() => { 
        
                           let Some(error) = received else { 
        
                               tracing::info!("pipeline reporting task closed"); 
        
                               return; 
        
                           }; 
        
                           let entry = pipeline_errors.entry(error.to_string()).or_default(); 
        
                           *entry += 1; 
        
                       } 
        
                   } 
        
               } 
        
           });

do not consistently receive CPU time. In this context, where every received packet generates an error, which is then sent to an unbounded channel before being dequeued, the dequeue task might not run quickly enough to dequeue each error, causing both the memory allocated by the error, as well as by the queue, grow quickly enough that the process can be OOM killed.

This PR changes it so that instead of each error being sent to an unbounded channel, the errors are instead accumulated by the receive tasks themselves, and those accumulations being sent to the print task once they reach a cap, or a certain amount of time since the first error was pushed has elapsed to a bounded channel. If the bounded channel is full and too many errors are accumulated on a particular receive task, we drop the errors and note that it happened, rather than risk consuming too much heap memory.

Previously:

This PR:

While investigating I ran into a bug with xDS filterchain where the server will send empty delta updates, so now the server doesn't send updates if the resource(s) haven't changed.

Also added 2 ways to generate unique tokens, one where n tokens of length l are generated and written to ./quilkin-test-config.yaml that can be used with -c, and the other is with a new --to-tokens arg on proxy that can do the same, assigning unique tokens of a specific length to each address in --to.

Resolves: #983

crates/gen.rs

src/components/proxy/error.rs

src/components/proxy/packet_router.rs

src/cli/proxy.rs

Don't bother sending resource update if there aren't actually any changes

This allows easier testing of proxies by creating a certain number of unique tokens of a specified length to each address in --to

This can be used to do the same thing as --to-tokens but to a config file instead of directly via cli

quilkin-bot · 2024-08-05T13:47:50Z

Build Succeeded 🥳

Build Id: d2794817-c17c-442a-ab47-3ba5ba9930af

The following development images have been built, and will exist for the next 30 days:

us-docker.pkg.dev/quilkin/ci/quilkin:0.9.0-dev-ed05d32

To build this version:

git fetch git@github.com:googleforgames/quilkin.git pull/987/head:pr_987 && git checkout pr_987
cargo build

Jake-Shadle requested review from markmandel, XAMPPRocky and koslib as code owners July 4, 2024 15:12

github-actions bot added kind/other size/l labels Jul 4, 2024

markmandel mentioned this pull request Jul 9, 2024

Incorrect Routing of Upstream Packets (Two clients and one game server) #988

Open

markmandel reviewed Jul 12, 2024

View reviewed changes

crates/gen.rs Show resolved Hide resolved

src/components/proxy/error.rs Show resolved Hide resolved

src/components/proxy/packet_router.rs Show resolved Hide resolved

src/cli/proxy.rs Show resolved Hide resolved

Jake-Shadle added 5 commits August 5, 2024 14:42

Fix filterchain

d3cfe14

Don't bother sending resource update if there aren't actually any changes

Add --to-tokens

d8a89bc

This allows easier testing of proxies by creating a certain number of unique tokens of a specified length to each address in --to

Add gen script

35c4f13

This can be used to do the same thing as --to-tokens but to a config file instead of directly via cli

Rework errors

d28d9b0

Fixup

7d0cdbb

Jake-Shadle force-pushed the cap-error-heap branch from b3c2c38 to 7d0cdbb Compare August 5, 2024 12:46

Jake-Shadle enabled auto-merge (squash) August 5, 2024 12:50

Address feedback

052edbf

Jake-Shadle force-pushed the cap-error-heap branch from 343be17 to 052edbf Compare August 5, 2024 12:57

XAMPPRocky approved these changes Aug 5, 2024

View reviewed changes

Merge branch 'main' into cap-error-heap

ed05d32

Jake-Shadle merged commit ee0b70f into main Aug 5, 2024
12 checks passed

Jake-Shadle deleted the cap-error-heap branch August 5, 2024 13:48

markmandel added kind/bug Something isn't working area/performance Anything to do with Quilkin being slow, or making it go faster. and removed kind/other labels Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap error heap usage #987

Cap error heap usage #987

Jake-Shadle commented Jul 4, 2024

quilkin-bot commented Aug 5, 2024

	tokio::spawn(async move {
	let mut log_task = tokio::time::interval(std::time::Duration::from_secs(5));

	let mut pipeline_errors = std::collections::HashMap::<String, u64>::new();
	loop {
	tokio::select! {
	_ = log_task.tick() => {
	for (error, instances) in &pipeline_errors {
	tracing::warn!(%error, %instances, "pipeline report");
	}
	pipeline_errors.clear();
	}
	received = error_receiver.recv() => {
	let Some(error) = received else {
	tracing::info!("pipeline reporting task closed");
	return;
	};

	let entry = pipeline_errors.entry(error.to_string()).or_default();
	*entry += 1;
	}
	}
	}
	});

Cap error heap usage #987

Cap error heap usage #987

Conversation

Jake-Shadle commented Jul 4, 2024

quilkin-bot commented Aug 5, 2024