High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

coder3101 · 2023-04-28T18:11:01Z

I have setup a cassandra cluster of 6 pods using the helm charts and I am connecting to that cluster with one known node. The application works fine, but occasionally the cpu usage of my application would go full and the application becomes unresponsive. I sampled the cpu usage at one of the peak cpu usage for 1 second and found that 99% of cpu is being used in cdrs_tokio::transport::AsyncTransport::start_processing, I have attached the complete flamegraph SVG for your reference here. I also observe that after few hours of full cpu usage, the usage does comes down to 50% (but still high)

The flamegraph was captured without debug symbols since It happens occasionally, below is the code that connect demonstrate how I make the connection, here CASSANDRA_HOST is the name of the service of K8 application.

pub async fn create_cassandra_session() -> anyhow::Result<CassandraSession> {
    let url = format!(
        "{}:{}",
        std::env::var("CASSANDRA_HOST").unwrap_or("127.0.0.1".to_string()),
        std::env::var("CASSANDRA_PORT").unwrap_or("9042".to_string())
    );
    tracing::info!("Connecting to cassandra at {url}");
    let cluster_config = if let (Some(username), Some(password)) = (
        std::env::var("CASSANDRA_USERNAME").ok(),
        std::env::var("CASSANDRA_PASSWORD").ok(),
    ) {
        tracing::info!("Cassandra static credentials read from environment");
        let authenticator = StaticPasswordAuthenticatorProvider::new(username, password);
        NodeTcpConfigBuilder::new()
            .with_contact_point(url.into())
            .with_authenticator_provider(Arc::new(authenticator))
            .build()
            .await?
    } else {
        tracing::info!("Cassandra is connecting without credentials");
        NodeTcpConfigBuilder::new()
            .with_contact_point(url.into())
            .build()
            .await?
    };

    Ok(TcpSessionBuilder::new(RoundRobinLoadBalancingStrategy::new(), cluster_config).build()?)
}

The text was updated successfully, but these errors were encountered:

krojew · 2023-04-28T18:20:34Z

start_processing is a bit misleading - it starts read/write loops per connection, so it will naturally show up on the flamegraph as the main culprit, especially when connections live as long as the whole application. Can you share debug logs from the moment it happens?

coder3101 · 2023-04-28T18:23:47Z

I will redeploy my app to run with debug symbols and at debug log level and I will update the logs here once I have them.

krojew · 2023-04-28T18:25:32Z

You don't have to run a debug build - just debug logs will be fine. The problem might be in any part of the application, maybe even not in cdrs, so it's important to get the logs from the exact moment it happens.

coder3101 · 2023-04-28T18:30:05Z

I will run a release build itself but with debug symbols.

coder3101 · 2023-04-29T06:11:37Z

I ran again and captured the flamegraph with same results as above. I finally decided to use another library named scylla which does not have this problem. Is there any possibility that start_processing::{closure}::{..} can somehow be stuck in a infinite loop?

krojew · 2023-04-29T06:16:51Z

No it can't, since it's simply a read/write loop which waits for data. That's why you're seeing it on the flamegraph - it runs for the entire duration of the session, but doesn't actually do anything until data comes in or out. That's why you can see AsynRead and AsyncWrite there which don't actually take any CPU cycles - they wait an do work only when data is available. It would be great if you provided the debug logs, so we could see what's actually going on.

coder3101 · 2023-04-29T06:45:47Z

I ran with RUST_LOG=debug but it cdrs_tokio did not logged anything. Did i missed something? I use rust nightly and tokio-tracing.

krojew · 2023-04-29T07:17:23Z

Are you using the latest version? Do you have any tracing subscriber, which outputs the logs, e.g. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/index.html?

coder3101 · 2023-04-29T16:05:33Z

Are you using the latest version?

Yes, (I usually perform cargo update once a week)

Do you have any tracing subscriber, which outputs the logs, e.g. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/index.html?

Yes, I use the above tracing_subscriber itself. I am able to get logs from hyper, h2 and other crates.

I have been running the service for a long time and this has been happening all the time. I am able to get some logs from cdrs_tokio target at ERROR.

For now, I have moved to scylla which does not have this problem. This library used to somehow cause so much IO, that our istio-sidecar would also consume more CPU, resulting in a high overall pod CPU usage. There must be some tight loop somewhere in transport.

krojew · 2023-04-29T16:10:09Z

"Connection reset by peer" means the cluster shut the connection down. Can you verify two things:

You are using version 8.1, not 8.0 which was yanked.
RUST_LOG is set to DEBUG, since it seems you have some other log level set, as if the env variable was not set for your application. cdrs uses tracing for logging, so it follows all the usual rules.

krojew · 2023-04-29T16:12:35Z

Just had a thought - if you are using 8.0 and the nodes drop connections, the io spikes might be related to connection pools being re-established. Lowering pool size might be the solution.

coder3101 · 2023-04-29T16:21:11Z

You are using version 8.1, not 8.0 which was yanked.

Now that I check, I was using cdrs-tokio = { version = "7.0.4", features = ["derive"] }, Let me check back these things with latest version.

krojew · 2023-04-29T16:24:06Z

It might help out of the box, but also remember you can fine tune many settings. If you still get those spikes, try lowering heartbeat interval and/or pool size.

vaikzs · 2023-06-03T07:55:40Z

I'm facing similar issue wherein CPU usage is ~50% and I tried both lowering connection pooling (local: 1 and remote:0) and upgrading to 8.1.0. Both didn't help as of now.

krojew · 2023-06-03T07:56:50Z

Do you have a flamegraph and/or debug logs?

vaikzs · 2023-08-04T23:54:10Z

@krojew unfortunately I'm still working on flamegraph report. However I can assure you that once I take out cdrs-tokio I don't see the CPU usage raised after a while. Below is when cdrs-tokio is present and application is idle:

stale · 2023-10-15T06:02:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vaikzs · 2024-03-06T22:28:24Z

@krojew I can confirm we are still facing this issue, requesting to investigate or acknowledge its a WIP.

krojew · 2024-03-07T04:59:08Z

Can you provide logs and flame graph if possible?

stale bot added the wontfix This will not be worked on label Oct 15, 2023

stale bot removed the wontfix This will not be worked on label Mar 6, 2024

krojew added the help wanted Extra attention is needed label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

coder3101 commented Apr 28, 2023 •

edited

Loading

krojew commented Apr 28, 2023

coder3101 commented Apr 28, 2023

krojew commented Apr 28, 2023

coder3101 commented Apr 28, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

vaikzs commented Jun 3, 2023

krojew commented Jun 3, 2023

vaikzs commented Aug 4, 2023

stale bot commented Oct 15, 2023

vaikzs commented Mar 6, 2024

krojew commented Mar 7, 2024

High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

Comments

coder3101 commented Apr 28, 2023 • edited Loading

krojew commented Apr 28, 2023

coder3101 commented Apr 28, 2023

krojew commented Apr 28, 2023

coder3101 commented Apr 28, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

krojew commented Apr 29, 2023

coder3101 commented Apr 29, 2023

krojew commented Apr 29, 2023

vaikzs commented Jun 3, 2023

krojew commented Jun 3, 2023

vaikzs commented Aug 4, 2023

stale bot commented Oct 15, 2023

vaikzs commented Mar 6, 2024

krojew commented Mar 7, 2024

coder3101 commented Apr 28, 2023 •

edited

Loading