Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage for cdrs_tokio::transport::AsyncTransport::start_processing #161

Open
coder3101 opened this issue Apr 28, 2023 · 19 comments
Open
Labels
help wanted Extra attention is needed

Comments

@coder3101
Copy link

coder3101 commented Apr 28, 2023

I have setup a cassandra cluster of 6 pods using the helm charts and I am connecting to that cluster with one known node. The application works fine, but occasionally the cpu usage of my application would go full and the application becomes unresponsive. I sampled the cpu usage at one of the peak cpu usage for 1 second and found that 99% of cpu is being used in cdrs_tokio::transport::AsyncTransport::start_processing, I have attached the complete flamegraph SVG for your reference here. I also observe that after few hours of full cpu usage, the usage does comes down to 50% (but still high)

out

The flamegraph was captured without debug symbols since It happens occasionally, below is the code that connect demonstrate how I make the connection, here CASSANDRA_HOST is the name of the service of K8 application.

pub async fn create_cassandra_session() -> anyhow::Result<CassandraSession> {
    let url = format!(
        "{}:{}",
        std::env::var("CASSANDRA_HOST").unwrap_or("127.0.0.1".to_string()),
        std::env::var("CASSANDRA_PORT").unwrap_or("9042".to_string())
    );
    tracing::info!("Connecting to cassandra at {url}");
    let cluster_config = if let (Some(username), Some(password)) = (
        std::env::var("CASSANDRA_USERNAME").ok(),
        std::env::var("CASSANDRA_PASSWORD").ok(),
    ) {
        tracing::info!("Cassandra static credentials read from environment");
        let authenticator = StaticPasswordAuthenticatorProvider::new(username, password);
        NodeTcpConfigBuilder::new()
            .with_contact_point(url.into())
            .with_authenticator_provider(Arc::new(authenticator))
            .build()
            .await?
    } else {
        tracing::info!("Cassandra is connecting without credentials");
        NodeTcpConfigBuilder::new()
            .with_contact_point(url.into())
            .build()
            .await?
    };

    Ok(TcpSessionBuilder::new(RoundRobinLoadBalancingStrategy::new(), cluster_config).build()?)
}
@krojew
Copy link
Owner

krojew commented Apr 28, 2023

start_processing is a bit misleading - it starts read/write loops per connection, so it will naturally show up on the flamegraph as the main culprit, especially when connections live as long as the whole application. Can you share debug logs from the moment it happens?

@coder3101
Copy link
Author

I will redeploy my app to run with debug symbols and at debug log level and I will update the logs here once I have them.

@krojew
Copy link
Owner

krojew commented Apr 28, 2023

You don't have to run a debug build - just debug logs will be fine. The problem might be in any part of the application, maybe even not in cdrs, so it's important to get the logs from the exact moment it happens.

@coder3101
Copy link
Author

I will run a release build itself but with debug symbols.

@coder3101
Copy link
Author

I ran again and captured the flamegraph with same results as above. I finally decided to use another library named scylla which does not have this problem. Is there any possibility that start_processing::{closure}::{..} can somehow be stuck in a infinite loop?

@krojew
Copy link
Owner

krojew commented Apr 29, 2023

No it can't, since it's simply a read/write loop which waits for data. That's why you're seeing it on the flamegraph - it runs for the entire duration of the session, but doesn't actually do anything until data comes in or out. That's why you can see AsynRead and AsyncWrite there which don't actually take any CPU cycles - they wait an do work only when data is available. It would be great if you provided the debug logs, so we could see what's actually going on.

@coder3101
Copy link
Author

I ran with RUST_LOG=debug but it cdrs_tokio did not logged anything. Did i missed something? I use rust nightly and tokio-tracing.

@krojew
Copy link
Owner

krojew commented Apr 29, 2023

Are you using the latest version? Do you have any tracing subscriber, which outputs the logs, e.g. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/index.html?

@coder3101
Copy link
Author

Are you using the latest version?

Yes, (I usually perform cargo update once a week)

Do you have any tracing subscriber, which outputs the logs, e.g. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/index.html?

Yes, I use the above tracing_subscriber itself. I am able to get logs from hyper, h2 and other crates.

I have been running the service for a long time and this has been happening all the time. I am able to get some logs from cdrs_tokio target at ERROR.

Screenshot 2023-04-29 at 9 32 03 PM

Screenshot 2023-04-29 at 9 31 50 PM

For now, I have moved to scylla which does not have this problem. This library used to somehow cause so much IO, that our istio-sidecar would also consume more CPU, resulting in a high overall pod CPU usage. There must be some tight loop somewhere in transport.

@krojew
Copy link
Owner

krojew commented Apr 29, 2023

"Connection reset by peer" means the cluster shut the connection down. Can you verify two things:

  1. You are using version 8.1, not 8.0 which was yanked.
  2. RUST_LOG is set to DEBUG, since it seems you have some other log level set, as if the env variable was not set for your application. cdrs uses tracing for logging, so it follows all the usual rules.

@krojew
Copy link
Owner

krojew commented Apr 29, 2023

Just had a thought - if you are using 8.0 and the nodes drop connections, the io spikes might be related to connection pools being re-established. Lowering pool size might be the solution.

@coder3101
Copy link
Author

You are using version 8.1, not 8.0 which was yanked.

Now that I check, I was using cdrs-tokio = { version = "7.0.4", features = ["derive"] }, Let me check back these things with latest version.

@krojew
Copy link
Owner

krojew commented Apr 29, 2023

It might help out of the box, but also remember you can fine tune many settings. If you still get those spikes, try lowering heartbeat interval and/or pool size.

@vaikzs
Copy link

vaikzs commented Jun 3, 2023

I'm facing similar issue wherein CPU usage is ~50% and I tried both lowering connection pooling (local: 1 and remote:0) and upgrading to 8.1.0. Both didn't help as of now.

@krojew
Copy link
Owner

krojew commented Jun 3, 2023

Do you have a flamegraph and/or debug logs?

@vaikzs
Copy link

vaikzs commented Aug 4, 2023

@krojew unfortunately I'm still working on flamegraph report. However I can assure you that once I take out cdrs-tokio I don't see the CPU usage raised after a while. Below is when cdrs-tokio is present and application is idle:
image

@stale
Copy link

stale bot commented Oct 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 15, 2023
@vaikzs
Copy link

vaikzs commented Mar 6, 2024

@krojew I can confirm we are still facing this issue, requesting to investigate or acknowledge its a WIP.

@stale stale bot removed the wontfix This will not be worked on label Mar 6, 2024
@krojew
Copy link
Owner

krojew commented Mar 7, 2024

Can you provide logs and flame graph if possible?

@krojew krojew added the help wanted Extra attention is needed label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants