Issue with AWS MSK IAM using Apache Kafka scaler #5531

sameerjoshinice · 2024-02-25T04:35:01Z

Report

AWS MSK getting into high CPU usage and retrieval of metadata not working for Apache Kafka scaler experimental

Expected Behavior

After having everything correctly configured, Keda should have been able to get the metadata for the topics, use it for scaling and not affect MSK itself.

Actual Behavior

No metadata retrieval working giving errors, causing high CPU usage on MSK causing MSK outage. This means scaler is not working as expected.

Steps to Reproduce the Problem

Add AWS MSK IAM with roleArn based authentication in Apache Kafka scaler. Kafka version on MSK is 3.5.1
2.Sasl is set to aws_msk_iam and tls is set to enable.
Following is the scaled object and triggerauth config:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: abcd-selector-scaler
  namespace: apps-abcd
spec:
  scaleTargetRef:
    name: apps-abcd-selector
  pollingInterval: 5 # Optional. Default: 30 seconds
  cooldownPeriod: 30 # Optional. Default: 300 seconds
  maxReplicaCount: 8 # Optional. Default: 100
  minReplicaCount: 2
  triggers:
    - type: apache-kafka
      metadata:
        bootstrapServers: abcd-3-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198,abcd-1-public.msk01uswest2.casdas.c6.kafka.us-west-2.amazonaws.com:9198
        consumerGroup: abcd-selector
        topic: Abcd.Potential.V1
        awsRegion: us-west-2
        lagThreshold: '5'
      authenticationRef:
        name: abcd-selector-trigger

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: abcd-selector-trigger
  namespace: apps-abcd
spec:
  secretTargetRef:
    - parameter: sasl
      name: abcd-selector-secret
      key: sasl
    - parameter: awsRoleArn
      name: abcd-selector-secret
      key: awsRoleArn
    - parameter: tls
      name: abcd-selector-secret
      key: tls

Logs from KEDA operator

error getting metadata: kafka.(*Client).Metadata: read tcp xxx.xxx.xxx.xxx:42116->xx.xxx.xxx.xxx:9198: i/o timeout
error getting metadata: kafka.(*Client).Metadata: context deadline exceeded

KEDA Version

2.13.0

Kubernetes Version

1.26

Platform

Amazon Web Services

Scaler Details

Apache Kafka scaler (experimental)

Anything else?

This caused a major outage for us since we use shared MSK. This is a big problem for other services that got affected because of this scaler. Even after restart of brokers, the issue remains because Kafka keeps the information about these connections and is taking lot of time to stabilize after that.

The text was updated successfully, but these errors were encountered:

dttung2905 · 2024-02-25T21:33:01Z

Hi @sameerjoshinice,

Thanks for reporting this to us. An i/o timeout and context deadline exceed often mean network connection error. I have a few questions:

Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
If you could get more logs for troubleshooting, that would be great

sameerjoshinice · 2024-02-26T07:57:04Z

Hi @dttung2905 ,
Please see answers inline
Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
[SJ]: First time this scaler has been run causing the outage.
Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
[SJ]: There are other clients which are contacting the MSK with same role and are working fine. Those clients are Java based mostly.
How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
[SJ]: There are 3 brokers in shared MSK and all of them got affected. This happened twice and both the time, it was KEDA scaler whose permissions were enabled for access to the MSK and issue started happening.
If you could get more logs for troubleshooting, that would be great.
[SJ]: I will try to get more logs as and when I get something of importance.

jared-schmidt-niceincontact · 2024-02-26T18:19:19Z

We also saw this error from the Keda operator before the timeouts and context deadline started happening:

ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "mynamespace", "scaledObject.Name": "myscaler", "trigger": "apacheKafkaScaler", "error": "error listing consumer group offset: %!w()"}

jared-schmidt-niceincontact · 2024-02-26T18:35:17Z

Our suspicion is that the scaler caused a flood of broken connections that didn't close properly and eventually caused all of the groups to rebalance which pegged the CPU. The rebalances can be seen within a few minutes of starting the scalingobject.

I also have this email which highlights some things AWS was finding at the same time:

I’ve been talking to our AWS TAM and the AWS folks about this issue. They still believe based on the logs that they have access to (which we don’t) that the problems are related to a new IAM permission that is required when running under the newest Kafka version. They are seeing many authentication issues related to the routing pods. My coworker and I have been playing with permissions to give the application rights requested by AWS. The CPU on the cluster dropped slightly when we did that, however, we are getting the following error still even after applying the update on the routing pods:

Connection to node -2 () terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue.

AWS believes that the authentication sessions on the backend have been marked as expired, but they have not been removed and are treated as invalid. They have been attempting to manually remove them, but have run into problems doing so. They are going to restart all the MSK brokers to clear out the session cache.

jared-schmidt-niceincontact · 2024-02-26T18:36:31Z

Unfortunately, restarting the brokers didn't fix the CPU problems.

JorTurFer · 2024-02-26T21:22:10Z

Did you try restarting KEDA operator?
I'm checking and apparently we are closing the connection correctly in case of failure:

keda/pkg/scalers/apache_kafka_scaler.go

Lines 564 to 574 in 367fcd3

    
           // Close closes the kafka client 
        
           func (s *apacheKafkaScaler) Close(context.Context) error { 
        
           	if s.client == nil { 
        
           		return nil 
        
           	} 
        
           	transport := s.client.Transport.(*kafka.Transport) 
        
           	if transport != nil { 
        
           		transport.CloseIdleConnections() 
        
           	} 
        
           	return nil 
        
           }

But maybe there is any other way to close the client that we've missed :/

sameerjoshinice · 2024-03-27T10:13:32Z

Following is the analysis from AWS MSK team for the issue. They see this as a problem in Keda scaler. The issue is mainly new apache kafka scaler keep on retrying constantly with non renewed credentials even after session expiry.

Based on the authorizer logs, we see that KEDA is denied to access certain resources. This is leading to the same scaler retrying. This retry happens constantly until the session expires. When session expires, the credential is not renewed by KEDA, and thus, it attempts to call the cluster with an outdated credential. This leads to a race condition where the requests are constantly in AUTHENTICATION failed state. This leads to the request queue, and then the network queue filling up, which leads to high CPU.

The behaviour on MSK side is that the failure is happening in the authorization flow, and not the authentication flow. This was hard to reproduce. However, when we add latency in between the server and authorization flow which might happen in case of network queue being overloaded, we can reproduce this behaviour. The rejection of these requests is correct. Under normal circumstances, they would be rejected in authentication flow, but under heavy load, they are being rejected in authorization flow.

Mitigation
In order to fix this, the KEDA configurations need to be fixed to allow it access to all topics and groups. This will stop the retries, and allow the clients to be closed before the session expires.

An issue should be raised with KEDA about this. The scaler will always eventually crash if authentication or authorization fails. This can trigger with any KEDA scaler if permissions are not sufficient. It will keep retrying until session expires, and then face this issue.

sansmoraxz · 2024-03-27T13:26:36Z

Been a while since I worked with Kafka, I made the initial commit for the scaler.

If it's an issue with rotating credentials I think it would be better to raise the issue in over at segmentio/kafka-go. They are the ones maintaining the underlying library. Could it be confirmed if it's the newer versions of Kafka having the issue? Or for that matter if earlier builds of Keda have the issue. I don't see many changes in in this repo that would give this issue.

sansmoraxz · 2024-03-27T13:29:18Z

The behaviour on MSK side is that the failure is happening in the authorization flow, and not the authentication flow. This was hard to reproduce. However, when we add latency in between the server and authorization flow which might happen in case of network queue being overloaded, we can reproduce this behaviour. The rejection of these requests is correct. Under normal circumstances, they would be rejected in authentication flow, but under heavy load, they are being rejected in authorization flow.

Is this specific to RBAC or same problem can be seen when an IAM user is used? I don't have access to infra to test this on tbh.

sameerjoshinice · 2024-03-28T05:09:46Z

Been a while since I worked with Kafka, I made the initial commit for the scaler.

If it's an issue with rotating credentials I think it would be better to raise the issue in over at segmentio/kafka-go. They are the ones maintaining the underlying library. Could it be confirmed if it's the newer versions of Kafka having the issue? Or for that matter if earlier builds of Keda have the issue. I don't see many changes in in this repo that would give this issue.

Version of Kafka being used is 3.5.1. We could not confirm if it was an issue with earlier version or not.

sameerjoshinice · 2024-03-28T05:16:19Z

The behaviour on MSK side is that the failure is happening in the authorization flow, and not the authentication flow. This was hard to reproduce. However, when we add latency in between the server and authorization flow which might happen in case of network queue being overloaded, we can reproduce this behaviour. The rejection of these requests is correct. Under normal circumstances, they would be rejected in authentication flow, but under heavy load, they are being rejected in authorization flow.

Is this specific to RBAC or same problem can be seen when an IAM user is used? I don't have access to infra to test this on tbh.
This problem was seen with the awsRoleArn and using msk_iam as sasl method.

JorTurFer · 2024-03-30T17:29:12Z

@dttung2905 @zroubalik , do you know how the integration of AWS MSK IAM within sarama is? I mean, if we are close to unify both scalers again we can just work in that direction

sameerjoshinice · 2024-03-31T09:18:43Z

I see AWS has released the signer https://github.com/aws/aws-msk-iam-sasl-signer-go .
This signer can be integrated with IBM sarama library. The signer implements the interface necessary for SASL OAUTHBEARER authentication. This means IBM sarama does not need any change for supporting IAM for MSK, but same can be achieved by using SASL OAUTHBEARER authentication with different implementations of token providers depending on whether role, profile or credentials are specified. This means the new experimental scaler wont be needed with the IAM support already available with sarama using AWS provided SASL signer.

JorTurFer · 2024-03-31T15:58:57Z

Yeah, we knew it because another folk told it to use some weeks ago, IIRC @dttung2905 is checking how to integrate it

dttung2905 · 2024-04-01T19:35:04Z

I see AWS has released the signer https://github.com/aws/aws-msk-iam-sasl-signer-go .

@sameerjoshinice yes. I think thats the repo to use. It has been mentioned in this Sarama issue IBM/sarama#1985

@JorTurFer I did not recall I was looking into that sorry. But I'm happy to test it out if I could get my hands on a testing MSK environment :D

JorTurFer · 2024-04-03T06:18:49Z

@JorTurFer I did not recall I was looking into that sorry

Lol, maybe I'm wrong but that what's I remember 😰 xD

No worries, IDK why I remembered it but surely I was wrong

adrien-f · 2024-04-05T13:33:43Z

👋 Greetings!

Would you be open if I were to look at this issue and see how we could solve this with integrating the signer from AWS?

JorTurFer · 2024-04-06T21:50:14Z

Would you be open if I were to look at this issue and see how we could solve this with integrating the signer from AWS?

Yeah, if you are willing to give a try, it'd be nice!

zroubalik · 2024-04-10T16:31:35Z

@adrien-f thanks, let me assing this issue to you.

adrien-f · 2024-04-13T12:03:30Z

Greetings 👋

I was able to get our MSK cluster (1000+ topics) to enable IAM authentication. With the current codebase, it connected fine so that's good news on the current state of the authentication system.

Immediately, I notice the scaler runs the following:

keda/pkg/scalers/apache_kafka_scaler.go

Lines 424 to 431 in 80806a7

    
           func (s *apacheKafkaScaler) getTopicPartitions(ctx context.Context) (map[string][]int, error) { 
        
           	metadata, err := s.client.Metadata(ctx, &kafka.MetadataRequest{ 
        
           		Addr: s.client.Addr, 
        
           	}) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("error getting metadata: %w", err) 
        
           	} 
        
           	s.logger.V(1).Info(fmt.Sprintf("Listed topics %v", metadata.Topics))

That MetadataRequest is not scoped to the topic the scaler is looking at. Which means it retrieves all information for all topics & partitions on the cluster and it can get big. I think it could be fair to already scope that data to the topic configured for the scaler.

Moreover, the following is from the Kafka protocol documentation:

The client does not need to keep polling to see if the cluster has changed; it can fetch metadata once when it is instantiated cache that metadata until it receives an error indicating that the metadata is out of date. This error can come in two forms: (1) a socket error indicating the client cannot communicate with a particular broker, (2) an error code in the response to a request indicating that this broker no longer hosts the partition for which data was requested.

Cycle through a list of "bootstrap" Kafka URLs until we find one we can connect to.

Fetch cluster metadata.

Process fetch or produce requests, directing them to the appropriate broker based on the topic/partitions they send to or fetch from.
If we get an appropriate error, refresh the metadata and try again.

Caching topic partitions might also be another option to investigate.

I will continue investigating and open a PR with these suggestions 👍

sameerjoshinice · 2024-04-13T12:25:11Z

@adrien-f I thought we were discussing the idea of modifying the existing Kafka scaler based on sarama library than fixing the issue with the experimental scaler based on segment-io library.

adrien-f · 2024-04-13T12:33:53Z

Hey @sameerjoshinice ! Isn't apache-kafka the experimental scaler?

adrien-f · 2024-04-13T12:38:49Z

https://keda.sh/docs/2.13/scalers/apache-kafka/ the first one kafka, unless I'm mistaken, does not support MSK IAM auth.

sameerjoshinice · 2024-04-13T12:41:17Z

@adrien-f
Following is the original scaler based on sarama library:
https://github.com/kedacore/keda/blob/80806a73218e7d128bd25945f573c2a91316d1d3/pkg/scalers/kafka_scaler.go
and the experimental scaler is this :
https://github.com/kedacore/keda/blob/80806a73218e7d128bd25945f573c2a91316d1d3/pkg/scalers/apache_kafka_scaler.go
The thing mentioned in the above comments suggests that the original scaler based on sarama library can itself be modified to be made compatible with AWS MSK IAM. Please see the following comment:
#5531 (comment)

adrien-f · 2024-04-13T12:43:50Z

Got it 👍 I'll look at adding that !

adrien-f · 2024-04-15T15:39:16Z

Hey there!

I've implemented the MSK signer! Let me know what you think :)

gjacquet · 2024-05-15T19:29:41Z

I think part of the issue might be related to #5806.
I have also faced issues with connection remaining active but using outdated credentials.

stale · 2024-07-14T20:54:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2024-07-21T23:07:32Z

This issue has been automatically closed due to inactivity.

sameerjoshinice added the bug Something isn't working label Feb 25, 2024

zroubalik assigned adrien-f Apr 10, 2024

adrien-f mentioned this issue Apr 15, 2024

Add AWS MSK IAM authentication to Kafka scaler #5692

Merged

7 tasks

adrien-f mentioned this issue Apr 16, 2024

Optimize experimental Kafka scaler and fix consumer group logic #5697

Open

7 tasks

stale bot added the stale All issues that are marked as stale due to inactivity label Jul 14, 2024

stale bot closed this as completed Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with AWS MSK IAM using Apache Kafka scaler #5531

Issue with AWS MSK IAM using Apache Kafka scaler #5531

sameerjoshinice commented Feb 25, 2024

dttung2905 commented Feb 25, 2024

sameerjoshinice commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024 •

edited

Loading

JorTurFer commented Feb 26, 2024

sameerjoshinice commented Mar 27, 2024

sansmoraxz commented Mar 27, 2024

sansmoraxz commented Mar 27, 2024

sameerjoshinice commented Mar 28, 2024

sameerjoshinice commented Mar 28, 2024

JorTurFer commented Mar 30, 2024

sameerjoshinice commented Mar 31, 2024

JorTurFer commented Mar 31, 2024

dttung2905 commented Apr 1, 2024

JorTurFer commented Apr 3, 2024

adrien-f commented Apr 5, 2024

JorTurFer commented Apr 6, 2024

zroubalik commented Apr 10, 2024

adrien-f commented Apr 13, 2024

sameerjoshinice commented Apr 13, 2024

adrien-f commented Apr 13, 2024

adrien-f commented Apr 13, 2024

sameerjoshinice commented Apr 13, 2024

adrien-f commented Apr 13, 2024

adrien-f commented Apr 15, 2024

gjacquet commented May 15, 2024

stale bot commented Jul 14, 2024

stale bot commented Jul 21, 2024

Issue with AWS MSK IAM using Apache Kafka scaler #5531

Issue with AWS MSK IAM using Apache Kafka scaler #5531

Comments

sameerjoshinice commented Feb 25, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

dttung2905 commented Feb 25, 2024

sameerjoshinice commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024

jared-schmidt-niceincontact commented Feb 26, 2024 • edited Loading

JorTurFer commented Feb 26, 2024

sameerjoshinice commented Mar 27, 2024

sansmoraxz commented Mar 27, 2024

sansmoraxz commented Mar 27, 2024

sameerjoshinice commented Mar 28, 2024

sameerjoshinice commented Mar 28, 2024

JorTurFer commented Mar 30, 2024

sameerjoshinice commented Mar 31, 2024

JorTurFer commented Mar 31, 2024

dttung2905 commented Apr 1, 2024

JorTurFer commented Apr 3, 2024

adrien-f commented Apr 5, 2024

JorTurFer commented Apr 6, 2024

zroubalik commented Apr 10, 2024

adrien-f commented Apr 13, 2024

sameerjoshinice commented Apr 13, 2024

adrien-f commented Apr 13, 2024

adrien-f commented Apr 13, 2024

sameerjoshinice commented Apr 13, 2024

adrien-f commented Apr 13, 2024

adrien-f commented Apr 15, 2024

gjacquet commented May 15, 2024

stale bot commented Jul 14, 2024

stale bot commented Jul 21, 2024

jared-schmidt-niceincontact commented Feb 26, 2024 •

edited

Loading