Kafka scaler reports -2 and scale to maxReplicas #2612

loicmathieu · 2022-02-09T11:01:42Z

Report

We have a situation where Kafka scaler reports -2 and the HPA scale to the maxReplicas.
A CPU scaler is also defined but it reports 1% usage so the desired replicas count should be 1 not 3.

The following information comes from a kubectl describe on the HPA

Name:                                                                        keda-hpa-vcstream-out-listener-kafka-geco
Namespace:                                                                   vcstream-dev
Labels:                                                                      app.kubernetes.io/managed-by=Helm
                                                                             app.kubernetes.io/name=keda-hpa-vcstream-out-listener-kafka-geco
                                                                             app.kubernetes.io/part-of=vcstream-out-listener-kafka-geco
                                                                             app.kubernetes.io/version=2.5.0
                                                                             scaledobject.keda.sh/name=vcstream-out-listener-kafka-geco
Annotations:                                                                 <none>
CreationTimestamp:                                                           Tue, 08 Feb 2022 17:09:07 +0100
Reference:                                                                   Deployment/vcstream-out-listener-kafka-geco
Metrics:                                                                     ( current / target )
  "s1-kafka-private_dkt_out_listener_kafka_geco_v1" (target average value):  -2 / 500
  resource cpu on pods  (as a percentage of request):                        1% (3m) / 75%
Min replicas:                                                                1
Max replicas:                                                                3
Deployment pods:                                                             3 current / 3 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range

Other scalers defined the same way and connected to the same Kafka broker works as expected.

Expected Behavior

As there is no lag and no CPU, HPA need to scale to the minReplicas wich is 1.

Actual Behavior

HPA scale to maxReplicas wich is 3.

Steps to Reproduce the Problem

Here is the sclaled objet definition we use:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vcstream-out-listener-kafka-geco
spec:
  scaleTargetRef:
    name: vcstream-out-listener-kafka-geco
  minReplicaCount: 1
  maxReplicaCount:  3
  cooldownPeriod: 60
  fallback:
    failureThreshold: 3
    replicas: 1
  advanced:
    restoreToOriginalReplicaCount: true
  triggers:
  - type: cpu
    metadata:
      type: Utilization
      value: "75"
  - type: kafka      
    metadata:
      bootstrapServers: "redacted"
      consumerGroup: vcstream-out-listener-kafka-geco
      topic:  private_dkt_out_listener_kafka_geco_v1
      lagThreshold: "500"
      offsetResetPolicy: latest
    authenticationRef:
      name: keda-trigger-auth-kafka-credential

Logs from KEDA operator

example

KEDA Version

No response

Kubernetes Version

No response

Platform

Google Cloud

Scaler Details

cpu, kafka

Anything else?

Maybe the issue is that the lag is not reported and Keda returns -2 instead of 0 when there is no lag ?

The text was updated successfully, but these errors were encountered:

loicmathieu · 2022-02-09T15:08:07Z

So I check the Keda log and it's effectively what I thought, I saw the following line in the log file:

I0209 15:06:56.834200       1 logr.go:252] keda_metrics_adapter/kafka_scaler "msg"="invalid offset found for topic private_dkt_out_listener_kafka_geco_v1 in group vcstream-out-listener-kafka-geco and partition 4, probably no offset is committed yet"

I think that in this case the scaler should report 0 instead of -2.

tomkerkhove · 2022-02-10T07:17:55Z

Are you open to contributing a fix?

When getting the consumer offset of a Kafka topic for which no offset was committed yet, the scaler returns with -1 instead of 0, which causes to scale to the maximum number of replicas. Also fixed some typos and used interim variables for error strings. Fixes kedacore#2612 Signed-off-by: Ram Cohen <ram.cohen@gmail.com>

RamCohen · 2022-02-10T07:43:21Z

Already working on it

loicmathieu · 2022-02-10T08:19:02Z

@tomkerkhove next time I'll try to contribute, this time @RamCohen beat me at it ;)

zroubalik · 2022-02-10T08:21:38Z

@RamCohen @loicmathieu could you please check this related PR #2033

loicmathieu · 2022-02-10T08:29:07Z

@zroubalik as far as I understand it, this seems to be the same issue so #2621 will also fix #2033

zroubalik · 2022-02-10T10:05:20Z

@loicmathieu what do you think about this specific comment? #2033 (comment)

Specifically about this:

When there are messages and no committed offset, the lag depends on offsetResetPolicy, in any case, it should scale to at least one max(minReplicaCount, 1)

loicmathieu · 2022-02-10T10:19:50Z

There is a risk of chicken and egg issue here, when there is no offset, if we return 0 and a minReplicas of 0 is configured, we will never scalle to 1 so there will never be any consumption, so there will never be any offset, ...

In case no offset, maybe the best is to return 1 so at least we scale to 1 to wait for the first message.

Returing 0 as in the current PR seems more logical but in this case a big WARNING needs to be written in the documentation stateting that in case no offset exists the scaler will return 0 so minReplicas mustnot be configured to 0 as long as no offsets has been committed for this topic. This should be error prone thought.

loicmathieu · 2022-02-10T10:21:53Z

Anyway, the behaviour in case no offset must be described in the documentation as if we return 1 this can also surpised people.

zroubalik · 2022-02-10T10:23:44Z

+1 That's why I am more inclined towards to solution propsed in that comment. Not sure if we should scale to 0, we should try to make it as easy as possible for users. And scaling to at least 1 replica could help mitigate the problem.

Solid documentation is a must. That's why I don't think we should release this in 2.6.1, but should think about this properly and prepare the feature for the next release. (CC @tomkerkhove that' why I am removing the milestone).

RamCohen · 2022-02-10T11:18:09Z

Perhaps having a parameter that specifies what to return in case of a missing offset (0/1), with a default of '1' as this resembles more what is happening today ?

loicmathieu · 2022-02-10T13:18:39Z

Yes, I was thinking about it also, the best is to let the user choose how many replicas should be available waiting for a message when no offsets exist.
I don't know how it will be done as today the HPA choose the number of replicas, but this is what the cron does so it should be feasible ;)

What can be done is to fix the current issue, as returning -1 is a bug (the HPA doesn't understand it) so the current PR is better and could be shipped as a workaround while waiting for a better solution: a fix number of replicas in case there is no offset.

zroubalik · 2022-02-10T14:27:55Z

Having and optional parameter on 0/1 with a sane default seems like a great idea.

I prefer to not create an intermediate step in the functionality. So I will keep this #2621 open and would be better if we can find with a proper fix for the next release @RamCohen @loicmathieu

loicmathieu · 2022-02-15T08:50:06Z

@zroubalik I still think that it's a bug on the current implementation and 0 should be returned instead as the HPA have an incorrect behaviour when metrics are negative (it goes to the max). This bug is easily fixable to why not fixing it asap ?

Having an ability to choose how many replicas to maintain when no offset is a new feature that may be more complex to solves but I agree it's a best way to handle no offset.

zroubalik · 2022-02-15T12:18:50Z

@loicmathieu that might be true, but the current behavior is present from the beginning so some users might expect this. So I don't want to break any existing stuff twice in a short period of time. Let's do it properly in one attempt.

PaulLiang1 · 2022-03-01T01:01:23Z

Reading the description from #996 where the -1 behaviour is added, looks like it's mean to trigger scaling when there is no offset + using latests.

Agree with @loicmathieu , it does seem to be a more reasonable approach to let the user know to set minRep >= 1 in such case + warning instead.

When getting the consumer offset of a Kafka topic for which no offset was committed yet, the scaler returns with -1 instead of 0, which causes to scale to the maximum number of replicas. Also fixed some typos and used interim variables for error strings. Fixes kedacore#2612 Signed-off-by: Ram Cohen <ram.cohen@gmail.com>

loicmathieu added the bug Something isn't working label Feb 9, 2022

RamCohen mentioned this issue Feb 10, 2022

Return 1 or 0 instead of -1 when detecting an invalid offset in Kafka #2621

Merged

tomkerkhove added this to the v2.6.1 milestone Feb 10, 2022

zroubalik removed this from the v2.6.1 milestone Feb 10, 2022

zroubalik mentioned this issue Apr 4, 2022

KedaScaledJob is unable to scale due to invalid offset with kafka-trigger #2871

Closed

zroubalik closed this as completed in #2621 Apr 7, 2022

dttung2905 mentioned this issue Apr 11, 2024

Kafka Scaler scaleToZeroOnInvalidOffset flag is only working for 'latest' offsetresetpolicy #4910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka scaler reports -2 and scale to maxReplicas #2612

Kafka scaler reports -2 and scale to maxReplicas #2612

loicmathieu commented Feb 9, 2022 •

edited

loicmathieu commented Feb 9, 2022 •

edited

tomkerkhove commented Feb 10, 2022

RamCohen commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022 •

edited

RamCohen commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 15, 2022

zroubalik commented Feb 15, 2022

PaulLiang1 commented Mar 1, 2022

Kafka scaler reports -2 and scale to maxReplicas #2612

Kafka scaler reports -2 and scale to maxReplicas #2612

Comments

loicmathieu commented Feb 9, 2022 • edited

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

loicmathieu commented Feb 9, 2022 • edited

tomkerkhove commented Feb 10, 2022

RamCohen commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022 • edited

RamCohen commented Feb 10, 2022

loicmathieu commented Feb 10, 2022

zroubalik commented Feb 10, 2022

loicmathieu commented Feb 15, 2022

zroubalik commented Feb 15, 2022

PaulLiang1 commented Mar 1, 2022

loicmathieu commented Feb 9, 2022 •

edited

loicmathieu commented Feb 9, 2022 •

edited

zroubalik commented Feb 10, 2022 •

edited