Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka scaler reports -2 and scale to maxReplicas #2612

Closed
loicmathieu opened this issue Feb 9, 2022 · 16 comments · Fixed by #2621
Closed

Kafka scaler reports -2 and scale to maxReplicas #2612

loicmathieu opened this issue Feb 9, 2022 · 16 comments · Fixed by #2621
Labels
bug Something isn't working

Comments

@loicmathieu
Copy link

loicmathieu commented Feb 9, 2022

Report

We have a situation where Kafka scaler reports -2 and the HPA scale to the maxReplicas.
A CPU scaler is also defined but it reports 1% usage so the desired replicas count should be 1 not 3.

The following information comes from a kubectl describe on the HPA

Name:                                                                        keda-hpa-vcstream-out-listener-kafka-geco
Namespace:                                                                   vcstream-dev
Labels:                                                                      app.kubernetes.io/managed-by=Helm
                                                                             app.kubernetes.io/name=keda-hpa-vcstream-out-listener-kafka-geco
                                                                             app.kubernetes.io/part-of=vcstream-out-listener-kafka-geco
                                                                             app.kubernetes.io/version=2.5.0
                                                                             scaledobject.keda.sh/name=vcstream-out-listener-kafka-geco
Annotations:                                                                 <none>
CreationTimestamp:                                                           Tue, 08 Feb 2022 17:09:07 +0100
Reference:                                                                   Deployment/vcstream-out-listener-kafka-geco
Metrics:                                                                     ( current / target )
  "s1-kafka-private_dkt_out_listener_kafka_geco_v1" (target average value):  -2 / 500
  resource cpu on pods  (as a percentage of request):                        1% (3m) / 75%
Min replicas:                                                                1
Max replicas:                                                                3
Deployment pods:                                                             3 current / 3 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range

Other scalers defined the same way and connected to the same Kafka broker works as expected.

Expected Behavior

As there is no lag and no CPU, HPA need to scale to the minReplicas wich is 1.

Actual Behavior

HPA scale to maxReplicas wich is 3.

Steps to Reproduce the Problem

Here is the sclaled objet definition we use:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vcstream-out-listener-kafka-geco
spec:
  scaleTargetRef:
    name: vcstream-out-listener-kafka-geco
  minReplicaCount: 1
  maxReplicaCount:  3
  cooldownPeriod: 60
  fallback:
    failureThreshold: 3
    replicas: 1
  advanced:
    restoreToOriginalReplicaCount: true
  triggers:
  - type: cpu
    metadata:
      type: Utilization
      value: "75"
  - type: kafka      
    metadata:
      bootstrapServers: "redacted"
      consumerGroup: vcstream-out-listener-kafka-geco
      topic:  private_dkt_out_listener_kafka_geco_v1
      lagThreshold: "500"
      offsetResetPolicy: latest
    authenticationRef:
      name: keda-trigger-auth-kafka-credential

Logs from KEDA operator

example

KEDA Version

No response

Kubernetes Version

No response

Platform

Google Cloud

Scaler Details

cpu, kafka

Anything else?

Maybe the issue is that the lag is not reported and Keda returns -2 instead of 0 when there is no lag ?

@loicmathieu loicmathieu added the bug Something isn't working label Feb 9, 2022
@loicmathieu
Copy link
Author

loicmathieu commented Feb 9, 2022

So I check the Keda log and it's effectively what I thought, I saw the following line in the log file:

I0209 15:06:56.834200       1 logr.go:252] keda_metrics_adapter/kafka_scaler "msg"="invalid offset found for topic private_dkt_out_listener_kafka_geco_v1 in group vcstream-out-listener-kafka-geco and partition 4, probably no offset is committed yet"

I think that in this case the scaler should report 0 instead of -2.

@tomkerkhove
Copy link
Member

Are you open to contributing a fix?

RamCohen added a commit to RamCohen/keda that referenced this issue Feb 10, 2022
When getting the consumer offset of a Kafka topic for which no offset
was committed yet, the scaler returns with -1 instead of 0, which
causes to scale to the maximum number of replicas.

Also fixed some typos and used interim variables for error strings.

Fixes kedacore#2612

Signed-off-by: Ram Cohen <ram.cohen@gmail.com>
@RamCohen
Copy link
Contributor

Already working on it

@loicmathieu
Copy link
Author

@tomkerkhove next time I'll try to contribute, this time @RamCohen beat me at it ;)

@zroubalik
Copy link
Member

@RamCohen @loicmathieu could you please check this related PR #2033

@loicmathieu
Copy link
Author

@zroubalik as far as I understand it, this seems to be the same issue so #2621 will also fix #2033

@tomkerkhove tomkerkhove added this to the v2.6.1 milestone Feb 10, 2022
@zroubalik
Copy link
Member

@loicmathieu what do you think about this specific comment? #2033 (comment)

Specifically about this:

When there are messages and no committed offset, the lag depends on offsetResetPolicy, in any case, it should scale to at least one max(minReplicaCount, 1)

@loicmathieu
Copy link
Author

There is a risk of chicken and egg issue here, when there is no offset, if we return 0 and a minReplicas of 0 is configured, we will never scalle to 1 so there will never be any consumption, so there will never be any offset, ...

In case no offset, maybe the best is to return 1 so at least we scale to 1 to wait for the first message.

Returing 0 as in the current PR seems more logical but in this case a big WARNING needs to be written in the documentation stateting that in case no offset exists the scaler will return 0 so minReplicas mustnot be configured to 0 as long as no offsets has been committed for this topic. This should be error prone thought.

@loicmathieu
Copy link
Author

Anyway, the behaviour in case no offset must be described in the documentation as if we return 1 this can also surpised people.

@zroubalik
Copy link
Member

zroubalik commented Feb 10, 2022

+1 That's why I am more inclined towards to solution propsed in that comment. Not sure if we should scale to 0, we should try to make it as easy as possible for users. And scaling to at least 1 replica could help mitigate the problem.

Solid documentation is a must. That's why I don't think we should release this in 2.6.1, but should think about this properly and prepare the feature for the next release. (CC @tomkerkhove that' why I am removing the milestone).

@zroubalik zroubalik removed this from the v2.6.1 milestone Feb 10, 2022
@RamCohen
Copy link
Contributor

Perhaps having a parameter that specifies what to return in case of a missing offset (0/1), with a default of '1' as this resembles more what is happening today ?

@loicmathieu
Copy link
Author

Yes, I was thinking about it also, the best is to let the user choose how many replicas should be available waiting for a message when no offsets exist.
I don't know how it will be done as today the HPA choose the number of replicas, but this is what the cron does so it should be feasible ;)

What can be done is to fix the current issue, as returning -1 is a bug (the HPA doesn't understand it) so the current PR is better and could be shipped as a workaround while waiting for a better solution: a fix number of replicas in case there is no offset.

@zroubalik
Copy link
Member

Having and optional parameter on 0/1 with a sane default seems like a great idea.

I prefer to not create an intermediate step in the functionality. So I will keep this #2621 open and would be better if we can find with a proper fix for the next release @RamCohen @loicmathieu

@loicmathieu
Copy link
Author

@zroubalik I still think that it's a bug on the current implementation and 0 should be returned instead as the HPA have an incorrect behaviour when metrics are negative (it goes to the max). This bug is easily fixable to why not fixing it asap ?

Having an ability to choose how many replicas to maintain when no offset is a new feature that may be more complex to solves but I agree it's a best way to handle no offset.

@zroubalik
Copy link
Member

@loicmathieu that might be true, but the current behavior is present from the beginning so some users might expect this. So I don't want to break any existing stuff twice in a short period of time. Let's do it properly in one attempt.

@PaulLiang1
Copy link
Contributor

Reading the description from #996 where the -1 behaviour is added, looks like it's mean to trigger scaling when there is no offset + using latests.

Agree with @loicmathieu , it does seem to be a more reasonable approach to let the user know to set minRep >= 1 in such case + warning instead.

RamCohen added a commit to RamCohen/keda that referenced this issue Mar 30, 2022
When getting the consumer offset of a Kafka topic for which no offset
was committed yet, the scaler returns with -1 instead of 0, which
causes to scale to the maximum number of replicas.

Also fixed some typos and used interim variables for error strings.

Fixes kedacore#2612

Signed-off-by: Ram Cohen <ram.cohen@gmail.com>
RamCohen added a commit to RamCohen/keda that referenced this issue Mar 30, 2022
When getting the consumer offset of a Kafka topic for which no offset
was committed yet, the scaler returns with -1 instead of 0, which
causes to scale to the maximum number of replicas.

Also fixed some typos and used interim variables for error strings.

Fixes kedacore#2612

Signed-off-by: Ram Cohen <ram.cohen@gmail.com>
RamCohen added a commit to RamCohen/keda that referenced this issue Apr 5, 2022
When getting the consumer offset of a Kafka topic for which no offset
was committed yet, the scaler returns with -1 instead of 0, which
causes to scale to the maximum number of replicas.

Also fixed some typos and used interim variables for error strings.

Fixes kedacore#2612

Signed-off-by: Ram Cohen <ram.cohen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants