Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache metrics (values) in Metric Server and honor pollingInterval #2282

Closed
zroubalik opened this issue Nov 12, 2021 · 14 comments
Closed

Cache metrics (values) in Metric Server and honor pollingInterval #2282

zroubalik opened this issue Nov 12, 2021 · 14 comments
Assignees
Labels
enhancement New feature or request stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@zroubalik
Copy link
Member

zroubalik commented Nov 12, 2021

Proposal

Currently KEDA Metrics Server scrapes metrics values from external services (triggers) with each request from k8s server. This could overload the external service.

func (p *KedaProvider) GetExternalMetric(ctx context.Context, namespace string, metricSelector labels.Selector, info provider.ExternalMetricInfo) (*external_metrics.ExternalMetricValueList, error) {

We might want to cache the metric values in Metric Server to reduce the load on the external service and respond to k8s server much faster.

We have pollingInterval property in ScaledObject which is being honored only by KEDA Operator, we could extend this behavior to KEDA Metrics Server as well. There could be a new property in ScaledObject.spec.advanced to configure this in case users would like to not cache metric values and keep the original behavior (if we make the new behavior default).

The current design

When useCachedMetrics option is set to true, metrics are queried only during polling interval and stored in Operator. KEDA Metrics Server asks KEDA Operator for metrics through gRPC:

  • if useCachedMetrics: true -> we will hit the cache
  • if useCachedMetrics: false or not specifed -> Operator will do the request to the scaler and reply back to KEDA Metrics server
  - type: kafka
    useCachedMetrics: true     # <-- NEW OPTION, default value: false
    metadata:
      bootstrapServers: kafka.svc:9092
      consumerGroup: my-group
      topic: test-topic
      lagThreshold: '5'

Tasks to do:

@zroubalik zroubalik added needs-discussion feature-request All issues for new features that have not been committed to labels Nov 12, 2021
@zroubalik zroubalik changed the title Cache metrics (values) in Metric Server Cache metrics (values) in Metric Server and honor pollingInterval Nov 12, 2021
@tomkerkhove tomkerkhove added enhancement New feature or request and removed feature-request All issues for new features that have not been committed to labels Nov 23, 2021
@stale
Copy link

stale bot commented Jan 22, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jan 22, 2022
@zroubalik zroubalik added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Jan 24, 2022
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Jan 24, 2022
@kefik
Copy link

kefik commented May 12, 2022

In my use-case, not being able to fully control pollingInterval on the Prometheus metrics results in starting more replicas than required. My metrics logic requires ~60s to settle on a new correct value after starting a new replica.

@zroubalik
Copy link
Member Author

In my use-case, not being able to fully control pollingInterval on the Prometheus metrics results in starting more replicas than required. My metrics logic requires ~60s to settle on a new correct value after starting a new replica.

What if you specify this in the Prometheus Query?

@bjethwan
Copy link

This looks like a related issue:
#3291

@zroubalik
Copy link
Member Author

zroubalik commented Nov 8, 2022

So the idea is following:

  1. Introduce a new option for specifying pollingInterval for 1<->N scaling:
    1. this could be either on a ScaledObject level
    2. or on a Scaler (trigger level)
  2. This would be an optional value, if it is not set, the current behavior is preserved.
  3. If the option si specified, the metric value would be cached for the specified interval, if there's a request coming from HPA(k8s) we will return from cache if it is within the interval. Otherwise we will poll the external service for a new value.
  4. Since there could be multiple replicas of the same KEDA Metrics Server, we need to share the cache somehow between the replicas. This is hard to implement, so we can do ->
  5. Query the metric value from KEDA Operator and cache the value there (this has also additional benefit in using only one connection to the external service -> now we have 2 connections: Operator & Metrics Server). KEDA Operator will act as a source of the metrics for Metric Server
  6. There will be gRPC server hosted on KEDA Operator and Metric servers will connect to this server for getting the metric value.
  7. This way we have all the metric values in the single place - KEDA Operator. This enables us with possible extension on this part in the future (we can modify the metric values before they are exposed to HPA).

💩 I am not sure how to name this new option, honestly the best would be to name it pollingInterval and rename current property with this name to activationPollingInterval, this would make our spec consistent and in sync with threshold & activationThreshold properties. But it is a huge breaking change, so I am not sure if it is doable.

@JorTurFer
Copy link
Member

Sorry for the late response.
As I said during the standup, I'd use the metrics server as server because that opens the scenario where different KEDA operator instances from different namespaces, push metrics to the shared metrics server. With this change, the operator instances must know where the metrics server is (single service in a namespace), which is easier than the metrics server knowing where are (single service across multiple namespaces).

We should think about how to push/pull the metrics between different metrics server instances, as starting point I guess that we could assume the topology is 1-1, but from reliability pov is good idea having 2 or more instances. Maybe we could create a mesh between metrics servers to share the information, but maybe the best idea is just creating a gRPC connections slice and push the metrics using all of them. In background, we could watch (k8s) endpoints for the service and use the chagnes to establish and add new connections to the slice.

WDYT?

@zroubalik
Copy link
Member Author

zroubalik commented Nov 15, 2022

I do understand the point you have and I agree. Though in the first iteration I'd try to make this feature as simple as possible. Creating the unary RPC client(metrics server)->server(opeator) is much easier than the opposite direction server(metrics server)->client(operator), because the flow of request is going from the metrics server side - this side is making the call for metrics.
Once this proves to be functional, we can think about more complext stuff.

Alternative approach to support multi namespace installations would be to open a client connection from Metrics Server for each installation and send the request the the respective Operator. The filtering would be easy, as the namespace is part of the request from the k8s api when asking for metrics. So we would know (in Metrics Server) which operator ask for the metric. This topology also supports mutliple Metrics Server replicas, as those could be independent because all stuff is stored in the operator.
The registration of a new namespaced KEDA installation could be done 1) manually by user by editing the Metrics Server config (a configmap?) or 2) we can have a separate grpc server hosted on metrics server just for the registration purposes.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 15, 2022

okey, let's go with your design and once we check it works, we can modify how we stablish the connections. For the moment we could go 1-1 and then we can think how to solve the connection management. I think that watching KEDA namespaced endpoints to detect new instances and stablish new connections based on that should be a small change

@JorTurFer
Copy link
Member

I'm thinking on the approach and I have one question, will we enforce 1 single operator instance? Docs explain that multiple instances are allowed for HA stuff, even a single instance will work. So we could break the behaviour to them if we stablish the connection against the wrong instance

@zroubalik
Copy link
Member Author

I'm thinking on the approach and I have one question, will we enforce 1 single operator instance? Docs explain that multiple instances are allowed for HA stuff, even a single instance will work. So we could break the behaviour to them if we stablish the connection against the wrong instance

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

@JorTurFer
Copy link
Member

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

Nice!, I have to think about this slowly to discover potential gaps/issues, but this probably is better solution than mine. Congrats! 👏

@zroubalik
Copy link
Member Author

The gRPC server start will be hooked on the leader election -> so it will run only on the active instance.

Nice!, I have to think about this slowly to discover potential gaps/issues, but this probably is better solution than mine. Congrats! 👏

see this 19cd301

@tomkerkhove
Copy link
Member

@zroubalik I think we can close this?

@zroubalik
Copy link
Member Author

@tomkerkhove there is one outstanding sub issue TBD, I'd keep this open till we fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Archived in project
Development

No branches or pull requests

5 participants