Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make prevention rule set configurable #4067

Open
tomkerkhove opened this issue Jan 3, 2023 · 20 comments
Open

Make prevention rule set configurable #4067

tomkerkhove opened this issue Jan 3, 2023 · 20 comments
Labels
feature All issues for new features that have been committed to prevention stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@tomkerkhove
Copy link
Member

Proposal

Make prevention rule set configurable by giving end-users the nobs on a per-rule level so that they can manage what validation should be applied and what not.

This is something we should do by mounting a YAML configuration file with the various rules and provide a reference page with the available rules.

Use-Case

Give control to the end-user.

Anything else?

No response

@tomkerkhove tomkerkhove added needs-discussion feature-request All issues for new features that have not been committed to prevention labels Jan 3, 2023
@tomkerkhove
Copy link
Member Author

Let's make sure not to use environment variables as this will not scale over time, imo

@JorTurFer
Copy link
Member

I don't get what do you want to achieve, is this something for the admission controller?

@tomkerkhove
Copy link
Member Author

Yes, if we add all these rules then end-users might want to control which ones should be on and which ones not.

@JorTurFer
Copy link
Member

Are they configurable in the real life? I mean, CPU/Memory without requests won't work, multiple HPA scaling the same workload won't work either.
I don't see what user should be able to configure because those rules are mandatory, otherwise there is a misconfiguration.

But I get the idea, maybe there are other rules in the future that can be configured

@tomkerkhove
Copy link
Member Author

But I get the idea, maybe there are other rules in the future that can be configured

This indeed!

But also, based on your argument above why are we making it an opt-in then? To give people the flexibility to choose how aggressive they want to be. Maybe a SO is scaling something that has an HPA on top of it as well because of a transition period; who knows.

It's more about giving control over enforcement/enabling everything or nothing.

@JorTurFer
Copy link
Member

Maybe a SO is scaling something that has an HPA on top of it as well because of a transition period; who knows.

This doesn't work. The HPA controller will conflict them, this scenario is worse than not having autoscaling because every HPA controller cycle will change the replicas twice if there are different values

@tomkerkhove
Copy link
Member Author

I know it does not work and it's an example but the point is that we allow end-users to mix and match rules. Today, the only option would be to enable all validation or none while there can be scenarios to turn off certain validation.

@stale
Copy link

stale bot commented Mar 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Mar 4, 2023
@tomkerkhove tomkerkhove added stale-bot-ignore All issues that should not be automatically closed by our stale bot stale All issues that are marked as stale due to inactivity and removed stale All issues that are marked as stale due to inactivity labels Mar 6, 2023
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Mar 6, 2023
@ppanyukov
Copy link

This admission hook gives me a lot of headache where we want to transition from HPA to KEDA for existing production workloads, and specifically preventing deployment when there is already another HPA. From docs:

The scaled workload (scaledobject.spec.scaleTargetRef) is already autoscaled by another other sources (other ScaledObject or HPA).

It makes it very difficult to switch from vanilla HPA to KEDA-managed HPA. Consider this scenario for production:

We have this configured in Helm chart:

deployment:
  replicaCount: 3

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20

And the HPA currently set desiredReplicaCount = 15.

Now imagine wa want to transition to KEDA.

We'd have Helm configured like this:

useKeda: true

The rendered templates will remove HPA and add KEDA ScaledObject. But we can't deploy this because Helm will want to add ScaledObject while HPA is still there and so deployment will fail.

With this hard validation migration is complex. We'd need to delete our HPA first. Then add KEDA. But what happens in this case? Since the HPA is gone, k8s is going to set replicaCount = 3 instead of desired 15, therefore pretty much killing our app due to load.

The workaround is way too complex: need to set replicaCount equal to the current desiredReplicaCount as reported by HPA, then deploy. Then delete HPA. Then deploy KEDA. Then wait a bit. Then set replicaCount back to 3.

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

@zroubalik
Copy link
Member

This admission hook gives me a lot of headache where we want to transition from HPA to KEDA for existing production workloads, and specifically preventing deployment when there is already another HPA. From docs:

The scaled workload (scaledobject.spec.scaleTargetRef) is already autoscaled by another other sources (other ScaledObject or HPA).

It makes it very difficult to switch from vanilla HPA to KEDA-managed HPA. Consider this scenario for production:

We have this configured in Helm chart:

deployment:
  replicaCount: 3

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20

And the HPA currently set desiredReplicaCount = 15.

Now imagine wa want to transition to KEDA.

We'd have Helm configured like this:

useKeda: true

The rendered templates will remove HPA and add KEDA ScaledObject. But we can't deploy this because Helm will want to add ScaledObject while HPA is still there and so deployment will fail.

With this hard validation migration is complex. We'd need to delete our HPA first. Then add KEDA. But what happens in this case? Since the HPA is gone, k8s is going to set replicaCount = 3 instead of desired 15, therefore pretty much killing our app due to load.

The workaround is way too complex: need to set replicaCount equal to the current desiredReplicaCount as reported by HPA, then deploy. Then delete HPA. Then deploy KEDA. Then wait a bit. Then set replicaCount back to 3.

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

@ppanyukov as a workaround, you can disable (scale the webhook controller to 0) after the migration you can enable it again.

@ppanyukov
Copy link

And speaking of more trouble with admission webhook and Helm. I've hit another issue when we cannot uninstall any Helm releases which use KEDA for auto-scaler because the admission webhook errors out and finilisers don't complete and we are left with "hung" and broken deployments.

Here is what I get from the logs:

keda-operator-77d5b46d4-fbjzf keda-operator 2023-03-21T15: 45: 25Z	

ERROR	Failed to update ScaledObject after removing a finalizer	{
    "controller": "scaledobject",
    "controllerGroup": "keda.sh",
    "controllerKind": "ScaledObject",
    "ScaledObject": {
        "name": "my-app",
        "namespace": "ops"
    },
    "namespace": "ops",
    "name": "my-app",
    "reconcileID": "e996beff-a348-4b99-b479-ccd7eead01dc",
    "finalizer": "finalizer.keda.sh",
    "error": "admission webhook \"vscaledobject.kb.io\" denied the request: Deployment.apps \"my-app\" not found"
}

github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).finalizeScaledObject
	/workspace/controllers/keda/scaledobject_finalizer.go: 78
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile
	/workspace/controllers/keda/scaledobject_controller.go: 157
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go: 235

See? And the reason this happens because Helm removes Deployment first. Here is the order of uninstall.

client.go:477: [debug] Starting delete for "my-app" Ingress
client.go:477: [debug] Starting delete for "my-app" Service
client.go:477: [debug] Starting delete for "my-app" Deployment
client.go:477: [debug] Starting delete for "my-app" ConfigMap
client.go:477: [debug] Starting delete for "my-app-1" Mapping
client.go:477: [debug] Starting delete for "my-app-2" Mapping
client.go:477: [debug] Starting delete for "my-app" ScaledObject
uninstall.go:148: [debug] purge requested for my-app
release "my-app" uninstalled

@ppanyukov
Copy link

ppanyukov commented Mar 21, 2023

@ppanyukov as a workaround, you can disable (scale the webhook controller to 0) after the migration you can enable it again.

Well, maybe. It's a good suggestion to be fair. But it's yet another "mysterious dependency" in a million-piece puzzle that we'd be grateful if we didn't have to think about.

Any suggestions for the "Helm uninstall" workaround? Scaling admission hook to zero in this case if very easy to forget to do, and once Helm release gets into a botched state, it's not so easy to fix things. Although actually I wonder if the finalisers will actually complete if I disable the admission webhook post-uninstall? Need to test this.

EDIT: it does look like the reconciler eventually runs a cycle and does clean up things when the Admission webhook is disabled after Helm uninstall. It's still not great. Should I report this "uninstall" as a separate issue? I would say this validation that deployment should exist is wrong as it depends on the order. And we should be able to deploy things in any order, first because we cannot always control it, and sometimes we may want to configure ScaleObjects separately before there are any deployments.

@JorTurFer
Copy link
Member

This can be easily avoided if we could temporarily disable this validation, and temporarily have both HPA and KEDA (with exactly same CPU profile) for a brief period of time.

Be careful because this maybe won't happen as you expect, 2 HPAs (the old one and the new one generated by KEDA) can have unexpected behaviors, The HPA controller doesn't scale the value instantly in the first cycle AFAIK, so in the first cycle it will potentially scale in your workload, reducing your active instances. You can try this just adding another HPA and checking if this happens or not, but I'd ensure it

sometimes we may want to configure ScaleObjects separately before there are any deployments

This is something unsupported at this moment, the only option for this is totally disabling the webhooks

About the deletion itself, definitively the webhook shouldn't block the deletion but maybe there is any error or any unexpected behavior. I need to check it in depth because I have an idea about what can be happening

@JorTurFer
Copy link
Member

sometimes we may want to configure ScaleObjects separately before there are any deployments

I have reviewed the code and you can deploy ScaledObjects before any deployment if you aren't using cpu/memory scalers, if you are using them, you can't because the resources are required for validating them

@tomkerkhove
Copy link
Member Author

I think it's good that we have fixed the issue but the main point remains End-users should have an easy way to disable one/multiple/all rules.

While scaling in the webhooks component/opting-out in Helm is a mitigation, I believe we should make the rules configurable.

@ppanyukov
Copy link

Yes @JorTurFer saw your PR, thanks for that, it will make our lives much easier :)

Hopefully some consensus will be reached on other validation issues as well.

@tomkerkhove
Copy link
Member Author

Any thoughts on this @kedacore/keda-maintainers? I think having them configurable through a YAML is still valuable and trivial to do.

@tomkerkhove tomkerkhove added needs-discussion and removed feature-request All issues for new features that have not been committed to needs-discussion labels May 23, 2023
@tomkerkhove
Copy link
Member Author

tomkerkhove commented Jul 10, 2023

@JorTurFer @zroubalik After giving another thought and discussion with @howang-ms - Why not introduce a CRD to have proper resource validation/etc rather than mounting a YAML?

I think this would be a nice addition to have typed rules in a ClusterPreventionRules resource that has every entry from https://keda.sh/docs/2.11/concepts/admission-webhooks/ listed.

Allows us to also use that to offer to end-users for proper tooling by just doing a kubectl get -o wide to see what is applied

@JorTurFer
Copy link
Member

JorTurFer commented Jul 10, 2023

I think that it'd be a nice addition, maybe it's a bit overkill at this moment because we only have 3-5 checks, but if we want to extend it, it'd be nice. I'm curious about if there is any tool to generate scripts that we can inject there. I'm thinking in passing the check code using the CRD instead of having to code it inside the webhooks, that'd bring a total flexibility for extending the checks to platform providers

@tomkerkhove
Copy link
Member Author

I think that it'd be a nice addition, maybe it's a bit overkill at this moment because we only have 3-5 checks, but if we want to extend it, it'd be nice.

This is to ease the usage for long-term indeed, but also we've recently had people that were blocked to migrate to KEDA because of HPA already being around. Although we have the feature now, they could have migrated faster if they could have disabled that one rule already.

So scaling the rules it not one driver, being able to customize what is there today already is another one.

I'm curious about if there is any tool to generate scripts that we can inject there. I'm thinking in passing the check code using the CRD instead of having to code it inside the webhooks, that'd bring a total flexibility for extending the checks to platform providers

Maybe, but that is out of scope here

@tomkerkhove tomkerkhove added feature All issues for new features that have been committed to and removed needs-discussion labels Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature All issues for new features that have been committed to prevention stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Status: To Do
Development

No branches or pull requests

4 participants