Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate hpa target values #120373

Closed

Conversation

Lukasz-AWS
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes a bug where if the corresponding value (value, averageValue) of the the type of target (Value, AverageValue) is not set but the opposite type is, HPA object would pass validation but then cause kube-controller-manager to crash loop due to nil pointer dereference looking for that corresponding value. This PR adds an additional check to make sure the corresponding value gets set.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Observed in 1.27 and 1.28, possibly might be an issue in earlier versions as well.
Can be replicated by deploying an hpa with a valid scaleTargetRef and adding this object to the metrics

...
- apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  spec:
    maxReplicas: 10
    metrics:
    - type: Object
      object:
        metric:
          name: requests-per-second
        describedObject:
          apiVersion: networking.k8s.io/v1beta1
          kind: Ingress
          name: main-route
        target:
          type: AverageValue
          value: 1
...

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 1, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @Lukasz-AWS!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 1, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @Lukasz-AWS. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 1, 2023
@krmayankk
Copy link

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2023
allErrs = append(allErrs, field.Invalid(fldPath.Child("value"), mt.Value, "must be set for metric target type Value"))
}

if mt.Type == autoscaling.AverageValueMetricType && mt.AverageValue == nil && mt.Value != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only care that when Type is AverageValue, averageValue is specified and when type is Value, value is specified. I think we can ignore if Value is set or not in this specific check

Copy link

@krmayankk krmayankk Sep 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only value of additionally checking if Value is set or not for type AverageValue is to show a more helpful message saying , you set value but instead you need to set averageValue, which i dont see in the code

Copy link
Contributor Author

@Lukasz-AWS Lukasz-AWS Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! I've made the requested change to only care about the mapping we're looking for, and removed some now redundant tests as they're calling validateMetricTarget beforehand actually ended up just fixing these just in case.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 5, 2023
@krmayankk
Copy link

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 9, 2023
@krmayankk
Copy link

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 9, 2023
@krmayankk
Copy link

@kubernetes/sig-autoscaling-leads for review

@krmayankk
Copy link

/retest

Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits to remove deprecated methods and make the linter happy.

pkg/apis/autoscaling/validation/validation_test.go Outdated Show resolved Hide resolved
pkg/apis/autoscaling/validation/validation_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 29, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2023
@Lukasz-AWS
Copy link
Contributor Author

/retest

@Lukasz-AWS
Copy link
Contributor Author

Hey @liggitt could you take another look please.
Alternatively since this PR has been open for a good while now, I could revert the validation changes and just add the nil checks to nip the bug.

@liggitt
Copy link
Member

liggitt commented Oct 5, 2023

Sorry, ~all PRs got preempted by design reviews through design freeze today.

A separate PR to fix the nil panic we can backport would be good regardless. This one is in my queue to review now that design freeze is upon us.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Lukasz-AWS
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Lukasz-AWS
Copy link
Contributor Author

No worries! And understood thanks! I moved the nil check change to its own PR here #121015

@Lukasz-AWS
Copy link
Contributor Author

hey @liggitt bumping this PR again in case it got lost in the shuffle.

@liggitt
Copy link
Member

liggitt commented Nov 21, 2023

hey @liggitt bumping this PR again in case it got lost in the shuffle.

Paging this back in.

We have five types of metrics: object, pod, resource, containerResource, and external

For each type, we have existing controller behavior and existing validation behavior.

When validation allows in things that make the controller fail at runtime, that's a bug (🐞) and we can tighten validation in a ratcheting way.

When validation allows in things that are weird / underspecified but the controller also doesn't check for or tolerates, people could currently be successful setting those weird / underspecified things, so we should (at most) issue a warning (⚠️).

I swept and annotated the current behavior:

  • objectMetric
  • podsMetric
    • controller
      • assumes AverageValue is non-nil (panics if it isn't ... this could be checked for separately and backported, though validation is protecting us right now, so this technically isn't a bug, just a scary assumption that happens to be safe at the moment)
      • doesn't check Type matches (⚠️)
      • replicaCountProposal, usageProposal, timestampProposal, err := a.replicaCalc.GetMetricReplicas(currentReplicas, metricSpec.Pods.Target.AverageValue.MilliValue(), metricSpec.Pods.Metric.Name, hpa.Namespace, selector, metricSelector)
    • validation
  • resourceMetric / containerResourceMetric
    • controller
    • validation
      • ensures exactly one of AverageUtilization and AverageValue are non-nil (✅)
      • doesn't check Type matches (⚠️)
      • doesn't prevent Value from being set (⚠️)
      • func validateContainerResourceSource(src *autoscaling.ContainerResourceMetricSource, fldPath *field.Path) field.ErrorList {
        allErrs := field.ErrorList{}
        if len(src.Name) == 0 {
        allErrs = append(allErrs, field.Required(fldPath.Child("name"), "must specify a resource name"))
        } else {
        allErrs = append(allErrs, corevalidation.ValidateContainerResourceName(src.Name, fldPath.Child("name"))...)
        }
        if len(src.Container) == 0 {
        allErrs = append(allErrs, field.Required(fldPath.Child("container"), "must specify a container"))
        } else {
        allErrs = append(allErrs, apivalidation.ValidateDNS1123Label(src.Container, fldPath.Child("container"))...)
        }
        allErrs = append(allErrs, validateMetricTarget(src.Target, fldPath.Child("target"))...)
        if src.Target.AverageUtilization == nil && src.Target.AverageValue == nil {
        allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("averageUtilization"), "must set either a target raw value or a target utilization"))
        }
        if src.Target.AverageUtilization != nil && src.Target.AverageValue != nil {
        allErrs = append(allErrs, field.Forbidden(fldPath.Child("target").Child("averageValue"), "may not set both a target raw value and a target utilization"))
        }
        return allErrs
        }
        func validateResourceSource(src *autoscaling.ResourceMetricSource, fldPath *field.Path) field.ErrorList {
        allErrs := field.ErrorList{}
        if len(src.Name) == 0 {
        allErrs = append(allErrs, field.Required(fldPath.Child("name"), "must specify a resource name"))
        }
        allErrs = append(allErrs, validateMetricTarget(src.Target, fldPath.Child("target"))...)
        if src.Target.AverageUtilization == nil && src.Target.AverageValue == nil {
        allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("averageUtilization"), "must set either a target raw value or a target utilization"))
        }
        if src.Target.AverageUtilization != nil && src.Target.AverageValue != nil {
        allErrs = append(allErrs, field.Forbidden(fldPath.Child("target").Child("averageValue"), "may not set both a target raw value and a target utilization"))
        }
  • externalMetric:

@@ -36,6 +37,33 @@ const (
MaxStabilizationWindowSeconds int32 = 3600
)

// HPAValidationOptions contains the different settings for HPA validation
type HPAValidationOptions struct {
AllowMismatchedTargetTypeAndValue bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the only thing we're ratcheting here is objectMetrics, make the name explicitly about that metric type:

Suggested change
AllowMismatchedTargetTypeAndValue bool
AllowMismatchedObjectMetricTypeAndValue bool

opts := HPAValidationOptions{
AllowMismatchedTargetTypeAndValue: false,
}
// Don't allow mismatched target type/value fields for new HPAs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Don't allow mismatched target type/value fields for new HPAs
// Don't allow mismatched objectMetric target type/value fields for new HPAs

Comment on lines +54 to +63
// Check for existing bad mismatched target type/value fields for ObjectMetricSource
// If UtilizationMetricType was set previously then either one of the values could have been used to pass validation before
for _, ms := range oldAutoscaler.Spec.Metrics {
if ms.Object != nil && ((ms.Object.Target.Type == autoscaling.ValueMetricType && ms.Object.Target.Value == nil) ||
(ms.Object.Target.Type == autoscaling.AverageValueMetricType && ms.Object.Target.AverageValue == nil) ||
(ms.Object.Target.Type == autoscaling.UtilizationMetricType)) {
opts.AllowMismatchedTargetTypeAndValue = true
return opts
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit would be easier to read in a standalone method that could return early:

func hasMismatchedObjectMetricType(hpa *autoscaling.HorizontalPodAutoscaler) bool {
  if hpa == nil {
    return false
  }
  for _, ms := range oldAutoscaler.Spec.Metrics {
    switch {
    case ms.Object == nil:
      continue
    case ms.Object.Target.Type == autoscaling.ValueMetricType && ms.Object.Target.Value == nil:
      return true
    case ms.Object.Target.Type == autoscaling.AverageValueMetricType && ms.Object.Target.AverageValue == nil:
      return true
    case ms.Object.Target.Type == autoscaling.UtilizationMetricType:
      // If object metrics accept UtilizationMetricType in the future, expand this case to also check ms.Object.Target.AverageUtilizationValue == nil
      return true
    }
  }
  return false
}

Comment on lines +383 to +388
if src.Target.AverageValue != nil {
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("averageValue"), src.Target.AverageValue, "must not set target averageValue for value metric type"))
}
if src.Target.AverageUtilization != nil {
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("averageUtilization"), src.Target.AverageUtilization, "must not set target averageUtilization for value metric type"))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the AverageValue / AverageUtilization checks here... these go further than the controller does and can fail an existing object that is currently working:

object:
  ...
  target:
    type: Value
    value: "..."
    averageValue: "..."
    averageUtilization: "..."

That would pass validation today, work successfully with the controller today (averageValue and averageUtilization would be ignored), result in AllowMismatchedTargetTypeAndValue=false, and then get rejected on update by validation, which is not what we want.

allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("type"), src.Target.Type, "must not set Utilization type for Object source type"))
case autoscaling.ValueMetricType:
if src.Target.Value == nil {
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("value"), "must set target value for value metric type"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("value"), "must set target value for value metric type"))
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("value"), "required when type=Value"))

if src.Target.AverageUtilization != nil {
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("averageUtilization"), src.Target.AverageUtilization, "must not set target averageUtilization for value metric type"))
}
case autoscaling.AverageValueMetricType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the Value / AverageUtilization checks in this case for the same reason as mentioned above

allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("value"), src.Target.Value, "must not set target value for average value metric type"))
}
if src.Target.AverageValue == nil {
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("averageValue"), "must set target averageValue for average value metric type"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("averageValue"), "must set target averageValue for average value metric type"))
allErrs = append(allErrs, field.Required(fldPath.Child("target").Child("averageValue"), "required when type=AverageValue"))

if !opts.AllowMismatchedTargetTypeAndValue {
switch src.Target.Type {
case autoscaling.UtilizationMetricType:
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("type"), src.Target.Type, "must not set Utilization type for Object source type"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("type"), src.Target.Type, "must not set Utilization type for Object source type"))
allErrs = append(allErrs, field.Invalid(fldPath.Child("target").Child("type"), src.Target.Type, "must be either Value, or AverageValue"))

@liggitt
Copy link
Member

liggitt commented Nov 21, 2023

for the gaps flagged as being warning-worthy (⚠️), adding warnings in autoscalerStrategy#WarningsOnCreate / autoscalerStrategy#WarningsOnUpdate would be fine, but let's do that separately from this validation change

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

7 participants