Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count pod overhead against an entity's ResourceQuota #99600

Merged

Conversation

gjkim42
Copy link
Member

@gjkim42 gjkim42 commented Mar 1, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

According to https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/688-pod-overhead, the ResourceQuota controller should account for pod.Spec.Overhead

Which issue(s) this PR fixes:

Fixes #99577

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Count pod overhead against an entity's ResourceQuota

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note cncf-cla: yes labels Mar 1, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

/cc @egernst
Could you review this?

@k8s-ci-robot k8s-ci-robot requested a review from egernst Mar 1, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node label Mar 1, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug label Mar 1, 2021
@troy0820
Copy link
Member

@troy0820 troy0820 commented Mar 1, 2021

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon triage/accepted labels Mar 1, 2021
@troy0820 troy0820 added this to Triage in SIG Node PR Triage Mar 1, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

Need ok-to-test!

@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

/auto-cc

@k8s-ci-robot k8s-ci-robot requested review from deads2k and vishh Mar 1, 2021
@troy0820
Copy link
Member

@troy0820 troy0820 commented Mar 1, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test label Mar 1, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 1, 2021

/retest

@troy0820
Copy link
Member

@troy0820 troy0820 commented Mar 2, 2021

/assign

@troy0820 troy0820 removed their assignment Mar 2, 2021
@troy0820
Copy link
Member

@troy0820 troy0820 commented Mar 2, 2021

/assign @dims

@chestack
Copy link
Contributor

@chestack chestack commented Mar 2, 2021

@gjkim42 thanks for your quick fix.

About PodOverhead, I have something else to discuss.

Currently, PodOverhead is defined as a fixed resource. this value looks like the average usage value and being added into requests resource during scheduling.

type Overhead struct {
  PodFixed *ResourceList
}

My use case is kata_container, the VMM related processes consume much overhead resource when pod running with high workload. A small overhead(small cgroup limits) will trigger oom but a big value result that big overhead requests take into account during scheduling.

So how about define PodOverhead including limits and requests like that in Pod resources definition. It could be helpful

    resources:
      limits:
        memory: 1Gi
      requests:
        memory: 200M

@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 2, 2021

So how about define PodOverhead including limits and requests like that in Pod resources definition. It could be helpful

    resources:
      limits:
        memory: 1Gi
      requests:
        memory: 200M

@chestack
In my opinion, there seem to be some places to be used. Maybe, you can make an issue to request api-change. However, I am not familiar with this domain.

@egernst need your opinion!

@gjkim42 gjkim42 force-pushed the count-pod-overhead-against-rsquota branch from c783d35 to 53ab29b Compare Mar 2, 2021
@k8s-ci-robot k8s-ci-robot added size/M sig/api-machinery labels Mar 2, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 2, 2021

/retest

@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 4, 2021

Rebased. PTAL
/assign @vishh (for approval)
ping @derekwaynecarr

@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 4, 2021

/assign @vishh

usage corev1.ResourceList
pod *api.Pod
usage corev1.ResourceList
podOverheadEnabled bool
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so default is false? And all tests that do not set this field will execute with the pod overhead disabled? What am I missing?

Copy link
Member Author

@gjkim42 gjkim42 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would say that there is no need to explicitly enable the pod overhead for the other cases. The other cases are irrelevant.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. So there is no intention to explicitly disable it for other cases. It is just a side effect of the need to enable it for one case. Since it's in beta and about to get GA, does this logic even needed?

If you think it is needed, maybe just do nothing when it's false. See comment in file below

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see the new test case. it's ok than. Hopefully will GA PodOverhead in 1.22

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev commented Mar 5, 2021

So this KEP paragraph wasn't implemented? This comment still lists it as a "(placeholder)".

KEP xref: #kubernetes/enhancements#688

},
},
},
usage: corev1.ResourceList{
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test case tests for usage, not for quota, correct? Sorry I might be missing something

Copy link
Member Author

@gjkim42 gjkim42 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test is for usage, and resourcequota controller ensures the usage to not exceed the quota

According to the KEP (The controller will be updated to add the pod Overhead to the container resource request summation.), I would say that the container resource request summation means the usage. And #99577 requested the same thing.

@gjkim42 gjkim42 force-pushed the count-pod-overhead-against-rsquota branch from 0c65219 to df48ee4 Compare Mar 5, 2021
Copy link
Member Author

@gjkim42 gjkim42 left a comment

usage corev1.ResourceList
pod *api.Pod
usage corev1.ResourceList
podOverheadEnabled bool
Copy link
Member Author

@gjkim42 gjkim42 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would say that there is no need to explicitly enable the pod overhead for the other cases. The other cases are irrelevant.

},
},
},
usage: corev1.ResourceList{
Copy link
Member Author

@gjkim42 gjkim42 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test is for usage, and resourcequota controller ensures the usage to not exceed the quota

According to the KEP (The controller will be updated to add the pod Overhead to the container resource request summation.), I would say that the container resource request summation means the usage. And #99577 requested the same thing.

"do not count pod overhead as usage with pod overhead disabled": {
pod: &api.Pod{
Spec: api.PodSpec{
Overhead: api.ResourceList{
api.ResourceCPU: resource.MustParse("1"),
},
Containers: []api.Container{
{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceCPU: resource.MustParse("1"),
},
Limits: api.ResourceList{
api.ResourceCPU: resource.MustParse("2"),
},
},
},
},
},
},
usage: corev1.ResourceList{
corev1.ResourceRequestsCPU: resource.MustParse("1"),
corev1.ResourceLimitsCPU: resource.MustParse("2"),
corev1.ResourcePods: resource.MustParse("1"),
corev1.ResourceCPU: resource.MustParse("1"),
generic.ObjectCountQuotaResourceNameFor(schema.GroupResource{Resource: "pods"}): resource.MustParse("1"),
},
podOverheadEnabled: false,
},
Copy link
Member Author

@gjkim42 gjkim42 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the case containing pod overhead with the pod overhead disabled

@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 5, 2021

So this KEP paragraph wasn't implemented?

I don't know the full history, but I think it wasn't.

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev commented Mar 6, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 6, 2021
@gjkim42
Copy link
Member Author

@gjkim42 gjkim42 commented Mar 8, 2021

@derekwaynecarr
Need your approval! Thanks.

@@ -354,6 +354,10 @@ func PodUsageFunc(obj runtime.Object, clock clock.Clock) (corev1.ResourceList, e
limits = quota.Max(limits, pod.Spec.InitContainers[i].Resources.Limits)
}

if feature.DefaultFeatureGate.Enabled(features.PodOverhead) {
Copy link
Member

@derekwaynecarr derekwaynecarr Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wish we had this in beta as intended; happy we caught it prior to moving to GA.

@derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Mar 9, 2021

/approve
/lgtm

this should probably warrant cherry-picks to past releases.

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Mar 9, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, dims, gjkim42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved label Mar 9, 2021
@k8s-ci-robot k8s-ci-robot merged commit 03ae13a into kubernetes:master Mar 10, 2021
14 checks passed
SIG Node PR Triage automation moved this from Needs Approver to Done Mar 10, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 10, 2021
@gjkim42 gjkim42 deleted the count-pod-overhead-against-rsquota branch Mar 10, 2021
k8s-ci-robot added a commit that referenced this issue Mar 15, 2021
…00-upstream-release-1.19

Automated cherry pick of #99600: Count pod overhead as an entity's resource usage
k8s-ci-robot added a commit that referenced this issue Mar 15, 2021
…00-upstream-release-1.18

Automated cherry pick of #99600: Count pod overhead as an entity's resource usage
k8s-ci-robot added a commit that referenced this issue Mar 15, 2021
…00-upstream-release-1.20

Automated cherry pick of #99600: Count pod overhead as an entity's resource usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved cncf-cla: yes kind/bug lgtm ok-to-test priority/important-soon release-note sig/api-machinery sig/node size/M triage/accepted
Projects
Development

Successfully merging this pull request may close these issues.

10 participants