New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor the /kubepods cgroup for allocatable metrics #57802

Merged
merged 1 commit into from Feb 19, 2018

Conversation

@dashpole
Contributor

dashpole commented Jan 3, 2018

What this PR does / why we need it:
The current implementation of allocatable memory evictions sums the usage of pods in order to compute the total usage by user processes.
This PR changes this to instead monitor the /kubepods cgroup, which contains all pods, and use this value directly. This is more accurate than summing pod usage, as it is measured at a single point in time.
This also collects metrics from this cgroup on-demand.
This PR is a precursor to memcg notifications on the /kubepods cgroup.
This removes the dependency the eviction manager has on the container manager, and adds a dependency for the summary collector on the container manager (to get Cgroup Root)
This also changes the way that the allocatable memory eviction signal and threshold are added to make them in-line with the memory eviction signal to address #53902

Which issue(s) this PR fixes:
Fixes #55638
Fixes #53902

Special notes for your reviewer:
I have tested this, and can confirm that it works when CgroupsPerQos is set to false. In this case, it returns node metrics, as it is monitoring the / cgroup, rather than the /kubepods cgroup (which doesn't exist).

Release note:

Expose total usage of pods through the "pods" SystemContainer in the Kubelet Summary API

cc @sjenning @derekwaynecarr @vishh @kubernetes/sig-node-pr-reviews

@dashpole

This comment has been minimized.

Contributor

dashpole commented Jan 3, 2018

/retest

1 similar comment
@dashpole

This comment has been minimized.

Contributor

dashpole commented Jan 3, 2018

/retest

@dashpole

This comment has been minimized.

Contributor

dashpole commented Jan 3, 2018

looks like cross is already failing.

@dashpole

This comment has been minimized.

Contributor

dashpole commented Jan 12, 2018

/retest

@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 7, 2018

/assign @sjenning
any chance you have bandwidth to take a look?

@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 8, 2018

/assign @yguo0905

@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 9, 2018

/retest

// build the function to work against for pod stats
statsFunc := cachedStatsFunc(summary.Pods)
// build an evaluation context for current eviction signals
result := signalObservations{}
if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil {
capacity := resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI)

This comment has been minimized.

@yguo0905

yguo0905 Feb 12, 2018

Contributor

This change is unnecessary?

This comment has been minimized.

@dashpole

dashpole Feb 12, 2018

Contributor

fixed

@@ -500,6 +500,11 @@ func (cm *containerManagerImpl) GetNodeConfig() NodeConfig {
return cm.NodeConfig
}
// GetCgroupRoot returns the cgroup containing all pods
func (cm *containerManagerImpl) GetCgroupRoot() string {

This comment has been minimized.

@yguo0905

yguo0905 Feb 12, 2018

Contributor

CgroupRoot doesn't seem to be very descriptive - can't tell from its name that this is the cgroup for pods, but if this is a well-known term, then nvm.

This comment has been minimized.

@dashpole

dashpole Feb 12, 2018

Contributor

changed to GetPodCgroupRoot

@yguo0905

This comment has been minimized.

Contributor

yguo0905 commented Feb 12, 2018

/lgtm

@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 15, 2018

/kind feature
/priority important-soon

func addAllocatableThresholds(thresholds []evictionapi.Threshold) []evictionapi.Threshold {
additionalThresholds := []evictionapi.Threshold{}
for _, threshold := range thresholds {
isHardEvictionThreshold := threshold.GracePeriod == 0*time.Second

This comment has been minimized.

@sjenning

sjenning Feb 16, 2018

Contributor

nit: use isHardEvictionThreshold()

This comment has been minimized.

@dashpole

dashpole Feb 16, 2018

Contributor

done

}
}
if enforceAllocatable {
results = addAllocatableThresholds(results)

This comment has been minimized.

@sjenning

sjenning Feb 16, 2018

Contributor

nit: could move this inside the above for loop and break afterward, removing the need for enforceAllocatable

This comment has been minimized.

@dashpole

dashpole Feb 16, 2018

Contributor

done

@sjenning

This comment has been minimized.

Contributor

sjenning commented Feb 16, 2018

two nits, but lgtm

@k8s-merge-robot k8s-merge-robot removed the lgtm label Feb 16, 2018

@sjenning

This comment has been minimized.

Contributor

sjenning commented Feb 16, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 16, 2018

@@ -500,6 +500,11 @@ func (cm *containerManagerImpl) GetNodeConfig() NodeConfig {
return cm.NodeConfig
}
// GetPodCgroupRoot returns the cgroup containing all pods
func (cm *containerManagerImpl) GetPodCgroupRoot() string {
return cm.cgroupRoot

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Feb 16, 2018

Member

this will return /kubepods w/ systemd cgroup driver which is wrong.

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Feb 16, 2018

Member

if we are returning the literal cgroupfs value, you need to take into account the driver.

as a result, need to return

return cm.cgroupManager.Name(CgroupName(cm.cgroupRoot))

This comment has been minimized.

@dashpole

dashpole Feb 16, 2018

Contributor

done

@@ -78,6 +78,8 @@ type StatsProvider interface {
ListVolumesForPod(podUID types.UID) (map[string]volume.Volume, bool)
// GetPods returns the specs of all the pods running on this node.
GetPods() []*v1.Pod
// GetPodCgroupRoot returns the cgroup containing all pods
GetPodCgroupRoot() string

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Feb 16, 2018

Member

can you note that this is the literal cgroupfs value?

This comment has been minimized.

@dashpole

dashpole Feb 16, 2018

Contributor

done

@@ -211,6 +211,11 @@ func (kl *Kubelet) GetNodeConfig() cm.NodeConfig {
return kl.containerManager.GetNodeConfig()
}
// GetPodCgroupRoot returns the cgroup containing all pods

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Feb 16, 2018

Member

note that this is the cgroupfs value

This comment has been minimized.

@dashpole

dashpole Feb 16, 2018

Contributor

done

@k8s-merge-robot k8s-merge-robot removed the lgtm label Feb 16, 2018

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Feb 16, 2018

this still has a problem on systemd, but its due to a logic error in opencontianers runc
https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L399

it is converting kubepods.slice into kubepods.slice/ rather than /kubepods.slice , and the lack of a leading slash is causing a problem when calling into cAdvisor later to get container info.

as a result, you get logged messages every 10s about failing to find container info for kubepods.slice/

we can fix that in a follow-on issue.
@sjenning can you make sure a member of team takes that?
/approve

@yguo0905

This comment has been minimized.

Contributor

yguo0905 commented Feb 16, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 16, 2018

k8s-merge-robot added a commit that referenced this pull request Feb 17, 2018

Merge pull request #59841 from dashpole/metrics_after_reclaim
Automatic merge from submit-queue (batch tested with PRs 59683, 59964, 59841, 59936, 59686). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Reevaluate eviction thresholds after reclaim functions

**What this PR does / why we need it**:
When the node comes under `DiskPressure` due to inodes or disk space, the eviction manager runs garbage collection functions to clean up dead containers and unused images.
Currently, we use the strategy of trying to measure the disk space and inodes freed by garbage collection.  However, as #46789 and #56573 point out, there are gaps in the implementation that can cause extra evictions even when they are not required.  Furthermore, for nodes which frequently cycle through images, it results in a large number of evictions, as running out of inodes always causes an eviction.

This PR changes this strategy to call the garbage collection functions and ignore the results.  Then, it triggers another collection of node-level metrics, and sees if the node is still under DiskPressure.
This way, we can simply observe the decrease in disk or inode usage, rather than trying to measure how much is freed.

**Which issue(s) this PR fixes**:
Fixes #46789
Fixes #56573
Related PR #56575

**Special notes for your reviewer**:
This will look cleaner after #57802  removes arguments from [makeSignalObservations](https://github.com/kubernetes/kubernetes/pull/57802/files#diff-9e5246d8c78d50ce4ba440f98663f3e9R719).

**Release note**:
```release-note
NONE
```

/sig node
/kind bug
/priority important-soon
cc @kubernetes/sig-node-pr-reviews
@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 17, 2018

rebased. Needs lgtm reapplied

@k8s-merge-robot k8s-merge-robot removed the lgtm label Feb 17, 2018

@dashpole

This comment has been minimized.

Contributor

dashpole commented Feb 19, 2018

/retest

@sjenning

This comment has been minimized.

Contributor

sjenning commented Feb 19, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 19, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Feb 19, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, derekwaynecarr, sjenning, yguo0905

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 19, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 19, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 236fa89 into kubernetes:master Feb 19, 2018

12 of 13 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation dashpole authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-merge-robot added a commit that referenced this pull request Feb 21, 2018

Merge pull request #60098 from dashpole/fix_localstorage_eviction
Automatic merge from submit-queue (batch tested with PRs 59934, 60098, 60103, 60104, 60109). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix running with no eviction thresholds

**What this PR does / why we need it**:
After #57802, [LocalStorageCapacityIsolationEviction tests](https://k8s-testgrid.appspot.com/sig-node-kubelet#kubelet-serial-gce-e2e&include-filter-by-regex=LocalStorageCapacityIsolationEviction) started failing.  They failed because the eviction manager was not running its synchronization loops when we have no thresholds.  We should still perform the eviction manager synchronization loop even when we have no thresholds if the LocalStorageCapacityIsolation feature gate is enabled.  The reason we didn't see this before is that we added a threshold for node allocatable even when there was no corresponding eviction threshold.   #57802 changed this to only add a memory allocatable threshold when we have a memory eviction threshold specified.

**Release note**:
```release-note
NONE
```

/kind bug
/priority critical-urgent
/sig node
/assign @Random-Liu 
cc @kubernetes/sig-node-test-failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment