[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

dashpole · 2017-03-02T20:32:45Z

Continuation of #33121. Credit for most of this goes to @sjenning. I added volume fs timestamps.

why is this a bug
This PR attempts to fix part of #31362 which results in multiple pods getting evicted unnecessarily whenever the node runs into resource pressure. This PR reduces the chances of such disruptions by avoiding reacting to old/stale metrics.
Without this PR, kubernetes nodes under resource pressure will cause unnecessary disruptions to user workloads.
This PR will also help deflake a node e2e test suite.

The eviction manager currently avoids evicting pods if metrics are old. However, timestamp data is not available for filesystem data, and this causes lots of extra evictions.
See the inode eviction test flakes for examples.
This should probably be treated as a bugfix, as it should help mitigate extra evictions.

cc: @kubernetes/sig-storage-pr-reviews @kubernetes/sig-node-pr-reviews @vishh @derekwaynecarr @sjenning

k8s-reviewable · 2017-03-02T20:33:52Z

This change is

dashpole · 2017-03-02T20:36:19Z

@k8s-bot bazel test this

dashpole · 2017-03-02T21:54:58Z

/release-note-none

vishh · 2017-03-03T00:23:46Z

pkg/kubelet/server/stats/volume_stat_calculator.go

@@ -120,7 +120,7 @@ func (s *volumeStatCalculator) parsePodVolumeStats(podName string, metric *volum
 	inodesUsed := uint64(metric.InodesUsed.Value())
 	return stats.VolumeStats{
 		Name: podName,
-		FsStats: stats.FsStats{AvailableBytes: &available, CapacityBytes: &capacity, UsedBytes: &used,
-			Inodes: &inodes, InodesFree: &inodesFree, InodesUsed: &inodesUsed},
+		FsStats: stats.FsStats{Time: metric.Time, AvailableBytes: &available, CapacityBytes: &capacity,


nit: One field per line.

vishh · 2017-03-03T00:27:05Z

Only one nit.

/lgtm
/approve

k8s-github-robot · 2017-03-03T00:27:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: dashpole, vishh

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @matchstick @timstclair
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

vishh · 2017-03-03T00:54:47Z

@saad-ali @matchstick I understand the metrics code under /pkg/volume well since I orginally reviewed it. Hence I'm approving this PR to get it through the merge queue soon and deflake node e2es. I hope I don't cause much inconvenience my manually approving this PR

dashpole · 2017-03-03T04:16:37Z

@k8s-bot kops aws e2e test this
@k8s-bot cvm gce e2e test this

smarterclayton · 2017-03-03T06:17:06Z

@k8s-bot kops aws e2e test this

smarterclayton · 2017-03-03T06:17:18Z

@k8s-bot cvm gce e2e test this

k8s-github-robot · 2017-03-04T07:21:47Z

Automatic merge from submit-queue (batch tested with PRs 42369, 42375, 42397, 42435, 42455)

vishh · 2017-03-09T21:06:09Z

@dashpole did this PR help with the eviction test flakiness?

dashpole · 2017-03-09T21:23:39Z

Yes, I believe it did. Look at the flaky testgrid. Ignore the section that is almost entirely red March 4th-7th, since during that period tests were being run in parallel, which broke everything. Compare March 2-3 (before this change) with March 8-9 (after). Failure rate over those periods went from 56% to 38%. I took a quick look at a failure yesterday, and it was simply using stats that were collected 7 seconds before use, which still showed pressure, and caused the extra pod to be evicted. I think this change helped though

vishh · 2017-03-09T21:28:54Z

Awesome! Like we discussed offline, now we need to bring down the cost of collecting disk usage metrics and then collect more often than every minute.

…

On Thu, Mar 9, 2017 at 1:24 PM, David Ashpole ***@***.***> wrote: Yes, I believe it did. Look at the flaky testgrid <https://k8s-testgrid.appspot.com/google-node#kubelet-flaky-gce-e2e&width=20>. Ignore the section that is almost entirely red March 4th-7th, since during that period tests were being run in parallel, which broke everything. Compare March 2-3 (before this change) with March 8-9 (after). Failure rate over those periods went from 56% to 38%. I took a quick look at a failure yesterday, and it was simply using stats that were collected 7 seconds before use, which still showed pressure, and caused the extra pod to be evicted. I think this change helped though — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#42435 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKAWl2JLxCI8BSL7Tl0J8zi5HXURUks5rkG4AgaJpZM4MRffP> .

kubelet: eviction: add timestamp to FsStats

c5faf1c

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 2, 2017

k8s-github-robot assigned pmorie Mar 2, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Mar 2, 2017

dashpole force-pushed the timestamps_for_fsstats branch from 1fe645a to 29f5e55 Compare March 2, 2017 21:54

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Mar 2, 2017

vishh assigned vishh and unassigned pmorie Mar 2, 2017

add volume timestamps

a90c795

dashpole force-pushed the timestamps_for_fsstats branch from 29f5e55 to a90c795 Compare March 2, 2017 23:02

vishh reviewed Mar 3, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2017

vishh added the kind/bug Categorizes issue or PR as related to a bug. label Mar 3, 2017

vishh added this to the v1.6 milestone Mar 3, 2017

vishh changed the title ~~Continuation: Eviction: Timestamps for fsstats~~ [Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats Mar 3, 2017

vishh added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2017

k8s-github-robot merged commit f9ccee7 into kubernetes:master Mar 4, 2017

sjenning mentioned this pull request Mar 6, 2017

kubelet: eviction: add timestamp to FsStats #33121

Closed

vishh mentioned this pull request Mar 9, 2017

Guaranteed admissions and evictions of "critical pods" and "static pods" #40573

Closed

12 tasks

dashpole deleted the timestamps_for_fsstats branch March 9, 2017 21:15

dchen1107 mentioned this pull request Mar 10, 2017

Eviction manager evicts innocent pods after cleaning up unused images. #41032

Closed

sjenning mentioned this pull request Jan 16, 2018

kubelet: eviction: need timestamp on fs stats for eviction #33049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

dashpole commented Mar 2, 2017 •

edited by vishh

k8s-reviewable commented Mar 2, 2017

dashpole commented Mar 2, 2017

dashpole commented Mar 2, 2017

vishh Mar 3, 2017

vishh commented Mar 3, 2017

k8s-github-robot commented Mar 3, 2017

vishh commented Mar 3, 2017

dashpole commented Mar 3, 2017

smarterclayton commented Mar 3, 2017

smarterclayton commented Mar 3, 2017

k8s-github-robot commented Mar 4, 2017

vishh commented Mar 9, 2017

dashpole commented Mar 9, 2017

vishh commented Mar 9, 2017 via email

[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

Conversation

dashpole commented Mar 2, 2017 • edited by vishh

k8s-reviewable commented Mar 2, 2017

dashpole commented Mar 2, 2017

dashpole commented Mar 2, 2017

vishh Mar 3, 2017

Choose a reason for hiding this comment

vishh commented Mar 3, 2017

k8s-github-robot commented Mar 3, 2017

vishh commented Mar 3, 2017

dashpole commented Mar 3, 2017

smarterclayton commented Mar 3, 2017

smarterclayton commented Mar 3, 2017

k8s-github-robot commented Mar 4, 2017

vishh commented Mar 9, 2017

dashpole commented Mar 9, 2017

vishh commented Mar 9, 2017 via email

dashpole commented Mar 2, 2017 •

edited by vishh