Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats #42435

Merged
merged 2 commits into from
Mar 4, 2017

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Mar 2, 2017

Continuation of #33121. Credit for most of this goes to @sjenning. I added volume fs timestamps.

why is this a bug
This PR attempts to fix part of #31362 which results in multiple pods getting evicted unnecessarily whenever the node runs into resource pressure. This PR reduces the chances of such disruptions by avoiding reacting to old/stale metrics.
Without this PR, kubernetes nodes under resource pressure will cause unnecessary disruptions to user workloads.
This PR will also help deflake a node e2e test suite.

The eviction manager currently avoids evicting pods if metrics are old. However, timestamp data is not available for filesystem data, and this causes lots of extra evictions.
See the inode eviction test flakes for examples.
This should probably be treated as a bugfix, as it should help mitigate extra evictions.

cc: @kubernetes/sig-storage-pr-reviews @kubernetes/sig-node-pr-reviews @vishh @derekwaynecarr @sjenning

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 2, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Mar 2, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@dashpole
Copy link
Contributor Author

dashpole commented Mar 2, 2017

@k8s-bot bazel test this

@dashpole
Copy link
Contributor Author

dashpole commented Mar 2, 2017

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Mar 2, 2017
@vishh vishh assigned vishh and unassigned pmorie Mar 2, 2017
@@ -120,7 +120,7 @@ func (s *volumeStatCalculator) parsePodVolumeStats(podName string, metric *volum
inodesUsed := uint64(metric.InodesUsed.Value())
return stats.VolumeStats{
Name: podName,
FsStats: stats.FsStats{AvailableBytes: &available, CapacityBytes: &capacity, UsedBytes: &used,
Inodes: &inodes, InodesFree: &inodesFree, InodesUsed: &inodesUsed},
FsStats: stats.FsStats{Time: metric.Time, AvailableBytes: &available, CapacityBytes: &capacity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: One field per line.

@vishh
Copy link
Contributor

vishh commented Mar 3, 2017

Only one nit.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: dashpole, vishh

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @matchstick @timstclair
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@vishh vishh added the kind/bug Categorizes issue or PR as related to a bug. label Mar 3, 2017
@vishh vishh added this to the v1.6 milestone Mar 3, 2017
@vishh vishh changed the title Continuation: Eviction: Timestamps for fsstats [Bug Fix]: Avoid evicting more pods than necessary by adding Timestamps for fsstats and ignoring stale stats Mar 3, 2017
@vishh
Copy link
Contributor

vishh commented Mar 3, 2017

@saad-ali @matchstick I understand the metrics code under /pkg/volume well since I orginally reviewed it. Hence I'm approving this PR to get it through the merge queue soon and deflake node e2es. I hope I don't cause much inconvenience my manually approving this PR

@vishh vishh added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2017
@dashpole
Copy link
Contributor Author

dashpole commented Mar 3, 2017

@k8s-bot kops aws e2e test this
@k8s-bot cvm gce e2e test this

@smarterclayton
Copy link
Contributor

@k8s-bot kops aws e2e test this

@smarterclayton
Copy link
Contributor

@k8s-bot cvm gce e2e test this

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 42369, 42375, 42397, 42435, 42455)

@k8s-github-robot k8s-github-robot merged commit f9ccee7 into kubernetes:master Mar 4, 2017
@vishh
Copy link
Contributor

vishh commented Mar 9, 2017

@dashpole did this PR help with the eviction test flakiness?

@dashpole dashpole deleted the timestamps_for_fsstats branch March 9, 2017 21:15
@dashpole
Copy link
Contributor Author

dashpole commented Mar 9, 2017

Yes, I believe it did. Look at the flaky testgrid. Ignore the section that is almost entirely red March 4th-7th, since during that period tests were being run in parallel, which broke everything. Compare March 2-3 (before this change) with March 8-9 (after). Failure rate over those periods went from 56% to 38%. I took a quick look at a failure yesterday, and it was simply using stats that were collected 7 seconds before use, which still showed pressure, and caused the extra pod to be evicted. I think this change helped though

@vishh
Copy link
Contributor

vishh commented Mar 9, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants