Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the new v2 garbage collector #8621

Merged
merged 44 commits into from
Jul 21, 2020

Conversation

whaught
Copy link
Contributor

@whaught whaught commented Jul 14, 2020

Issue #8208

Proposed Changes

  • New GC algorithm
    • Takes into account a maximum limit of revisions
    • Allows time-based GC to be disabled

Release Note

Implemented new garbage collector that allows for either time-based
or min/max count bounds for automatic deletion of old revisions.

alpha note added

update checksum

fix comments

checksum and validation

Update config/core/configmaps/gc.yaml

Co-authored-by: Victor Agababov <vagababov@gmail.com>

fix comments

fix unit test

fix both check

include unit test

make new settings for v2

v2 gc settings

fix defaults

default values

fix unit test. expand comment

fix validation error

Widen collector test timeouts and some cleansing. (knative#8607)

Make sorting consistent in GC test. (knative#8606)

One of the tests ("keep recent lastPinned") failed fairly often because it used two revisions with the same LastPinned timestamp. Sorting a slice is unstable so in a lot of occasions, the given order was flipped and revision 5555 was tried to be deleted, even though it shouldn't even have been considered to be deleted.

Note: sort.SliceStable doesn't help because the incoming list's order is already non-consistent.

[master] Auto-update dependencies (knative#8611)

Produced via:
  `./hack/update-deps.sh --upgrade && ./hack/update-codegen.sh`
/assign dprotaso vagababov
/cc dprotaso vagababov

Add route consistency retry to new logging E2E test. (knative#8324)

Handle InitialScale != 1 when creating multiScaler and some tidy-ups (knative#8612)

Now InitialScale has landed, take it into account when calculating initial EBC in multiscaler.

Also:

 - Tidy up some comments.
 - Fix up some test error messages.
 - Move mutexes above the things they're guarding.
 - Use %d and %0.3f for printing numbers, consistently with existing Debug
   statement at start of same method.

Enable HA by default (knative#8602)

* Enable HA by default

* Centralize the bucket/replica information for running the HA testing.

Add a test to ensure the count of the controller's reconciler stays in sync with the actual controller binary.

Remove endpoints from KPA as well. (knative#8616)

* Remove endpoints from KPA as well.

Today I noticed that I missed the endpoints code in the KPA itself, when removing that informer.
So now really get rid of those.

The tests are of course nightmarish :)

* remove old cruft

Assorted fixes to enable chaos duck. (knative#8619)

* Assorted fixes to enable chaos duck.

This lands a handful of fixes that I uncovered preparing to run controlplane chaos testing during our e2e tests.

* Drop sleep to 10s
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 14, 2020
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 14, 2020
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whaught: 0 warnings.

In response to this:

Fixes #

Proposed Changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Jul 14, 2020
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2020
@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 15, 2020
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2020
pkg/reconciler/gc/v2/gc.go Outdated Show resolved Hide resolved
pkg/reconciler/gc/v2/gc.go Outdated Show resolved Hide resolved
continue
}

if err := client.ServingV1().Revisions(rev.Namespace).Delete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could use deletecollection 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement

That would be neat for batch delete (in practice this isn't likely to actually delete more than one since it runs on every deploy), but I think you'd need to do a set selector on metadata.name.

Are set selectors supported for field selectors (rather than labels)? There's no mention of it here: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we can set a label for things we want to collect and collect them, but probably not worth it.

pkg/reconciler/gc/v2/gc.go Outdated Show resolved Hide resolved
pkg/reconciler/gc/v2/gc.go Outdated Show resolved Hide resolved
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2020
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2020
}

if stale > min {
logger.Info("Deleting stale revision: ", rev.ObjectMeta.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should tweak the name of MinNonActiveRevisions and MaxNonActiveRevisions config options for more clarity

serving/pkg/gc/config.go

Lines 58 to 63 in d86ab4a

// Minimum number of stale revisions to keep before considering for GC.
MinNonActiveRevisions int64
// Maximum number of non-active revisions to keep before considering for GC.
// regardless of creation or staleness time-bounds.
// Set Disabled (-1) to disable/ignore max.
MaxNonActiveRevisions int64

Since depending on how the revisions are processes you'd get different gc results

For example (with min 20 & max 1000):
A list composed of 20 stale revisions and then 100 non-active would cause all non-active revisions to be deleted
compared to
A list composed of 100 non-active revisions and then 20 stale revisions would delete nothing

Unsure what we want

Copy link
Contributor Author

@whaught whaught Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've documented that better here:

# ---------------------------------------

and need to carry that across to the golang comment. Note that all stale revisions are also non-active. In your example I think we'd expect 20 to be deleted and the remaining to stick around until they also go stale. However if you disable the time bounds for staleness, we keep growing the revision list to the max. That allows you to either have a hard-count of how many we keep around or a less deterministic time-based approach that can still have a ceiling. Or disable everything!

Copy link
Contributor Author

@whaught whaught Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood what you were saying. The former scenario would be unexpected because the revisions are first sorted by their last active time, but we do want the latter scenario to delete the 20 stale revisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to consider the latter scenario regardless of the sort assumption. I think it's a little more readable too.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/gc/v2/gc.go 97.6% 98.2% 0.7

@knative-test-reporter-robot

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-integration-tests 2020-07-21 07:43:00.364 +0000 UTC 1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

@dprotaso
Copy link
Member

/lgtm
/approve

I think there's an improvement to be made to reduce the memory allocations but doesn't need to block this PR

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, whaught

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@knative-prow-robot knative-prow-robot merged commit c1a8eab into knative:master Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants