Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix configsync for static -> dynamic -> static #54

Merged
merged 2 commits into from Dec 8, 2020

Conversation

alvaroaleman
Copy link
Member

@alvaroaleman alvaroaleman commented Nov 23, 2020

First commit: Make config sync a controller

  • We need to retry on errors
  • We need to react to changes in resources, specifically Add/Delete or Update where a resource gets set into Tombstome state

Second commit: Fix config sync when a config is changed from static -> dynamic -> static
This currently causes an IsAlreadyExists error because the calculation for the static resource creation ignores all resources that have a corresponding dynamic config. This blocks any further progress.
-> Fixed by including all resources that have a static config in the static resource add/delete calculation
-> This made the dynamic resource reconciliation instantly delete them. Fixed by making the dynamic resource reconciliation ignore all resources that have a static config

Note: While I added a unit testcase for this specific problem, I verified the pull by deploying boskos and changing its config from (for the same resource) static -> dynamic -> static -> dyanamic -> static in order to make sure that all of reconciliation and triggering work correctly and that there are no errors.

/assign @ixdy

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 23, 2020
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 23, 2020
@@ -91,12 +91,6 @@ func NewRanch(config string, s *Storage, ttl time.Duration) (*Ranch, error) {
requestMgr: NewRequestManager(ttl),
now: time.Now,
}
if config != "" {
if err := newRanch.SyncConfig(config); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This races with the cache startup and will fail if the cache wasn't started in time

panic(fmt.Sprintf("BUG: expected *crds.ResourceObject, got %T", e.ObjectNew))
}

return resource.Status.State == common.Tombstone
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because boskos is the entity that deletes Tombstoned resources:

if r.Status.State == common.Tombstone {

The process is:

  • Boskos sets to ToBeDeleted
  • Cleaner cleans and sets to Tombstone
  • Boskos deletes

To me it seems that the cleaner should just delete them. The only additional thing we do here is to check that they have no owner, but they will always have no owner because the cleaner always does both clean, then release:

c.recycleOne(res)
if err := c.client.ReleaseOne(res.Name, common.Tombstone); err != nil {
logrus.WithError(err).Errorf("failed to release %s as %s", res.Name, common.Tombstone)

Are there more reasons or can we (in a follow-up) make the cleaner just delete the resources?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sebastienvas

This seems reasonable to me. There may have been some reason this was separated that got removed in subsequent refactorings.

@alvaroaleman alvaroaleman changed the title Fix configsync for static -> dyanmic -> static Fix configsync for static -> dynanic -> static Nov 23, 2020
@alvaroaleman alvaroaleman changed the title Fix configsync for static -> dynanic -> static Fix configsync for static -> dynamic -> static Nov 23, 2020
Copy link
Contributor

@ixdy ixdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Generally LG, just a few questions/comments.

cmd/boskos/boskos.go Show resolved Hide resolved
panic(fmt.Sprintf("BUG: expected *crds.ResourceObject, got %T", e.ObjectNew))
}

return resource.Status.State == common.Tombstone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sebastienvas

This seems reasonable to me. There may have been some reason this was separated that got removed in subsequent refactorings.

ranch/ranch.go Show resolved Hide resolved
cmd/boskos/boskos.go Show resolved Hide resolved
if err := s.client.Delete(s.ctx, o); err != nil {
return err
}
if err := wait.Poll(100*time.Millisecond, 5*time.Second, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this? and where did the value of 5s come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client is cache-backed, which means that it takes a moment for our own changes to be visible for us. This is important here, because otherwise we will a moment later get the drlc, not see this change, exclude the now-static resources from the number of static resources created for this dlrc, note that we have too little, create them and then a moment later attempt do delete all of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 100ms and 5s are relatively arbitrary, under normal circumstances the cache will lag between 200ms to 2s behind.

ranch/storage.go Outdated Show resolved Hide resolved
ranch/storage.go Outdated
// Mark for deletion of all associated dynamic resources.
existingDRLC.Spec.MinCount = 0
existingDRLC.Spec.MaxCount = 0
dRLCToUpdate = append(dRLCToUpdate, existingDRLC)
if err := s.client.Update(s.ctx, existingDRLC.DeepCopyObject()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to do this update in the loop?

(Partially I'm wondering why we treat this as a special update, rather than just passing the DRLC to deleteDynamicResourceLifecycles, though I guess we don't have the list of static resources here. Hm.)

I guess it feels like this code is a bit messy and needs some cleaning up, though that does predate this PR. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be in the loop because only for this change we must make sure to again have it in the cache before proceeding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it feels like this code is a bit messy and needs some cleaning up, though that does predate this PR. Thoughts?

Certainly, for example the approach here of constructing a slice to then pass that to another func that serially updates all elements in the slice is just needlessly complex and adds a couple of allocations, just directly updating here would be simpler. But the change here already has an about 400 line diff, so I wanted to not make that worse

ranch/storage.go Outdated Show resolved Hide resolved
Currently we reconcile the config on file update or the dynamic part
periodically. This doesn't catch all cases, for example if non-dynamic
resources need to be deleted but are currently in use, this will only
happen when the config file gets changed. Additionally, its slow as it
happens only periodicially.

This change moves that into a simplistic controller that reacts to
config file, resource object and dynamic resource object changes.
cmd/boskos/boskos.go Show resolved Hide resolved
cmd/boskos/boskos.go Outdated Show resolved Hide resolved
ranch/storage.go Outdated
errs = append(errs, err)
continue
}
// Make sure we have this change in our cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just put this logic in UpdateDynamicResourceLifeCycle, and then reduce some of the code-duplication? (I guess figuring out the right condition to check is tricky...)

couldn't we run into similar issues where an update to a DRLC (e.g. changing the mincount/maxcount to something nonzero) results in us creating and then immediately deleting a dynamic resource?

I guess it might increase the latency for syncs, but hopefully not by much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (We have this logic twice, once in syncDynamicResourceLifeCycles and once in UpdateAllDynamicResources. We should probably unexport the latter and remove those bits there, as they are redundant now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace the duplicated logic in this function now with a call to UpdateDynamicResourceLifeCycle? i.e. replace the s.client.Update call (after setting MinCount and MaxCount to 0) with UpdateDynamicResourceLifeCycle(&dRLCToDelelete[idx])?

(also I just noticed that typo on Delete in dRLCToDelelete)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and fixed the typo

@alvaroaleman
Copy link
Member Author

@ixdy any chance you can have another look?

Copy link
Contributor

@ixdy ixdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long review latency on this one. LGTM!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, ixdy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit a8d2ffc into kubernetes-sigs:master Dec 8, 2020
wking added a commit to wking/openshift-release that referenced this pull request Dec 10, 2020
…e, and GCP by region""

This reverts commit 8a22fc4, openshift#12842.

Boskos' config reloading on dynamic -> static pivots has been fixed by
kubernetes-sigs/boskos@3834f37d8a (Config sync: Avoid deadlock when
static -> dynamic -> static, 2020-12-03, kubernetes-sigs/boskos#54),
so we can take another run at static leases for these platforms.  Not
a clean re-revert, because 4705f26 (core-services/prow/02_config:
Drop GCP Boskos leases to 80, 2020-12-02, openshift#14032) landed in the
meantime, but it was easy to update from 120 to 80 here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants