Skip to content

Conversation

@shawn-hurley
Copy link
Member

  • no longer hang ansible execution because we are unable to create a
    watch
  • exposing an error message with hint on potential cause of error

Description of the change:

Motivation for the change:

@openshift-ci-robot
Copy link

@shawn-hurley: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1701041: Adding timeout on dependent watch establishment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 3, 2019
@shawn-hurley shawn-hurley requested a review from jmrodri July 3, 2019 17:57
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 3, 2019
@shawn-hurley
Copy link
Member Author

@djzager can you take a look as well

Copy link
Contributor

@djzager djzager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I wish I could wrap my head around why the watch would hang.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 5, 2019
Shawn Hurley added 2 commits July 8, 2019 10:48
* no longer hang ansible execution because we are unable to create a
watch
* exposing an error message with hint on potential cause of error
@openshift-ci-robot
Copy link

@shawn-hurley: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-ansible 3a85c37 link /test e2e-aws-ansible
ci/prow/images 3a85c37 link /test images
ci/prow/unit 3a85c37 link /test unit
ci/prow/sanity 3a85c37 link /test sanity
ci/prow/e2e-aws-helm 3a85c37 link /test e2e-aws-helm
ci/prow/e2e-aws-go 3a85c37 link /test e2e-aws-go
ci/prow/e2e-aws-subcommand 3a85c37 link /test e2e-aws-subcommand

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fabianvf
Copy link
Member

@shawn-hurley can you rebase this or do you need someone else to pick it up?

@camilamacedo86 camilamacedo86 added language/ansible Issue is related to an Ansible operator project kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 7, 2019
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shawn-hurley,

Could you please rebase it with the master?
Also, could you please add the CHANGELOG or we are able to move forward with?

The CI shows the error: (which shows related to)

pkg/ansible/proxy/cache_response.go:231:20: undefined: cacheEscacheEstablishmentTimeout
pkg/ansible/proxy/cache_response.go:255:20: undefined: cacheEstacacheEstablishmentTimeout

)

// This is the default timeout to wait for the cache to respond
// TODO: Eventually this should be configurable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: Eventually this should be configurable
// todo(shawn-hurley): Eventually this should be configurable

Comment on lines +241 to +258
errChan := make(chan error, 1)
go func() {
err := c.informerCache.Get(context.Background(), obj, un)
errChan <- err
}()

select {
case watchErr := <-errChan:
if watchErr != nil {
// break here in case resource doesn't exist in cache but exists on APIserver
// This is very unlikely but provides user with expected 404
log.Info(fmt.Sprintf("cache miss: %v err-%v", k, watchErr))
return nil, watchErr
}
case <-time.After(cacheEstacacheEstablishmentTimeout):
return nil, fmt.Errorf("timeout establishing watch, commonly permissions of the controller are not sufficent")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a channel, a goroutine, and a timer, a more idiomatic approach would be to use context.WithTimeout(context.Background(), cacheEstablishmentTimeout)

And then check the error from the Get() call to see if it is a context.DeadlineExceeded error.

Ditto for the informerCache.List() call above.

func addWatch(c controller.Controller, s source.Source, eh handler.EventHandler, predicates ...predicate.Predicate) error {
errChan := make(chan error, 1)
go func() {
err := c.Watch(s, eh, predicates...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should work upstream to get a controller.WatchWithContext() function so that we don't need to resort to this.

The problem here is that it is not possible to cancel the c.Watch() call, so if we reach the deadline, the c.Watch call will just stay running in the background forever. This is effectively a resource leak.

@camilamacedo86
Copy link
Contributor

Closing this one in favour of #2264
See that the patch was carrying on in order to give credits to the author.

camilamacedo86 added a commit that referenced this pull request Apr 22, 2020
**Description**
Add timeout in the watch feature for Ansible based-operators proxy to avoid appears that the reconcile is stuck and hang when the operator has not the correct permissions to List and Watch the resources. 

**Motivation for the change:**

- #1638
- https://bugzilla.redhat.com/show_bug.cgi?id=1701041

**Note**
Also, solved by kubernetes-sigs/controller-runtime#663.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. kind/bug Categorizes issue or PR as related to a bug. language/ansible Issue is related to an Ansible operator project needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants