Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync cache with timeout, fix apbroute start #3763

Merged
merged 4 commits into from
Jul 13, 2023

Conversation

npinaeva
Copy link
Member

@npinaeva npinaeva commented Jul 12, 2023

WaitForCacheSync function takes stopChannel as an argument, and it will
retry syncing cache until it is closed. If we just use ovn-kube stop
channel, it will never return an error (it will never return until
either the cache is synced, or stopChannel is closed).
To fix it, GetChildStopChanWithTimeout and some wrappers functions
passing this channel with timeout were added. The default cache sync
timeout is 20 seconds (with reflector exponential backoff it means 4
list() retries).
If any informer failed to list its resource withing a given time period,
return error to make the problem visible.

To test this change, comment out list permission for any object in
dist/templates/ovn-setup.yaml.j2, run kind cluster and check the
corresponding container fails with message like
"failed to start ovnkube controller: error in syncing cache for
*v1.EndpointSlice informer"

Don't add event handlers for existing informers in constructor.
Do so only if Run() is called, otherwise workqueues will be filled,
but not handled.

I joined these 2 fixes together, because they both add an error to the Run method of the apbroute controllers

routePolicyFactory

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
retry syncing cache until it is closed. If we just use ovn-kube stop
channel, it will never return an error (it will never return until
either the cache is synced, or stopChannel is closed).
To fix it, GetChildStopChanWithTimeout and some wrappers functions
passing this channel with timeout were added. The default cache sync
timeout is 20 seconds (with reflector exponential backoff it means 4
list() retries).
If any informer failed to list its resource withing a given time period,
return error to make the problem visible.

To test this change, comment out list permission for any object in
dist/templates/ovn-setup.yaml.j2, run kind cluster and check the
corresponding container fails with message like
"failed to start ovnkube controller: error in syncing cache for
*v1.EndpointSlice informer"

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
Otherwise, sync logic will do nothing if the object is already deleted,
but add event wasn't handled yet.

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
Do so only if Run() is called, otherwise workqueues will be filled,
but not handled.

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
@npinaeva npinaeva marked this pull request as ready for review July 12, 2023 17:30
@npinaeva npinaeva mentioned this pull request Jul 13, 2023
Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @npinaeva

@trozet trozet merged commit a1d9da1 into ovn-org:master Jul 13, 2023
25 of 26 checks passed
@npinaeva npinaeva deleted the sync-cache-with-timeout branch July 14, 2023 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants