fix(kuma-cp): make store changes processing more reliable #6728

lukidzi · 2023-05-10T17:13:30Z

Problem

When running deployment with Postgres we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That cause the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing on it, the component was hanging and events were not propagated to other subscribers.

Solution

When subscribing to the EventBus we are generating an UUID that is a key for a subscription. Before components shut down, we are calling defer to unsubscribe from EventBus.

Changes:

introduce resilient component for mux server in Global
introduce a map of subscribers to EventBus to be able to unsubscribe
EventBus requires id to subscribe
Added a new method Unsubscribe that removed a subscription from EventBus
Added propagation of stop channel to pq_listener, to stop the goroutine
Changed Postgres plugin Error() method to a channel of errors and added reaction for Events

Link to relevant issue as well as docs and UI issues --
This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as a image registry) and it will work on Windows, system specific functions like syscall.Mkfifo have equivalent implementation on the other OS --
Tests (Unit test, E2E tests, manual test on universal and k8s) --
Do you need to update UPGRADE.md? --
Does it need to be backported according to the backporting policy? -- probably needs to be backported
Do you need to explicitly set a > Changelog: entry here or add a ci/ label to run fewer/more tests?

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

michaelbeaumont · 2023-05-10T18:18:40Z

Isn't the biggest change here that we no longer block on event sending? I think we need to make sure that all components have a timeout for polling the state of whatever they're listening for since event delivery isn't guaranteed.

pkg/events/eventbus.go

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

lukidzi · 2023-05-11T08:00:34Z

Isn't the biggest change here that we no longer block on event sending? I think we need to make sure that all components have a timeout for polling the state of whatever they're listening for since event delivery isn't guaranteed.

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

lobkovilya · 2023-05-11T08:57:08Z

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

Hmm, I think @michaelbeaumont is right. Now if the client is busy and its channel is full then eventBus is going to drop the event. Maybe we should go only with close in this PR and keep event-sending blocking (or add some reasonable timeout on sending).

pkg/events/eventbus.go

pkg/insights/resyncer.go

pkg/plugins/common/postgres/listener.go

pkg/plugins/common/postgres/pq_listener.go

…eration Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

lukidzi · 2023-05-11T11:10:13Z

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

Hmm, I think @michaelbeaumont is right. Now if the client is busy and its channel is full then eventBus is going to drop the event. Maybe we should go only with close in this PR and keep event-sending blocking (or add some reasonable timeout on sending).

I've added timeout and the configuration. I am not sure how long it should the to process an event. I've set it to 2 seconds but we can change it.

michaelbeaumont · 2023-05-12T14:30:46Z

IMO the timeout on send blocking isn't a great solution. There should instead be, in each listening component, a timeout after which everything is reconciled. Is that possible?

Otherwise, can we maybe just properly unsubscribe as a solution? The blocking is still a problem, but maybe it should be solved properly separately, if just unsubscribing is enough.

lukidzi · 2023-05-12T14:42:55Z

IMO the timeout on send blocking isn't a great solution. There should instead be, in each listening component, a timeout after which everything is reconciled. Is that possible?

Otherwise, can we maybe just properly unsubscribe as a solution? The blocking is still a problem, but maybe it should be solved properly separately, if just unsubscribing is enough.

I think for now we should be fine with unsubscribing. Also, each channel has a queue of 10 elements so maybe it's not required and we can queue the requests.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

pkg/config/core/resources/store/config.go

pkg/insights/test/test_event_reader.go

michaelbeaumont

Not sure what to call the PR but it's out of date now.

Maybe title it with whatever the user visible change is?

"fix(kuma-cp): make zone syncing more reliable" or something?

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

michaelbeaumont

LGTM!

github-actions · 2023-05-15T15:30:09Z

backporting to release-2.1 with action

backporting to release-1.8 with action
backporting to release-1.7 with action

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

…#6728) (#6765) fix(kuma-cp): make store changes processing more reliable (#6728) When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>

…#6728) (#6767) * fix(kuma-cp): make store changes processing more reliable (#6728) When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>

…#6728) (#6763) * fix(kuma-cp): make store changes processing more reliable (#6728) When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>

…#6728) (#6764) * fix(kuma-cp): make store changes processing more reliable (#6728) When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>

…#6728) (#6766) * fix(kuma-cp): make store changes processing more reliable (#6728) When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers. Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>

lukidzi added 3 commits May 10, 2023 18:30

fix(kuma-cp): remove subscription when component shutdown

3c328bc

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

fix(kuma-cp): add resilent component for mux global server sync in kds

ab46bd8

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

fix(kuma-cp): fix test method

85650b6

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

lukidzi requested review from a team, michaelbeaumont, jakubdyszkiewicz and lobkovilya and removed request for a team May 10, 2023 17:13

michaelbeaumont reviewed May 10, 2023

View reviewed changes

pkg/events/eventbus.go Outdated Show resolved Hide resolved

fix(kuma-cp): change the unsubcribe to close and refactor

185949c

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

lukidzi changed the title ~~fix(kuma-cp): unsubscribe from events when components restart~~ fix(kuma-cp): don't block and event sending in EventBus May 11, 2023

lobkovilya reviewed May 11, 2023

View reviewed changes

michaelbeaumont changed the title ~~fix(kuma-cp): don't block and event sending in EventBus~~ fix(kuma-cp): don't block on event sending in EventBus May 11, 2023

lukidzi added 2 commits May 11, 2023 12:17

fix(kuma-cp): changed to blocking sending and added timeout on the op…

24f2e5d

…eration Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

fix(kuma-cp): core review changes

d55b531

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

fix(kuma-cp): remove timeout, unsubscribing should be enough for now

60af553

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

michaelbeaumont approved these changes May 15, 2023

View reviewed changes

pkg/config/core/resources/store/config.go Outdated Show resolved Hide resolved

pkg/insights/test/test_event_reader.go Show resolved Hide resolved

michaelbeaumont approved these changes May 15, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

michaelbeaumont requested changes May 15, 2023

View reviewed changes

fix(kuma-cp): code review changes

e829b7b

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

lukidzi changed the title ~~fix(kuma-cp): don't block on event sending in EventBus~~ fix(kuma-cp): make store changes processing more reliable May 15, 2023

michaelbeaumont approved these changes May 15, 2023

View reviewed changes

lukidzi merged commit c0953ae into kumahq:master May 15, 2023
5 checks passed

lukidzi added the backport label May 15, 2023

This was referenced May 15, 2023

fix(kuma-cp): make store changes processing more reliable (backport of #6728) #6763

Merged

fix(kuma-cp): make store changes processing more reliable (backport of #6728) #6764

Merged

kumahq bot mentioned this pull request May 15, 2023

fix(kuma-cp): make store changes processing more reliable (backport of #6728) #6765

Merged

This was referenced May 15, 2023

fix(kuma-cp): make store changes processing more reliable (backport of #6728) #6766

Merged

fix(kuma-cp): make store changes processing more reliable (backport of #6728) #6767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kuma-cp): make store changes processing more reliable #6728

fix(kuma-cp): make store changes processing more reliable #6728

lukidzi commented May 10, 2023 •

edited

michaelbeaumont commented May 10, 2023

lukidzi commented May 11, 2023

lobkovilya commented May 11, 2023

lukidzi commented May 11, 2023

michaelbeaumont commented May 12, 2023

lukidzi commented May 12, 2023

This comment was marked as duplicate.

michaelbeaumont left a comment

michaelbeaumont left a comment

github-actions bot commented May 15, 2023 •

edited

fix(kuma-cp): make store changes processing more reliable #6728

fix(kuma-cp): make store changes processing more reliable #6728

Conversation

lukidzi commented May 10, 2023 • edited

Problem

Solution

michaelbeaumont commented May 10, 2023

lukidzi commented May 11, 2023

lobkovilya commented May 11, 2023

lukidzi commented May 11, 2023

michaelbeaumont commented May 12, 2023

lukidzi commented May 12, 2023

This comment was marked as duplicate.

michaelbeaumont left a comment

Choose a reason for hiding this comment

michaelbeaumont left a comment

Choose a reason for hiding this comment

github-actions bot commented May 15, 2023 • edited

lukidzi commented May 10, 2023 •

edited

github-actions bot commented May 15, 2023 •

edited