Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp): make store changes processing more reliable #6728

Merged
merged 8 commits into from May 15, 2023

Conversation

lukidzi
Copy link
Contributor

@lukidzi lukidzi commented May 10, 2023

Problem

When running deployment with Postgres we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That cause the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing on it, the component was hanging and events were not propagated to other subscribers.

Solution

When subscribing to the EventBus we are generating an UUID that is a key for a subscription. Before components shut down, we are calling defer to unsubscribe from EventBus.

Changes:

  • introduce resilient component for mux server in Global
  • introduce a map of subscribers to EventBus to be able to unsubscribe
  • EventBus requires id to subscribe
  • Added a new method Unsubscribe that removed a subscription from EventBus
  • Added propagation of stop channel to pq_listener, to stop the goroutine
  • Changed Postgres plugin Error() method to a channel of errors and added reaction for Events
  • Link to relevant issue as well as docs and UI issues --
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as a image registry) and it will work on Windows, system specific functions like syscall.Mkfifo have equivalent implementation on the other OS --
  • Tests (Unit test, E2E tests, manual test on universal and k8s) --
  • Do you need to update UPGRADE.md? --
  • Does it need to be backported according to the backporting policy? -- probably needs to be backported
  • Do you need to explicitly set a > Changelog: entry here or add a ci/ label to run fewer/more tests?

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi lukidzi requested review from a team, michaelbeaumont, jakubdyszkiewicz and lobkovilya and removed request for a team May 10, 2023 17:13
@michaelbeaumont
Copy link
Contributor

Isn't the biggest change here that we no longer block on event sending? I think we need to make sure that all components have a timeout for polling the state of whatever they're listening for since event delivery isn't guaranteed.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi
Copy link
Contributor Author

lukidzi commented May 11, 2023

Isn't the biggest change here that we no longer block on event sending? I think we need to make sure that all components have a timeout for polling the state of whatever they're listening for since event delivery isn't guaranteed.

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

@lukidzi lukidzi changed the title fix(kuma-cp): unsubscribe from events when components restart fix(kuma-cp): don't block and event sending in EventBus May 11, 2023
@lobkovilya
Copy link
Contributor

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

Hmm, I think @michaelbeaumont is right. Now if the client is busy and its channel is full then eventBus is going to drop the event. Maybe we should go only with close in this PR and keep event-sending blocking (or add some reasonable timeout on sending).

pkg/events/eventbus.go Outdated Show resolved Hide resolved
pkg/insights/resyncer.go Outdated Show resolved Hide resolved
pkg/plugins/common/postgres/listener.go Outdated Show resolved Hide resolved
pkg/plugins/common/postgres/pq_listener.go Outdated Show resolved Hide resolved
pkg/plugins/common/postgres/pq_listener.go Outdated Show resolved Hide resolved
pkg/plugins/common/postgres/pq_listener.go Outdated Show resolved Hide resolved
@michaelbeaumont michaelbeaumont changed the title fix(kuma-cp): don't block and event sending in EventBus fix(kuma-cp): don't block on event sending in EventBus May 11, 2023
…eration

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi
Copy link
Contributor Author

lukidzi commented May 11, 2023

Yes, we are not blocking anymore on sending. You are right this might be the biggest change because thanks to part of not blocking we would fix but we would try to send it to not existing subscribers. I will change the name of the PR. If it's going about timeout I feel like we could add but not sure if that is critical now. I can create a task to fix it. WDYT?

Hmm, I think @michaelbeaumont is right. Now if the client is busy and its channel is full then eventBus is going to drop the event. Maybe we should go only with close in this PR and keep event-sending blocking (or add some reasonable timeout on sending).

I've added timeout and the configuration. I am not sure how long it should the to process an event. I've set it to 2 seconds but we can change it.

@michaelbeaumont
Copy link
Contributor

IMO the timeout on send blocking isn't a great solution. There should instead be, in each listening component, a timeout after which everything is reconciled. Is that possible?

Otherwise, can we maybe just properly unsubscribe as a solution? The blocking is still a problem, but maybe it should be solved properly separately, if just unsubscribing is enough.

@lukidzi
Copy link
Contributor Author

lukidzi commented May 12, 2023

IMO the timeout on send blocking isn't a great solution. There should instead be, in each listening component, a timeout after which everything is reconciled. Is that possible?

Otherwise, can we maybe just properly unsubscribe as a solution? The blocking is still a problem, but maybe it should be solved properly separately, if just unsubscribing is enough.

I think for now we should be fine with unsubscribing. Also, each channel has a queue of 10 elements so maybe it's not required and we can queue the requests.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
pkg/config/core/resources/store/config.go Outdated Show resolved Hide resolved
pkg/insights/test/test_event_reader.go Show resolved Hide resolved
michaelbeaumont

This comment was marked as duplicate.

Copy link
Contributor

@michaelbeaumont michaelbeaumont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to call the PR but it's out of date now.

Maybe title it with whatever the user visible change is?

"fix(kuma-cp): make zone syncing more reliable" or something?

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi lukidzi changed the title fix(kuma-cp): don't block on event sending in EventBus fix(kuma-cp): make store changes processing more reliable May 15, 2023
Copy link
Contributor

@michaelbeaumont michaelbeaumont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lukidzi lukidzi merged commit c0953ae into kumahq:master May 15, 2023
5 checks passed
@github-actions
Copy link
Contributor

github-actions bot commented May 15, 2023

backporting to release-2.1 with action

backporting to release-1.8 with action
backporting to release-1.7 with action

kumahq bot pushed a commit that referenced this pull request May 15, 2023
When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 15, 2023
When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 15, 2023
When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 15, 2023
When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 15, 2023
When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 16, 2023
…#6728) (#6765)

fix(kuma-cp): make store changes processing more reliable (#6728)

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 16, 2023
…#6728) (#6767)

* fix(kuma-cp): make store changes processing more reliable (#6728)

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 16, 2023
…#6728) (#6763)

* fix(kuma-cp): make store changes processing more reliable (#6728)

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 16, 2023
…#6728) (#6764)

* fix(kuma-cp): make store changes processing more reliable (#6728)

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 16, 2023
…#6728) (#6766)

* fix(kuma-cp): make store changes processing more reliable (#6728)

When running deployment with postgres/etcd we are using EventBus which is responsible for sending database events to mesh-insight-resyncer component. In case of a problem with the connection to the database, mesh-insight-resyncer component is closed and restarted by ResilientComponent. On each restart, we are subscribing to the event bus, but we are not removing the old subscription. That caused the issue in which we were sending events to the subscriber that didn't exist at that time. Because we are sending it and no one is listing it, the component was hanging and events were not propagated to other subscribers.

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Łukasz Dziedziak <lukidzi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants