JAMES-3142 eventsourcing for group unregistration #3280

chibenwa · 2020-04-09T08:35:11Z

No description provided.

mbaechler · 2020-04-09T08:59:35Z

mailbox/event/event-rabbitmq/pom.xml

+        </dependency>
+        <dependency>
+            <groupId>${james.groupId}</groupId>
+            <artifactId>event-sourcing-event-store-memory</artifactId>


ordering issue

mbaechler · 2020-04-09T09:01:21Z

...mq/src/main/java/org/apache/james/mailbox/events/eventsourcing/GroupUnregistringManager.java

+
+import reactor.core.publisher.Mono;
+
+public class GroupUnregistringManager {


what about GroupRegistry?

...n/java/org/apache/james/mailbox/events/eventsourcing/RegisteredGroupListenerChangeEvent.java

...q/src/main/java/org/apache/james/mailbox/events/eventsourcing/RegisteredGroupsAggregate.java

...event-rabbitmq/src/main/java/org/apache/james/mailbox/events/eventsourcing/StartCommand.java

mbaechler · 2020-04-09T09:08:21Z

...mq/src/main/java/org/apache/james/mailbox/events/eventsourcing/GroupUnregistringManager.java

+
+    public GroupUnregistringManager(EventStore eventStore, UnregisterRemovedGroupsSubscriber.Unregisterer unregisterer) {
+        this.eventSourcingSystem = EventSourcingSystem.fromJava(ImmutableSet.of(new StartCommandHandler(eventStore)),
+            ImmutableSet.of(new UnregisterRemovedGroupsSubscriber(unregisterer)),


there's no delivery guarantee for event handlers, what will happen if we miss one event?

there's no delivery guarantee for event handlers, what will happen if we miss one event?

The error scenario here is a server stop as our eventsourcing-eventbus is in memory based.

I guess in this very unlikely case a manual admin intervention to cliclk the "unbind" button would be acceptable. (I don't see how we could do better)

The error scenario here is a server stop as our eventsourcing-eventbus is in memory based.

You should not try to describe all errors cases because:

you'll always miss some

the system this code depends on will change over time

Either we can enforce transactionality or we can't.

In this case, as we are using handlers, we are not in a transaction and we will miss an event sooner or later.

I can't think of a better solution than a handler but it means we have to figure out what happens in this failure case.

I guess in this very unlikely case a manual admin intervention to cliclk the "unbind" button would be acceptable. (I don't see how we could do better)

How he would know he need to do that? (sorry, I tried to find a solution but can't in a decent timeframe)

We could have an aggregate handling 3 state for a group:

used

used but still binded

not used and unbinded

Then if the unRegisterSubscriber succeeds, it can fire a UnbindSucceededCommand on the aggregate.

That way we might attempt several time an unbind operation, without manual admin operation.

chibenwa · 2020-04-09T15:52:45Z

[90d9a956f0644df5bd85a18fd7613f333b30b1de] [ERROR] Failed to execute goal on project james-server-task-api: Could not resolve dependencies for project org.apache.james:james-server-task-api:jar:3.6.0-SNAPSHOT: Failed to collect dependencies at com.google.guava:guava:jar:25.1-jre -> org.codehaus.mojo:animal-sniffer-annotations:jar:1.14: Failed to read artifact descriptor for org.codehaus.mojo:animal-sniffer-annotations:jar:1.14: Could not transfer artifact org.codehaus.mojo:animal-sniffer-parent:pom:1.14 from/to central (https://repo.maven.apache.org/maven2): /root/.m2/repository/org/codehaus/mojo/animal-sniffer-parent/1.14/animal-sniffer-parent-1.14.pom.part (No such file or directory) -> [Help 1]

test this please

chibenwa · 2020-04-10T02:21:04Z

test this please

Arsnael · 2020-04-14T04:26:41Z

...rc/test/java/org/apache/james/mailbox/events/eventsourcing/GroupUnregistringManagerTest.java

+        assertThat(unregisterer.unregisteredGroups())
+            .containsExactly(GROUP_A, GROUP_B);
+    }
+}


What about the test :

testee.start(ImmutableSet.of(GROUP_A, GROUP_B)).block(); testee.start(ImmutableSet.of(GROUP_A)).block(); testee.start(ImmutableSet.of(GROUP_A, GROUP_B)).block();

???

GroupB would be unregistered once, and re-added in the aggregate.

I don't see the value of this test.

mbaechler

How are we going to detect a missing unbind?
What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

...q/src/main/java/org/apache/james/mailbox/events/eventsourcing/RegisteredGroupsAggregate.java

chibenwa · 2020-04-14T09:55:17Z

How are we going to detect a missing unbind?

Logging the error.

Also that will eventually be corrected so what is the point of doing better?

What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

Looks like a design involving a distributed scheduler. Not sure that is more reasonable than the proposed solution.

Hence, for the time being, the two realistic ways of triggering the aggregate is either:

Upon start
Using a webadmin endpoint triggered by an external cron.

In any case we would have to prevent triggering the aggregate while groups are partially registered.

...mq/src/main/java/org/apache/james/mailbox/events/eventsourcing/GroupUnregistringManager.java

Arsnael · 2020-04-15T02:21:29Z

...q/src/main/java/org/apache/james/mailbox/events/eventsourcing/RegisteredGroupsAggregate.java

@@ -129,6 +125,9 @@ private RegisteredGroupsAggregate(History history) {
    }

    public List<UnbindSucceededEvent> handle(MarkUnbindAsSucceededCommand command) {
+        Preconditions.checkArgument(state.bindedGroups().contains(command.getSucceededGroup()),
+            "unbing a non binded group, or a used group");


s/unbing/unbind

mbaechler · 2020-04-15T09:05:56Z

How are we going to detect a missing unbind?

Logging the error.

it's not "detection" is "sending a bottle in the sea" (:

Also that will eventually be corrected so what is the point of doing better?

That's the question: how?

What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

Looks like a design involving a distributed scheduler. Not sure that is more reasonable than the proposed solution.

Hence, for the time being, the two realistic ways of triggering the aggregate is either:
* Upon start

* Using a webadmin endpoint triggered by an external cron.

Are we going to require James admin to setup a cron for every jobs we want to be scheduled?

chibenwa · 2020-04-15T09:11:34Z

How are we going to detect a missing unbind?

Logging the error.

it's not "detection" is "sending a bottle in the sea" (:

That's a productive remark.

I don't have better ideas. I bet this can always come as future enhencements.

We could for example think about a healthcheck loading the aggregate to ensure all the groups are unbinded. And have a webadmin endpoint for triggering the aggregate without requiring a restart.

These are very valuable overall design remarks (thanks) however this have nothing to do with how the event sourcing stuff works.

In short: let's discuss system design that don't have code impact in other channels.

chibenwa · 2020-04-15T09:15:47Z

Are we going to require James admin to setup a cron for every jobs we want to be scheduled?

Again the way you call the aggregate is orthogonal to it's inner working.

You are discussing things unrelated to my work here. I would have loved to have such discussion in a grooming meeting.

So yes, let's have design discussion, let's improve our system design in a progressive fashion.

Also, a distributed scheduler is as far as I understand a hundred man day work feature. Linking a bugfix to such a feature is the best way to nether get it fixed.

chibenwa · 2020-04-15T09:16:14Z

test this please

mbaechler · 2020-04-15T10:40:48Z

I understand very well your frustration: you wrote some nice code handling an immediate issue and you are not happy "loosing time" before having it merged and useful.

I'm not against merging the first try at the feature as it will fix part of the problem and make people happier.

However, I can't be blame for not having anticipated all the design decision during the grooming. And I think you can agree that the way we solve the problem here is a partial solution because in some cases, a queue won't be unbound (and the goal of this feature is to make sure it doesn't happen).

So we can move the discussion elsewhere if you think it's better suited but let's not fool ourselves: this issue won't be "done" thanks to this PR.

chibenwa · 2020-04-15T10:50:34Z

So we can move the discussion elsewhere if you think it's better suited

The JIRA ticket is the right place.

but let's not fool ourselves: this issue won't be "done" thanks to this PR.

That's far from over: we need to initialize groups "at once" anyway

mbaechler · 2020-04-16T08:21:44Z

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

The problem that's left is: the projection (in this case the group listener queues) can diverge from the source of truth.

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

Do we agree?

chibenwa · 2020-04-16T08:57:08Z

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

Without that you do not get eventual consistence as the aggregate can't know if the unbinding is done or not.

That is the clever way I found for resolving your concern expressed here: #3280 (comment)

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

Adapting the aggregate should be easy then. I would agree with this even if we would end up with a pattern matching strategy to retrieve such a list.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

A/ Upon start.

B/ Or via a healthcheck.

C/ Or via a webadmin end point

D/ Or via a cron task within the TaskManager

D/ -> not realistic.

A/ is the moment the list of group change. It handles the happy scenario which is far better than the nothing is done today scenario. A reboot being needed, it's not really admin friendly upon error yet it allows an admin to still fix it.

B/ would be a good diagnostic tool.

C/ alone is not nice: can we really trust the admin to really remember he should call it after unconfiguring a group?

So I believe A + B (diagnostic only) + C is the realistic midlle term solution.

However the definition of done of our task, that is acceptable from an admin standpoint is A.

chibenwa · 2020-04-16T10:06:31Z

See #3303

mbaechler · 2020-04-16T10:19:02Z

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

Without that you do not get eventual consistence as the aggregate can't know if the unbinding is done or not.

With it you don't have it neither (the command can fail). Thus I challenge the cost of adding the command wrt to result we get.

That is the clever way I found for resolving your concern expressed here: #3280 (comment)

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

Adapting the aggregate should be easy then. I would agree with this even if we would end up with a pattern matching strategy to retrieve such a list.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

A/ Upon start.

B/ Or via a healthcheck.

C/ Or via a webadmin end point

D/ Or via a cron task within the TaskManager

D/ -> not realistic.

A/ is the moment the list of group change. It handles the happy scenario which is far better than the nothing is done today scenario. A reboot being needed, it's not really admin friendly upon error yet it allows an admin to still fix it.

It doesn't handle the detection part but could fix it, yes

B/ would be a good diagnostic tool.

Why not. BTW it happens that we implemented a scheduler in Healthcheck to log statuses. It looks ok as a detection mechanism.

C/ alone is not nice: can we really trust the admin to really remember he should call it after unconfiguring a group?

Not ok for detection but could be a good way to fix problems.

So I believe A + B (diagnostic only) + C is the realistic midlle term solution.

We agree

However the definition of done of our task, that is acceptable from an admin standpoint is A.

Definitely not my opinion.

chibenwa · 2020-04-16T10:24:52Z

However the definition of done of our task, that is acceptable from an admin standpoint is A.

Definitely not my opinion.

Then please express your opinion.

I'm tired of gessing it.

Again that don't mean it can't be done in a close future, that sprint content can't be handled, or that it can't fit in a future Sprint.

So we implement A/ now. And plan work on B and C as we think it is needed.

Would you agree with this?

mbaechler · 2020-04-17T07:29:14Z

Then please express your opinion.

My opinion:

the issue doesn't have a DoD
if we add a DoD, I don't think it can be "unbind the listener queue most of the time" because it can't be verified by QA properly
we must merge the work done for A, it fixes 99% of cases
we can't consider the issue as resolved without B and C

mbaechler · 2020-04-17T07:29:48Z

So we implement A/ now. And plan work on B and C as we think it is needed.

What about now?

chibenwa · 2020-04-17T07:46:04Z

What about now?

We do A.

We create tickets for B & C. And will add it in following sprints.

Expressing such concerns at grooming time would have been welcome.

chibenwa · 2020-04-17T07:46:33Z

The linagora issue has a DOD, not the apache one.

JAMES-3142 eventsourcing for group unregistration

90d9a95

chibenwa added this to the Sprint 17 - Robusta Beans milestone Apr 9, 2020

chibenwa self-assigned this Apr 9, 2020

mbaechler reviewed Apr 9, 2020

View reviewed changes

chibenwa added 5 commits April 10, 2020 16:47

JAMES-3142 Add timestamp & hostname to the event for diagnostic purposes

4358c31

fixup! JAMES-3142 eventsourcing for group unregistration

eb3988b

JAMES-3142 s/StartCommand/RequireGroupsCommand/

1166120

JAMES-3142 Eventual consistence for group unbinding

1aabf53

fixup! JAMES-3142 Eventual consistence for group unbinding

9ad44cd

Arsnael reviewed Apr 14, 2020

View reviewed changes

chibenwa added the cross-review needed label Apr 14, 2020

Arsnael approved these changes Apr 14, 2020

View reviewed changes

mbaechler reviewed Apr 14, 2020

View reviewed changes

fixup! fixup! JAMES-3142 Eventual consistence for group unbinding

cbc47c1

Arsnael approved these changes Apr 15, 2020

View reviewed changes

fixup! fixup! JAMES-3142 Eventual consistence for group unbinding

7837e68

mbaechler removed the cross-review needed label Apr 16, 2020

chibenwa mentioned this pull request Apr 16, 2020

JAMES-3142 Rely on event sourcing to correct current group registration #3303

Closed

chibenwa closed this Apr 16, 2020

chibenwa modified the milestones: Sprint 17 - Robusta Beans, Sprint 18 Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAMES-3142 eventsourcing for group unregistration #3280

JAMES-3142 eventsourcing for group unregistration #3280

chibenwa commented Apr 9, 2020

mbaechler Apr 9, 2020

mbaechler Apr 9, 2020

mbaechler Apr 9, 2020

chibenwa Apr 9, 2020

mbaechler Apr 10, 2020

chibenwa Apr 10, 2020

chibenwa commented Apr 9, 2020

chibenwa commented Apr 10, 2020

Arsnael Apr 14, 2020

chibenwa Apr 14, 2020

mbaechler left a comment

chibenwa commented Apr 14, 2020

Arsnael Apr 15, 2020

mbaechler commented Apr 15, 2020

chibenwa commented Apr 15, 2020

chibenwa commented Apr 15, 2020

chibenwa commented Apr 15, 2020

mbaechler commented Apr 15, 2020

chibenwa commented Apr 15, 2020

mbaechler commented Apr 16, 2020

chibenwa commented Apr 16, 2020

chibenwa commented Apr 16, 2020

mbaechler commented Apr 16, 2020

chibenwa commented Apr 16, 2020 •

edited

Loading

mbaechler commented Apr 17, 2020

mbaechler commented Apr 17, 2020

chibenwa commented Apr 17, 2020 •

edited

Loading

chibenwa commented Apr 17, 2020


		import reactor.core.publisher.Mono;

		public class GroupUnregistringManager {

JAMES-3142 eventsourcing for group unregistration #3280

JAMES-3142 eventsourcing for group unregistration #3280

Conversation

chibenwa commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chibenwa commented Apr 9, 2020

chibenwa commented Apr 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbaechler left a comment

Choose a reason for hiding this comment

chibenwa commented Apr 14, 2020

Choose a reason for hiding this comment

mbaechler commented Apr 15, 2020

chibenwa commented Apr 15, 2020

chibenwa commented Apr 15, 2020

chibenwa commented Apr 15, 2020

mbaechler commented Apr 15, 2020

chibenwa commented Apr 15, 2020

mbaechler commented Apr 16, 2020

chibenwa commented Apr 16, 2020

chibenwa commented Apr 16, 2020

mbaechler commented Apr 16, 2020

chibenwa commented Apr 16, 2020 • edited Loading

mbaechler commented Apr 17, 2020

mbaechler commented Apr 17, 2020

chibenwa commented Apr 17, 2020 • edited Loading

chibenwa commented Apr 17, 2020

chibenwa commented Apr 16, 2020 •

edited

Loading

chibenwa commented Apr 17, 2020 •

edited

Loading