Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAMES-3142 eventsourcing for group unregistration #3280

Closed
wants to merge 8 commits into from

Conversation

chibenwa
Copy link
Member

@chibenwa chibenwa commented Apr 9, 2020

No description provided.

@chibenwa chibenwa added this to the Sprint 17 - Robusta Beans milestone Apr 9, 2020
@chibenwa chibenwa self-assigned this Apr 9, 2020
</dependency>
<dependency>
<groupId>${james.groupId}</groupId>
<artifactId>event-sourcing-event-store-memory</artifactId>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordering issue


import reactor.core.publisher.Mono;

public class GroupUnregistringManager {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about GroupRegistry?


public GroupUnregistringManager(EventStore eventStore, UnregisterRemovedGroupsSubscriber.Unregisterer unregisterer) {
this.eventSourcingSystem = EventSourcingSystem.fromJava(ImmutableSet.of(new StartCommandHandler(eventStore)),
ImmutableSet.of(new UnregisterRemovedGroupsSubscriber(unregisterer)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no delivery guarantee for event handlers, what will happen if we miss one event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no delivery guarantee for event handlers, what will happen if we miss one event?

The error scenario here is a server stop as our eventsourcing-eventbus is in memory based.

I guess in this very unlikely case a manual admin intervention to cliclk the "unbind" button would be acceptable. (I don't see how we could do better)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error scenario here is a server stop as our eventsourcing-eventbus is in memory based.

You should not try to describe all errors cases because:

  1. you'll always miss some
  2. the system this code depends on will change over time

Either we can enforce transactionality or we can't.

In this case, as we are using handlers, we are not in a transaction and we will miss an event sooner or later.

I can't think of a better solution than a handler but it means we have to figure out what happens in this failure case.

I guess in this very unlikely case a manual admin intervention to cliclk the "unbind" button would be acceptable. (I don't see how we could do better)

How he would know he need to do that? (sorry, I tried to find a solution but can't in a decent timeframe)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an aggregate handling 3 state for a group:

  • used
  • used but still binded
  • not used and unbinded

Then if the unRegisterSubscriber succeeds, it can fire a UnbindSucceededCommand on the aggregate.

That way we might attempt several time an unbind operation, without manual admin operation.

@chibenwa
Copy link
Member Author

chibenwa commented Apr 9, 2020

[90d9a956f0644df5bd85a18fd7613f333b30b1de] [ERROR] Failed to execute goal on project james-server-task-api: Could not resolve dependencies for project org.apache.james:james-server-task-api:jar:3.6.0-SNAPSHOT: Failed to collect dependencies at com.google.guava:guava:jar:25.1-jre -> org.codehaus.mojo:animal-sniffer-annotations:jar:1.14: Failed to read artifact descriptor for org.codehaus.mojo:animal-sniffer-annotations:jar:1.14: Could not transfer artifact org.codehaus.mojo:animal-sniffer-parent:pom:1.14 from/to central (https://repo.maven.apache.org/maven2): /root/.m2/repository/org/codehaus/mojo/animal-sniffer-parent/1.14/animal-sniffer-parent-1.14.pom.part (No such file or directory) -> [Help 1]

test this please

@chibenwa
Copy link
Member Author

test this please

assertThat(unregisterer.unregisteredGroups())
.containsExactly(GROUP_A, GROUP_B);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the test :

testee.start(ImmutableSet.of(GROUP_A, GROUP_B)).block();
testee.start(ImmutableSet.of(GROUP_A)).block();
testee.start(ImmutableSet.of(GROUP_A, GROUP_B)).block();

???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupB would be unregistered once, and re-added in the aggregate.

I don't see the value of this test.

Copy link

@mbaechler mbaechler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to detect a missing unbind?
What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

@chibenwa
Copy link
Member Author

How are we going to detect a missing unbind?

Logging the error.

Also that will eventually be corrected so what is the point of doing better?

What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

Looks like a design involving a distributed scheduler. Not sure that is more reasonable than the proposed solution.

Hence, for the time being, the two realistic ways of triggering the aggregate is either:

  • Upon start
  • Using a webadmin endpoint triggered by an external cron.

In any case we would have to prevent triggering the aggregate while groups are partially registered.

@@ -129,6 +125,9 @@ private RegisteredGroupsAggregate(History history) {
}

public List<UnbindSucceededEvent> handle(MarkUnbindAsSucceededCommand command) {
Preconditions.checkArgument(state.bindedGroups().contains(command.getSucceededGroup()),
"unbing a non binded group, or a used group");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/unbing/unbind

@mbaechler
Copy link

How are we going to detect a missing unbind?

Logging the error.

it's not "detection" is "sending a bottle in the sea" (:

Also that will eventually be corrected so what is the point of doing better?

That's the question: how?

What if we have a repeated task (I guess we don't have that feature in Task Manager yet) that list queues and send them to our aggregate to check if any action is required?

Looks like a design involving a distributed scheduler. Not sure that is more reasonable than the proposed solution.

Hence, for the time being, the two realistic ways of triggering the aggregate is either:

* Upon start

* Using a webadmin endpoint triggered by an external cron.

Are we going to require James admin to setup a cron for every jobs we want to be scheduled?

@chibenwa
Copy link
Member Author

How are we going to detect a missing unbind?

Logging the error.

it's not "detection" is "sending a bottle in the sea" (:

That's a productive remark.

I don't have better ideas. I bet this can always come as future enhencements.

We could for example think about a healthcheck loading the aggregate to ensure all the groups are unbinded. And have a webadmin endpoint for triggering the aggregate without requiring a restart.

These are very valuable overall design remarks (thanks) however this have nothing to do with how the event sourcing stuff works.

In short: let's discuss system design that don't have code impact in other channels.

@chibenwa
Copy link
Member Author

Are we going to require James admin to setup a cron for every jobs we want to be scheduled?

Again the way you call the aggregate is orthogonal to it's inner working.

You are discussing things unrelated to my work here. I would have loved to have such discussion in a grooming meeting.

So yes, let's have design discussion, let's improve our system design in a progressive fashion.

Also, a distributed scheduler is as far as I understand a hundred man day work feature. Linking a bugfix to such a feature is the best way to nether get it fixed.

@chibenwa
Copy link
Member Author

test this please

@mbaechler
Copy link

I understand very well your frustration: you wrote some nice code handling an immediate issue and you are not happy "loosing time" before having it merged and useful.

I'm not against merging the first try at the feature as it will fix part of the problem and make people happier.

However, I can't be blame for not having anticipated all the design decision during the grooming. And I think you can agree that the way we solve the problem here is a partial solution because in some cases, a queue won't be unbound (and the goal of this feature is to make sure it doesn't happen).

So we can move the discussion elsewhere if you think it's better suited but let's not fool ourselves: this issue won't be "done" thanks to this PR.

@chibenwa
Copy link
Member Author

So we can move the discussion elsewhere if you think it's better suited

The JIRA ticket is the right place.

but let's not fool ourselves: this issue won't be "done" thanks to this PR.

That's far from over: we need to initialize groups "at once" anyway

@mbaechler
Copy link

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

The problem that's left is: the projection (in this case the group listener queues) can diverge from the source of truth.

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

Do we agree?

@chibenwa
Copy link
Member Author

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

Without that you do not get eventual consistence as the aggregate can't know if the unbinding is done or not.

That is the clever way I found for resolving your concern expressed here: #3280 (comment)

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

Adapting the aggregate should be easy then. I would agree with this even if we would end up with a pattern matching strategy to retrieve such a list.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

A/ Upon start.

B/ Or via a healthcheck.

C/ Or via a webadmin end point

D/ Or via a cron task within the TaskManager

D/ -> not realistic.

A/ is the moment the list of group change. It handles the happy scenario which is far better than the nothing is done today scenario. A reboot being needed, it's not really admin friendly upon error yet it allows an admin to still fix it.

B/ would be a good diagnostic tool.

C/ alone is not nice: can we really trust the admin to really remember he should call it after unconfiguring a group?

So I believe A + B (diagnostic only) + C is the realistic midlle term solution.

However the definition of done of our task, that is acceptable from an admin standpoint is A.

@chibenwa
Copy link
Member Author

See #3303

@chibenwa chibenwa closed this Apr 16, 2020
@mbaechler
Copy link

Looking at the discussion, I think that the EventHandler doing the queue unbinding should not emit a command. It makes everything more complex and doesn't seem to solve any issue.

Without that you do not get eventual consistence as the aggregate can't know if the unbinding is done or not.

With it you don't have it neither (the command can fail). Thus I challenge the cost of adding the command wrt to result we get.

That is the clever way I found for resolving your concern expressed here: #3280 (comment)

We can compute easily the expected list of group listeners that should exist so we can recompute the projection from the aggregate history.

Adapting the aggregate should be easy then. I would agree with this even if we would end up with a pattern matching strategy to retrieve such a list.

The only question left, in my opinion, is: how do we detect it diverged and when/how do we fix it?

A/ Upon start.

B/ Or via a healthcheck.

C/ Or via a webadmin end point

D/ Or via a cron task within the TaskManager

D/ -> not realistic.

A/ is the moment the list of group change. It handles the happy scenario which is far better than the nothing is done today scenario. A reboot being needed, it's not really admin friendly upon error yet it allows an admin to still fix it.

It doesn't handle the detection part but could fix it, yes

B/ would be a good diagnostic tool.

Why not. BTW it happens that we implemented a scheduler in Healthcheck to log statuses. It looks ok as a detection mechanism.

C/ alone is not nice: can we really trust the admin to really remember he should call it after unconfiguring a group?

Not ok for detection but could be a good way to fix problems.

So I believe A + B (diagnostic only) + C is the realistic midlle term solution.

We agree

However the definition of done of our task, that is acceptable from an admin standpoint is A.

Definitely not my opinion.

@chibenwa
Copy link
Member Author

chibenwa commented Apr 16, 2020

However the definition of done of our task, that is acceptable from an admin standpoint is A.

Definitely not my opinion.

Then please express your opinion.

I'm tired of gessing it.

Again that don't mean it can't be done in a close future, that sprint content can't be handled, or that it can't fit in a future Sprint.

So we implement A/ now. And plan work on B and C as we think it is needed.

Would you agree with this?

@mbaechler
Copy link

Then please express your opinion.

My opinion:

  • the issue doesn't have a DoD
  • if we add a DoD, I don't think it can be "unbind the listener queue most of the time" because it can't be verified by QA properly
  • we must merge the work done for A, it fixes 99% of cases
  • we can't consider the issue as resolved without B and C

@mbaechler
Copy link

So we implement A/ now. And plan work on B and C as we think it is needed.

What about now?

@chibenwa
Copy link
Member Author

chibenwa commented Apr 17, 2020

What about now?

We do A.

We create tickets for B & C. And will add it in following sprints.

Expressing such concerns at grooming time would have been welcome.

@chibenwa
Copy link
Member Author

The linagora issue has a DOD, not the apache one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants