Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that event IDs are globally unique #2779

Closed
turt2live opened this issue Sep 17, 2020 · 21 comments
Closed

Clarify that event IDs are globally unique #2779

turt2live opened this issue Sep 17, 2020 · 21 comments
Labels
clarification An area where the spec could do with being more explicit client-server Client-Server API s2s Server-to-Server API (federation)

Comments

@turt2live
Copy link
Member

No description provided.

@turt2live turt2live added clarification An area where the spec could do with being more explicit s2s Server-to-Server API (federation) client-server Client-Server API labels Sep 17, 2020
@Sorunome
Copy link
Contributor

The entire C-S API seems to imply they are not globally unique, though, so it would make more sense to not make it harder to fix the issue with the S-S API

@turt2live
Copy link
Member Author

The server-server API already makes them globally unique, as does the supporting documentation. This is a clarification problem in the client-server API.

@Sorunome
Copy link
Contributor

Just wanted to point out that the general consensus was that event IDs are room-unique, when asking over and over in the spec room about this for years.

Seeing this sudden change is, soru has to admit, rather infuriating, especially when you ask this very thing before and you are keep being told "They are room-unique".

Room-uniqueness also has the advantage that we will be able to introduce future room versions where that may be utilized / needed, we have greater flexibility going forward.

@KitsuneRal
Copy link
Member

KitsuneRal commented Sep 19, 2020

Well, as much as this is disappointing, we have to admit the room-uniqueness assumption was inaccurate, as the very shape of the Federation API proves to the contrary. Too bad.
On the other hand, tbh, I don't see how room-uniqueness of event ids could be put to any use by now. The change in room version 3 to make event ids hashes was made to resolve federation issues from event ids and event hashes being different things, and that restricted possible creative ways of making event ids to a great extent. Can you conceive a use case where event ids of different room versions would collide, assuming they still remain reference hashes?

@Sorunome
Copy link
Contributor

assuming they still remain reference hashes?

This is just it - we can't know if they will still be in the future, we can't know yet what we'll do in 10+ years

This is also about not having a local state arbitrarily depend on a global state.

Even e.g. discord, which uses snowflakes, which are by design globally unique, identifies messages with GUILDID/CHANID/MESSAGEID, likely because it provides you way greater flexibility down the road.

Now, if you aren't convinced by local state arbitrarily depending on global state and by future flexibility, having events per-room unique on API level could allow for greater homeserver performance, as a homeserver would just know which event a room is in, as it is right there in the request.

@KitsuneRal
Copy link
Member

assuming they still remain reference hashes?

This is just it - we can't know if they will still be in the future, we can't know yet what we'll do in 10+ years

That's a pretty usual argument for premature optimisation, as well as premature pessimisation. As it stands now, using reference hashes to refer to events seems to stay around for the foreseeable future. Making event ids scoped to rooms imposes a limitation that event ids can only identify events within rooms - which equally may or may not be the case in 10+ years.

This is also about not having a local state arbitrarily depend on a global state.

Global uniqueness is a direct corollary of the way reference hashes are constructed. It's not arbitrary.

Even e.g. discord, which uses snowflakes, which are by design globally unique, identifies messages with GUILDID/CHANID/MESSAGEID, likely because it provides you way greater flexibility down the road.

Does discord make message ids as a function of the message metadata?

Now, if you aren't convinced by local state arbitrarily depending on global state and by future flexibility, having events per-room unique on API level could allow for greater homeserver performance, as a homeserver would just know which event a room is in, as it is right there in the request.

Do you seriously believe in better server performance from adding another lookup (for a room) on top of a lookup in a hash table?

@Sorunome
Copy link
Contributor

Does discord make message ids as a function of the message metadata?

Discords snowflakes are based on the current timestamp, worker id and counter. https://discord.com/developers/docs/reference#snowflakes

Do you seriously believe in better server performance from adding another lookup (for a room) on top of a lookup in a hash table?

instead of one overly large hashtable which outgrows its size you can break things down

@neilalexander
Copy link
Contributor

For what it’s worth, I would agree with @Sorunome that it would be far better to fix the federation API to scope event ID uniqueness to rooms rather than just to assert that they are global. We can’t guarantee that they will be globally unique. (Hell, we can’t even guarantee they’ll be unique within a room, but we have to draw the line somewhere and anything that can affect other rooms is bad.)

Given that we can’t actually make any guarantee that collisions won’t happen in the reference hashes, there are various conditions that might arise:

  • /event might return an event from a different room than the intended one and the requestor can’t do anything about that
  • Different servers might return different events via /event for the same event ID
  • /state could return different events with the same event ID for different rooms
  • /state_ids could reference an event which the requestor HS believes it already knows (but in a different room)

A server should just drop an event if it did one of those things and found the result was for the wrong room. However, if a joining server, or a server with missing history, can’t find a server that will return the correct event for the correct room, it could break at least:

  • Fetching missing events, by referring to event IDs we can’t find
  • Pagination, by hitting backward extremities we can’t find
  • Event auth, if an event comes in referring to an auth event that we can’t find

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 2, 2020

Do you seriously believe in better server performance from adding another lookup (for a room) on top of a lookup in a hash table?

A server would have to lookup the room to do anything worthwhile with the event anyways (the receiver), and yes, you might even find that it is actually faster to lookup an event when they are partitioned by room-id. This would also generally allow for better architectural decisions to be made when designing a homeserver and pave the way for more state to be distributed between nodes within one homeserver (global event ids get in the way of this) and this in turn would allow for much more modular and performant implementations.

@KitsuneRal
Copy link
Member

Everyone, sorry for somewhat long text.

@neilalexander apologies if I miss something obvious, but don't we rely on reference hashes to be unique by construction? I mean, if we end up having a hash collision then drawing any kind of line becomes irrelevant; the issues you listed arise even within one room. Room versions 1 and 2 are a separate story - they are prone to problems more fundamental than reference hash collisions, and I would discourage linking to events in these rooms in general.

@Gnuxie (and @Sorunome), here's my line of thinking - correct me if I'm missing something. Given the vast diversity in room sizes (from just a few events to millions) you will have to aggregate a variable number of room ids into one partition; in other words, partitioning by room-id would require you to have yet another lookup table that would direct you to the right partition. That lookup table will have hotter and colder buckets (because links to popular rooms will occur much more often) and you cannot balance the temperature without more information about the event than just its room.
I'm not a Matrix server-side person so my experience (with other high-scale systems of record but not Matrix) might as well mean nothing; but from my past server-side practice using business object identities (such as room ids) for partitioning has been very specific to use case, and hardwiring partitioning logic into the API design was a sure way to leave yourself with very sticky (because it's in the API) extra technical debt when this optimisation stops working. I can share more details of what I've seen, if interested. On the other hand, I'd be happy to learn more on the actual ("I/we had"), rather than speculative ("I/we could"), issues with partitioning of Matrix events.

So far I see the only practical reason to dissect the event ids space by room: to contain possible/eventual issues with the reference hash algorithm, we may have to devise a new room version with such a new way of calculating the reference hash that it might produce collisions in the current hashes space. But that basically rules out back-compatibility in a few other places including the federation; and even then this particular referencing issue can be easily circumvented by shifting the format and/or the alphabet of the hash.

So, in summary: we should change the design when we know better what to anticipate with respect to uniqueness requirements.

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

Hi @KitsuneRal

partitioning by room-id would require you to have yet another lookup table that would direct you to the right partition.

I feel like the lookup time is a weak argument because it is completely insignificant compared to state authorization and everything else the receiving homeserver has to do upon receiving a pdu (not to mention state resolution and back-filling where the s2s endpoint for an event by id is likely to be used), but I might be misguided.

table will have hotter and colder buckets

This is true but I don't see it as a problem, it could even be advantageous as a mix of hot/cold rooms would help with allocation and general workload, even if partitioning was local to one node (and even if there's still issues they could be migrated dynamically).

hardwiring partitioning logic into the API design

I'm specifically countering your points about it being a potential performance problem, I do not think potential performance optimizations are the only reason to have events local to rooms. It would be hardwiring only if we consider that the endpoints where an event can be referenced only be id client-server api were not already deprecated. I personally think that because rooms are already global and an event has no meaning outside the context of a room, it is actually extra work to consider an event-id to be global rather than the other way around (and more restricting).

@KitsuneRal
Copy link
Member

KitsuneRal commented Nov 3, 2020

partitioning by room-id would require you to have yet another lookup table that would direct you to the right partition.

I feel like the lookup time is a weak argument because it is completely insignificant compared to state authorization and everything else the receiving homeserver has to do upon receiving a pdu (not to mention state resolution and back-filling where the s2s endpoint for an event by id is likely to be used), but I might be misguided.

The discussion was around resolving an existing event, not about adding a new one; and it was about ability to partition the whole body of events by room in the hope to get some performance benefits. And it wasn't me who originally brought the argument about performance.

table will have hotter and colder buckets

This is true but I don't see it as a problem, it could even be advantageous as a mix of hot/cold rooms would help with allocation and general workload, even if partitioning was local to one node (and even if there's still issues they could be migrated dynamically).

It could, or it couldn't. As I said above, by now I'm really interested in practical experience. My practical experience tells me to avoid hot and cold storage partitions where possible because having them led to additional maintenance burden.

hardwiring partitioning logic into the API design

I'm specifically countering your points about it being a potential performance problem, I do not think potential performance optimizations are the only reason to have events local to rooms.

Again, the point about performance wasn't originally mine. Again, practical observations would be invaluable at this point. We've been handwaving here for enough long by now.

It would be hardwiring only if we consider that the endpoints where an event can be referenced only by id in client-server api were not already deprecated. I personally think that because rooms are already global and an event has no meaning outside the context of a room, it is actually extra work to consider an event-id to be global rather than the other way around (and more restricting).

There's extra work in any case, if we want to align the APIs. Either we have to change the Federation API to follow the convention of the Client-Server API, or we have to undeprecate the relevant endpoint in the Client-Server API. Guess which extra work is smaller. Also, the statement that an event has no meaning outside the context of a room is quite incorrect, since not all events need a room. That said, I think we don't have events referenced by event id outside of rooms as yet. Whether or not we're to have them is a matter of speculation, so see my previous comment - I don't see how we can reasonably anticipate something we don't know.

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

the statement that an event has no meaning outside the context of a room is quite incorrect
Guess which is more work

Sure, let me explain.

Perhaps I should be clear that this is the case for RoomEvents and PDUs, EDUs are an exception but they also cannot be requested in any federation api, they can only be received from send. I'm aware that at one point there was a schema in the specification that said an 'Event' only needed a type and id but these cannot be room events or PDUs.

It is already the case that a PDU cannot exist outside the context of a room, and there is only one API that breaks this afaik which is /event/{eventId} which again only returns PDUs (for RoomEvents).

Unless I am missing something major (which I'm sure I could be), even the s2s api already operates on the assumption that event-ids for PDUs are not global apart from the one inconsistent endpoint /event/{eventId}.

Whether or not we're to have them is a matter of speculation, so see my previous comment - I don't see how we can reasonably anticipate something we don't know.

The RoomEvent is already a concrete thing, if there was an extension later on that sent events outside of the context of rooms, they would not be persisted in the same way or use the same federation apis and I think it's very unwise to start confusing the two (and again there's only the one place where this is possible to confuse at the moment (afaik))

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

Ok, I just looked myself and EDUs don't even have an ID and the base event schema also does not need an ID https://github.com/matrix-org/matrix-doc/blob/master/event-schemas/schema/core-event-schema/event.yaml https://matrix.org/docs/spec/server_server/r0.1.4#edus

So we need to be very specific by what we mean by 'Event' and in this case, the only specified event that has an ID is a RoomEvent, which cannot exist outside the context of a room. If another Event was specified later on, that did not use Rooms but had an ID, this event and ID would be incompatible with the current federation api endpoints dealing with PDUs.

I.e. events referenced by event id outside of rooms are already impossible to express using the current s2s API, even if event IDs were considered global so I do not see that as a valid argument

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

If you want to suggest that RoomEvent ids are globally unique, then you also need to consider whether there is an 'identifiable event' that is specified one level above RoomEvent, so that all events that could use an id in future share the same identifier space, otherwise nothing stops someone creating a new event type with an event-id key that clashes with RoomEvents or any other type.

Of course, that is ignoring that this actually makes no sense to do with regards to the current RoomEvent model

So to make it completely clear, it is already the case that RoomEvents are bound to the context of a Room and the concept of an event id only exists in the context of a RoomEvent, therefor to specify event ids as global, for the purpose of indexing other types of events that are not bound to rooms, you need to do extra work (and a lot more work) than deprecating the one s2s API /event/{eventId} where the room event can be considered as 'global'. Since you would also have to create a whole set of endpoints to manage 'indexed' events or change the existing ones which are specifically for the purpose of managing room events in PDUs.

@KitsuneRal
Copy link
Member

Unless I am missing something major (which I'm sure I could be), even the s2s api already operates on the assumption that event-ids for PDUs are not global apart from the one inconsistent endpoint /event/{eventId}.

That's not quite accurate. Aside from this one endpoint, event IDs for room version 3 and above are globally-unique by construction. And event IDs for lower room versions are a poor addressing mechanism because they don't reliably identify the event (the discussion of this is as old as #418).

Further on, I'm afraid your analysis above still doesn't answer my question. Let me rephrase. As of now the only thing we really try to do is to resolve the confusion between the two APIs and event ID calculation algorithm:

  • Event ID algorithm is such that it produces globally-unique identifiers.
  • Federation API uses that fact to be able to pull individual events using just event ID (see MSC 1659 Proposal: Change Event IDs to Hashes #1659 that specifically mentions this rationale).
  • CS API happened to deprecate event retrieval outside of room context and introduced a room-scoped endpoint instead.

So what is the practical benefit of changing the direction chosen in #1659? Scoping endpoints to room just because event ids don't happen outside of room context as yet is not a practical benefit.

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

are globally-unique by construction.

It does not say in #1659 that an event id should be used in the way you are describing, it was a motion to remove the blatant subjectivity from event ids (which is a positive thing no matter what camp you're in), and that doesn't suggest anywhere that this is so that they can be used in absence of a room id. So no I find it infuriating actually that you suggest that 'the direction has been chosen' in #1659 when it is clearly not explicit and as far as I can see not even implied.

You are trying to suggest that using event ids in the absence of a room id is the current status quo, but that's not true, you also need to provide reasons why you think it is practical because #1659 hasn't and this has not been discussed. It is not the status quo and you need to stop pretending it is to evade discussion.

CS API happened to deprecate event retrieval outside of room context and introduced a room-scoped endpoint instead.

Did it really just happen or can I pull out a reference to the msc used to deprecate those endpoints to say that the specification is implicitly moving in the direction I want to see through? I find that extremely insulting.

Scoping endpoints to room just because event ids don't happen outside of room context as yet is not a practical benefit.

The truth is there are 7 endpoints scoped to the room in the s2s api, vs one that isn't, which is an inconsistency. So if anything the status quo lies in the event ids being local to rooms, not the other way around, but we should pretend that neither is the current state of affairs for completion.

@KitsuneRal
Copy link
Member

are globally-unique by construction.

It does not say in #1659 that an event id should be used in the way you are describing, it was a motion to remove the blatant subjectivity from event ids (which is a positive thing no matter what camp you're in), and that doesn't suggest anywhere that this is so that they can be used in absence of a room id.

There's the Motivation section in the very beginning; it discusses two possible ways to avoid conflicting event ids in federation traffic - changing the API and changing the way ids are generated. The new way to generate IDs obviated the need to change the API (along with other benefits, of course). Does it make sense?

You are trying to suggest that using event ids in the absence of a room id is the current status quo, but that's not true, you also need to provide reasons why you think it is practical because #1659 hasn't and this has not been discussed. It is not the status quo and you need to stop pretending it is to evade discussion.

I'm not evading the discussion. If anything, I'm trying to keep it. The status quo is that event ids are global by construction. That's the effect of the algorithm used - I'm not "pretending", it's just a corollary of using a hash function, isn't it?

And I'd appreciate if our tone remains civil. Otherwise I'll have to fold the discussion.

CS API happened to deprecate event retrieval outside of room context and introduced a room-scoped endpoint instead.

Did it really just happen or can I pull out a reference to the msc used to deprecate those endpoints to say that the specification is implicitly moving in the direction I want to see through? I find that extremely insulting.

I'm afraid you can't because the decision to deprecate those endpoints has been made before the institute of MSCs have been established. I'm not sure what is insulting in studying where the current event IDs come from, in the first place. This MSC is one of the very first things you find on the topic of event ids as hashes. Again, I'd appreciate you stay away from ad hominem attacks.

Scoping endpoints to room just because event ids don't happen outside of room context as yet is not a practical benefit.

The truth is there are 7 endpoints scoped to the room in the s2s api, vs one that isn't, which is an inconsistency. So if anything the status quo lies in the event ids being local to rooms, not the other way around, but we should pretend that neither is the current state of affairs for completion.

Ok, so your argument is API consistency. Unfortunately, as I mentioned before, there's no practical benefit in it. Also, sorry but if the scope is leaking, it's leaking. You can't have event ids local to rooms "except that one endpoint where they are actually global". That means they are global, however upset this makes you. I totally recognise you want to fix that inconsistency. The problem is - I'm yet to see any good reason to go for this change, aside from being able to proudly say that now our API is consistent.

Thanks for understanding, and I'm still willing to read some good argument from the practical side. Forget about my words about "the direction" if they upset you. They are not essential to the discussion.

@Gnuxie
Copy link
Contributor

Gnuxie commented Nov 3, 2020

There's the Motivation section in the very beginning; it discusses two possible ways to avoid conflicting event ids in federation traffic - changing the API and changing the way ids are generated. The new way to generate IDs obviated the need to change the API (along with other benefits, of course). Does it make sense?

Yes but this does not mean that an event id can be used in the absence of a room id and thus is 'global'. It does make sense to me thank you very much. It does not in any way contradict what I have already said, so can you please stop.

I'm not evading the discussion. If anything, I'm trying to keep it. The status quo is that event ids are global by construction. That's the effect of the algorithm used - I'm not "pretending", it's just a corollary of using a hash function, isn't it?

Just because an event id has been constructed so it can be calculated objectively between servers it by no means implies that it should be used in the absence of a room id. I do not understand why you insist that it does and i don't appreciate it.

Yes I have not prepared any material arguing for a practical use case for dropping the one endpoint in the s2s api, but you have not presented any practical material for why an event id should be used in absence of a room id either (since no where in the documentation does it say that it can be, it is only implied by the 1 endpoint discussed before) and you are arguing from the side of a perceived status quo and a position of power within the institution, and that is very poor frankly and I am extremely unimpressed. You should take a look at https://en.wikipedia.org/wiki/Conway's_law

It is clear to me that this is an ambiguity in the specification and it has been interpreted both ways by many parties, and so it needs to be discussed properly

@turt2live
Copy link
Member Author

ftr this conversation spilled into #matrix-spec around here: https://matrix.to/#/!NasysSDfxKxZBzJJoE:matrix.org/$6YDHvimPbwFcyH7xA9NSdSD2Qi_jW-hba_mJhLS25aU?via=matrix.org&via=amorgan.xyz&via=pixie.town

The idea as it currently stands is to write an MSC to form discussion around what a globally unique identifier looks like for events and how that relates to room IDs. There's quite a few points raised in the linked conversation that aren't covered here, so I suggest readers visit that before forming opinions, at least until the MSC makes an appearance. As such, I'm locking this issue as the MSC is expected fairly quickly (I can say this because I'm the one responsible for writing it at the moment).

@matrix-org matrix-org locked and limited conversation to collaborators Nov 3, 2020
@turt2live
Copy link
Member Author

As promised, there's now an MSC for this: #2848

Please take comments there so we can work out the problem space. Ultimately if the MSC ends up turning back into a simple spec clarification, then we'll at least have the conversation history to back up the rationale for why.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
clarification An area where the spec could do with being more explicit client-server Client-Server API s2s Server-to-Server API (federation)
Projects
None yet
Development

No branches or pull requests

5 participants