Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
MSC1884: Proposal to replace slashes in event IDs #1884
@mscbot fcp merge
calling fcp to see if anyone wants this other than me. my reasons for wanting it is that it is an unnecessary pain that almost every client developer is going to trip over and curse us for. ideally we would spec it for room ids and user ids at the same time and avoid mandating escaping in urls in general, but at least this doesn’t needlessly break clients in a way that client developers will consider avoidable.
Team member @ara4n has proposed to merge this. The next step is review by the rest of the tagged people:
Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
I really really would prefer it if we were consistent with the encoding we used for things. If we're going to go down this route I would really rather we did this across the board (or at least agreed that that is where we should aim for). Just because a given encoded value isn't currently used in a URL doesn't mean it won't be; certainly I wouldn't be surprised if some of the newer E2E APIs ended up with keys and things in the URL.
I think the worst of all worlds we could end up in is where we have a mix of encodings and still require people to URL encode their parameters. Whether that is due to existing fields getting used in new APIs, or other server implementations coming along and allowing slashes in identifiers, or whatever.
Wouldn't this effectively mean that we could never use unicode for user IDs? I know that we currently don't allow them but I thought we were wanting to support them at some stage in the future?
There is a perverse effect here that having slashes in identifiers quickly forces clients to correctly URL encode parameters, which while annoying for existing clients may actually be easier for newer client implementers. URL encoding parameters is such a standard thing to do that I don't think people will get annoyed at us if they realise almost immediately that they need to remember to do it, but it will be annoying if it works 99% but then breaks the odd time they do actually need to do so.
Effectively, I imagine this is short term pain for long term gain, in the sense that new clients will be written correctly from the start.
I just spoke this through IRL with @erikjohnston and @neilisfragile and came to realise that a lot of the debate here (e.g. whether should pander to client developers who don't encode URIs) is specious.
The whole question here boils down to selecting between two trade-offs:
Now, I think that many many more people are going to be bitten by the inconveniences of the first tradeoff than will be by the second. Particularly, the only people who may be bitten by the second are expert server and encryption developers, who are relatively uncommon, and also sophisticated enough to be able to check whether a given string is base64 or url-safe base64.
Having expressed it like this, I strongly think that we should optimise for supporting the CS API developers over the SS/E2EE developers, and adopt the following rule of thumb: all base64 presented in the CS API (ignoring E2EE) should be URL safe encoded. This starts with event IDs, and then in the #1228 future will include room ids, user ids, etc too.
Meanwhile, the E2EE API is stuck with pure base64, but typical client developers shouldn't need to go anywhere near it (especially with https://github.com/matrix-org/pantalaimon), and the SS API can use whatever mix makes most sense, and yes - it makes it a bit harder for server & encryption developers to remember which encoding to use.... but this is a small price to pay for making the CS API easier to play with.
Hopefully this accurately represents the discussion, and that @erikjohnston agreed in the end on the conclusion.
If this sounds good to other people, I suggest we indicate it by voting on entering FCP, and then fix up the MSC to reflect the rationale (and the intention in general to standardise on URL-safe base64 in the CS API), and then move on.
I think you're still missing the point of my argument, so your dichotomy is false.
My concern is not really with having two different base64 encodings. That is ugly and annoying, but I wouldn't see it as a hard blocker if I thought it bought us anything.
What does concern me is that, given you have to url-encode room aliases, room ids, user ids, state keys, device ids, transaction ids, etc, excluding event ids from that list doesn't buy you much except (1) timebombs for developers who fail to grok url-encoding and (2) the aforementioned ugliness.
Further, I think some of your arguments against regular base64 are flawed:
The logs are already full of %3As. I don't buy this.
Which columns? And loglines are already variable length?
Is it? You keep saying things like this, but I don't see any evidence of clients which correctly url-encode mxids and room aliases but not event ids. Nor do I agree that having to url-encode your path parameters constitutes a perversely developer unfriendly protocol, however if it does then, as I keep saying: changing the event ID format doesn't fix this.
I guess ultimately this isn't a mountain that I'm prepared to die on, and I accept that there are some advantages to changing the format of event IDs. I just think that you are overstating them, and I really don't think that one of those advantages is saving client developers from having to URL-encode things.
I totally agree that client developers should be url-encoding the strings they put in their URLs, and i'm not suggesting they should cut the corner by skipping it for event-ids. This was my first sentence at #1884 (comment).
According to RFC3986 you don't have to escape colons in the path part of URLs.
However, if you copy-paste
Anywhere you are dealing with a stack of event IDs. or msc1228-user IDs or room IDs, where you are building them into a SQL query or a commandline arg or otherwise manipulating them in a text editor. Rather than being a nice neat stack of 32-character IDs, they're arbitrarily meandering around all over the place.
But that was my point!! It's quite possible that all the clients might and could and should correctly url-encode their event IDs. The reason to have URL-safe IDs in URLs is because it makes life way easier for the humans whilst working on the clients, rather than anything to do with letting client developers cut corners in their escaping logic.
I never said that having to url-encode your path parameters constitutes a perversely developer unfriendly protocol. But a protocol that wilfully picks the non-URL-safe format of an encoding to use in URLs is surely unquestionably perverse, from a CS API developer perspective. And changing the event ID format to be URL safe does fix that.
I did not say it was.
Please can we FCP this?
@erikjohnston points out that this isn't what the checkboxes are for, so ignore that.
Your first sentence was "I just spoke this through IRL with @erikjohnston and @neilisfragile and came to realise that a lot of the debate here (e.g. whether should pander to client developers who don't encode URIs) is specious."
You've certainly previously argued that one reason for favouring url-safe base64 is to avoid breaking clients which do not correctly url-encode their parameters (eg here). But ok, if we're now agreed that this is not a valid argument, \o/.
s/some/most/, I'd suggest. Certainly enough that it means you need to handle %-decoding when you are grokking logs.
Given, as you point out, it's optional, I don't see how it's constant-width; and in any case, why is its width relevant in logs?
I'm failing to see the relevance to the logs containing %-encoded characters here.
How often do you actually build a URL including an event id by hand? I often cut-and-paste event ids into postgres queries and the like, but I think I could count the times I've manually put one in a URL on the fingers of one hand.
why would you have a stack of %-encoded event ids? I mean, I suppose you might fish them out of an access log, but I'd have thought by the time you've done that you'd probably want to stick them through
If that was your point, then I think you've changed your stance since previously, and #1884 (comment) really didn't make it clear, but okm fine.
Well, you certainly did previously.
I think it's inappropriate to repeatedly demand FCP while there is still a very active conversation ongoing.
I think it boils down to a matter of aesthetics and legibility. As a human scanning a log file, the less variation in the fields in the logfile the better, from my perspective.
I was trying to point out that the fact that colons are optionally URL encoded doesn't hinder developers in general, so the implication of "colons already get encoded, so why does it matter if slashes do" is invalid.
So whenever you cut-and-paste an event ID from an HTTP log into a postgres query you're going to have to manually unescape it. For no reason.
I spend way too much time currently going through DBs, finding IDs, and putting them into the admin API - basically every time anyone reports abuse on matrix.org, or any time i'm doing bridge maintenance or otherwise using the CS API based on data from the DB (or conversely using IDs from HTTP logs in the DB). I'm guessing about 50 times so far this year? Someone like half-shot probably even more so.
Because you've just grepped them out of an access log, or your client's HTTP logs, or your URLs in Chrome Dev Tools, etc.
Optimising for tech architects who have perl oneliners hardwired to their muscle memory is precisely what I mean by perverse protocol design. We have the option to avoid that pain, let's use it.
As per #1884 (comment) i've given up asking people to vote to express their position; will wait to see if there is any hope of getting aligned first.
I'm afraid I'm totally unconvinced that a few %-encoded characters add meaningfully to the variation in logfile lines.
But it is valid when we are talking about grokking logfile lines.
This is a slightly different argument to your previous one, which was about C&P in the other direction. Anyway, you have to unescape v1 event IDs currently, but ok, I can see that it would be nice to fix this.
Which admin APIs take event IDs, ooi?
Well, I think that if you attempt to grok logs without making an attempt to unescape %-encoded identifiers, you're going to have a bad time.
To be clear, my position remains:
Now that we've dismissed the "clients shouldn't be forced to url-encode" argument, I think the stated advantages are:
Ultimately it's a judgement call as to whether those advantages outweigh the disadvantages of making the change. I'm not convinced they do, but as I've said previously: it's not a mountain I'm prepared to die on, so, with my reservations recorded:
@mscbot resolve this doesn't fix anything
Rather than going through the rigmarole of cancelling FCP proposal and starting over again, please can we just update the checkboxes to match current thoughts on the updated proposal?
As I said, I think it's important to make what is written match reality. Given the amount of time we spend going back wondering why we made a given decision, I think it is an important use of our time.
Having said that, I'm now happy enough with what we have.
(and explaining/justifying to others. Remember these MSCs serve as a historical record to all members of the community. I'd rather take the time to clarify it now then repeatedly have to clarify "yes I know that's what the MSC says but it's not what it means" in future.)