MSC2326: Label based filtering #2326

ara4n · 2019-10-22T11:29:29Z

Rendered

proposals/2326-label-based-filtering.md

richvdh · 2019-10-22T12:45:40Z

Seems generally sane to me.

proposals/2326-label-based-filtering.md

Half-Shot

I love this idea and it will really help with things like Zulip bridging. My concerns are mostly around privacy/metadata leaks but I think we have a sane solution in the comments.

Concerns should have been addressed.

dkasak · 2019-10-29T09:21:19Z

Some more potential problems:

No way to list all threads in a room without crawling the entire room
history.

This could be mitigated by having the client crawl the room and
remembering/indexing the labels. Since this is also necessary for E2EE
search, it seems like a feasible solution.
The described scheme makes it difficult to build rainbow tables, but since
# labels will commonly be natural language (not random data), it will still
be quite simple to leak the labels from label hashes in most cases. For
instance, an attacker could just take a list of the 10000 most common English
words and hash them on the fly, probably uncovering most labels. Even
producing simple variations (word_word, wordword, WordWord,
Word_Word, etc) and checking them all is feasible on modern hardware.

What if the label hash was peppered and the pepper stored in the encrypted
part of the labelled event? The hash would then be calculated as Hash(label
|| roomid || pepper) instead of Hash(label || roomid), e.g.
#fun!someroom:example.com!PEPPER.

babolivier · 2019-10-29T10:44:25Z

Hey @dkasak, thanks for your feedback!

Some more potential problems:

No way to list all threads in a room without crawling the entire room
history.

This could be mitigated by having the client crawl the room and
remembering/indexing the labels. Since this is also necessary for E2EE
search, it seems like a feasible solution.

FWIW crawling the entire room history isn't always a realistic expectation to have from clients, especially in big and old rooms with thousands of messages and hundreds of servers.
Another solution for this would have to spec a federation endpoint to ask each server in the room for the list of labels (or hashes) it knows about for this room, but this is also bound to create issues in big and old rooms.
I'd rather have this feature of listing every label in a room as a non-spec'd one which existence and behaviour are left as an implementation detail, in order to keep this MSC simple. If there is a need for it to be in the spec, it can also be done in a separate MSC.

The described scheme makes it difficult to build rainbow tables, but since
# labels will commonly be natural language (not random data), it will still
be quite simple to leak the labels from label hashes in most cases. For
instance, an attacker could just take a list of the 10000 most common English
words and hash them on the fly, probably uncovering most labels. Even
producing simple variations (word_word, wordword, WordWord,
Word_Word, etc) and checking them all is feasible on modern hardware.

What if the label hash was peppered and the pepper stored in the encrypted
part of the labelled event? The hash would then be calculated as Hash(label
|| roomid || pepper) instead of Hash(label || roomid), e.g.
#fun!someroom:example.com!PEPPER.

Although I agree with your point, and like your solution, the whole point of using hashes instead of e.g. random opaque strings is to ensure that two clients can only use the same identifier for the same label text. Bearing in mind that we need to be able to use the label identifiers to filter the history of the room server-side (because we're not expecting clients to know about the whole history of the room, see my first point above), opaque strings had the following downsides, all originating from the fact that nothing would prevent 1000 clients from using each a different identifier:

filtering would have serious performances issues in E2EE rooms, as the server would need to return all events it knows about which label identifier is any of the 1000 identifiers provided by the client, which is quite expensive to do (even when using PostgreSQL with the right indexing strategy).
it would be impossible for a filtered /message (or /sync) request to include every event matching the desired label because we can't expect a client to know about every identifier that has been used in the whole history of the room, or about the fact that another client might suddenly decide to use another identifier for the same label text, and include those identifiers in its filtered request.

Although what your proposing isn't using a random opaque string, it would still end up having different identifiers for a same label, which would end up causing the issues with filtering I've already mentioned. This hash(label + room_id) solution is the best compromise we've come up with between enabling filtering and hiding the actual label as much as possible from servers.

I realise that this explanation belongs to an "Alternative solutions" section of the MSC which I've actually forgotten to add, I'll update the proposal with it.

Half-Shot

Notes about the hash

proposals/2326-label-based-filtering.md

dkasak · 2019-10-29T13:01:28Z

@Half-Shot:

It's important to note here that the ID + text label is used to mainain
consistency of the key without having access to history. E.g. a random new
joiner without history could compute the same hash for the same label as
another client.

This was indeed the part I was missing, but I'm less clear on why this ability
is useful.

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.
However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.
They are effectively public information, but this is made unobvious by the fact
that they are hashed.

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

dkasak · 2019-10-29T13:32:46Z

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

On second thought, there's probably no problem with any character here since there's no parsing involved.

babolivier · 2019-10-29T17:50:45Z

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

All of this is a result of the fact that we can't realistically expect a client to know about the entire history of every room it's in (which is why we're doing the filtering on the server side btw), rather than the solution described in this MSC.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.

The MSC aims at allowing both scenarios rather than one to the detriment of the other. We want both to allow users to continue a thread à la Slack if their client can access one of the messages in the thread, and to be able to show conversation topics à la Zulip alongside messages in the timeline. To our (that would be mine and @Half-Shot's) knowledge, the solution that's currently described is the only one that allows both to coexist, because we can't rely on the client to know about every event, thus every possible label in a room.

If anyone has an idea for a better solution that allows for both scenarios, I'd love to hear about it.

However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.

My take on this is that the current solution makes it as hard as possible for rogue servers to figure out the labels on an encrypted event while allowing the features described to work. I agree on the fact that it's not a perfectly secure solution, and again, if anyone knows about a better one I'd be more than happy to hear about it.

They are effectively public information, but this is made unobvious by the fact
that they are hashed.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

turt2live · 2019-10-29T17:57:00Z

(please move to threads so the conversation can actually be followed)

dkasak · 2019-10-29T18:53:04Z

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

I realized I haven't said so earlier, so just to be sure: I'm not advocating for this MSC to gain the ability to list all labels.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

Yes, sorry, my wording was not the best. What I meant is that the server will be able to reverse the hashes (and hence read the labels) easily in a very large number of cases, if it wants to. I wouldn't had considered this to be the case for E2EE rooms intuitively, had I not seen this MSC.

That is, unless people start naming their threads with peppers embedded in them. ;)

(also, sorry, I didn't understand Travis' comment; is there a Github threading feature I'm not aware of or was that tongue-in-cheek?)

turt2live · 2019-10-29T19:38:57Z

Line comments on the diff rather than using the comment box here are threads. Trying to track down what people are talking about here is way too difficult, which is why we require people to use threads.

.

polwel · 2020-05-27T12:48:34Z

proposals/2326-label-based-filtering.md

+
+Therefore, this is a simpler proposal to allow messages in a room to be filtered
+based on a given label in order to give basic one-layer-deep threading
+functionality.


Still, the 'reply to an existing message' is an important use case.

A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required?

Isn't that like replying to the message based on message ID?

#1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though.

@ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads.

AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up.

I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do.

Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity.

KB1RD · 2020-05-27T13:31:07Z

proposals/2326-label-based-filtering.md

+This is done by new `labels` and `not_labels` fields to the `EventFilter`
+object, which specifies a list of labels to include or exclude in the given
+filter.
+


This may be something for another MSC down the road, but would it make sense to have power level restrictions for labels? For example, I may want to have an announcements label in my room that only bots and I can post in. For example, a GitHub status label could be useful to filter out announcement spam from the main room conversation.

This is an interesting thought, but is probably out of the scope of this MSC imho

KB1RD · 2020-08-08T13:28:27Z

proposals/2326-label-based-filtering.md

+
+When a user wants to filter a room to given label(s), it defines a filter for
+use with `/sync`, `/context`, `/search` or `/messages` to limit appropriately.
+This is done by new `labels` and `not_labels` fields to the `EventFilter`


Currently, this doesn't seem to make it possible to expose to the user which labels are present in a room (since the client would have to scan all of the events in a room for labels). IMO, it would be poor UX to not have available labels visible. I suppose there could be two ways of doing this:

Have the server scan each event for labels and keep track of them

Add a state event (ex, m.labels) where the state key is set to a label and the content could contain attributes like description or label color (if desired for UX). IMO this would be the cleanest option.

I made a proposal about an endpoint to get the list of labels available in a given room here: #2326 (comment), but haven't gotten around to add it to the proposal yet.

ara4n · 2020-09-27T23:07:19Z

proposals/2326-label-based-filtering.md

+maintaining one rainbow table for each encrypted room it's in, which would
+probably make it not worth the trouble.
+
+## Alternative solutions


I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled.

Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them).

This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings.

ara4n · 2020-09-27T23:09:12Z

@mscbot concern hashing the labels in encrypted rooms provides a very false sense of security. i think we should add a pepper to the hash, which can be sent to new users in an encrypted msg when they're invited to the room (if the new users need to know about the existing labels).

richvdh · 2021-04-06T16:14:34Z

This seems to have lots of outstanding concerns, and is generally looking a bit unloved, so I don't think it's ready for FCP generally.

@mscbot fcp cancel

MSC2326: Label based filtering

3214e90

ara4n added the proposal A matrix spec change proposal label Oct 22, 2019

richvdh reviewed Oct 22, 2019

View reviewed changes

Half-Shot reviewed Oct 22, 2019

View reviewed changes

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved

Half-Shot previously requested changes Oct 23, 2019

View reviewed changes

Add links to other MSCs

8c84d7b

turt2live self-requested a review October 28, 2019 19:36

Incorporate reviews

cb0c68f

babolivier requested review from Half-Shot and richvdh October 28, 2019 22:03

babolivier added 2 commits October 29, 2019 11:13

Describe alternative solutions

6627b00

Include an unstable prefix for m.labels_hashes

32597a7

Half-Shot reviewed Oct 29, 2019

View reviewed changes

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved

proposals/2326-label-based-filtering.md Show resolved Hide resolved

babolivier added 2 commits October 29, 2019 11:41

Incorporate review

6f36f56

Fix link

46d412e

dkasak reviewed Oct 29, 2019

View reviewed changes

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved

Fix copy-paste

78c4e16

babolivier mentioned this pull request Oct 30, 2019

Implement MSC2326 (Label based filtering) matrix-org/synapse#6288

Open

5 tasks

babolivier added 2 commits October 30, 2019 14:20

Standardise labels format

05217cd

Mandate case insensitivity in labels

da7776f

turt2live removed the proposal-in-review label Mar 5, 2020

jevolk added a commit to jevolk/charybdis that referenced this pull request Mar 30, 2020

ircd::m: Add filter properties for matrix-org/matrix-spec-proposals#2326

92ad654

.

turt2live added the kind:core MSC which is critical to the protocol's success label Apr 20, 2020

ara4n mentioned this pull request May 7, 2020

[Feature request] Implement message threads in Matrix like Slack Threaded messaging matrix-org/synapse#2329

Closed

polwel reviewed May 27, 2020

View reviewed changes

KB1RD reviewed May 27, 2020

View reviewed changes

This was referenced May 30, 2020

MSC2589: Improve replies #2589

Closed

MSC1849: Proposal for m.relates_to aggregations #1849

Closed

uhoreg mentioned this pull request Jul 7, 2020

MSC2674: Event Relationships #2674

Merged

KB1RD reviewed Aug 8, 2020

View reviewed changes

notramo mentioned this pull request Aug 21, 2020

Make changing channels in room easier element-hq/element-android#1937

Open

richvdh added this to Awaiting FCP ticks in Spec Core Team Backlog Sep 10, 2020

anoadragon453 added this to Ready for FCP ticks in Spec Core Team Backlog Sep 11, 2020

anoadragon453 moved this from Ready for FCP ticks to Awaiting SCT input in Spec Core Team Backlog Sep 23, 2020

ara4n commented Sep 27, 2020

View reviewed changes

mscbot added the unresolved-concerns This proposal has at least one outstanding concern label Mar 9, 2021

mscbot added proposal-in-review and removed disposition-merge proposed-final-comment-period Currently awaiting signoff of a majority of team members in order to enter the final comment period. labels Apr 6, 2021

richvdh assigned ara4n May 11, 2021

richvdh removed this from Awaiting SCT input in Spec Core Team Backlog May 11, 2021

turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021

turt2live force-pushed the old_master branch from e895827 to dca99ee Compare August 30, 2021 22:34

erlend-sh mentioned this pull request Aug 9, 2020

Threading API #3799

Closed

KitsuneRal mentioned this pull request Apr 29, 2021

Navigate rooms with hashtag search function matrix-org/matrix-spec#821

Open

turt2live removed the proposal-in-review label May 5, 2022

erlend-sh mentioned this pull request May 10, 2022

Threads element-hq/roadmap#8

Open

matrixbot mentioned this pull request Dec 21, 2023

Implement MSC2326 (Label based filtering) element-hq/synapse#6288

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC2326: Label based filtering #2326

MSC2326: Label based filtering #2326

ara4n commented Oct 22, 2019 •

edited

richvdh commented Oct 22, 2019

Half-Shot left a comment

dkasak commented Oct 29, 2019

babolivier commented Oct 29, 2019

Half-Shot left a comment

dkasak commented Oct 29, 2019

dkasak commented Oct 29, 2019

babolivier commented Oct 29, 2019

turt2live commented Oct 29, 2019

dkasak commented Oct 29, 2019

turt2live commented Oct 29, 2019

polwel May 27, 2020

ptman May 27, 2020

KB1RD May 27, 2020

polwel May 27, 2020

ptman May 27, 2020 •

edited

polwel May 27, 2020

KB1RD May 27, 2020

babolivier Aug 13, 2020

KB1RD Aug 8, 2020

babolivier Aug 13, 2020

ara4n Sep 27, 2020 •

edited

ara4n commented Sep 27, 2020

richvdh commented Apr 6, 2021

MSC2326: Label based filtering #2326

Are you sure you want to change the base?

MSC2326: Label based filtering #2326

Conversation

ara4n commented Oct 22, 2019 • edited

richvdh commented Oct 22, 2019

Half-Shot left a comment

Choose a reason for hiding this comment

dkasak commented Oct 29, 2019

babolivier commented Oct 29, 2019

Half-Shot left a comment

Choose a reason for hiding this comment

dkasak commented Oct 29, 2019

dkasak commented Oct 29, 2019

babolivier commented Oct 29, 2019

turt2live commented Oct 29, 2019

dkasak commented Oct 29, 2019

turt2live commented Oct 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptman May 27, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ara4n Sep 27, 2020 • edited

Choose a reason for hiding this comment

ara4n commented Sep 27, 2020

richvdh commented Apr 6, 2021

ara4n commented Oct 22, 2019 •

edited

ptman May 27, 2020 •

edited

ara4n Sep 27, 2020 •

edited