Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2326: Label based filtering #2326

Open
wants to merge 17 commits into
base: master
from

Conversation

@ara4n
Copy link
Member

ara4n commented Oct 22, 2019

Rendered

@ara4n ara4n added the proposal label Oct 22, 2019
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
@richvdh

This comment has been minimized.

Copy link
Member

richvdh commented Oct 22, 2019

Seems generally sane to me.

Copy link
Contributor

Half-Shot left a comment

I love this idea and it will really help with things like Zulip bridging. My concerns are mostly around privacy/metadata leaks but I think we have a sane solution in the comments.

@turt2live turt2live self-requested a review Oct 28, 2019
@babolivier babolivier dismissed Half-Shot’s stale review Oct 28, 2019

Concerns should have been addressed.

@babolivier babolivier requested review from Half-Shot and richvdh Oct 28, 2019
@dkasak

This comment has been minimized.

Copy link

dkasak commented Oct 29, 2019

Some more potential problems:

  • No way to list all threads in a room without crawling the entire room
    history.

    This could be mitigated by having the client crawl the room and
    remembering/indexing the labels. Since this is also necessary for E2EE
    search, it seems like a feasible solution.

  • The described scheme makes it difficult to build rainbow tables, but since
    # labels will commonly be natural language (not random data), it will still
    be quite simple to leak the labels from label hashes in most cases. For
    instance, an attacker could just take a list of the 10000 most common English
    words and hash them on the fly, probably uncovering most labels. Even
    producing simple variations (word_word, wordword, WordWord,
    Word_Word, etc) and checking them all is feasible on modern hardware.

    What if the label hash was peppered and the pepper stored in the encrypted
    part of the labelled event? The hash would then be calculated as Hash(label
    || roomid || pepper) instead of Hash(label || roomid), e.g.
    #fun!someroom:example.com!PEPPER.

@babolivier

This comment has been minimized.

Copy link
Member

babolivier commented Oct 29, 2019

Hey @dkasak, thanks for your feedback!

Some more potential problems:

  • No way to list all threads in a room without crawling the entire room
    history.

    This could be mitigated by having the client crawl the room and
    remembering/indexing the labels. Since this is also necessary for E2EE
    search, it seems like a feasible solution.

FWIW crawling the entire room history isn't always a realistic expectation to have from clients, especially in big and old rooms with thousands of messages and hundreds of servers.
Another solution for this would have to spec a federation endpoint to ask each server in the room for the list of labels (or hashes) it knows about for this room, but this is also bound to create issues in big and old rooms.
I'd rather have this feature of listing every label in a room as a non-spec'd one which existence and behaviour are left as an implementation detail, in order to keep this MSC simple. If there is a need for it to be in the spec, it can also be done in a separate MSC.

  • The described scheme makes it difficult to build rainbow tables, but since
    # labels will commonly be natural language (not random data), it will still
    be quite simple to leak the labels from label hashes in most cases. For
    instance, an attacker could just take a list of the 10000 most common English
    words and hash them on the fly, probably uncovering most labels. Even
    producing simple variations (word_word, wordword, WordWord,
    Word_Word, etc) and checking them all is feasible on modern hardware.

    What if the label hash was peppered and the pepper stored in the encrypted
    part of the labelled event? The hash would then be calculated as Hash(label
    || roomid || pepper) instead of Hash(label || roomid), e.g.
    #fun!someroom:example.com!PEPPER.

Although I agree with your point, and like your solution, the whole point of using hashes instead of e.g. random opaque strings is to ensure that two clients can only use the same identifier for the same label text. Bearing in mind that we need to be able to use the label identifiers to filter the history of the room server-side (because we're not expecting clients to know about the whole history of the room, see my first point above), opaque strings had the following downsides, all originating from the fact that nothing would prevent 1000 clients from using each a different identifier:

  • filtering would have serious performances issues in E2EE rooms, as the server would need to return all events it knows about which label identifier is any of the 1000 identifiers provided by the client, which is quite expensive to do (even when using PostgreSQL with the right indexing strategy).
  • it would be impossible for a filtered /message (or /sync) request to include every event matching the desired label because we can't expect a client to know about every identifier that has been used in the whole history of the room, or about the fact that another client might suddenly decide to use another identifier for the same label text, and include those identifiers in its filtered request.

Although what your proposing isn't using a random opaque string, it would still end up having different identifiers for a same label, which would end up causing the issues with filtering I've already mentioned. This hash(label + room_id) solution is the best compromise we've come up with between enabling filtering and hiding the actual label as much as possible from servers.

I realise that this explanation belongs to an "Alternative solutions" section of the MSC which I've actually forgotten to add, I'll update the proposal with it.

babolivier added 2 commits Oct 29, 2019
Copy link
Contributor

Half-Shot left a comment

Notes about the hash

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Show resolved Hide resolved
babolivier added 2 commits Oct 29, 2019
@dkasak

This comment has been minimized.

Copy link

dkasak commented Oct 29, 2019

@Half-Shot:

It's important to note here that the ID + text label is used to mainain
consistency of the key without having access to history. E.g. a random new
joiner without history could compute the same hash for the same label as
another client.

This was indeed the part I was missing, but I'm less clear on why this ability
is useful.

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.
However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.
They are effectively public information, but this is made unobvious by the fact
that they are hashed.

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

@dkasak

This comment has been minimized.

Copy link

dkasak commented Oct 29, 2019

[*]: Which makes me think of another problem: is ! disallowed in labels due
to it also being used as a separator? More generally, which characters are
allowed?

On second thought, there's probably no problem with any character here since there's no parsing involved.

@babolivier

This comment has been minimized.

Copy link
Member

babolivier commented Oct 29, 2019

I suppose the intent is to enable users wanting to talk about (e.g.) cats to
simply start sending messages labelled with #cats. However, the room might
have an existing thread for talking about cats -- yet it is called #Cats or
#cats![*] but the user is unaware of them. Furthermore, the new #cats
thread might also be entirely contextually unrelated to the old one (except
sharing a name coincidentally), but a user has no way of knowing this without
the ability to inspect the old thread's content.

All of this is a result of the fact that we can't realistically expect a client to know about the entire history of every room it's in (which is why we're doing the filtering on the server side btw), rather than the solution described in this MSC.

In essence, continuing an old thread seems primarily useful when you have the
ability to access the old content. But if you can access the old content, then
you also have access to the corresponding pepper to continue that particular
thread.

I suppose the desired scenario might be to make labelling a message with
a particular label (or, to put it differently, putting the message in
a particular "bin") cheap, while making only the retrieval potentially costly
(due to having to walk the room's history). Is this the design goal you had in
mind? In that case, I can see how the unpeppered system provides an advantage.

The MSC aims at allowing both scenarios rather than one to the detriment of the other. We want both to allow users to continue a thread à la Slack if their client can access one of the messages in the thread, and to be able to show conversation topics à la Zulip alongside messages in the timeline. To our (that would be mine and @Half-Shot's) knowledge, the solution that's currently described is the only one that allows both to coexist, because we can't rely on the client to know about every event, thus every possible label in a room.

If anyone has an idea for a better solution that allows for both scenarios, I'd love to hear about it.

However, I still consider this ability to be of limited value if one is unable
to inspect either the set of all labels or the content behind them.

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

My main concern is that just hashing the labels has a rather large downside of
simply obscuring them from casual glances but doing nothing to truly hide them.

My take on this is that the current solution makes it as hard as possible for rogue servers to figure out the labels on an encrypted event while allowing the features described to work. I agree on the fact that it's not a perfectly secure solution, and again, if anyone knows about a better one I'd be more than happy to hear about it.

They are effectively public information, but this is made unobvious by the fact
that they are hashed.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

@turt2live

This comment has been minimized.

Copy link
Member

turt2live commented Oct 29, 2019

(please move to threads so the conversation can actually be followed)

@dkasak

This comment has been minimized.

Copy link

dkasak commented Oct 29, 2019

Again, that is not an issue with this MSC, and this MSC doesn't aim to solve it.

I realized I haven't said so earlier, so just to be sure: I'm not advocating for this MSC to gain the ability to list all labels.

The labels are not public information in E2EE rooms, which is the only scenario in which hashing happens (and is also the reason hashing happens).

Yes, sorry, my wording was not the best. What I meant is that the server will be able to reverse the hashes (and hence read the labels) easily in a very large number of cases, if it wants to. I wouldn't had considered this to be the case for E2EE rooms intuitively, had I not seen this MSC.

That is, unless people start naming their threads with peppers embedded in them. ;)

(also, sorry, I didn't understand Travis' comment; is there a Github threading feature I'm not aware of or was that tongue-in-cheek?)

@turt2live

This comment has been minimized.

Copy link
Member

turt2live commented Oct 29, 2019

Line comments on the diff rather than using the comment box here are threads. Trying to track down what people are talking about here is way too difficult, which is why we require people to use threads.

@babolivier babolivier referenced this pull request Oct 30, 2019
3 of 5 tasks complete
babolivier added 2 commits Oct 30, 2019
@richvdh
richvdh approved these changes Nov 7, 2019
Copy link
Member

richvdh left a comment

this lgtm. I wish the e2e labels discussion would be unwedged...

proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Outdated Show resolved Hide resolved
proposals/2326-label-based-filtering.md Show resolved Hide resolved
babolivier added 3 commits Nov 7, 2019
@KB1RD

This comment has been minimized.

Copy link

KB1RD commented Nov 13, 2019

Is there a length limit for labels?

babolivier added 2 commits Nov 14, 2019
@babolivier

This comment has been minimized.

Copy link
Member

babolivier commented Nov 14, 2019

Is there a length limit for labels?

Good catch, I've added one in 45225af to a maximum of 100 chars, which should be a decent enough length.

@KB1RD KB1RD referenced this pull request Nov 14, 2019
@KB1RD

This comment has been minimized.

Copy link

KB1RD commented Nov 14, 2019

Would this be applicable for threaded messages as well or is the general consensus to leave this up to MSC1849 and MSC1198? Ex, it could be possible to set the label to an event ID. The issue is that syncing for threaded messages for every message in the displayed history could be costly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.