Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2810: Consistent globs specification #2810

Closed
wants to merge 1 commit into from

Conversation

turt2live
Copy link
Member

@turt2live turt2live added proposal A matrix spec change proposal proposal-in-review kind:maintenance MSC which clarifies/updates existing spec labels Oct 8, 2020
described by the article linked above. The characters are:

* `*` - an asterisk to denote zero or more character matches.
* `?` - a question mark to denote one or more character matches.
Copy link
Contributor

@jplatte jplatte Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never heard of ? meaning .+ (regex). It should be exactly one character (just .).

C.f. https://en.wikipedia.org/wiki/Glob_%28programming%29#Syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exciting, that's yet another inconsistency in our usage then :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @jplatte on this - where do we use ? to mean "one or more characters" ? I'm pretty sure sure synapse doesn't do that anywhere...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory ban lists, though that might be a spec mismatch with implementation thing.

Copy link
Contributor

@Gnuxie Gnuxie Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding so this isn't lost after a discussion in #matrix-spec starting here

#2313 Policy lists says:

Glob characters * and ? can be used to match zero or more and one or more characters respectively.

#1550 server ACLs says:

? matches exactly one character.

Synapse implements ACLs so that ? matches only one character, which is consistent with the spec for ACLs.

Mjolnir's Synapse antispam module uses the same code from synapse for ACLs to implement the globs from policy lists #2313 which is inconsistent with the spec for policy lists.

So the definition of ? for #2313 isn't actually being observed by the known implementation and we think that is intentional. Thus it is recommended to update the spec to match implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the definition of ? for #2313 isn't actually being observed by the known implementation and we think that is intentional. Thus it is recommended to update the spec to match implementation.

opened this as #3691. I encourage people finding bugs in the spec to open gh issues to track them!

Comment on lines +26 to +27
Further rules, such as square bracket matching, are excluded for the time being. A future MSC
can introduce them as needed when an appropriate use case arises.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth noting that [...] is implemented for push rules, currently.

for the most part and have an easier set of rules to process.

Matrix already uses globs in server ACLs, push rules, and policy events. This proposal aims to
formalize the glob specification between these three areas (which is already consistent) and for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is already consistent

roflololol no.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed.

Much sadness is to be had, and this proposal probably needs a bunch of updates.

Comment on lines +18 to +19
* `\` - a backslash escapes the character that follows to be treated as a literal. If the character
that follows has no specific meaning, the escape character is treated as a literal itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, I don't think \ works in Synapse, in either server ACLs or push rules, so we probably have a clean slate here. Given that, IMHO, \a should mean a, if for no other reason than consistency with common usage (such as bash's filename expansion), though it also feels more consistent for me for reasons I'm struggling to characterise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably fair to follow the rest of the industry and spit out "invalid escape sequence" errors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define if each and every character is a metacharacter and we should require escaping all literal metacharacters and disallow escaping everything else. Otherwise we will have no wiggle room for future changes.

I see two main options metacharacters as [*?[\]\], or [^a-zA-Z0-9] (these groups can be tweaked, but I think we need to define them). I agree that we should raise errors on escaping things that aren't allowed to be escaped.

Copy link
Member

@KitsuneRal KitsuneRal Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a practical example - [...] glob, not specced here but likely to be specced later. Correct me if I'm wrong somewhere:

  1. Following the current MSC text:
    • Before introducing the new glob ([ matches [): \[ matches \[
    • After ([ becomes a magic character): both [ and \[ change meaning; \[ now matches [
  2. If '\[' == '[' today:
    • Before: \[ also matches [; \\[ and \\\[ match \[
    • After: \[ still matches [, providing forward-compatibility; \\[ is now a combination of a literal \ match and a magic character but \\\[ still matches \[
  3. If \ in front of "normal" characters is considered invalid:
    • Before: \[ and \\\[ are invalid and therefore unused in the wild; \\[ matches \[
    • After: \[ matches [ but should be explicitly introduced to patterns where [ changed meaning; \\[ is now a combination of a literal \ match and a magic character

Seems like option 2 provides more wiggle room, followed by option 3 and 1, with 3 still better since only [ is used "in the wild" at the moment of change. On the API designer side, I don't really see many options to have a "safe haven" to introduce new glob characters (except reserving some characters for future special use, which effectively makes those characters follow option 2).

Copy link

@kevincox kevincox Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what you mean in 2. But let me try to reword what I am proposing.

  1. We define a set of meta characters. (Ex [^a-zA-Z0-9 _-])
  2. Everything else is considered a normal character.
  3. Putting a backslash in front of a meta character results in matching that character literally.
  4. Some meta characters have special meaning. Putting an unescaped meta-character which doesn't have a special meaning is an error.
  5. Escaping a normal character is an error. (We may give escaped normal characters special meanings in the future)
  6. Unescaped normal characters match themselves.

So:

  • [abc] is an error as [ and ] are undefined metacharacters.
  • \[abc\] matches a literal [abc].
  • \abc is an error because \a is undefined.
  • \n may match a literal newline in the future or something. (Currently an error because it is undefined).

I think this is pretty future proof and easy to implement. We just have to pick the definition of normal vs meta. I think we should probably define meta as "All characters with codepoints < 128 except for a-z, A-Z, 0-9, space, dash, underscore, newline"

Copy link
Member

@KitsuneRal KitsuneRal Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n is not really interesting because n is not magic, only \n (eventually) is. In that case \\n will match a literal \n, and a linefeed will (eventually) be matched by \n.

By Option 2, I meant that escaped non-special characters match themselves - same as Rich proposed above, \a matches a; \\a matches a literal \a. This way, even if a becomes a magic character, users can proactively switch to \a instead of a even before the change is actually rolled out (and I tried to illustrate it with [ above).

Reserving a space of special characters means that those characters become problematic for common users. Consider that this syntax is used in push rules: to begin with, your suggested set throws out all non-Latin script users; but even for Latin scripts, any punctuation now needs a \. Writing a pattern becomes a real chore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that escaped non-special characters match themselves

I see. That is a problem for changes in the future because then once released we can never make \a do anything special because we have already defined it to match a literal a.

your suggested set throws out all non-Latin script users

How so? I just defined all non-Latin script as "normal".

Writing a pattern becomes a real chore.

Well then add common punctuation to the "normal" list. My point is that once this is published we can't move anything from "normal" to "meta" in the future. I maybe cast the net wider than necessary but if we are worried about UX we can move characters that are expected to be common (@ and /?) into the "normal" group.

I don't really care what the "meta" group is but note that we can never add to it later, so it is better to make to too big rather than too small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I misinterpreted [^...] as a set complement. Also yes, good point that alphabetic characters may (eventually) have special meaning with \.

What I was meaning to show in my options above is that actually we can add to the list of metacharacters, provided that there's a migration path. If we're careful (i.e. if it's completely unambiguous), the migration may even involve actual rewriting. But I see value in outright reserving some (very limited) set of characters, to minimise disruption.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I misinterpreted [^...] as a set complement.

My mistake. That was a "simple" example but I put my actual proposal for meta-characters below my explanation of the rules. That was probably more confusing than it needed to be.

the migration may even involve actual rewriting

It may be possible but I would try to avoid relying on this at all costs. Especially when globs are going to be shared across the federation like the proposed roles and permissions that use globs. It seems that trying to migrate a character to require escaping is likely more effort than it will ever be worth and there will be bugs and accidents along the way.

@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
@turt2live
Copy link
Member Author

I don't think this MSC fits anymore - the specification isn't consistent enough to migrate, and honestly it's not really hurting us too badly to have 3 globs that are well-defined (though some do require looking at the issue tracker for the full story, and we should fix that).

Abandoning in favour of a future MSC picking up the work when it is needed.

@turt2live turt2live closed this Feb 2, 2022
@turt2live turt2live deleted the travis/msc/globs branch February 2, 2022 01:55
@turt2live turt2live added abandoned A proposal where the author/shepherd is not responsive and removed proposal-in-review labels Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abandoned A proposal where the author/shepherd is not responsive kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants