Skip to content

lexer: lowercase-only delimiters for quoted strings#13628

Merged
Octachron merged 4 commits intoocaml:trunkfrom
Octachron:quoted_delim_fix
Nov 21, 2024
Merged

lexer: lowercase-only delimiters for quoted strings#13628
Octachron merged 4 commits intoocaml:trunkfrom
Octachron:quoted_delim_fix

Conversation

@Octachron
Copy link
Copy Markdown
Member

@Octachron Octachron commented Nov 20, 2024

This PR fixes #13626 by limiting quoted string delimiters to alphabetical lowercase identifiers (straße but not x1 nor x').

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 20, 2024

Naive questions:

  • The bug report appeared to also be about single-quote ' being supported that previously was not, right? What is the impact of your PR on single quotes? Where are they hidden in the code?
  • Is this conservative with respect to the lexing rules of 5.2? (How do I review this property?)

@Octachron
Copy link
Copy Markdown
Member Author

Octachron commented Nov 20, 2024

  1. The single quotes (and digits) are hidden in the definition of ident_ext:
let identchar = ['A'-'Z' 'a'-'z' '_' '\'' '0'-'9']
let identstart_ext = identstart | utf8
let identchar_ext = identchar | utf8
let identstart = lowercase | uppercase
let ident_ext = identstart_ext  identchar_ext*

compared to the new delim_ext:

let delim_ext = ( lowercase | uppercase | utf_8 ) *

(where uppercase is only added to benefits from the improved error messages).
2. You can compare with 5.2 with git diff 5.2.. -- parsing/lexer.mll . The difference with 5.2 is essentially that lowercase* is replaced by delim_ext. The absence of uppercase letters is then checked in the delimiter validation function, to either emit a syntax error or to continue lexing inside a comment.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 20, 2024

Note to self: in the diff from 5.2, the _ext prefix generally means "extended" and that means "also with a restricted choice of extra utf8 characters", except for validate_ext where ext means "extension (point)".

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra comments. I think that I now understand the change and agree that it is safe -- the set of quoted string delimiters of the PR is closer to the one of 5.2, so there are less risks of regressions similar to #13626.

Comment thread parsing/lexer.mll
let utf8 = ['\192'-'\255'] ['\128'-'\191']*
let identstart_ext = identstart | utf8
let identchar_ext = identchar | utf8
let delim_ext = (lowercase | uppercase | utf8)*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could have a comment on the fact that uppercase is in fact rejected by the validate_delim check, we intentionally accept more to provide clearer / more specific error messages?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an error message to point out that we want the same error message for {A| and {À|.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 21, 2024

Looks good to me. By default I assume that you will merge and cherry-pick in 5.3.

@Octachron Octachron added this to the 5.3.0 milestone Nov 21, 2024
@Octachron
Copy link
Copy Markdown
Member Author

Indeed.

@Octachron Octachron merged commit 75a518d into ocaml:trunk Nov 21, 2024
Octachron added a commit that referenced this pull request Nov 21, 2024
* lexer: lowercase-only delimiters for quoted strings
* lexer: allow {%ext  | |}

(cherry picked from commit 75a518d)
@Octachron
Copy link
Copy Markdown
Member Author

Cherry-picked on 5.3 as 66b424d .

Julow added a commit to Julow/ocamlformat that referenced this pull request Nov 25, 2024
Julow added a commit to ocaml-ppx/ocamlformat that referenced this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New syntax error in string literals in comments in 5.3

2 participants