Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: string gaps for continuing string literals across multiple lines #2838

Closed
kmill opened this issue Nov 7, 2023 · 1 comment
Closed
Labels
RFC Request for comments

Comments

@kmill
Copy link
Collaborator

kmill commented Nov 7, 2023

Summary

Modify the lexical syntax of string literals to have string gaps, which are escape sequences of the form "\" newline whitespace* that have the interpretation of an empty string. For example,

  "this is \
     a string"

is equivalent to "this is a string".

Motivation

In projects that have a line length limit (such as std and mathlib), long strings need to be wrapped to respect this limit. Long strings appear in error messages in meta code.

A workaround one sees is using string interpolation with empty strings, since the expression between curly braces may itself contain whitespace that is ignored:

  s!"this is {
  ""}a string"

However, (1) this feels like a hack to many who use Lean and (2) this term produces "this is " ++ "" ++ "a string", which is not an atomic string literal. Note that users do not write

  "this is
   a string"

because this includes extraneous whitespace in the string -- in particular, this example is equivalent to "this is\n a string".

String literals should have syntax to be able to continue them across multiple lines without incorporating the intervening whitespace. Using Haskell terminology, we want "gaps" in the lexical syntax of string literals.

Manual-level explanation

String literals can be broken across multiple lines without introducing additional whitespace by ending lines with the \ character. All the whitespace up until the first non-whitespace character on the next line is ignored. These \-escaped line breaks are known as string gaps.

For example, the string literal

  "this is \
     a string"

has a string gap starting at the end of the first line, and the string literal is equivalent to "this is a string"

You must have a line break immediately after the \, otherwise it is an error. For example, you are not allowed to have comments within strings.

  "this is \ -- Having a comment here is not allowed.
     a string"

If you need to start a line with whitespace, one trick is to use an escape code such as \x20 (space) to start the next line, since otherwise whitespace is consumed as part of the string gap. For example, in the following the string still has a space between is and a.

  "this is\
     \x20a string"

Reference-level explanation

In the lexical syntax of string literals, string gaps correspond to string_gap in the following grammar:

   string       : '"' string_item '"'
   string_item  : string_char | char_escape | string_gap
   string_char  : [^"\\]
   char_escape  : "\" ("\" | '"' | "'" | "n" | "t" | "x" hex_char{2} | "u" hex_char{4})
   hex_char     : [0-9a-fA-F]
   string_gap   : "\" newline whitespace*

To account for systems using different line endings, newline matches both \n and \r\n.

The lexical syntax allows a string gap to consume one or more newlines. However, to prevent confusion, the parser should throw an error if the string gap contains more than one. When there is more than one newline, it becomes unclear what the intended interpretation is meant to be, and also it invites users to try to put comments in intervening blank lines. As such, this is disallowed:

  "this is \

     a string"

The technical reason for allowing multiple newlines in the lexical syntax but disallowing them during parsing is that otherwise the lexical syntax would indicate that a second newline terminates a string gap.

During elaboration time, the Lean.Syntax.isStrLit? function is responsible for decoding parsed string literals, including processing all escape codes. This is where string gaps are skipped. Note that this function shares code with character literal decoding, and so the shared code needs to take care that character literals do not process string gaps.

Drawbacks

String gaps add to the complexity of the lexical syntax. This implementation of string gaps makes it somewhat difficult to include whitespace at the beginning of the continuing line. This is the one place in Lean that needs to be aware of \r\n line endings.

Rationale and alternatives

String gaps already have precedent in other languages. This RFC is essentially the Rust implementation of the concept. However, in Rust they choose to make multiple newlines a warning issued later rather than catching it at parse time. For simplicity, we have this condition raise a parser error (rather than waiting for elaboration time to emit a warning). Note that the lexical syntax for gaps is still "\" newline whitespace*, but we choose to have the parser throw an error if whitespace* contains a newline. One way this makes the code simpler is that we can easily throw an error at the position of a second newline in the gap, if it exists.

Alternatives:

  • Haskell-style string gaps have the lexical syntax "\" whitespace+ "\". These neatly solve the problem of where the next line continues, even if it starts with whitespace.

    "this is\
        \ a string"
    

    However, beginning lines with \ will play havoc on syntax highlighting, especially if the character after the second \ is the closing " for the string literal. Everything you can do with Haskell-style gaps can be represented in some way with Rust-style gaps, with small variations in line lengths.

  • Trying to adapt C's automatic string concatenation, where the juxtaposition "this is" " a string" is equivalent to the concatenated version. However, this creates an ambiguity when passing strings as arguments to functions.

  • Allowing interpolated strings to include expression-free {} notation

    s!"this is {
      }a string"
    

    While this would be a smaller change to the lexical syntax, this does not produce a string literal. It also forces users to switch to using an interpolated string if they want gaps in strings.

  • Using unrestricted "\" whitespace+ gaps. These are easy to parse, but here is an example confusion this can lead to.

    "this is \ -- that's the first half
      a string" -- and that's the second
    

    The first "comment" is actually part of the string. Requiring that the first whitespace character be a newline prevents this.

  • Using "\" whitespace+ gaps where the whitespace must contain at least one newline. This circumvents the need for special CRLF handling, but allowing spaces immediately after \ and before the first newline might be confusing.

Other considerations

  • Note that the whitespace in a string gap must be whitespace. Comments are not permitted. The parser does not interpret string literals (that is, it does not do escape sequence replacements), and they are instead interpreted during elaboration time. If the gaps were to admit comments, this interpretation process would need to be able to handle comment syntax as well. This limitation is also found in the Haskell 2010 language specification.

  • The implementation needs to be careful not to allow string gaps inside character literals. String literals and character literals share code for handling escapes, but the string gap feature only makes sense for strings.

Future possibilities

  • Rather than having special handling for \r\n, we can consider having a global preprocessing step that does \r\n to \n conversion for the entire input file. This approach was taken by Rust.

  • This enables features such as multiline strings where each line begins with some delimiter, which would be another way to control whitespace at the beginnings of lines. This would be useful for embedding blocks of code in string literals.

     d!"this is \
              line 1
          |  line 2, indented
          |line 3"
    
  • We could include an escape such as \> that also stands for the empty string. This could serve as an explicit end to a string gap. For example,

    "this is\
       \> a string"
    

Community discussion

Zulip thread

Impact

Add 👍 to issues you consider important. If others benefit from the changes in this proposal being added, please ask them to add 👍 to it.

@kmill kmill added the RFC Request for comments label Nov 7, 2023
github-merge-queue bot pushed a commit that referenced this issue Dec 7, 2023
#2821)

Implements "gaps" in string literals. These are escape sequences of the
form `"\" newline whitespace+` that have the interpretation of an empty
string. For example,
```
  "this is \
     a string"
```
is equivalent to `"this is a string"`. These are modeled after string
continuations in
[Rust](https://doc.rust-lang.org/beta/reference/tokens.html#string-literals).

Implements RFC #2838
@leodemoura
Copy link
Member

Implemented by commit above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments
Projects
None yet
Development

No branches or pull requests

2 participants