[BUG] Missing escape of `char` literal in interpolation #861

JohelEGP · 2023-12-01T22:47:23Z

Title: Missing escape of char literal in UFCS in interpolation.

Minimal reproducer (https://cpp2.godbolt.org/z/r1f73EP85):

#include <ranges>
main: () = {
  using namespace std::views;
  s := "".std::string();
  _ = "((_ = s.split('\n'), 0))$";
}

Commands:

cppfront main.cpp2
clang++18 -std=c++23 -stdlib=libc++ -lc++abi -pedantic-errors -Wall -Wextra -Wconversion -Werror=unused-result -Werror=unused-value -Werror=unused-parameter -I . main.cpp

Expected result: static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), '\n')), 0)));.

Actual result and error: static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), 'n')), 0)));.

Cpp2 lowered to Cpp1:

//=== Cpp2 type declarations ====================================================


#include "cpp2util.h"

#line 1 "/app/example.cpp2"


//=== Cpp2 type definitions and function declarations ===========================

#line 1 "/app/example.cpp2"
#include <ranges>
auto main() -> int;

//=== Cpp2 function definitions =================================================

#line 1 "/app/example.cpp2"

#line 2 "/app/example.cpp2"
auto main() -> int{
  using namespace std::views;
  auto s {CPP2_UFCS_0(std::string, "")}; 
  static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), 'n')), 0)));
}

Output:

Program returned: 0

The text was updated successfully, but these errors were encountered:

JohelEGP · 2023-12-01T22:49:13Z

Hmm... Maybe I should be using '\\\n' or a interpolated raw string literal.

gregmarr · 2023-12-01T22:53:57Z

Looks like it's not disabling string quoting inside the capture. You get the same result with just this:

main: () = {
  _ = "('\n')$";
}

JohelEGP · 2023-12-01T22:54:02Z

Hmm... Maybe I should be using '\\\n' or a interpolated raw string literal.

Documentation might help to build an intuition on how this should work.
Looking at the grammar, I should even be able to use any expression in an interpolation.
That's because the string doesn't continue until the interpolation ends.
So I think just '\n' should work, without extra \s and without using a raw string literal.

gregmarr · 2023-12-01T23:28:32Z

Character literals such as \x10 and \u0010 don't parse properly either.

hsutter · 2023-12-15T03:40:20Z

It looks like this behavior was added in #464 and #481. The block of code that does it is here:

https://github.com/hsutter/cppfront/blame/main/source/lex.h#L430-L436

I'm not sure why we're removing those \ characters. @filipsajdak do you recall?

When I comment out that block, all regression tests pass unchanged, and the code reported in this issue is fixed.

gregmarr · 2023-12-15T19:20:06Z

It was to fix #461
Looks like something more involved is needed to fix that while not breaking character literals.

hsutter · 2023-12-18T02:56:55Z

Here's the code from #461, showing the state before #461 was fixed:

  std::cout << "\"\"" << '\n'; // Prints `""`.
  std::cout << $R"("")" << '\n'; // Prints `""`.
  std::cout << "\"(\"\")$\"" << '\n'; // Prints `""`.
  std::cout << "(\"\\\"\\\"\")$" << '\n'; // cppfront asserts

My question is, should those last two have to escape inside the captures? Because what's in the captures should be an expression, and \"\" and \"\\\"\\\"\" aren't valid expressions (you can't just write those in code).

In other words, the capture source expression is not part of the string itself (and therefore shouldn't be escaped) -- its evaluation is part of the string (after being passed through to_string).

So I would instead expect this behavior:

  std::cout << "\"\"" << '\n';            // prints "" (as above)
  std::cout << $R"("")" << '\n';          // prints "" (as above)
  std::cout << "\"(\"\")$\"" << '\n';     // error, \"\" isn't a valid expression
  std::cout << "\"("")$\"" << '\n';       // this should print ""
  std::cout << "(\"\\\"\\\"\")$" << '\n'; // error, \"\\\"\\\"\" isn't a valid expression
  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \\\"\\\"
  std::cout << "("\"\"")$" << '\n';       // this should print \"\"

Before I make that change, does that make sense to this group?

JohelEGP · 2023-12-18T03:17:56Z

  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \\\"\\\"
  std::cout << "("\"\"")$" << '\n';       // this should print \"\"

The comments look wrong to me.
I think they should be

  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \"\"
  std::cout << "("\"\"")$" << '\n';       // this should print ""

If this is correct, it's the behavior I expect.

hsutter · 2023-12-18T03:37:19Z

Ah, I think you're right about those last two, thanks.

In particular, these two should be the same and print \"\" as you say:

std::cout << "("\\\"\\\"")$" << '\n';
std::cout << "" << "\\\"\\\"" << "" << '\n';

And these two should be the same and print "" as you say:

std::cout << "("\"\"")$" << '\n';
std::cout << "" << "\"\"" << "" << '\n';

Right?

Let me try a few other cases...

a := 40;
b := 2;
s: std::string = "xyzzy";

std::cout << "(a+b)$"; // should print 42
std::cout << "(s+"plugh")$"; // should print xyzzyplugh
std::cout << "("plugh")$"; // should print plugh
std::cout << "(s+"\"\n")$"; // should print xyzzy" plus a newline
std::cout << "("\"\n")$"; // should print " plus a newline
std::cout << "("")$"; // should print nothing

Do all these look correct now?

JohelEGP · 2023-12-18T03:40:39Z

Yes.

gregmarr · 2023-12-18T16:12:40Z

I'm a bit unsure about how unquoted double quotes should interact with interpolation captures.

std::cout << "(" + s + ")$";

Should this be interpreted as the string ( plus a string variable plus the string )$ and be printed as (xyzzy)$ or as a string containing a capture of a string iteral and be printed as + s +?

hsutter · 2023-12-18T16:21:59Z

OK, I'm just going to fix this immediate Issue.

However, I would still like to get to where " doesn't need to be escaped inside capture-expressions inside string literals. It currently does because in an example like "("")$" that currently lexes as a string literal "(" followed by a string literal of ")$" which gets an error because it doesn't have the matching ( (but even if it wasn't an error, it would be flagged as ungrammatical later since two string literals can't be adjacent).

Of course, you can see above that my wish is to lex "("")$" as a single string literal, but I now see that requires knowing that the second " is not intended to be the closing quote to terminate the string literal... which would require not only lookahead but (I fear possibly) speculative lexing which is a terrible idea. So for now I'm going to leave " inside string interpolations alone, they must still be escaped (but \ and such do not need to be escaped). But this may well be what tips me over the edge to making $ prefix, which would make lexing this simple, align with other languages, and address the steady trickle of requests to make prefix 😉 ... at the expense of making captures in functions require parentheses more often, but maybe it's fine to make captures stand out more (notably, capture-by-pointer such as var&$*++ would become $(var&)*++ which may be actually better and address that current slightly warty syntax).

hsutter · 2023-12-18T16:23:17Z

I'm a bit unsure about how unquoted double quotes should interact with interpolation captures.

Right you are. I think you just identified the same issue in a racing comment while I was writing the above. 👍

gregmarr · 2023-12-18T16:27:54Z

(but even if it wasn't an error, it would be flagged as ungrammatical later since two string literals can't be adjacent)

Is that an intentional change from Cpp1?

https://en.cppreference.com/w/cpp/language/string_literal

String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).

https://www.godbolt.org/z/j1v6qrK81

    std::cout << "("")$";

Program returned: 0
()$

hsutter · 2023-12-18T16:32:38Z

Is that an intentional change from Cpp1?

Yes. I've considered allowing that, but haven't gone there yet. If Cpp2 did allow merging consecutive string literal tokens, though, it would be in the grammar and not in the preprocessor.

Updated to add: I see in the quote above it says "phase 6 (after the preprocessor)" but I don't think that parenthetical is quite right, because it's not until phase 7 that "each preprocessing token is converted into a token." So up to the end of phase 6 we still have only preprocessing tokens.

gregmarr · 2023-12-18T17:16:34Z

If cppfront doesn't support that, then should there be a "how to fix it" message?

gregmarr · 2023-12-18T17:39:20Z

Updated to add: I see in the quote above it says "phase 6 (after the preprocessor)" but I don't think that parenthetical is quite right, because it's not until phase 7 that "each preprocessing token is converted into a token." So up to the end of phase 6 we still have only preprocessing tokens.

I read "preprocessing tokens" as the output of the preprocessor. Phase 4 is the preprocessor, phase 5 is determining the common encoding for adjacent string literals, phase 6 is concatenating those adjacent string literals, and phase 7 is lexing, so I think "after the preprocessor" is accurate. The handling is covered in [lex.string] as well so it may even generally be handled inside the lexer.

JohelEGP · 2023-12-18T18:18:53Z

Is the problem with "("")$" that
"the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token"?

But this may well be what tips me over the edge to making $ prefix, which would make lexing this simple, align with other languages, and address the steady trickle of requests to make prefix 😉 ... at the expense of making captures in functions require parentheses more often, but maybe it's fine to make captures stand out more (notably, capture-by-pointer such as var&$*++ would become $(var&)*++ which may be actually better and address that current slightly warty syntax).

Please, resist.
I very much prefer not having redundant parentheses.
I think postfix operators is justice.

I really liked where this way going,
until I noticed that you intended to make $ prefix everywhere,
and not just in string literals.
That's unfortunate, but it makes sense in the greater scheme of things.

How about alternatives?
I think @msadeqhe once presented a syntax similar to "("x$")".
It's an extension to adjacent string literals, with captures adjoined: x$"y", "x"y$, "x"y$"z".

gregmarr · 2023-12-18T18:59:25Z

Is the problem with "("")$" that

No, I don't think so. The problem is that the first non-escaped " ends the string literal. To have that not be the case, you need a prefix to change the string parsing behavior, as in raw string literals, or you need some other indicator that you are entering a special parsing mode. If the ( is that special indicator, then you need to escape that ( in all cases where you don't want it to indicate that special behavior. Trading "you don't need to escape " in a capture" for "you need to escape ( when not capturing" doesn't sound like a fair trade, especially since we already have a syntax for "you don't need to escape " in a string".

JohelEGP · 2023-12-18T19:27:52Z

Can we get that behavior by changing the grammar of string-literal and extending [lex.phases]?
This is the current grammar (missing the format-specifier #387 (comment)):

//G string-literal:
//G     encoding-prefix? '"' s-char-seq? '"'
//G     encoding-prefix? 'R"' d-char-seq? '(' s-char-seq? ')' d-char-seq? '"'
//G
//G s-char-seq:
//G     interpolation? s-char
//G     interpolation? s-char-seq s-char
//G
//G d-char-seq:
//G     d-char
//G
//G interpolation:
//G     '(' expression ')' '$'

The idea is for "(, with anything in-between, to be a potential-interpolation-prefix.
If that is followed by an expression, optional format-specifier, and )$ (call it interpolation-suffix),
it can be considered part of a string-literal prefix (that still needs the closing ").
If what follows "( isn't that, then it can be a string-literal-part.
Then, at some point in the lex phase, you concatenate potential-interpolation-prefix
with interpolation-suffix and string-literal-part to form a string-literal.
Or maybe that can just be part of the grammar.

gregmarr · 2023-12-18T19:44:41Z

Can we get that behavior by changing the grammar of string-literal

The idea is for "(, with anything in-between, to be a potential-interpolation-prefix. If that is followed by an expression, optional format-specifier, and )$ (call it interpolation-suffix), it can be considered part of a string-literal prefix (that still needs the closing "). If what follows "( isn't that, then it can be a string-literal-part. Then, at some point in the lex phase, you concatenate potential-interpolation-prefix with interpolation-suffix and string-literal-part to form a string-literal. Or maybe that can just be part of the grammar.

That would still change the meaning of these well-formed lines of Cpp1 code.

std::cout << "("")$";
std::cout << "(" + s + ")$";

"( potential-interpolation-prefix
"" expression (an empty string)
)$" interpolation-suffix

"( potential-interpolation-prefix
" + s + " expression (the string + s +)
)$" interpolation-suffix

JohelEGP · 2023-12-18T19:48:36Z

I'm fine with that, for now.
My goal is to keep $ suffix.

hsutter · 2023-12-18T20:41:57Z

OK, let's not quickly switch from suffix $.

I'm not too concerned about having a string literal with interpolation change meaning from Cpp1... they already do that for every case (e.g., "value is (val)$\n" has a different meaning) and so far nobody has raised that part as a concern IIRC.

I still want to keep lexing simple (modulo the backtrack we need for interpolation right now).

I'll think about it some more...

gregmarr · 2023-12-18T21:05:08Z

I'm not too concerned about a single string either, but eliminating the need to escape the quotes can change from multiple strings to a single fixed string, and thus are much more concerning.

std::cout << "(" + someFunctionThatStoresStuffInTheDatabaseAndReturnsString() + ")$";

hsutter · 2023-12-21T17:38:42Z

For now the status quo is:

suffix $ which is working well
the only character that needs to be escaped inside a string interpolation is ", which isn't bad especially considering Greg's counterexamples, and that it is inside a string literal after all

I think that part is fine then until we get more new information.

For consecutive string literals: I see their absence as a good simplification in general. But it's true that sometimes it's useful to break a string literal across source lines, and interpolation does make for longer (if fewer) string literals. My thought there was to recognize it grammatically but require + between: "a string literal" + "another one" + "still another". My experience in Cpp1 today has been that we (at least I) still accidently write what looks like string literal addition pretty regularly when concatenating a long string made up of various parts, and it works just fine when a string object precedes it:

// ok in Cpp1
f( my_string + "a string literal" + "another one" + "still another" );

// error in Cpp1 - can't add char[K] types
f ( "a string literal" + "another one" + "still another" );

and the usual fix is to write std::string{"a string literal"} + "another one" + "still another". So lowering of binary_expression could just inject std::string{ and } around the first literal when it sees a sequence of string-literal + string_literal + string_literal.

That's been my thought anyway. How does that strike you?

gregmarr · 2023-12-21T18:23:11Z

For consecutive string literals: I see their absence as a good simplification in general. But it's true that sometimes it's useful to break a string literal across source lines, and interpolation does make for longer (if fewer) string literals. My thought there was to recognize it grammatically but require + between: "a string literal" + "another one" + "still another".

So lowering of binary_expression could just inject std::string{ and } around the first literal when it sees a sequence of string-literal + string_literal + string_literal.

std::string{"a string literal"} + "another one" + "still another".

This is a runtime concat.

To preserve the current behavior, you would need to convert it to this compile-time concat.

std::string{"a string literal" "another one" "still another"};

Something to watch out for however is that cpp1 supports putting a literal operator on the end of the final string literal, since the strings are combined before tokenization:

#include <string>
#include <iostream>

using namespace std::literals::string_literals;

int main()
{
    auto str = "foo" "bar" "baz"s;
    std::cout << str;
}

JohelEGP · 2023-12-25T15:42:24Z

I'm not too concerned about a single string either, but eliminating the need to escape the quotes can change from multiple strings to a single fixed string, and thus are much more concerning.
std::cout << "(" + someFunctionThatStoresStuffInTheDatabaseAndReturnsString() + ")$";

An interpolation with " could be required to be in a raw string literal.

JohelEGP added the bug Something isn't working label Dec 1, 2023

JohelEGP changed the title ~~[BUG] Missing escape of char literal in UFCS in interpolation~~ [BUG] Missing escape of char literal in interpolation Dec 6, 2023

hsutter closed this as completed in dc7e10c Dec 18, 2023

JohelEGP mentioned this issue Jan 4, 2024

[BUG] Can't interpolate captured function call #838

Open

gregmarr mentioned this issue Mar 18, 2024

[SUGGESTION] Simplify how lists separators are used to simplify toolability (code generation and parsing) #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Missing escape of `char` literal in interpolation #861

[BUG] Missing escape of `char` literal in interpolation #861

JohelEGP commented Dec 1, 2023

JohelEGP commented Dec 1, 2023

gregmarr commented Dec 1, 2023

JohelEGP commented Dec 1, 2023

gregmarr commented Dec 1, 2023 •

edited

Loading

hsutter commented Dec 15, 2023

gregmarr commented Dec 15, 2023 •

edited

Loading

hsutter commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

hsutter commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023 •

edited

Loading

hsutter commented Dec 18, 2023

hsutter commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023 •

edited

Loading

hsutter commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023

gregmarr commented Dec 18, 2023 •

edited

Loading

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023 •

edited

Loading

JohelEGP commented Dec 18, 2023

hsutter commented Dec 18, 2023

gregmarr commented Dec 18, 2023

hsutter commented Dec 21, 2023

gregmarr commented Dec 21, 2023 •

edited

Loading

JohelEGP commented Dec 25, 2023

[BUG] Missing escape of char literal in interpolation #861

[BUG] Missing escape of char literal in interpolation #861

Comments

JohelEGP commented Dec 1, 2023

JohelEGP commented Dec 1, 2023

gregmarr commented Dec 1, 2023

JohelEGP commented Dec 1, 2023

gregmarr commented Dec 1, 2023 • edited Loading

hsutter commented Dec 15, 2023

gregmarr commented Dec 15, 2023 • edited Loading

hsutter commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

hsutter commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023 • edited Loading

hsutter commented Dec 18, 2023

hsutter commented Dec 18, 2023 • edited Loading

gregmarr commented Dec 18, 2023 • edited Loading

hsutter commented Dec 18, 2023 • edited Loading

gregmarr commented Dec 18, 2023

gregmarr commented Dec 18, 2023 • edited Loading

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023

JohelEGP commented Dec 18, 2023

gregmarr commented Dec 18, 2023 • edited Loading

JohelEGP commented Dec 18, 2023

hsutter commented Dec 18, 2023

gregmarr commented Dec 18, 2023

hsutter commented Dec 21, 2023

gregmarr commented Dec 21, 2023 • edited Loading

JohelEGP commented Dec 25, 2023

[BUG] Missing escape of `char` literal in interpolation #861

[BUG] Missing escape of `char` literal in interpolation #861

gregmarr commented Dec 1, 2023 •

edited

Loading

gregmarr commented Dec 15, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023 •

edited

Loading

hsutter commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023 •

edited

Loading

hsutter commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 18, 2023 •

edited

Loading

gregmarr commented Dec 21, 2023 •

edited

Loading