Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Interpolation #528

Open
TimWhiting opened this issue May 21, 2024 · 34 comments
Open

String Interpolation #528

TimWhiting opened this issue May 21, 2024 · 34 comments

Comments

@TimWhiting
Copy link
Collaborator

TimWhiting commented May 21, 2024

Just trying to get some ideas out of my head and put them down somewhere, so this is just a general proposal for string interpolation and tagged strings, which are in several languages notably Python's f strings f"{something}".

Koka has the advantage of it potentially being more user configurable and customizable due to the name based overloading we have.

Along the lines of #527 I think it would be good to allow prefixed commonly called tagged string literals that allow user customization of string interpolation.

For example a debug interpolator:

val x = Something()
debug"{x} is cool!"

fun debug/string-interpolate-start(s: string) : string-builder
  s.builder

fun debug/string-interpolate-value(sb: string-builder, a: a, ?debug-show: (a) -> e string): e string-builder
  sb ++ a.debug-show()

fun debug/string-interpolate-string(sb: string-builder, str: string): string-builder
  sb ++ str

fun debug/string-interpolate-finish(sb: string-builder): string
  sb.build

Which could be different than another interpolator.

I think we also need a string-builder / buffer type that allocates more memory than needed, and can add or concatenate efficiently, including optimizations for the C backend that don't create an intermediate koka string for concatenating (or at least makes them constant or static, including a static reference count).

The desugaring of string interpolation then would be to start out with calling tag/string-interpolate-start() with a simple string "" if there is no string prior to interpolation, and then continue to call tag/string-interpolate-value(intermediate, value) or tag/string-interpolate-string(intermediate, str) on the subsequent pieces (relying on overloading based on types for different values), with a tag/string-literal-finish(intermediate) call at the end - converting the intermediate value back to a string. Not sure if the intermediate type should always be a string-builder, or if we just desugar it and expect that the designer of the interpolators makes sure that it type checks correctly. I'm kind of inclined towards the more flexible option, so that the intermediate could be a rope or string-buffer or any other datastructure. This way the only change to Koka is the desugaring in the parser.

There also might be a use for not automatically finishing the literal (such as when you then want to build more pieces onto it, but not in the same expression), or if you want to reuse the buffers directly for file or network I/O instead of creating an intermediate string. I don't have any specific ideas on what the default would be or how this could be configured.

I think we should start with disallowing mixing raw strings with interpolation.

As far as the syntax itself, there are all sorts of common syntaxes.
Typically there is a start character (e.g. (#,$,`,@,%, or \), followed by some delimiters of some sort {}, and then allowing any expression in the middle. Dart also allows omission of the delimiters for simple identifiers (ends at a non identifier character). Of course it is an error if the identifier is not in scope. Other languages opt for no start character and only require delimiters. I prefer having the option for delimiters for longer expressions and a start character for identifiers.

It is preferable to use a start character that doesn't occur often in strings, which forces you to escape them. It is hard to know what choice that should be and in different situations there might be different needs. However, `, to me seems like a less commonly used character in strings. #,$, and % I can see often being used when interpolating with number values, and @ is often in emails or mentions. Swift uses \. It might be worth allowing different interpolators to define what their escape character should be similar to the infix operator notation Koka supports, but it also might be worth restricting it to a specific one or set. In particular making this configurable would be terrible for syntax highlighting and grammars for IDEs (though many IDEs also support semantic highlighting via a language server, which can add additional highlighting that cannot be determined in limited syntax rules).

UPDATE:

A consensus has sort of evolved among the participants of the discussion that no particular set of delimiters works really well, so a more drastic if not simpler proposal evolved that we just allow adjacent expressions starting with a tagged string.

With this change the above example changes as follows:

debug"" x " is cool!"

We also discussed that & looks nice when referring to a single identifier as it reminds of taking a reference to a variable (i.e. referring to it). So an alternate look at the example is this:

debug"&x is cool!"

You could add an explicit '&' by escaping with a backslash.

Formatting (such as padding or precision specifiers I argue should be implemented as part of the overloading) - but this requires that we just take the second to last part of the local qualifier as the "tag" so we don't have clashing names. Or we can just require explicit function calls / transformation into strings, which I personally think is fine.

@TimWhiting
Copy link
Collaborator Author

One advantage of this approach is it doesn't do any dynamic dispatch or anything special, it is just an extension of the current implicits and static overloading. We could even allow multiple parameters and have format specifiers for number precision etc.

@chtenb
Copy link
Contributor

chtenb commented May 22, 2024

In response to the starting character vs delimiter, when choosing { as delimiter you have a character that doesn't occur naturally in strings much. I agree that most of the starting character candidates are more common.
I personally have more experience with C# and Python languages, which both use { as delimiters and no starting character. Both these languages require you to double them like {{ when you want a literal brace. I have good experiences with this design, except when generating C# code, where it is a somewhat annoying choice, since C# syntax involves many braces.

@TimWhiting
Copy link
Collaborator Author

I like the idea of no starting character, and I do think braces are a good choice but I wonder if we could still a shorter ` for times where you are just using an identifier. \ doesn't make sense in those situations since \n could either mean a newline or interpolate the variable with the name n.

Here is what that might look like:

val err = Error("problem")
"Result: {match err { Ok() -> ""; Error(err) -> "Error! `err"}}"

With proper syntax highlighting or maybe indenting the whole match it might look good, but it seems a bit strange to me especially since blocks in Koka use } so many interpolated expressions might end in }} which almost seems like an escape rather than an end of block and then escape from interpolation. Simple expressions like abc + 1 might look better.

@chtenb
Copy link
Contributor

chtenb commented May 22, 2024

I see what you're getting at. In the context of string interpolation you are more likely to want to put everything on a single line, increasing the chance that braces are needed to delimit blocks. That indeed may conflict visually with the interpolation syntax. A small mitigation might be to escape { using a backslash \{ instead of {{, such that the end of your example }} does not look like an escaped closing brace anymore.

@kuchta
Copy link

kuchta commented May 22, 2024

Using escaped braces like \{ and \} or prefix characters solves just the syntactical problem, but not visual. What about using < and > for that. People already somehow associate them with markup and they are the only readily accesible symbols (on the keyboard) apart from those heavily used ones ({}, (), []) which have some "pairable" characteristics.

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 22, 2024

val err = Error("problem")
"Result: <match err { Ok() -> ""; Error(err) -> "Error! `err"}>"
val err = Error("problem")
"Result: |match err { Ok() -> ""; Error(err) -> "Error! `err"}|"

The problem with <> is that it looks like markup, but feels kind of reversed.

| maybe works better and is also not used a bunch in strings, but might have issues with parsing since it could be in the middle of an expression as an or & is not pairable.

For simple identifiers an & might look good. Reminds me of taking a reference to something, which is kind of similar.

val err = Error("problem")
"Result: <match err { Ok() -> ""; Error(err) -> "Error! &err"}>"

And using it as a start character might not be too bad.

val err = Error("problem")
"Result: &(match err { Ok() -> ""; Error(err) -> "Error! &err"})"

@kuchta
Copy link

kuchta commented May 22, 2024

Well, I think the less characters used, the better. () are also quite heavily used in such contexts. <> feels like some substitution, like the <body> and <expr> used in the documentation, so If only value is used, it's resembling the language used there. For expressions it could look outlandish at first, but I think it's just because we are not used to it...

@chtenb
Copy link
Contributor

chtenb commented May 22, 2024

<> is not a bad idea, since their usage in type signatures is not likely to come up in string interpolation contexts, solving the visual problem that {} has with code blocks.
However, if you were to do html or xml generation (which is a pretty common thing to want to generate), the choice of both <> and & suddenly becomes very cumbersome.

What about making it configurable per interpolator? This would allow the user to optimize for whatever kind of strings they are generating, and make this feature very helpful for embedded template languages.

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 22, 2024

@chtenb I like the idea, maybe we have a definition like:

// delimit-start, delimit-end, simple-identifier 
pub interpolator debug [<,>,$] 
// you can omit allowing simple identifiers, and you can have multiple characters for starting delimiter
pub interpolator html [${,}] 

Of course this means that these definitions need to be at the top of files like infix declarations since we might want to resolve this prior to parsing. Though for infix operators we transform into an intermediate representation and resolve after parsing.

@kuchta @chtenb
Curious what you think about inverting the angle brackets. I kind of think it might be pretty nice. (It's like a saying insert >here<), and stands out.

val err = Error("problem")
"Result: >match err { Ok() -> ""; Error(err) -> "Error! &err"}<"

Of course then what would a DSL for HTML look like in Koka?

html">div(inner=html">div(text="Hi")<")<"

or maybe a bit better formatted.

html">div(
  html">div(
    text="Hi"
   )<"
)<"

At that point we almost need to auto-infer which prefix tag to use for plain strings "" when a string is passed into a function needing a particular type:
e.g.

fun div(inner: maybe<html-builder> = Nothing, text:string = "")

html">div(
  ">div(
    text="Hi"
   )<"
)<"

// Non inverted
html"<div(
  "<div(
    text="Hi"
   )>"
)>"

Obviously this looks really nice the non-inverted way with html, but I personally think it looks really nice the other way for more general expressions.

Additionally you can argue that you really don't want to be doing this with strings anyways for HTML: (There is no static string in those examples, just nested div calls. So you can still just omit the string interpolation and have the following api which would work for generating strings or ASTs.

div(
  div(
    text="Hi
  ))

The one difficulty about this API is that you really don't want to build up the subpieces of the tree and then have a bunch of string appends. You'd rather generate from the outside in and append directly to a string-builder. You could create an intermediate AST, but that wastes time.

Or the vector api in #527 sort of supports the above already via the spread api:
html["div", ...html["div", "Hi"]]
Where the add-item adds a tag and add-items could add children to the last tag.

@kuchta
Copy link

kuchta commented May 22, 2024

I quite often think of koka as one of the best languages for the web, because it has compatible syntax that allow dashes in indentifiers. I image a future where I can write JSX like expressions in it. Writing html in a template languages (even tagged strings) never felt very pleasant to me and it would miss a lot of opportinities the React world already realized...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 22, 2024

With configurable delimiters you could even do lisp/scheme style quoting. :)

scheme"(eval ,(list.map(do-something)))"

@kuchta
Copy link

kuchta commented May 22, 2024

Yes, I think using configurable delimiters are probably the best way how to go about it. I was also thinking about inverted parenthesis 🙌🏻, but it probably could visually distract even more, if we are used to interpret them in some way (I mean in the context where there would be even non-inverted ones)

@chtenb
Copy link
Contributor

chtenb commented May 22, 2024

With configurable delimiters it would probably be wise to have the escape character \ be fixed for all interpolators? Intuitively I think that would keep things more sane than using repeated delimiters as a means of escaping.

@kuchta
Copy link

kuchta commented May 22, 2024

Regarding current syntax, it should be even possible to write something like this, right?

fun some-component(attr1="default", attr2="default" attr3="default", children={})
   div(attr1=att1, attr2=attr2)
      span(att3=att3)
          children()
      other-component()

With all nested blocks treated as trailing lambdas.

IMHO this is vastly superior syntax to something like JSX. It matches quite nicely do HTML, but doesn't suffer from having to close the tags...

I'm just not sure if named parameters don't have to come last, but what about trailing lambda? documentation don't show how to write function consuming it...

@kuchta
Copy link

kuchta commented May 22, 2024

@TimWhiting: "You could create an intermediate AST, but that wastes time." - Not if that's what you'll need to update the DOM on the client...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 22, 2024

@kuchta Of course, I was thinking about string interpolation specific to this issue, not saying that AST is bad, but one heavily used use case for advanced string interpolation would be a server side renderer, and especially if you do not use the ast in any way and just plan to convert it to a string, it seems a bit wasteful.

Yes, your syntax with trailing lambdas would be a great way to be build an AST unfortunately we need #491. Currently named parameters have to come last, including after trailing lambdas (since they just get desugared I think to the last parameter). With the change in the PR you could make the trailing lambda be a positional argument. Alternatively we could adjust the desugarer to put trailing lambdas after all positional arguments, but before named ones.

@kuchta
Copy link

kuchta commented May 22, 2024

@TimWhiting Exactly and those advanced server-side renderers might want to have features like React Server Components (RCS) for which some form of (build-time) code transformation (compilation) would be probably needed anyway.

Aren't trailing lambdas always a positional argument, if as you say must come before named ones? But it then can't have a default value. That's unfortunatte...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 22, 2024

Back to the issue though: I realized a major flaw with allowing user configurable interpolation delimiters.

Due to nested strings / interpolation, you have to resolve this at lexing time, otherwise you cannot find the end of the string! This means we would not be able to lex & parse in parallel, and would need to lex just the imports, later lexing the rest of the body after we know the delimiters to use for any prefixed string. This is not only complex, but a lot more work than the original proposal which would be able to be desugared directly in the parser.

As much as I would like to see user configurable delimiters, it seems like at least for now we would need to settle on what to use, though ultimately the decision rests with Daan. It seems like most of us are interested in trying out <> and maybe a for identifiers? Though with good colored highlighting we might be open to{}. And ` for escaping delimiters or the identifier interpolator.

Daan has more important issues he would like to work on first I think (specifically a robust async library, http/s tcp and other I/O).

@chtenb
Copy link
Contributor

chtenb commented May 23, 2024

Due to nested strings / interpolation, you have to resolve this at lexing time, otherwise you cannot find the end of the string!

Yeah, not surprising :) The grammar essentially becomes configurable using a language construct.

Perhaps instead of ` the & could also be considered. To me & makes the interpolation visually easier to read, and the reminiscence of taking a reference is indeed a nice coincidence. I don't think & comes up more often as a literal in interpolated strings than backticks (outside html).

@kuchta
Copy link

kuchta commented May 23, 2024

I though it wouldn't be so easy, since practically nobody is using it. I will leave here some prior art that led me to angle brackets. Unix man pages syntax was probably the first where I encountered them and to this day most of the (not just) unix commands are using them as a placeholder for substitution of required arguments.

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 23, 2024

@kuchta By the way #533 was just merged, and Daan is planning on releasing a new version of Koka soon, so you can add named arguments when using trailing lambdas.

@kuchta
Copy link

kuchta commented May 23, 2024

@TimWhiting Wow, I'm really looking forward to it. 🤗 Yesterday I found out that my koka installation is quite outdated, since homebrew channel is probably no longer maintained...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 23, 2024

So here is a radical idea: Just don't use delimiters (if we have a delimiter why not use the normal delimiter "). And then an interpolation is just sequence of expressions beginning with a tagged string, you can have spaces or not.

f"Result: " match err { Ok() -> ""; Error(err) -> "Error! &err" } ""

html"<" div(
  html"<" div(
    text="Hi"
   ) ">"
) ">"

scheme"(eval " list.map(do-something) ")"

Maybe an auto-formatter with some basic rules could make this look nice.

I think it would still be good to have a rule for desugaring that nested strings in interpolation inherit the same tag as their parent, unless specified otherwise.

html"<" div(
  "<" div(
    text="Hi"
   ) ">"
) ">"

I realize the html example isn't necessarily the best example, especially since I just realized I messed up the syntax anyways, but it illustrates the point still, and gives something to consider when talking about indentation / formatting.

This is not a totally crazy idea. Dart has 'adjacent string literals', which allows you to split string literals onto multiple lines for better readability, and preventing super long lines, they were basically implicit concatenation. Dart differs in the fact that it also has 'normal' interpolation.

I'll clarify that I'd still like to see "an identifier &ident is cool" for simple concatenation.

@chtenb
Copy link
Contributor

chtenb commented May 23, 2024

Interesting. This makes me think of function application in Haskell and PureScript, as you pass a bunch of primitive values into a function without using parenthesis and commas. In fact, PureScript does not have special string interpolation syntax, and the function i from https://pursuit.purescript.org/packages/purescript-interpolate/5.0.2 is commonly used instead.

In your example, html is a function that takes n parameters, except using whitespace to delimit arguments instead of commas like normal Koka functions. I wonder if this idea generalizes to an alternative function call syntax + variable length parameter lists.

@kuchta
Copy link

kuchta commented May 23, 2024

Why not use join right away? 🙂

[ "Result: " match err { Ok() -> ""; Error(err) -> "Error! &err" } ">" ].join/concat/...

But I like it. It's general and minimal, koka style 🕺🏼
BTW, making commas at least optional would be also great. IMHO they are superfluousness most of the time and if not, there are always parenthesis to the rescue 🛟 Or maybe I'm missing something, but they are often source of problems probably due to their low visibility. One reason less to argue if there should be trailing commas variants or not 🙂

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 23, 2024

The main difference is that it is not an n parameter function - so you don't have to worry about unification with different sized lambdas, instead it desugars to n function calls with a interpolate-begin , interpolate-value, interpolate-string and interpolate-end, or whatever the names end up being. Ideally most of them get inlined due to being simple. This feels almost akin to C's VARARGS interface but less manual and desugared at the application site into n function calls (which can use overloading of functions for type determination) instead of reflecting inside the function in a type unsafe way. (For a varags example see: https://stackoverflow.com/questions/15784729/an-example-of-use-of-varargs-in-c)

Koka already has parameters separated by whitespace (trailing lambda arguments). But they are clearly delimited by indentation and the fn keyword or braces for an anonymous function with no arguments {}.
For interpolation I don't mind omitting commas, but in general I find they make things more readable.

@kuchta
Part of my design was to introduce intermediate builders to make this more efficient than just building a list and then joining.

By the way:

[ "Result: " match err { Ok() -> ""; Error(err) -> "Error! &err" } ">" ].join/concat/...

Would not be possible for the general ?debug example at the top where you mix types, it would be weird for arrays to allow mixed types like this in just special situations.

This is how it would look with using " as our 'non-delimiter'.

debug"Result: " err "!"

fun general/debug/string-interp-value(sb: string-builder, value: a, ?debug: (a) -> string): string-builder
  sb ++ ?debug(value)
  
fun error/debug/string-interp-value(sb: string-builder, value: error<a>): string-builder
  match value
    Ok -> sb
    Error(err) -> sb ++ "Error! &err"

Of course if two overloads could match we might want some way of distinguishing which one to use. Either we need #531 or we could allow something strange.

debug"Result: " err>general "!"
debug"Result: " general>err "!"
debug"Result: " general/(err) "!"

@kuchta
Copy link

kuchta commented May 23, 2024

@TimWhiting I don't know if I would call it argument separation by whitespace if there could be just one trailing argument, but making them optional would leave that decision to the author...

I see where are you heading, Tim. Yes, it definitely has it's usage...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 23, 2024

You can actually have multiple trailing arguments:

while { n > 1 } 
  ...

gets desugared to

while(fn() n > 1, fn() ...)

Or slightly elongated, and with explicit

fun main()
  var n := 0
  while { 
    n > 1
    } fn()
      print("Enter a number: ")
      n := n + 1

Or

fun main()
  var n := 0
  while { 
    n > 1
    } {
      print("Enter a number: ")
      n := n + 1
    }

@kuchta
Copy link

kuchta commented May 23, 2024

Oh, true... You are right. But wouldn't it then be more consistent to allow even non-trailing arguments to be also separated by whitespace?

All separeted by whitespace, just trailing arguments delimited by indentation, non-trailing by parenthesis...

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 23, 2024

Whitespace separation is hard to do for general arguments, especially prior to type checking and with operators:

For example

dosomething a > b

Does this mean dosomething(a,>,b) or dosomething(a > b).
Of course you could put parentheses around the terms but does this look better:

dosomething (a > b) (c < d)
// or 
dosomething(a > b, c < d)

I guess we could allow both, but the error messages would have to be really good to help people find where to put the parentheses they forgot that they think they didn't need. And maybe a more extended discussion on this topic should go somewhere besides the string interpolation issue. Trailing lambdas don't have this problem because they must start with fn or {, and then have a clearly delimited block or indentation scope.

For string interpolation we can get around this issue by requiring a "" between non-string adjacent parts, or because there is a clearer expectation to delimit individual interpolated parts with (), or eagerly parsing as much as can be determined to be an expression - which if we allow whitespace separated function arguments might get really confusing quickly.

@chtenb
Copy link
Contributor

chtenb commented May 24, 2024

This conversation leads me to another possible way to think about it. We'd like interpolation to be flexible, both syntactically and semantically, but we also want to keep the language grammar simple and be able to desugar it early on in the compilation process.
This makes me think of a macro system. I'm not a fan of arbitrary textual rewrite macro's, like the C preprocessor, but perhaps there is some middle ground where a macro would act on expression tokens instead of raw text or something.
I haven't fully thought this out, but it might be an interesting angle to investigate. Maybe this would unify a broader set of desugaring features under a single umbrella.

@chtenb
Copy link
Contributor

chtenb commented May 24, 2024

Something we haven't discussed in the context of string interpolation is formatting, where you specify the formatting of arguments via a format specifier. For an example, see https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/interpolated
For a description of the format-string minilanguage in C#, see https://learn.microsoft.com/en-us/dotnet/standard/base-types/formatting-types

@TimWhiting
Copy link
Collaborator Author

TimWhiting commented May 24, 2024

I think formatting is the easy bit.

fmt"Result: " err "!"

value struct fmt-prec<a>
    v: a
    precision: int

fun fmt/string-interp-value(sb: string-builder, value: a, ?show: (a) -> string): string-builder
  sb ++ ?show(value)

fun prec/fmt/string-interp-value(sb: string-builder, value: fmt-prec<float64>): string-builder
  sb ++ value.v.show(precision=value.precision)

fun prec(v: float64, precision: int): fmt-prec<a>
  Fmt-prec(v,precision)

"Here is a precise " d.prec(10) " floating point value"

Since you can overload the string-interp-value function and Koka picks the one that requires the fewest implicits, it will use the formater for precision. I know, not as short as other solutions, but arguably more developer friendly and discoverable due to autocompletion and hovering / documentation.

As far as metaprogramming, let's discuss that in a new issue #536

@kuchta
Copy link

kuchta commented May 26, 2024

dosomething (a > b) (c < d)
// or 
dosomething(a > b, c < d)

@TimWhiting I haven't commented on this, because everything is already said in the next paragraph, maybe except one thing. If you put it this way, the second example is definitely more natural, but being able to use both syntaxes would be great for DSL like shell, which I'm quite interested in....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants