Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation update on lexical lookup via quasis/injectiles, etc #410

Closed
masak opened this issue Oct 22, 2018 · 4 comments
Closed

Implementation update on lexical lookup via quasis/injectiles, etc #410

masak opened this issue Oct 22, 2018 · 4 comments

Comments

@masak
Copy link
Owner

masak commented Oct 22, 2018

This is probably something I should publish as a blog post on strangelyconsistent. For now, there's time only to write it up here as a self-closing issue.

Who knows, maybe there's some interested reader out there for whom this is useful. The main use is for me to info-dump what I feel is a first complete, workable solution to hygiene in 007 and Perl 6. (As a consequence, I won't pull any punches. If the below sounds like complete gobbledygook, assume that's me, not you. But maybe see the examples at the end.)

I should also link to this gist because it contains the germs of the thinking that's outlined here. Think of the gist as the exploration part and this issue as its final distillation.

Unique variables

In order to explain the solution, let's introduce a non-established term: unique variables.

In general, each variable in the source code is declared somewhere, and is then used a number of times:

my foo = ...;     # declaration
# ...
...foo...;        # usage (read)
# ...
foo = ...;        # usage (write)

Without any further information, we don't actually know whether the variable foo corresponds to one single allocated location in memory, or several. Here are a few reasons it might be several:

  • The surrounding block might be a loop. Each iteration of the loop, in the general case, needs fresh locations for all its variable declarations.
  • The surrounding block might be a routine. Each time the function is called, it needs fresh locations for all its variable and parameter declarations.
  • As a special case of the previous point, pointy blocks also have parameters.

But there are also a few well-known cases where a variable declaration corresponds exactly to one single location — let's call such a variable (both its declaration and its usages) a unique variable. Here are some examples of variables that turn out to be unique:

  • The built-ins (kind of implicit, but it fits the definition otherwise).
  • The mainline code of the main program.
  • The mainline code of any loaded module (because modules are singletons, this holds true even if a module is imported multiple times in the code base; its mainline is guaranteed to only run once).
  • Any code nested within such a mainline but without a pointy-block parameter — kind of "honorary mainline variables" — but here we're sort of hitting diminishing returns...
  • (Most important for this issue and the implementation it proposes) A macro as seen from inside its quasi blocks.

In all of these cases, we can start to think about optimizations where we do the lookup at compile time, and replace the runtime lookup with just the unique location. (If we can infer that all we're ever doing is reading form that location, we can further optimize reading from the location down to just its resulting constant value.)

Why are macro variables unique?

The last point in the preceding list is the odd one out, so it bears pausing for a bit and motivating that one. Macros are routines just like functions, so on the face of it, they should sort into the top list, not the bottom one.

While it is true that a macro can be called multiple times, and that each such call will generate a fresh location for each of its variables, code generated in the macro will only ever see a fixed variable. Whenever the macro gets called again, it's another quasi being created; one does not simply declare the same variable in the same macro call twice. It might be related to the fact that the macro runs at compile time, so what's compile time for the injectile is actually runtime for the macro, even though that's... the same time.

macro moo() {
    my x = ...;
    quasi {
        ...x...;        # for *this* quasi, `x` refers to a *unique* location
    }
}

I can't stress enough how this "happy coincidence" feels both necessary and sufficient in some way. In the sense that it's been really tricky to see how to make macro hygiene realistic... but the fact that quasis in macros only ever see unique variables in their surrounding macros means that, by a cosmic coincidence, all the runtime lookups that would have been "detached" (in the terminology of the gist) can instead be optimized away.

I dunno, YMMV, but to me it seems like quite a wonderful generalization. Who knew global variables and macro "closure" variables had this trait in common?

Implementation details

It gets cuter. Let's assume that it's possible both to uniquify a variable (to turn all its runtime lookups into global location accesses), and to de-uniquify it (to turn all its location accesses back into runtime lookups). Enough information needs to be stored in the location object itself to be able to restore it that way.

Two points in time are of interest during macro expansion:

  1. When we reach the line with the quasi (during the macro call) the quasi is interpolated, that is, for the first time it gets rendered as concrete code (a Qtree).
    • All the variables that are both defined and used inside the quasi block are left alone. (These are subject to normal lexical variable semantics inside the eventual injectile.)
    • All the variables that are defined outside the quasi block (maybe in the macro, maybe further out) are uniquified.
    • All the variables that are spliced into the quasi through unquotes are also left as they are. (If they come from macro arguments, they will find the corresponding declaration in the mainline. If they come from another Qtree, they will have already been uniqified.)
  2. At some point, the macro exits, we inject the quasi block into the mainline code (possibly after cloning it just to be safe), and
    • All the unique variables that are visible from that point in the mainline code are de-uniquified.

I think a good metaphor here is the freeze-drying of foodstuffs. We freeze-dry the food in order to transport it long distances or store it for a long time. When the food arrives at its destination, we can rehydrate it, returning it to its original fresh state.

In the case of the quasi, there are some variables that we know won't be lexically available from the mainline code. The two "lookup sites" involved form an inverted Y shape:

         builtins
         globals
            |
            |
            ^
           / \
quasi (1) /   \ (2) mainline

All the variables in the left branch of the Y shape will be unavailable from the point of view of (2), because they are no longer part of the sequence of OUTER blocks from (2). Preemptively, the quasi interpolation uniquifies all the variables used in the quasi but declared outside of it. (This includes the "stem" of the Y shape.) Later, the quasi injection de-uniquifies those variables declared in the stem of the Y.

Why? Because while uniquification is a "necessary evil" and something we need to do to get hygiene at all, non-unique variables are still preferable and more in line with the end user's intuition/expectations.

Some examples

my a = "mainline";
macro moo() {
    my a = "macro";
    return quasi { a };
}
say(moo());     # "macro" (because the macro's `a` has been uniquified)

my a = "mainline";
macro moo(x) {
    my a = "macro";
    return quasi { {{{x}}} };
}
a = a ~ " and mutable!";
say(moo(a));    # "mainline and mutable!" (because `a` from the mainline is being sent in
                # as a macro argument and spliced into the unquote; not uniquified)

BEGIN my a = "static mainline";
a = "mainline";
macro moo() {
    return quasi { a };
}
a = a ~ " and mutable!";
say(moo(a));    # "mainline and mutable!" (`a` is first uniqified and then de-uniquified;
                # if it weren't de-uniquified, the output would be "static mainline")

my a = "mainline";
macro moo(x) {
    return quasi { say(a, " and ", {{{x}}}) };
}
{
    my a = "block";
    moo(a);     # "mainline and block" (both `a` end up uniquified and then de-uniquified
                # but can correctly "see"/reach their definitions even though one shadows
                # the other)
}

my a = "mainline";
macro moo(x) {
    return quasi {
        my a = "quasi";
        say(a, " and ", {{{x}}});
    };
}
moo(a);         # "quasi and mainline" (the `a` being passed in has a lookup which
                # "leapfrogs" the lexical lookup of the `a` declared in the quasi)

my a = "mainline";
my moo;
{
    BEGIN my a = "static block";
    a = "block";
    macro car() {
        return quasi { a };
    }
    BEGIN moo = car;    # terribly realistic, this code
}
say(moo());     # "static block" (the uniqified variable is outside the macro, but still
                # not visible from the point of macro expansion; therefore not
                # de-uniquified)

# foo.007
BEGIN my a = "static module";
a = "module";
export macro moo() {
    return quasi { a };
}

# main.007
my a = "main program";
import { moo } from foo;
say(moo());     # "static module" (no lexical environment in common except for the
                # built-ins; the `a` is uniquified and not de-uniquified)

Hesitant addendum: with those last two examples, we might be able to get our cake and have it, too, getting "block" and "module" as outputs, respectively. A strong-enough "fixup mechanism" ought to be able to make those variables refer to their runtime locations.

@masak masak closed this as completed Oct 22, 2018
masak pushed a commit that referenced this issue Feb 17, 2019
Adding these from #410; if nothing else, this one is an interesting
regression test.
masak pushed a commit that referenced this issue Feb 19, 2019
Adding these from #410; if nothing else, this one is an interesting
regression test.
masak pushed a commit that referenced this issue Feb 19, 2019
Adding these from #410; if nothing else, this one is an interesting
regression test.
masak pushed a commit that referenced this issue Apr 4, 2019
This exploration, and the questions it raised, have been superseded by the
answers found in #410
@masak
Copy link
Owner Author

masak commented Sep 24, 2021

Just a quick note about terminology: "unique variable" was one of those terms I went through looking for the right term. It has its heart in the right place, but it didn't stick. What seems to have stuck (in the Alma source code at least) is the term "direct lookup" — as contrasted with "lexical lookup". In retrospect, I do like focusing on the lookup instead of the variable; it's the lookup that's different, after all.

@masak
Copy link
Owner Author

masak commented Oct 15, 2021

Holy carp! @vendethiel, you'd better take a look at this.

I'm reading Shutt's thesis. I confess I'm reading it very slowly, and the reason I'm reading it very slowly is that it can't be read quickly. At least I can't. The problem is not that it's opaque (rather the opposite, in fact); the problem is that it packs a lot of very precise information in each sentence.

And then I come across this, on page 111. For context, this is talking about Scheme's macro expansion hygiene, as laid down by R5RS.

The subtle heart of the hygienic algorithm is in what happens when a
symbol in the template is found to be unbound in the substitution macro-environment.
The unbound symbol is renamed to a gensym; the renaming is registered in the
substitution macro-environment, so that other occurrences of the same symbol in the
template will be renamed to the same gensym; and a binding is registered in the local
macro-environment from the gensym back to whatever value the original symbol was
mapped to in the static macro-environment of the macro.

This is a description of direct lookup. Scheme and Alma/Raku differ in the way that we've discussed in #567, but for this "subtle heart" part, those distinctions seem moot. There's a 1-to-1 correspondence here between "renamed to a gensym" and "turned into a direct lookup".

I mean, I think I knew that, at some level. But it still feels both shocking and validating to know that Scheme and Alma macro hygiene actually converge like this.

I still need to grok Scheme macro expansion. Probably I won't be able to rest until I've written at least one working implementation of it.

@vendethiel
Copy link
Collaborator

There's a 1-to-1 correspondence here between "renamed to a gensym" and "turned into a direct lookup".

That difference is declarative vs imperative. Scheme templates don't take arguments like Alma macros do, they just "match". The end result is the same, the way it's done is just different.

@masak
Copy link
Owner Author

masak commented Jan 30, 2023

This is a description of direct lookup. Scheme and Alma/Raku differ in the way that we've discussed in #567, but for this "subtle heart" part, those distinctions seem moot. There's a 1-to-1 correspondence here between "renamed to a gensym" and "turned into a direct lookup".

So, I came in here just now to make this exact same point again. The "cross-stage persistence" of #567 maybe adds an extra overtone to the macro hygiene chord, but the fundamental frequency is the same: that macro expansion can introduce new bindings, which are (need to be) subject to uniqueness/hygiene requirements. With cross-stage persistence (XSP), a new binding can come not just from within the quasi itself, but also from the quasi's surrounding lexical context.

Quoting @vendethiel:

That difference is declarative vs imperative. Scheme templates don't take arguments like Alma macros do, they just "match". The end result is the same, the way it's done is just different.

Turning this around in my mind, I'm not 100% sure whether Scheme's "macro by example" have cross-stage persistence or not. I think they might, actually. Now I need to find out.

(The litmus test would be something like this: I declare a Scheme macro that uses display, but the macro is declared in an environment where display has been lexically overridden to print everything in uppercase. Would a use of this macro in a different lexical environment (where display is not overridden) use the regular display, or the overridden one? I don't know if you can actually do it like that — maybe exported macros need to be top-level? — but surely the example can be adapted to do something similar enough. This I give myself as homework.)

I used to be more worried about the damage the "imperative" part of Raku/Alma macros could do to hygiene. Nowadays I'm more relaxed, because the algorithm described in the OP mostly acts "around" the macro, so the macro is free to do pretty much whatever it wants (including explicitly unhygienic things), and it will mesh pretty well with the hygiene algorithm.

I still need to grok Scheme macro expansion. Probably I won't be able to rest until I've written at least one working implementation of it.

This still holds. I should get a round tuit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants