Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GenIdent for anonymous instances #4096

Conversation

rhendric
Copy link
Member

@rhendric rhendric commented May 29, 2021

This fixes the limitation of the CoreFn renamer which prevented it from
renaming top-level GenIdents. As a consequence, we can now give unnamed
instances more idiomatic names and still guarantee that they will be
unique in their module.

Description of the change

With compliments to @JordanMartinez, follow-up from one of my suggestions on #4085.


Checklist:

  • Added the change to the changelog's "Unreleased" section with a reference to this PR (e.g. "- Made a change (#0000)")
  • Added myself to CONTRIBUTORS.md (if this is my first contribution)
  • Linked any existing issues or proposals that this pull request should close
  • Updated or added relevant documentation
  • Added a test for the contribution (if applicable)

@rhendric rhendric force-pushed the rhendric/use-genident-for-anonymous-instances branch 3 times, most recently from d078f0e to 2b9009c Compare May 29, 2021 04:15
Copy link
Contributor

@JordanMartinez JordanMartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just writing out my understanding of what this PR does, so that you can confirm/correct my understanding (and anyone following along can be brought up to speed sooner).

First, I'll cover some context / background info on the Ident type, the type class instance compiler-generated names and the Renamer code. Then, I'll describe the goal you're trying to achieve. Finally, I'll explain what you changed to achieve that goal.

The Ident type has three values:

  • Ident - an identifier provided by the developer
  • GenIdent - an identifier generated by the compiler
  • UnusedIdent - this is irrelevant for this PR. It's used in typechecking. Since it doesn't store any text value, we can't "rename" it anyway.

Identifiers can be thought of as 'names' of things (e.g. a function, a value, a let binding). Below, every variation of fooX (where X is a number) is an identifier:

module Module where

-- type signature omitted
-- but think `foo1 :: String -> Value`
foo1 foo2 = do
  let foo3 = --
  foo4 foo2
  where
  foo4 _ = "x"

Via #4085, type class instance whose names are generated currently use the Ident constructor when they should use the GenIdent constructor as it better reflects the source of their name. Such instances are currently named $_ClassNameTypeName_4 for the following reasons:

  • the leading $ ensures it does not clash with other purescript identifiers because no such identifiers can use $ in their name.
  • the first _ makes it easier to read the name because the $ character distracts the eye; however, it is not necessary for the name.
  • the final _ is also not necessary, but does make things a bit nicer to read due to the unique identifier at the end.
  • the 4 is a unique identifier that ensures similar instance names do not clash due to generating a name and then truncating it to 25 characters.

The Renamer is responsible for ensuring two or more identifiers don't clash within a given scope. For example, given the below code:

module Main where

import ImportedModule as ImportedModule

-- top-level member
foo a {- a1 -} =
  let
    -- let binding
    a {- a2 -} = "hello"
  in a <> " world"

-- or perhaps this?
bar a {- a1 -} = a <> ImportedModule.a {- a2 -}

the "a near {- a1 -}" will have a different identifier in the outputted JavaScript (e.g. a1) than the "a near {- a2 -}" despite both having the same name in the source code.

The Renamer removes clashes by adding an integer to the "base name" (e.g. a1, a2, a3, etc. for identifiers named a) every time it finds another clash.

The Renamer code BEFORE this PR works by doing the following:

  1. Extract a module's declarations via renameInModules
  2. Create the top-level scope by getting all identifiers in the module via findDeclIdents
  3. Use a top-down approach to recurse through each layer in the AST and rename identifiers based on what's in scope at that point via renameInDecl, renameInValue, renameInLiteral, and renameInBinder. The "scope" of the current scope is determined by the binding group (I think).
    1. Renames both the name and the value in a single pass (this is changed in this PR)
      1. top-level identifiers are not renamed (this is important)
      2. all other identifiers are renamed
  4. Returns a new version of the module with a new version of the list of declarations with updated names

Type class instances will always be top-level identifiers in a module. However, the Renamer doesn't currently rename top-level members. If top-level members were renamed, then the module's exports also need to be renamed (so that exported identifiers still work when imported into other modules). To rebuild a module's exports, one also needs to incorporate the module's FFI (if it exists) in case such members are also exported.

Thus, here's the goal of this PR

  • Make generated type class instances use the GenIdent constructor rather than the Ident constructor to accurately reflect the source of their name.
  • Remove the leading $ character from the instance name by updating the Renamer to ensure names don't clash by renaming top-level GenIdent constructors (if needed).

This is how the PR achieves the goal. This PR...

  1. Changes the final name generated by the compiler for a type class instance from $_ClassNameTypeName_4 to classNameTypeName4
  2. Updates generated type class instance names to use GenIdent rather than Ident
  3. Changes the Renamer to include top-level GenIdents when renaming and returns an updated version of the module's exports (new) in addition to its decls (current).
  4. Changes the Rename to first rename names in one pass and then rename values in a second pass (whereas current approach renames both names and values in the same pass).

The Renamer code AFTER this PR works by doing the following:

  1. Extract a module's exports, FFI, and declarations via renameInModules
  2. Create the top-level scope by combining 1) only the Ident identifiers via findDeclIdents and 2) the FFI identifiers in the module
  3. Uses State's Applicative (which executes sequentially and not in parallel) to do the following:
    1. Use a top-down approach to recurse through each layer in the AST and rename identifiers based on what's in scope at that point via renameInDecl, renameInValue, renameInLiteral, and renameInBinder. The "scope" of the current scope is determined by the binding group (I think).
      1. Renames the names and then the values
        1. top-level GenIdents are renamed while top-level Idents are not renamed
        2. all other identifiers are renamed
    2. Lookup what the new name of an identifier is (if not renamed, the original name will be returned) and create a new version of the export with the new identifier.
    3. Return the updated declaration and its corresponding export
  4. Returns a new version of the module with
    1. a new version of the declarations with updated names
    2. a new version of the exports with updated names

It seems the various other changes (e.g. renameDecl, etc.) are to get the types to line up and other changes (e.g. updateName) are minor refactoring.

genName :: Text.Text
genName = "$_" <> Text.take 25 (className <> typeArgs) <> "_"
genName = Text.take 25 (className <> typeArgs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the final _ character being removed? Is this just preference?

I'm not for or against this as I think the larger readability problems are addressed in this PR. To me, this seems more like a 'stylistic' change than anything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cosmetic, yes. But note that the renamer only appends numbers to the end of a GenIdent if necessary to disambiguate. So with the _ character, instances were getting named classNameArg_, which is why I removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense.

-- whereas in a Let declarations are renamed if their name shadows another in
-- the current scope. In order to prevent inner bindings earlier in the list
-- from shadowing bindings later in the list, first all of the declarations
-- are considered for renaming, and then all of the values are recursed into.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to prevent inner bindings earlier in the list from shadowing bindings later in the list, first all of the declarations are considered for renaming, and then all of the values are recursed into.

Could you clarify this point further because I'm not familiar with the code outside of this file?

Is this in case something like this occurs?

foo :: String -> String
foo a =
  -- does it use top-level `a` 
  -- or foo's `a`? 
  a <> "something"

a :: String
a = "text value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what I'm talking about there. The a in foo is an inner binding that shadows the top-level a. The renamer should rename the a in foo to a1. I think the CoreImp optimizer expects the code it receives to have no shadowing, so it's not just a style thing (even though in JavaScript shadowing works just fine).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Thanks for clarifying!

@rhendric rhendric force-pushed the rhendric/use-genident-for-anonymous-instances branch from a974eb4 to 2b9009c Compare May 30, 2021 20:37
@rhendric
Copy link
Member Author

Your analysis is quite correct!

One thing that might be clarifying to add (at least, it briefly confused me just now when I read your analysis and thought I had made things too complicated) is why renameInDecls does two passes at every level in addition to the findDeclIdents call at the top level, which acts as sort of a zeroth mini-pass.

At the top level, we effectively do three passes because we want regular Idents to have first dibs on their preferred name, then GenIdents, then bindings in inner scopes. If we didn't do the findDeclIdents pass first, then an earlier GenIdent could possibly steal a name from the top-level scope that a later Ident would want (which would be very bad if we export that name). If we didn't do the decl-rename pass (which at the top level just means renaming GenIdents), then a reference to that GenIdent appearing deep in a value somewhere might get visited before the GenIdent has its new name in scope, which could maybe cause it to get a different final name than the thing it references.

We could do three passes at every level—I'm still considering that, actually—but the importance of letting Idents have first dibs over GenIdents is much less when nothing is being exported.

(Please ignore the fixup commit I momentarily pushed and then retracted; as I said, I was briefly confused, and thought that two passes everywhere would work. I forgot that Idents need to have priority over GenIdents at least at the top level.)

@JordanMartinez
Copy link
Contributor

We could do three passes at every level—I'm still considering that, actually—but the importance of letting Idents have first dibs over GenIdents is much less when nothing is being exported.

Sounds like the 3-pass approach is necessary for top-level members and safer (but possibly not necessary?) for non-top-level members. Is that correct? If we do the 3-pass approach, does that slightly slow down the purs compile command?

@rhendric
Copy link
Member Author

Using three passes below the top level doesn't get us any more safety. What it would get us is:

  • Code simplicity: we would treat all declarations the same, which might make this code easier to read.

  • Somewhat more predictable generated code in the presence of a pseudo-collision between a GenIdent and an Ident. I suspect this isn't possible to trigger right now; most if not all of the GenIdents I've seen prior to this PR are function parameters and not let-bindings, and instances have to be top-level. But suppose some future language feature resulted in some CoreFn that looked like this (in pseudo-PS):

    let <GenIdent "x" 42> = foo ...
        x = bar ...
    in <GenIdent "x" 42> + x

    where x is a user-supplied proper Ident, and <GenIdent "x" 42> is an identifier that came from some desugaring somehow.

    With two passes, this code could end up being generated:

    var x = foo(...);
    var x1 = bar(...);
    return x + x1;

    With three passes, this is what you'd get always:

    var x1 = foo(...);
    var x = bar(...);
    return x1 + x;

    The idea is that a name chosen by the programmer has priority over a variable produced by the compiler. A programmer using a debugger might not appreciate the variable x meaning something other than what they used it for. It's still a safe transformation either way, just potentially a surprising one to anyone reading the generated code or otherwise looking under the hood.

    But as I said, I don't know of any way to trigger this sort of conflict below the top level right now, so this is a pretty theoretical benefit.

@JordanMartinez
Copy link
Contributor

Simple code is always better than complex code if no other factors are at play.

The idea is that a name chosen by the programmer has priority over a variable produced by the compiler. A programmer using a debugger might not appreciate the variable x meaning something other than what they used it for. It's still a safe transformation either way, just potentially a surprising one to anyone reading the generated code or otherwise looking under the hood.

I think this is actually worth doing even if nothing triggers it now. If a programmer has said that some name is x, then it should be x in the compiled output. AFAIK, a GenIdent means the compiler generated it whereas Ident means the developer defined it. I think the decision to make GenIdents work around Idents rather than making later bindings work around earlier bindings better abides with the principle of readable debuggable code.

@rhendric
Copy link
Member Author

rhendric commented Jun 1, 2021

New commit has a few changes:

  • I've implemented three passes everywhere. I think the code simplicity gains ended up being minimal, but positive.
  • I overlooked a few large mistakes that actually would have broken unnamed instances rather severely. 🤦 First, the renamer needs to rename variables that are qualified GenIdents now, because anything at the top level of a module gets qualified. Second, externs files weren't downstream of the renaming phase, so GenIdents there weren't getting renamed. Now there is some extra logic in externs generation that reuses the renaming map generated by the renamer.
  • I've added a bunch of tests for various cases that broke while I was experimenting. These are all basically copies of existing tests, but with some instance names removed.

Comment on lines +12 to +13
OverlapAcrossModules.X.cX
OverlapAcrossModules.$cXY0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages are not great. Neither of these instances has a programmer-provided name, and while at least the $ in the second instance suggests this, the first instance has nothing to indicate that the programmer shouldn't go literally searching for an instance named cX. But even in the second case, $cXY0 is not even the final name that will appear in the generated code!

I think the ideal thing here would be to describe these instances with something like this:

Suggested change
OverlapAcrossModules.X.cX
OverlapAcrossModules.$cXY0
instance in module OverlapAcrossModules.X with type forall y. C X y
instance in module OverlapAcrossModules with type C X Y

I'd like to address this in a later PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

@JordanMartinez JordanMartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this PR looks good. Could another core member look this over and provide their feedback?

@rhendric rhendric force-pushed the rhendric/use-genident-for-anonymous-instances branch from d818ca6 to 6957dab Compare June 5, 2021 19:10
@JordanMartinez
Copy link
Contributor

I've fixed the merge conflict. Can we get another review here?

Copy link
Contributor

@hdgarrood hdgarrood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


className :: Text.Text
className = N.runProperName $ qualName cls
className = foldMap (uncurry Text.cons . first toLower)
. Text.uncons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nit and I know that this style isn't universal in the code base already, but for new code I think it's good practice to avoid having indentation depend on the length of an identifier. In this case, if we wanted to rename className, we'd also have to touch every line in its implementation to realign the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm completely on board with that.

Maybe we should start a wiki page with our current/evolving notions of good style? I frequently find myself waffling between a style that is more idiomatic/traditional Haskell versus one which has practical benefits like this, and not having a dominant style in the code itself doesn't help. (Another example: equational declaration style versus lambda case—I think the latter is usually superior where applicable, again because renaming the function touches fewer lines.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be good 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a wiki page, should it be included in the main repo?

Another thing we could add to that is to use record types whenever something has 3+ args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking there isn't much advantage to versioning the standard along with the code—particularly while we're hammering out what the standard is, that could generate a lot of noise in the commit history. Maybe if the wiki page stabilizes, it makes sense to add it then, and then gatekeep it with the same PR workflow we use for code. But for starting out, I think we want something that we can feel more fluid about rewriting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a wiki page initially and then merging it into the repo when it stabilizes sounds like a good tradeoff.

moduleToExternsFile :: Module -> Environment -> ExternsFile
moduleToExternsFile (Module _ _ _ _ Nothing) _ = internalError "moduleToExternsFile: module exports were not elaborated"
moduleToExternsFile (Module ss _ mn ds (Just exps)) env = ExternsFile{..}
moduleToExternsFile :: Module -> Environment -> M.Map Ident Ident -> ExternsFile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be good to add a comment here which clarifies what the new Map argument is for? Even if it's just a pointer to the rename step of desugaring.

 
The following instances were found:

Main.$test1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't happen in this PR, but now that instance names are optional we should probably include source spans in this list so that people can find the problematic instances a little more easily.

This fixes the limitation of the CoreFn renamer which prevented it from
renaming top-level GenIdents. As a consequence, we can now give unnamed
instances more idiomatic names and still guarantee that they will be
unique in their module.
@rhendric rhendric force-pushed the rhendric/use-genident-for-anonymous-instances branch from 92ddb51 to c801acd Compare June 20, 2021 21:28
@rhendric rhendric merged commit 936d8d2 into purescript:master Jun 20, 2021
@rhendric rhendric deleted the rhendric/use-genident-for-anonymous-instances branch June 20, 2021 22:07
JordanMartinez added a commit that referenced this pull request Jun 21, 2021
…ass (#4099)

This refactors the work done in #4085. It was not merged immediately due to the work done in #4096 that would produce a merge conflict.
@hdgarrood
Copy link
Contributor

hdgarrood commented Jun 24, 2021

After a bug report on FP Slack in which someone ran into a case where an instance name had a $13 suffix in the module where it was defined but some code was trying to access the same instance with a $12 suffix, I am wondering if the renamer was deliberately leaving top level declarations out before, and whether we might want to backtrack here because this limits our ability to perform cutoff in incremental rebuilds, since the answer to question of “has the public interface of the module changed” is much more likely to become “in a way no, but we have to treat it as if it has, because of renaming.”

@hdgarrood
Copy link
Contributor

#3724

@rhendric
Copy link
Member Author

If I understand the issue correctly, isn't it substantially worse without GenIdent than with?

Prior to this PR, generated instance names always had a numeric suffix, and it depended on the state of the supply monad, which means any change that pulls one more or fewer number from the supply at any earlier point during compilation would affect the naming.

Post this PR, generated instance names only change if there is a top-level name conflict, which is rare to begin with, and much less likely to change mid-development. (It can happen, but any name generation strategy runs some risk of deoptimizing incremental builds.)

@hdgarrood
Copy link
Contributor

Right, ok, that makes sense, and I agree that this is a substantial improvement. I wasn’t objecting to this PR specifically, rather the approach of using statefulness in top level identifiers at all - I’m wondering whether instance names ought to be more deterministic than they are even after this PR, i.e. whether we should work out an algorithm which guarantees uniqueness without having to depend on statefulness / what else is inside the module. Of course readability will suffer that way, but I think reliability of incremental builds ought to win over readability when they are in conflict. But maybe this is enough of an edge case to not really be necessary; perhaps we should just get this change shipped instead.

@rhendric
Copy link
Member Author

If we gave anonymous instances unreadable names—I assume you're thinking something like a robust hash of the fully qualified class name and argument types—perhaps we could restore some readability by also generating friendly-named local vars inside the module and at any import sites? Then downstream modules, if not recompiled, would still function correctly, even though they'd potentially have the ‘wrong’ internal friendly name sometimes.

I'm not crazy about that approach; seems like a fair whack of complexity to take on for an edge case in a compile-time performance feature. But we could pivot into it in the future if this ends up presenting a practical issue for cutoff?

@rhendric
Copy link
Member Author

(Oh, and isn't it just a bug that incremental builds weren't triggered for downstream modules when $12 became $13? Is that tracked somewhere?)

@rhendric
Copy link
Member Author

proceeds to seriously consider implementing the idea he just described as not being crazy about

@JordanMartinez
Copy link
Contributor

Of course readability will suffer that way, but I think reliability of incremental builds ought to win over readability when they are in conflict.

But if one really wanted readable instance names, they could always provide it themselves, right?

@hdgarrood
Copy link
Contributor

It might be a bug in incremental compilation, but I think it’s more likely that it’s an instance of #3323 - “i recompiled a downstream module on its own using the ide server and didn’t think the upstream module would need recompiling because the module interface ought not to have changed,” which is arguably not a bug, but somewhat problematic if people can’t reliably guess when a module interface has changed.

@f-f
Copy link
Member

f-f commented Jun 24, 2021

Why not generate names that contain both a readable name and a hash? In this way they won't become too unreadable.

rhendric added a commit to rhendric/purescript that referenced this pull request Jul 3, 2021
rhendric added a commit to rhendric/purescript that referenced this pull request Jul 3, 2021
rhendric added a commit to rhendric/purescript that referenced this pull request Jul 3, 2021
rhendric added a commit to rhendric/purescript that referenced this pull request Jul 6, 2021
rhendric added a commit to rhendric/purescript that referenced this pull request Jul 11, 2021
rhendric added a commit to rhendric/purescript that referenced this pull request Jul 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants