-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GenIdent for anonymous instances #4096
Use GenIdent for anonymous instances #4096
Conversation
d078f0e
to
2b9009c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just writing out my understanding of what this PR does, so that you can confirm/correct my understanding (and anyone following along can be brought up to speed sooner).
First, I'll cover some context / background info on the Ident
type, the type class instance compiler-generated names and the Renamer code. Then, I'll describe the goal you're trying to achieve. Finally, I'll explain what you changed to achieve that goal.
The Ident
type has three values:
Ident
- an identifier provided by the developerGenIdent
- an identifier generated by the compilerUnusedIdent
- this is irrelevant for this PR. It's used in typechecking. Since it doesn't store any text value, we can't "rename" it anyway.
Identifiers can be thought of as 'names' of things (e.g. a function, a value, a let
binding). Below, every variation of fooX
(where X
is a number) is an identifier:
module Module where
-- type signature omitted
-- but think `foo1 :: String -> Value`
foo1 foo2 = do
let foo3 = --
foo4 foo2
where
foo4 _ = "x"
Via #4085, type class instance whose names are generated currently use the Ident
constructor when they should use the GenIdent
constructor as it better reflects the source of their name. Such instances are currently named $_ClassNameTypeName_4
for the following reasons:
- the leading
$
ensures it does not clash with other purescript identifiers because no such identifiers can use$
in their name. - the first
_
makes it easier to read the name because the$
character distracts the eye; however, it is not necessary for the name. - the final
_
is also not necessary, but does make things a bit nicer to read due to the unique identifier at the end. - the
4
is a unique identifier that ensures similar instance names do not clash due to generating a name and then truncating it to 25 characters.
The Renamer
is responsible for ensuring two or more identifiers don't clash within a given scope. For example, given the below code:
module Main where
import ImportedModule as ImportedModule
-- top-level member
foo a {- a1 -} =
let
-- let binding
a {- a2 -} = "hello"
in a <> " world"
-- or perhaps this?
bar a {- a1 -} = a <> ImportedModule.a {- a2 -}
the "a
near {- a1 -}
" will have a different identifier in the outputted JavaScript (e.g. a1
) than the "a
near {- a2 -}
" despite both having the same name in the source code.
The Renamer removes clashes by adding an integer to the "base name" (e.g. a1
, a2
, a3
, etc. for identifiers named a
) every time it finds another clash.
The Renamer code BEFORE this PR works by doing the following:
- Extract a module's declarations via
renameInModules
- Create the top-level scope by getting all identifiers in the module via
findDeclIdents
- Use a top-down approach to recurse through each layer in the AST and rename identifiers based on what's in scope at that point via
renameInDecl
,renameInValue
,renameInLiteral
, andrenameInBinder
. The "scope" of the current scope is determined by the binding group (I think).- Renames both the name and the value in a single pass (this is changed in this PR)
- top-level identifiers are not renamed (this is important)
- all other identifiers are renamed
- Renames both the name and the value in a single pass (this is changed in this PR)
- Returns a new version of the module with a new version of the list of declarations with updated names
Type class instances will always be top-level identifiers in a module. However, the Renamer doesn't currently rename top-level members. If top-level members were renamed, then the module's exports
also need to be renamed (so that exported identifiers still work when imported into other modules). To rebuild a module's exports
, one also needs to incorporate the module's FFI (if it exists) in case such members are also exported.
Thus, here's the goal of this PR
- Make generated type class instances use the
GenIdent
constructor rather than theIdent
constructor to accurately reflect the source of their name. - Remove the leading
$
character from the instance name by updating theRenamer
to ensure names don't clash by renaming top-levelGenIdent
constructors (if needed).
This is how the PR achieves the goal. This PR...
- Changes the final name generated by the compiler for a type class instance from
$_ClassNameTypeName_4
toclassNameTypeName4
- Updates generated type class instance names to use
GenIdent
rather thanIdent
- Changes the Renamer to include top-level
GenIdent
s when renaming and returns an updated version of the module'sexports
(new) in addition to itsdecls
(current). - Changes the Rename to first rename names in one pass and then rename values in a second pass (whereas current approach renames both names and values in the same pass).
The Renamer code AFTER this PR works by doing the following:
- Extract a module's exports, FFI, and declarations via
renameInModules
- Create the top-level scope by combining 1) only the
Ident
identifiers viafindDeclIdents
and 2) the FFI identifiers in the module - Uses
State
's Applicative (which executes sequentially and not in parallel) to do the following:- Use a top-down approach to recurse through each layer in the AST and rename identifiers based on what's in scope at that point via
renameInDecl
,renameInValue
,renameInLiteral
, andrenameInBinder
. The "scope" of the current scope is determined by the binding group (I think).- Renames the names and then the values
- top-level
GenIdent
s are renamed while top-levelIdent
s are not renamed - all other identifiers are renamed
- top-level
- Renames the names and then the values
- Lookup what the new name of an identifier is (if not renamed, the original name will be returned) and create a new version of the export with the new identifier.
- Return the updated declaration and its corresponding export
- Use a top-down approach to recurse through each layer in the AST and rename identifiers based on what's in scope at that point via
- Returns a new version of the module with
- a new version of the declarations with updated names
- a new version of the exports with updated names
It seems the various other changes (e.g. renameDecl
, etc.) are to get the types to line up and other changes (e.g. updateName
) are minor refactoring.
genName :: Text.Text | ||
genName = "$_" <> Text.take 25 (className <> typeArgs) <> "_" | ||
genName = Text.take 25 (className <> typeArgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the final _
character being removed? Is this just preference?
I'm not for or against this as I think the larger readability problems are addressed in this PR. To me, this seems more like a 'stylistic' change than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is cosmetic, yes. But note that the renamer only appends numbers to the end of a GenIdent
if necessary to disambiguate. So with the _
character, instances were getting named classNameArg_
, which is why I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense.
src/Language/PureScript/Renamer.hs
Outdated
-- whereas in a Let declarations are renamed if their name shadows another in | ||
-- the current scope. In order to prevent inner bindings earlier in the list | ||
-- from shadowing bindings later in the list, first all of the declarations | ||
-- are considered for renaming, and then all of the values are recursed into. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to prevent inner bindings earlier in the list from shadowing bindings later in the list, first all of the declarations are considered for renaming, and then all of the values are recursed into.
Could you clarify this point further because I'm not familiar with the code outside of this file?
Is this in case something like this occurs?
foo :: String -> String
foo a =
-- does it use top-level `a`
-- or foo's `a`?
a <> "something"
a :: String
a = "text value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly what I'm talking about there. The a
in foo
is an inner binding that shadows the top-level a
. The renamer should rename the a
in foo
to a1
. I think the CoreImp optimizer expects the code it receives to have no shadowing, so it's not just a style thing (even though in JavaScript shadowing works just fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Thanks for clarifying!
a974eb4
to
2b9009c
Compare
Your analysis is quite correct! One thing that might be clarifying to add (at least, it briefly confused me just now when I read your analysis and thought I had made things too complicated) is why At the top level, we effectively do three passes because we want regular We could do three passes at every level—I'm still considering that, actually—but the importance of letting (Please ignore the fixup commit I momentarily pushed and then retracted; as I said, I was briefly confused, and thought that two passes everywhere would work. I forgot that |
Sounds like the 3-pass approach is necessary for top-level members and safer (but possibly not necessary?) for non-top-level members. Is that correct? If we do the 3-pass approach, does that slightly slow down the |
Using three passes below the top level doesn't get us any more safety. What it would get us is:
|
Simple code is always better than complex code if no other factors are at play.
I think this is actually worth doing even if nothing triggers it now. If a programmer has said that some name is |
New commit has a few changes:
|
OverlapAcrossModules.X.cX | ||
OverlapAcrossModules.$cXY0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These error messages are not great. Neither of these instances has a programmer-provided name, and while at least the $
in the second instance suggests this, the first instance has nothing to indicate that the programmer shouldn't go literally searching for an instance named cX
. But even in the second case, $cXY0
is not even the final name that will appear in the generated code!
I think the ideal thing here would be to describe these instances with something like this:
OverlapAcrossModules.X.cX | |
OverlapAcrossModules.$cXY0 | |
instance in module OverlapAcrossModules.X with type forall y. C X y | |
instance in module OverlapAcrossModules with type C X Y |
I'd like to address this in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this PR looks good. Could another core member look this over and provide their feedback?
d818ca6
to
6957dab
Compare
I've fixed the merge conflict. Can we get another review here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
|
||
className :: Text.Text | ||
className = N.runProperName $ qualName cls | ||
className = foldMap (uncurry Text.cons . first toLower) | ||
. Text.uncons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a nit and I know that this style isn't universal in the code base already, but for new code I think it's good practice to avoid having indentation depend on the length of an identifier. In this case, if we wanted to rename className
, we'd also have to touch every line in its implementation to realign the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm completely on board with that.
Maybe we should start a wiki page with our current/evolving notions of good style? I frequently find myself waffling between a style that is more idiomatic/traditional Haskell versus one which has practical benefits like this, and not having a dominant style in the code itself doesn't help. (Another example: equational declaration style versus lambda case—I think the latter is usually superior where applicable, again because renaming the function touches fewer lines.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a wiki page, should it be included in the main repo?
Another thing we could add to that is to use record types whenever something has 3+ args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking there isn't much advantage to versioning the standard along with the code—particularly while we're hammering out what the standard is, that could generate a lot of noise in the commit history. Maybe if the wiki page stabilizes, it makes sense to add it then, and then gatekeep it with the same PR workflow we use for code. But for starting out, I think we want something that we can feel more fluid about rewriting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a wiki page initially and then merging it into the repo when it stabilizes sounds like a good tradeoff.
moduleToExternsFile :: Module -> Environment -> ExternsFile | ||
moduleToExternsFile (Module _ _ _ _ Nothing) _ = internalError "moduleToExternsFile: module exports were not elaborated" | ||
moduleToExternsFile (Module ss _ mn ds (Just exps)) env = ExternsFile{..} | ||
moduleToExternsFile :: Module -> Environment -> M.Map Ident Ident -> ExternsFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be good to add a comment here which clarifies what the new Map
argument is for? Even if it's just a pointer to the rename step of desugaring.
[33m [0m | ||
The following instances were found: | ||
|
||
Main.$test1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't happen in this PR, but now that instance names are optional we should probably include source spans in this list so that people can find the problematic instances a little more easily.
This fixes the limitation of the CoreFn renamer which prevented it from renaming top-level GenIdents. As a consequence, we can now give unnamed instances more idiomatic names and still guarantee that they will be unique in their module.
92ddb51
to
c801acd
Compare
After a bug report on FP Slack in which someone ran into a case where an instance name had a $13 suffix in the module where it was defined but some code was trying to access the same instance with a $12 suffix, I am wondering if the renamer was deliberately leaving top level declarations out before, and whether we might want to backtrack here because this limits our ability to perform cutoff in incremental rebuilds, since the answer to question of “has the public interface of the module changed” is much more likely to become “in a way no, but we have to treat it as if it has, because of renaming.” |
If I understand the issue correctly, isn't it substantially worse without Prior to this PR, generated instance names always had a numeric suffix, and it depended on the state of the supply monad, which means any change that pulls one more or fewer number from the supply at any earlier point during compilation would affect the naming. Post this PR, generated instance names only change if there is a top-level name conflict, which is rare to begin with, and much less likely to change mid-development. (It can happen, but any name generation strategy runs some risk of deoptimizing incremental builds.) |
Right, ok, that makes sense, and I agree that this is a substantial improvement. I wasn’t objecting to this PR specifically, rather the approach of using statefulness in top level identifiers at all - I’m wondering whether instance names ought to be more deterministic than they are even after this PR, i.e. whether we should work out an algorithm which guarantees uniqueness without having to depend on statefulness / what else is inside the module. Of course readability will suffer that way, but I think reliability of incremental builds ought to win over readability when they are in conflict. But maybe this is enough of an edge case to not really be necessary; perhaps we should just get this change shipped instead. |
If we gave anonymous instances unreadable names—I assume you're thinking something like a robust hash of the fully qualified class name and argument types—perhaps we could restore some readability by also generating friendly-named local vars inside the module and at any import sites? Then downstream modules, if not recompiled, would still function correctly, even though they'd potentially have the ‘wrong’ internal friendly name sometimes. I'm not crazy about that approach; seems like a fair whack of complexity to take on for an edge case in a compile-time performance feature. But we could pivot into it in the future if this ends up presenting a practical issue for cutoff? |
(Oh, and isn't it just a bug that incremental builds weren't triggered for downstream modules when |
proceeds to seriously consider implementing the idea he just described as not being crazy about |
But if one really wanted readable instance names, they could always provide it themselves, right? |
It might be a bug in incremental compilation, but I think it’s more likely that it’s an instance of #3323 - “i recompiled a downstream module on its own using the ide server and didn’t think the upstream module would need recompiling because the module interface ought not to have changed,” which is arguably not a bug, but somewhat problematic if people can’t reliably guess when a module interface has changed. |
Why not generate names that contain both a readable name and a hash? In this way they won't become too unreadable. |
This fixes the limitation of the CoreFn renamer which prevented it from
renaming top-level GenIdents. As a consequence, we can now give unnamed
instances more idiomatic names and still guarantee that they will be
unique in their module.
Description of the change
With compliments to @JordanMartinez, follow-up from one of my suggestions on #4085.
Checklist: