Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensible records and variants #201

Merged
merged 22 commits into from Aug 19, 2020

Conversation

danieldjohnson
Copy link
Collaborator

@danieldjohnson danieldjohnson commented Aug 15, 2020

Implements extensible records and variants, in the style of Daan Leijen's "Extensible records with scoped labels."

On a high level, we can now do stuff like:

-- Records

:p
  x = {a=5.0, b=2}
  y : {a:Int & b:Int & ..._} = {a=3, a=4, ...x}
  y
> {a = 3, a = 4, a = 5.0, b = 2}

def getTwoFoosAndABar (rest : Types)?->
                      (x : {foo:a & foo:b & bar:c & ...rest}) : (a&b&c) =
  ({foo=f1, foo=f2, bar=b, ...}) = x
  (f1, f2, b)

:p getTwoFoosAndABar {foo=1, bar=2, foo=0.0, foo=4, baz=3.0, bar=7}
> (1, (0.0, 2))

:p
  foo = 1
  bar = 2.0
  {foo, bar}
> {bar = 2.0, foo = 1}

:p
  ({foo, ...}) = {foo=1, bar=2.0}
  foo
> 1

-- Variants

def splitTwoFoosOrABar (rest:Types)?-> 
                       (x : {foo:a | foo:b | bar:c | ...rest})
                       : ({x:a | y:b | z:c} | {|...rest}) =
  case x of
    {| foo=x |}                 -> Left {| x=x |}
    {| foo | foo=y |}           -> Left {| y=y |}
    {| bar=z |}                 -> Left {| z=z |}
    {|foo|foo|bar| ...other |}  -> Right other

myStuff = [ {| foo=1 |},
            {| bar=2 |},
            {| foo | foo=3 |},
            {| baz=4 |},
            {| foo | foo | foo=5 |}
          ]:(Fin 5 => {foo:_ | foo:_ | foo:_ | bar:_ | baz:_})

:p for i. splitTwoFoosOrABar myStuff.i
> [ (Left {| x = 1 |})
> , (Left {| z = 2 |})
> , (Left {| y = 3 |})
> , (Right {| baz = 4 |})
> , (Right {| foo = 5 |}) ]

Individual fields can be extracted by pattern-matching with .... It should also be possible to implement field accessor functions, but I'll leave that for another PR (probably will look something like get #foo record : {foo:a, ...r} -> a)

Some more details

I've modified the syntax for variants with repeated labels, to make it more consistent with Leijen's work and also to make it possible to write down variant lifting/destructuring operations in a straightforward way. Variant literals can now have a bunch of labels with no values on the left side, which indicate the alternatives that were skipped over. So for instance, setting the second a field is done with {| a | a = value |}, and adding (empty) cases for a and b to an existing variant is done with {|a|b| ...existing |}. I'm not sure there's much use for repeated labels in variants anyway, at least at the user level, but we can add back something like the old syntax if we need to at some point.

In the core IR, extensibility is represented using a new kind LabeledRowKind (aliased to Types in the prelude for ease-of-use) and a new ADT ExtLabeledItems a b which represents some labeled items of type a along with possibly an extension of type b. Record and variant types are then written in terms of this. Concrete records, since they have to contain actual values, don't have an extensible tail. However, there are four new Expr types that allow constructing and breaking down records and variants, which all have special syntax in the parser.

Notably, those four Expr types exist only between parsing and simplification. As soon as we know the monomorphic type, they get simplified away into normal record unpacking and variant pattern-matching. This means that Imp doesn't have to change at all.

I've gone ahead and implemented an Unlabeled helper pattern, also, which can be used to construct records and variants that don't have a label (well, technically, they have one label, but it is a special singleton that can't be written in userspace). This can be used as an internal anonymous product or sum type (and I use it to split up records and variants when matching patterns that have a ...rest tail).

Side change: I've made it so that inference doesn't put solver variables into the Embed scope, since there's no particular need for them to be there, and it's a bit nicer to keep the two scopes separate. This also helps because we need to be able to create variables at unification time but we don't have a scope when we are unifying things.

danieldjohnson added 14 commits Aug 6, 2020
Introduces the core changes to the IR to support extensible records and
variants:

- Adds a new `ExtLabeledItems` object that extends a LabeledItems with
  an optional extension, parameterized by type. The record and variant
  type constructors are extended with a name, whereas the UExpr versions
  are extended with another expression, and patterns are extended with
  another pattern. Records themselves are not allowed to be extended,
  since by the time you have an atomic record it should have concrete
  values for all of the fields.

- Introduces a LabeledRow kind, which is the kind of type variables that
  appear on the RHS of an extensible record or variant type expression.

- Adds some definitions and patterns to allow using anonymous records
  and variants to desugar things. Right now, it just reserves a
  particular string label (written "%INTERNAL%") for internal use, and
  then allows Haskell code to pretend that records and variants indexed
  by that label are actually just anonymous tuples / positional
  variants.

- Adds four new types of Expr, corresponding to building and splitting
  records and variants. These allow you to convert between
  ```
  { A:a & B:b & ...rest }
  ```
  and
  ```
  { %INTERNAL%#0: {A:a & B:b}  &  %INTERNAL%google-research#1: {&...rest} }
  ```
  (and similarly for variant types.)
  These wrappers should allow us to completely desugar away the
  extensible records and variants when dealing with patterns, by first
  splitting them into non-extensible pieces and then unpacking those
  pieces separately.

In this change, most of the uses of records and variants have been
restricted to operate on the non-extended versions only, to ensure that
everything builds. The next step is to start adding logic to handle
extensible types in inference and simplification.
While working on inference, I realized that lifting variants really only
makes sense when you know exactly which alternatives you want to skip
over. This corresponds to the "embedding" operator in Leijen's "scoped
labels" paper, and makes it so that VariantLift matches nicely with
the other record and variant expressions. To make the syntax consistent
and enable writing out lifting in this way, I've changed variant syntax
to match Leijen's syntax for repeated labels: instead of `{| a#2 = x |}`
it is now `{|a|a| a = x|}`. You are free to also write things like
`{|a|b|c=x|}` but this doesn't really change anything because variant
literals are always open on the right; only duplicate labels need to be
on the left. This matters more for lift expressions: `{|a|b| ...tail |}`
is unambiguous regardless of whether `tail` has `a` or `b` labels in it.

The rest of this commit implements inference for most of the UExpr
literals for records and variants, along with labeled row unification.
The SolverT monad already ensures that inference names are unique, and
unification needs to be able to create new variables for row types.
So instead of putting inference names in the scope, we can just make
sure that Embed.hs doesn't try to look up inference names.
To make sure there's a unique representation for every labeled row type,
we don't want to allow labels with empty item lists in the LabeledItems
object. Switching to Data.List.NonEmpty ensures that we can't construct
a map with a label containing an empty list.
Instead of checking equality after zonking, the row unification checked
equality BEFORE zonking, which could lead to infinite loops in which
we repeatedly try to unify two things that we already know are
identical.
The basic extensible record and variant operations are necessary to
properly express operations on extensible records. However, as soon as
we know what the concrete, monomorphic type is, we can immediately
simplify away those operations: RecordCons and RecordSplit become
unpacks followed by rebuilt Record objects; VariantLift and VariantSplit
become case expressions, ordered based on the known canonical order of
the alternatives.
At long last, it is possible to access individual fields of records
without naming every single field! Record patterns with tails, such
as `{foo=foo, ...rest}`, desugar to an application of `RecordSplit`
followed by unpacking of each of the split names.
Matching on extensible variants was a bit complex, and I took a few
shortcuts. The main complication is that, given an extensible record,
it's not immediately clear how many alternatives there should be, since
it could be extended later (even during inference!). Thus, we want to
emit a VariantSplit to reduce it to the specific types we care about
and extract the polymorphic tail if needed. But the types we care about
can depend on what patterns we have!

The solution I implemented was to allow only one variant tail pattern
in a case expression, and require that it not overlap with any non-tail
variant patterns. This tail pattern is used to determine how to split
the scrutinee before we pattern match, and after splitting we then have
two disjoint non-extensible variants that we can pattern match on in
two steps.

Since this makes the canonical ordering of the alternatives a bit harder
to infer in advance, I've adapted the case pattern-match machinery to
use a helper ADT that tracks what kind of pattern we are matching, with
different cases for Dex ADT constructors, variant patterns, and variant
tail patterns.
Slight quality-of-life improvement: allow record fields to pun against
variable or pattern binders. This means you can just do `{foo, bar}` or
`({foo, ...}) = stuff` instead of having to write `{foo=foo, bar=bar}`
or `({foo=foo, ...rest}) = stuff`. There's also sugar for omitting the
`_` binder in a record pattern.

No punning is allowed for variants, because that doesn't really seem
useful, and could be confusing given that the variant syntax already
has labels with no values (to lift variant types with new options on
the left).
@danieldjohnson danieldjohnson force-pushed the records-variants branch 4 times, most recently from 099cd16 to e772cb3 Compare Aug 15, 2020
In the previous implementation, it was possible for inference to emit
RecordSplit on values that weren't yet known to be records (well,
inference knew it, but hadn't constrained it to be so). This was causing
type errors for RecordSplit when `getType` was called before inference
solved for the missing type variables.

The fix is to constrain the type to be a record (with unknown fields)
and then zonk the expression before emitting a RecordSplit. This has
the (surprising!) side effect of actually greatly simplifying the
pattern match logic; it turns out that most of that was just duplicating
work that unification had to do anyway.

This fix exposed a second problem with row inference, which was that
I forgot to include the base case for unifying two things that can't
possibly be equal! Now, if we try to unify anything other than a type
variable with an empty row, we throw a type error immediately.
I didn't realize this at first, but simplifyAtom/simplifyExpr are not
idempotent! In particular, simplification uses a different space of
variables (although they share the same namespace), so we might have
a subst env that binds X to Y and Y to Z. Simplifying twice can mean
that you bind X to Z by mistake!
dougalm
Copy link
Collaborator

@dougalm dougalm commented on a6425c6 Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. You have to be careful to only apply the substitution once. That's why we use dropSub when we do beta reduction etc. This is different from the substitution we use in type inference, which is idempotent, and we freely zonk multiple times.

Copy link
Collaborator

@apaszke apaszke left a comment

This looks great! I haven't looked at the parser and the inference part, but the rest seems fine. I'm a little split about having VariantSplit (pun not intended), because it seems like a partial operation and we've been trying to avoid those. Is it emitted for fallible patterns? If yes, can't we just use case expressions for that?

prelude.dx Outdated Show resolved Hide resolved
@@ -77,10 +82,11 @@ data Atom = Var Var
| Pi PiType
| DataCon DataDef [Atom] Int [Atom]
| TypeCon DataDef [Atom]
| LabeledRow (ExtLabeledItems Type Name)
| Record (LabeledItems Atom)
Copy link
Collaborator

@apaszke apaszke Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why do regular records don't use ExtLabeledItems? Is it because you can't have a run-time representation of a record unless you know all of its fields, unlike for a variant?

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's exactly right.

Copy link
Collaborator

@apaszke apaszke Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I wonder if we could do the same for variants. I understand that inference is no longer quite as straightforward, since you don't know the type exactly when you see the constructor, but I suspect that it should be fully known after type inference right? This way we could make Atoms be fully specified as all ambiguity would be resolved at the point where we translate the surface language (UExpr) into core IR, right?

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting thought. We would still need something to represent a variant with a not-fully-known type in the IR, though, so I guess that would have to be an expr or something? So we would have

data Atom = ...
  | Variant (LabeledItems Type) Label Int Atom

...

data Expr = ...
  | AmbiguousVariant (ExtLabeledItems Type Name) Label Int Atom

or similar?

I'm not sure this really adds anything, though? (Related note: it might still be unknown until function inlining, in the case where the tail is an implicit function argument, so it could last at least until simplification for evaluated things and forever inside function bodies.)

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, RecordTy and VariantTy are still ambiguous. The first argument to the Variant constructor is basically just it's type, since you can't infer the type of a variant from its contents. (I guess a different design decision would be to make it just Variant Type Label Int Atom where the first argument is always a VariantTy atom? But that seemed unnecessarily indirect.)

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another related point here is that Variant atoms with a tail Ext _ (Just tail) transform into unambiguous fully-specified Variant atoms during substitution, which is convenient. Records can't do that because a record atom has to actually have values for all of its fields. In a sense the only things that it makes sense to put in an ExtLabeledItems are types, not values. Maybe that's the more fundamental reason why Record just has a LabeledItems (and the equivalent of an extended record value is just RecordCons).

src/lib/Syntax.hs Outdated Show resolved Hide resolved
@@ -153,23 +164,30 @@ scalarTableBaseType t = case t of


type Label = String
newtype LabeledItems a = LabeledItems (M.Map Label [a])
newtype LabeledItems a = LabeledItems (M.Map Label (NE.NonEmpty a))
Copy link
Collaborator

@apaszke apaszke Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a little comment that describes what this ADT represents? Do I understand correctly, that the idea here is that we keep all the fields of a given label in order, to make sure we keep the scopes right? By the way, would it make sense to put in some extra care to restore the original field order when printing (doesn't have to happen in this PR)?

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add a comment.

Regarding "original field order", maybe that's something worth having a bigger discussion about. Right now field order is thrown away during parsing; fields are always ordered in alphabetical order after that. This is mainly so that {a:A, b:B} and {b:B, a:A} unify with each other, and so that you can extract either a or b in either order. This ADT represents the "canonical form" of the fields, such that we have an unordered collection of labels (treated as an alphabetical order during iteration) and an ordered collection of values for each label (for repeated fields, which DO have a definite ordering).

An alternative approach, I guess, would be to actually treat the order of the fields as part of the type? But then two records with different orders would be different types, and we'd have to cast between them, either explicitly in the concrete syntax or implicitly based on some set of rules. I don't know how that interacts with extensibility; I think most extensible record designs don't treat order as important.

I guess there's a third, purely cosmetic approach, where we associate the labeled items with an optional field order that is ignored by the typechecker? When unifying two things we would just arbitrarily pick one of the orders if they conflict. This would then be used to print things, but not affect the runtime representation? Although that might be a bit confusing because the runtime representation is what determines the iteration order of the record when used as an index set...

Either way probably any change like this should go in a different PR, I think.

Copy link
Collaborator

@apaszke apaszke Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I definitely didn't mean to suggest that order of fields should matter, only that we might do a best effort job to preserve it for printing. When it comes to PRs, many of the things that I pointed out could be separated so don't feel super pressured to fix it right away, but this is a convenient place to discuss them.

src/lib/Syntax.hs Outdated Show resolved Hide resolved
src/lib/Type.hs Outdated Show resolved Hide resolved
src/lib/Syntax.hs Outdated Show resolved Hide resolved
src/lib/Type.hs Outdated
@@ -322,6 +366,12 @@ instance CoreVariant Expr where
Case e alts _ -> do
checkVariant e
forM_ alts $ \(Abs _ body) -> checkVariant body
-- TODO: is there a way to express "this will go away once types are
-- monomorphic"?
Copy link
Collaborator

@apaszke apaszke Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... what's a polymorphic type in our case? You mean that there are no functions with Type arguments left?

Copy link
Collaborator

@apaszke apaszke Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wouldn't goneBy Simp be an more accurate description?

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polymorphic in the sense of "has arguments whose types depend on a function argument of type Fields (i.e. LabeledRowKind)". I don't think goneBy Simp is correct, because if the user defines a function with an argument of type Fields, these operations still last past Simp inside the Lam atom.

Copy link
Collaborator

@apaszke apaszke Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I see... This is annoying 😕

src/lib/Simplify.hs Show resolved Hide resolved
_ -> emit =<< RecordSplit <$> mapM substEmbedR leftTys <*> simplifyAtom full
VariantLift leftTys@(LabeledItems litems) right -> case getType right of
VariantTy (NoExt rightTys) -> do
-- Emit a case statement (ordered by the arg type) that lifts the type.
Copy link
Collaborator

@apaszke apaszke Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do some optimizations here? For example, if the left cases and right cases have no overlap, then there is nothing complicated that you have to do. Or do you expect those case statements to somehow get simplified away anyway.

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bit subtle, I think. Even if the left and right cases have no overlap in terms of fields, they can still "interleave" with each other in terms of the canonical order of alternatives, because we use alphabetical ordering in the core IR. In the particular special case where the argument is a variant literal, it's true that we can make additional simplifications here, by just changing the first argument to Variant. But when the argument isn't a literal (i.e. if it's a Var), we can't reach in and change the type of the variant (and, I think, if it remains a Var even into the Imp lowering, it could actually become a SumAsProd, which is much more complicated to lift).

I "cheat" a bit here by first desugaring and then simplifying the resulting case expression, which should optimize away the literal case.

Copy link
Collaborator

@apaszke apaszke Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the case simplification might be just enough to do the job. But your comment left me wondering a bit -- why would the interleaving of the fields matter here?

Copy link
Collaborator Author

@danieldjohnson danieldjohnson Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for Variant atoms themselves, I guess, but it does matter for SumAsProd lowering and also for ordering of alternatives for any case statement that matches on that variant, so it's not a simple "cast", if that makes sense.

@danieldjohnson
Copy link
Collaborator Author

@danieldjohnson danieldjohnson commented Aug 17, 2020

@apaszke See this comment regarding VariantSplit (tldr; it's actually total, and really only exists as a side effect of how Case works in the IR).

I'll try making the suggested changes and see if anything breaks.

Originally I thought LabeledRow would go away after inference, but if
there are functions with implicit arguments it can last until the end
of Simp. Once everything has been simplified, though, LabeledRow atoms
should have been substituted away.
src/lib/Syntax.hs Outdated Show resolved Hide resolved
@dougalm dougalm merged commit 01dac77 into google-research:main Aug 19, 2020
2 checks passed
@danieldjohnson danieldjohnson deleted the records-variants branch Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants