Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Alternative proposal by Ivan, in PEP form #19

Open
gvanrossum opened this issue May 14, 2020 · 24 comments
Open

Alternative proposal by Ivan, in PEP form #19

gvanrossum opened this issue May 14, 2020 · 24 comments
Labels
fully pepped Issues that have been fully documented in the PEP

Comments

@gvanrossum
Copy link
Owner

Ivan Levkivskyi has independently developed a PEP for matching. This has not been publicly shared yet, but I figured it would be okay to share in this private repo.

https://github.com/ilevkivskyi/peps/blob/pattern-matching/pep-9999.rst

@ilevkivskyi
Copy link
Collaborator

Important note: ideas marked as rejected there are not really rejected, I just wanted to summarize all alternatives, on many of them I was 50-50.

@ilevkivskyi
Copy link
Collaborator

Btw, it is interesting how close are the proposals, it is a good sign!

@ilevkivskyi
Copy link
Collaborator

After quickly scrolling through issues, it seems to me all ideas where we diverge are in the 50-50 zone I mentioned.

Couple exceptions:

@gvanrossum
Copy link
Owner Author

gvanrossum commented May 14, 2020

  • I like the idea of adding sealed (and in general thinking things through from the POV of a static type checker).
  • Our proposed __match__ API (and class pattern semantics) is so different from yours that dataclasses are very simple to support for us (see below).

Let me summarize what I see as the main differences.

  • Differences around as vs. case, indentation, multiple as clauses per block (Top-level syntax #2).
  • Our proposal has no else (How to spell anonymous default case? #6) and is a pass if nothing matches (What should happen if no case matches? #5). We use case _: instead of else:, and _ is the universal "don't care" pattern.
  • We have | for OR patterns. (We decided against AND patterns, AND patterns #14.)
  • We have somewhat different rules for variables (Distinguishing Loads and Stores #1). This is controversial, but we ended up with a compromise where "single" names starting with a lowercase letter are variables but if a name starts with an upper case letter it's an expression; and dotted names are always expressions (solving some of your issues).
  • We are adopting semantics closer to sequence unpacking, so [a, b] and (a, b) are patterns with exactly the same meaning, and they match any sequence. (I guess if you really wanted to check for a tuple you could use tuple([a, b]).
  • Our API for __match__ is entirely different (Defining a custom match protocol (__match__) #8). See below for a summary.

UPDATE: Here are some more differences:

  • You write unstructured matches as C(...); we just write C(). (see __match__ discussion below.)
  • You bind names only after the guard succeeds. Our proposal leaves the name bound even if the guard fails (in fact it may leave some names in a pattern bound even when a later sub-pattern fails). (Our proposal is inspired by the behavior of := here.)
  • You propose special semantics for coinciding names, so e.g. (x, x) would match a pair of equal values. In our proposal this is just a user error, and to match a pair you'd write case (x, y) if x == y:.
  • You have a special syntax for one-off matches. We should discuss this separately.

About the __match__ method

A pattern of the form C(<whatever>) calls C.__match__(target) (where target is the value from match target:), which should return an object whose attributes are used to match <whatever> or None if there's no match, or None if there's no match. The __match__() method is responsible for checking isinstance(), so that it can be used for duck typing.

In the general case __match__() can return some proxy that does attribute mapping and filtering. In many cases this can just return target unchanged. There's one magic attribute that the returned proxy may have, __match_args__, used for positional argument matching (see below). Usually __match_args__ is a class attribute.

The built-in pattern matching machinery is responsible of processing the return of `match():

  • If the pattern has the form C(), a non-None return from __match__() means success.

  • If the pattern uses keyword args, those are looked up as attributes on the proxy returned by __match__(). For example, Point(x=x, y=0) matches points whose y coordinate is zero, and binds x to the pattern's x coordinate.

  • If the pattern uses positional arguments, they are mapped to names using the __match_args__ attribute of the proxy.

  • For dataclasses we can have __match__() return self (after isinstance() check) and have a class attribute __match_args__ equal to list(C.__annotations__).

  • Not all positional arguments need to be given in the pattern, so Point(x) means the same as Point(x, _).

  • Positional and keyword args may be mixed. Examples: Point(x, y), Point(x, y=y), Point(y=y, x=x) all mean the same thing.

@ilevkivskyi
Copy link
Collaborator

Our API for __match__ is entirely different

It is actually not that different. The general logic is like this: interpreter asks an object: "Can you represent/unpack yourself as something like this?" and the object answers: "yes/no" and in "yes" how exactly. The details of how the answer is represented, as a custom object, or tuple + dictionary is less important. What is more important here is how much context you give to the object. There are two extremes here:

  • One is IIUC your proposal: give no context at all, so essentially in your case interpreter just asks "How can you represent/unpack yourself?"
  • The other extreme is in rejected ideas in my draft: give full nested pattern object, and ask the object "Can you represent/unpack yourself as this nested structure?"

I think both extremes are sub-optimal. For the second there are some arguments in the draft, for the first extreme I think the issue is that it is not flexible enough. I think the best option is somewhere in the middle, so that you ask an object "can you represent/unpack yourself like this: two things and z=third thing?"

In my proposal I pass the literal patterns as part of the context, but maybe it is still to much context. I think we should at least pass this data to __match__():

  • How many positional matches were requested
  • Which keys were requested
  • Are there * and/or **

One of the motivation for this is that I believe __match__() should raise an exception for impossible or ambiguous matches, and interpreter can't detect this because it doesn't know the mapping between named and positional representation of an object. In particular I think Point3D(1, 2) should raise an exception because it is ambiguous between Point(1, 2, _) and Point(_, 1, 2). Also Point2D(1, 2, y=1) should raise an exception, it can hide a hard to debug error where a different class was written (like it should probably be Point3D in the last example).

@ilevkivskyi
Copy link
Collaborator

ilevkivskyi commented May 14, 2020

Forgot about one thing: moving isinstance() to the __match__ is definitely a very good idea, +1 on this.

About other things:

  • I don't have strong attachment to coinciding names, exception is also a good option (you see I like exceptions :-))
  • Uppercase names as constants are probably fine. I was also thinking of this, but was stopped by the fact that there are no precedents in Python where name case influences semantics.
  • as vs case, I really like as more, but it may be personal.
  • +1 on name binding vs guards, let's be consistent with existing precedents.
  • (a, b) vs [a, b] probably also OK, it is good to be consistent with existing logic.
  • I was actually on the fence about allowing |, so I am fine with this. Note this may complicate the spec quite a bit, e.g. do we allow name | other_name, can one apply | to named sub-matches etc.
  • No else:, I am also +1

@ilevkivskyi
Copy link
Collaborator

One more deviation you forgot, what to do in case if all patterns fail? I propose to raise an exception. The main motivation is that most cases I have seen are rather fall where if match would work, or where I would want an exception. So probably whether to raise an exception or not may depend on whether we include if match syntax.

@gvanrossum
Copy link
Owner Author

About __match__:

We considered various APIs for __match__ but they were all pretty complex -- like yours, which has positional args, keyword args, *args, and **kwds. And most classes will only care about the first two.

Regarding your specific example, I don't see the ambiguity at all. Clearly Point3D(1, 2) intends to leave additional arguments of the end, not off the beginning. This is just like for regular calls -- you can leave arguments off the end (assuming they have default values) but not off the beginning (range(10) notwithstanding :-).

The matching of Point2D(1, 2, y=1) is first translated (at runtime, given Point2D.__match_args__ == ['x', 'y']) to Point2D(x=1, y=2, y=1) and the duplicate for y can be detected.

Do we at least agree that Point3D(1) is valid, and matches all 3D points whose x coordinate equals 1?

Apart from the complex signature, there's also the fact that in your proposal, every class must implement quite a bit of __match__ machinery (unless it needs nothing beyond the default object.__match__). In our proposal all the machinery of matching (including matching sub-patterns, or e.g. 0|1|2) is handled by the framework, and the __match__ implementation can focus on telling the framework which attributes exist and in which order. A proxy doesn't have to do any work unless a specific attribute is requested.

@ilevkivskyi
Copy link
Collaborator

This is just like for regular calls -- you can leave arguments off the end (assuming they have default values)

Yes, but what if all three arguments are required? I mean if one literally defines:

@dataclass
class Point3D:
    x: int
    y: int
    z: int

To me Point3D(1) looks a bit weird in such context. Of course the object can just return a bit more context to the interpreter, something like __required_match_args__ that would be an integer. And then interpreter can decide.

In general I agree that we should put as much burden on the interpreter as we can here. But give it the chance to detect all possible ambiguous/impossible matches.

@gvanrossum
Copy link
Owner Author

gvanrossum commented May 14, 2020

  • what to do in case if all patterns fail?

    See What should happen if no case matches? #5. We discussed this quite a bit; in the end @Tobias-Kohn convinced us that it should follow the lead of if ... elif ... elif .... But I can see that from a static checker's POV it's nicer if we fail hard here. It'll probably remain a point of discussion in the final PEP...

  • as vs. case: I can live with as. I agree that the extra level of indentation is tedious, but it's how every other compound statement in Python is done. Allowing multiple as clauses without intervening blocks feels like a syntax error (compare an if with an empty block -- we require pass to force the user to think about it), but if we decide to forego the extra indentation I could live with it.

  • x|y -- we can specify that the arguments to | must all be constants, or we can specify that each of the arms must bind the same set of variables. The latter does not feel hard to check.

  • Point3D with three required attributes: Even if you require all in the constructor, you might still have uses for omitting some. Certainly when using keywords it must be possible to leave some out. Consider a class User with mandatory id and name attributes -- I should be able to match User(id=42) or User(name='drew'). So I think it should be okay for positionals too.

@gvanrossum
Copy link
Owner Author

I don't have strong attachment to coinciding names, exception is also a good option (you see I like exceptions :-))

Compromise: we could specify that a pattern is only allowed to bind a name once (or only once in each branch of |, if we go that route). The "match two equal values" use case for (x, x) feels pretty theoretical, and it looks like it would require a somewhat complex implementation (essentially translating to (x, _x) if x == _x).

@ilevkivskyi
Copy link
Collaborator

Allowing multiple as clauses without intervening blocks feels like a syntax error

OK, let's drop it. It is not really needed if we have |.

we can specify that the arguments to | must all be constants [...]

Potentially we can also allow str() | int() as a shorthand for isinstance(..., (int, str)). So i would rather vote for the second option "each of the arms must bind the same set of variables".

So I think it should be okay for positionals too.

I totally agree about the keyword args, but not so much about positional. I think we should some opportunity to the implementer to add some strictness here, like __required_match_args__, it is fine not set it, then Point3D(1) will be accepted. Anyway, you at least convinced me that your approach is extensible in the direction I want (i.e. __required_match_args__ can be added later without breaking backwards compatibility).

@gvanrossum
Copy link
Owner Author

  • str() | int()

    Seems nice, happy to go this way.

  • __required_match_args__

    I can live with this. (Perhaps rename to __match_args_required__ so they all sort together? There's a bunch of bikeshedding to do here about the precise API.)

@gvanrossum
Copy link
Owner Author

Uppercase names as constants are probably fine. I was also thinking of this, but was stopped by the fact that there are precedents in Python where name case influences semantics.

Probably you meant "there are no precedents"?

@ilevkivskyi
Copy link
Collaborator

Perhaps rename to __match_args_required__ so they all sort together?

👍

Probably you meant "there are no precedents"?

Oops, sorry, yes that was a bad typo :-)

@gvanrossum
Copy link
Owner Author

Okay, fixed the typo in the original.

@viridia
Copy link
Collaborator

viridia commented May 15, 2020

I would not expect Point(1) to match Point(1, 2, 3) - however, I would expect Point(1, ...) or Point(1, _, _) to match.

@viridia
Copy link
Collaborator

viridia commented May 15, 2020

Also, as far as name matching goes: EIBTI. I still favor something like ?name or name? for variables, even though it's an extra character, I think clarity beats brevity in this case.

(It's interesting that in LISP, there's this concept of 'quoting' a name, so that you can refer to the name without dereferencing what it holds - much like the C++ reference operator, &. I thought about proposing a generalization of this for use with name matching, but also usable standalone; but on second thought given how many beginning programmers struggle with the concept of pointers in C, perhaps adding pointers to Python is not such a good idea. :) )

BTW one of the things I appreciated reading Ivan's PEP was the analysis of existing code patterns, specifically the statistics around usage of isinstance(). Any far-reaching change to a mature and popular language needs to be well-grounded in fact, and not just airy theorizing :)

I think the main points of divergence surround the semantics of __match__, and the dividing line between efficiency and flexibility; nor do I think we have exhausted the possible space of designs here. The VM is always going to be able to crawl through data structures much faster than end-user code, and will likely create fewer temporary objects along the way.

@gvanrossum
Copy link
Owner Author

I would not expect Point(1) to match Point(1, 2, 3)

Hm, then we definitely need to set __match_args_required__ = 3 in the Point class.

Point(1, ...) or Point(1, _, _).

I'm not a fan of using ellipses as part of pattern syntax. It's too ambiguous in docs etc. So let's use the second form.

Also, as far as name matching goes: EIBTI.

Then let me propose that the default is that a plain name is a variable binding for value extraction, while all dotted names are named constants, and if you really need to use a plain name as a named constants (e.g. example 5 in EXAMPLES.md uses none_rprimitive), you can mark it somehow.

Why mark named constants instead of bound variables? Because in 95% of the cases you'll want the latter. Match statements will be littered with things like Point(x, y) or [a, b, c] -- this is the reason why we have patterns at all -- whereas named constants without dots are relatively rare (and hopefully people will mostly use enums, which naturally have dots in their name, e.g. Color.red, or import modules, e.g. _locale.CHAR_MAX).

Also, let me propose that the marking would be a leading . -- this is unobtrusive, easy to type (no shift key needed), and extends the rule "names with a dot are constants".

As to data, example 5 is the only of the six motivating examples chosen by Brandt from real code that uses a dot-free name.

(But I'm still in favor of the lowercase/uppercase rule myself.)

@ilevkivskyi
Copy link
Collaborator

Also, let me propose that the marking would be a leading .

I really like this idea. I was lately thinking that requiring pattern to be dotted to be considered a reference rather than target will give some push towards using enums instead of plain integer constants, which is probably a good thing. But supporting only enums would be definitely too much, so using leading . (instead of something more cryptic like $ or ^) would be a good compromise. Some pros are:

  • It is explicit, rather than implicit like using upper-case
  • It is consistent, so is easier to remember

@Tobias-Kohn
Copy link
Collaborator

Also, let me propose that the marking would be a leading .

I also agree that the single dot might be a good compromise here. It also feels not too far off as similar syntax is used for imports.

In principle, however, we could also consider to have something like $, but then not as a load-marker, but rather as a proper evaluation operator similar to string interpolation (as I just described in issue #1).

@Tobias-Kohn
Copy link
Collaborator

First, please let me also welcome Ivan to the team.

It was very interesting to read Ivan's draft PEP. I like the different perspective and found the analysis of use cases brilliant. However, there are a few elements with which I find myself rather disagreeing (apart from many points in which we probably all agree).

Overall, I have the impression that Ivan's proposal is primarily coming from types, ADTs, and the perspective of performance (please correct me if I am wrong, though). Many of the proposal's elements are about strictness in one way or another. For instance, performing an isinstance()-check before calling __match__, the @sealed decorator, raising an UnmatchedValue exception, or making sure that the number of positional arguments fit. For Python, however, I feel we should rather follow a much more lenient "duck typing" approach, more in line with that type annotations are never enforced by the compiler, but only give hints to type checkers.


Let me pick up a few specific and minor points (I will add more comments to the corresponding issues).

  • Indentation level: there has been quite a bit of discussion about the identation of pattern matching with many proposals preferring a mode where indentation is "saved". So, this is certainly a point we will have to discuss. To be honest, I do not see the problem with indentation here in the first place or why it is something that should be avoided. One the one hand, we have "doubly" indented code in case of classes already as well. One the other hand, I do not think that indentation here is such a big concern that it is worth giving up the existing consistency.
  • Ellipsis: one of the reasons why I really like the underscore as the "don't care" wildcard (apart from it being used in almost all other pattern matching languages) is that it is a legal name and therefore (in principle) does not require any special syntax or treatment as such. The ellipsis, on the other hand, is a legal value in Python that might be used in production code. Using the ellipsis as a marker for not caring about the value seems thus similarly problematic to me like using None to indicate the absence of a value in other cases (although that value could be None itself). Finally, the ellipsis always looks to me like there are several elements missing, whereas the underscore feels more precise.
  • as vs. case: I am not a fan of the "fewer keystrokes" argument and not convinced that in a case like this, it really is valid. That case makes it more similar to switch-statements might even be an advantage as I believe there are more Python programmers with a background in C than with a background in functional languages. However, that as is already an existing keyword and would therefore work better is a very good and valid point. Since the case (or as) would only occur as keyword in well defined positions, this is probably not that much of an issue, though. I am more worried about match there. But it is a good point to consider, nonetheless. (I personally lean more towards case, but I am open to as as well).
  • the initial literal values passed in should not be included in the return: I think that a constraint like this makes the interface rather difficult and hard to use.
  • c(): I originally had (almost) the same syntax in my proposal, allowing int() to mean that the value must be expressible as an int—in addition to _: int. I think for the sake of consistency, we might want to support this.

This was referenced May 17, 2020
@ilevkivskyi
Copy link
Collaborator

@Tobias-Kohn some of your comments are in a sense outdated, because Guido already convinced me. The updated version (that I propose to take as a starting point) is in PR #25

@Tobias-Kohn
Copy link
Collaborator

Yes, I know that some of it is already outdated or really just a minor issue. But I felt your PEP deserves a proper reaction, even if it ends up being basically just a "I agree with Guido as well" :-).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
fully pepped Issues that have been fully documented in the PEP
Projects
None yet
Development

No branches or pull requests

4 participants