Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatching to list subtypes #72

Open
shoyer opened this issue Feb 26, 2018 · 35 comments
Open

Dispatching to list subtypes #72

shoyer opened this issue Feb 26, 2018 · 35 comments

Comments

@shoyer
Copy link

shoyer commented Feb 26, 2018

As an (temporary?) alternative to full support for Python typing (#69), I'd like to propose adding multipledispatch.TypedList. The use case here is separate functions for different list subtypes, e.g., lists of strings vs lists of integers. See pydata/xarray#1938 for discussion.

I have an example implementation here and am happy to work on putting together a PR if desired:
https://colab.research.google.com/drive/18zdyUpWLNFzFaz08GUOC5vs1GxE_jHg-#scrollTo=XDL0cBeS-lub

Example usage:

@dispatch(TypedList[int])
def f(args):
  print('integers:', args)

@dispatch(TypedList[str])
def f(args):
  print('strings:', args)

@dispatch(TypedList[str, int])
def f(args):
  print('mixed str-int:', args)

f([1, 2])  # integers: [1, 2]
f([1, 2, 'foo'])  # mixed str-int: [1, 2, 'foo']
f(['foo', 'bar'])  # strings: ['foo', 'bar']
f([[1, 2]])  # NotImplementedError: Could not find signature for f: <TypedList[list]>

The exact public facing API is up for discussion. I'm tentatively calling this TypedList for clarity and to distinguish this from typing.List (this is actually equivalent to typing.List[typing.Union[...]).

@mrocklin
Copy link
Owner

cc @llllllllll @mariusvniekerk

@hameerabbasi
Copy link

hameerabbasi commented Feb 26, 2018

I think that for tuples, it might be useful to have Tuple (exact match) and VarTuple (this approach). Tuples are often fixed length.

Edit: Or, more generally, Iterable and VarIterable.

@llllllllll
Copy link
Collaborator

I don't like that this merges the idea of typed lists and regular lists. One of the advantages of the implementation in blaze is that it makes the two kinds of dispatches distinct. This is important when working on a large code base with thousands of dispatch definitions, many of them already dispatch on list. The proposed implementation adds ambiguity when you register a dispatch for both list and TypedList. For example, what should the result of the following code be?

@dispatch(TypedList[int])
def f(xs):
    print('typed')


@dispatch(list)
def f(xs):
    print('regular')

f([1, 2, 3])

This currently prints regular; however, if you change the definition order of the two functions you will get typed. I find this confusing given that:

In [11]: issubclass(TypedList[int], list)
Out[11]: False

In [12]: issubclass(list, TypedList[int])
Out[12]: False

Given that we can get the the same auto-boxing behavior with two lines of Python, I am not sure this is a big feature. If anything, it forces us to explicitly register the list instance so we know that this has been done and we understand how dispatching will happen in the future.

Regarding the use of frozenset vs unique: you are correct that the frozenset version will create less classes and probably be a bit faster, but due to how the subclassing machinery works it still produces the correct result.

Finally, the VarArgs object is duck-typed like an immutable list, so it should be usable in almost all cases without worrying about the actual type, this is Python after all.

I would also like to make my stance on typing clear: I don't think that supporting typing style annotations in multipledispatch is a good idea. typing annotations are designed for static analysis tools like mypy, they should not affect the runtime behavior of your program.

@hameerabbasi
Copy link

I, on the other hand, support typing. But (on the other hand) Joe is right about overlapping, and how it could be difficult in a large codebase.

I suggest writing a simple decorator on anything that will convert VarArgs/TypedList to List. Should be trivial.

@shoyer
Copy link
Author

shoyer commented Feb 28, 2018

I'm also happy with a solution based more directly on @llllllllll's VarArgs. It is indeed simple to write a decorator to automatically unwrap dispatched types and to overload list directly, so I'm all for leaving that out of multipledispatch.

The harder part is writing the TypedList/VarArgs type in the first place, which involves some subtlety and metaclasses. Is there support for adding some form of this into multipledispatch?

As for typing, I'll note that there are at least two ways to support it:

  1. Types from typing in the @dispatch decorator. This form is very explicit:
@dispatch(List[Union[str, int]])
def f(x):
    ...
  1. Automatically extracting types from decorated functions for dispatching (from WIP: Experiment to use pytypes to add support for python type hints #69). This form is much more magical:
@dispatch()
def f(x: List[Union[str, int]])
    ...

I would support version (1) directly in multipledispatch.dispatch if it can be done in a performant way. I'm not opposed to version (2), but it should probably use a separate name, e.g., dispatch_by_type_hints or something similar.

@mrocklin
Copy link
Owner

Note that multipledispatch does support PEP 484 type hints: #65

@shoyer
Copy link
Author

shoyer commented Feb 28, 2018

Note that multipledispatch does support PEP 484 type hints: #65

Oh, interesting.

I'm not entirely sure that's a good idea to include for the default dispatch decorator, to be honest :). Types for dispatching with multipledispatch and type checking (e.g., with mypy) have some small but important differences:

  • multipledispatch only accepts exact type matches (not even subclasses) for runtime performance (edit: not true)
  • type checkers will even introspect values to determine compound types (e.g., List[str]) and have various other small differences from runtime type checking (e.g., integers are valid for float arguments).

I can see cases where I might use different types, e.g., list for multipledispatch but List[str] for type checking.

@llllllllll
Copy link
Collaborator

multipledispatch only accepts exact type matches (not even subclasses) for runtime performance

Multiple dispatch supports subclasses correctly. It uses the same linearization algorithm as the Python MRO to partially order the types to search.

@shoyer
Copy link
Author

shoyer commented Feb 28, 2018

Multiple dispatch supports subclasses correctly. It uses the same linearization algorithm as the Python MRO to partially order the types to search.

Oops, my mistake.

@hameerabbasi
Copy link

Subtypes can come in useful. For example, I wouldn't feel okay asking XArray to change its code every time I add a new matrix format. And it will be useful in sparse for checking input arguments are a SparseArray instead of every combination.

In light of @llllllllll's comments, I'm now in favour of @llllllllll's implementation, maybe while adding a simple wrapper that extracts the List from VarArgs so that the function itself doesn't have to be written to extract all the arguments. And possibly a wrapper that converts particular arguments to VarArgs.

@francesco-ballarin
Copy link
Contributor

Hi everybody,
in one library that I am currently developing I had done something similar (typed lists), see e.g. the following tests for the notation

# Test class with list_of inputs
def test_enabled_list_of():
    class E(object):
        @dispatch(list_of(int))
        def __init__(self, arg):
            self.arg = arg[0]
    assert E([1, 2]).arg == 1
    
# Test class with list_of inputs, with multiple types for each list
def test_enabled_list_of_multiple():
    class E(object):
        @dispatch(list_of(float))
        def __init__(self, arg):
            self.arg = arg[0]
            
        @dispatch(list_of((float, int)))
        def __init__(self, arg):
            self.arg = 2*arg[0]
            
        @dispatch(list_of((float, str)))
        def __init__(self, arg):
            self.arg = 3*arg[0]
            
    assert E([1.]).arg == 1. and isinstance(E([1., 2.]).arg, float)
    assert E(["a"]).arg == "aaa"
    assert E([1]).arg == 2 and isinstance(E([1]).arg, int)
    assert E([1., 2.]).arg == 1. and isinstance(E([1., 2.]).arg, float)
    assert E([4., 3]).arg == 8. and isinstance(E([4., 3]).arg, float)
    assert E([4, 3.]).arg == 8 and isinstance(E([4, 3.]).arg, int)
    assert E([2., "a"]).arg == 6. and isinstance(E([2., "a"]).arg, float)
    assert raises(UnavailableSignatureError, lambda: E([object()]))

# Test class with list_of inputs, with multiple types for each list, with ambiguous types
def test_enabled_list_of_multiple_ambiguous():
    class E(object):
        @dispatch(list_of(float))
        def __init__(self, arg):
            self.arg = arg[0]
            
        @dispatch(list_of((float, int)))
        def __init__(self, arg):
            self.arg = 2*arg[0]
            
        @dispatch(list_of((float, int, str)))
        def __init__(self, arg):
            self.arg = 3*arg[0]
    
    try:
        # Ambiguity checking is delayed to the first time the method is evaluated
        E = E(None)
        # The ambiguity for list_of(float) is solved by the first method, however
        # the ambiguity for list_of(int) cannot be solved as both the second and third
        # method could be used for that.
    except AmbiguousSignatureError:
        pass # this has failed has expected
    else:
        raise RuntimeError("This test should have failed due to list_of(int) ambiguity")
        
    # Ambiguity can be fixed by explicitly providing an overloaded method for list_of(int)
    class F(object):
        @dispatch(list_of(float))
        def __init__(self, arg):
            self.arg = arg[0]
            
        @dispatch(list_of(int))
        def __init__(self, arg):
            self.arg = 4*arg[0]
            
        @dispatch(list_of((float, int)))
        def __init__(self, arg):
            self.arg = 2*arg[0]
            
        @dispatch(list_of((float, int, str)))
        def __init__(self, arg):
            self.arg = 3*arg[0]
            
    assert F([1.]).arg == 1. and isinstance(F([1., 2.]).arg, float)
    assert F(["a"]).arg == "aaa"
    assert F([1]).arg == 4 and isinstance(F([1]).arg, int)
    assert F([1., 2.]).arg == 1. and isinstance(F([1., 2.]).arg, float)
    assert F([4., 3]).arg == 8. and isinstance(F([4., 3]).arg, float)
    assert F([4, 3.]).arg == 8 and isinstance(F([4, 3.]).arg, int)
    assert F([2., "a"]).arg == 6. and isinstance(F([2., "a"]).arg, float)

# Test inheritance in combination with list_of
def test_inheritance_for_list_of():
    class A(object):
        pass
    class B(object):
        pass
    class C(A):
        pass
    
    @dispatch(list_of(A))
    def f(x):
        return 'a'

    @dispatch(list_of(B))
    def f(x):
        return 'b'

    assert f([A(), A()]) == 'a'
    assert f([B(), B()]) == 'b'
    assert f([C(), C()]) == 'a'
    assert f([C(), A()]) == 'a'
    assert raises(UnavailableSignatureError, lambda: f(B(), C()))

Right now the code is a bit of hack on top of multipledispatch.
There is definitely some overlapping to PR 69 [*], but I can definitely share the code in a gist if you want to have a look [**]. Let me know.

[*] I created wrapper called list_of instead of using the typing module, which I thought was not available in python 2, and I felt it was closer to the existing multipledispatch notation to have list_of((type1, type2)) rather than List[Union[type1, type2]]. Even though the typing module is probably superior to my approach, maybe some of my test cases could still come useful.

[**] I am proposing a gist and not a PR because in any case the code would need a bit of cleanup

@shoyer
Copy link
Author

shoyer commented Mar 4, 2018

Finally, the VarArgs object is duck-typed like an immutable list, so it should be usable in almost all cases without worrying about the actual type, this is Python after all.

@llllllllll This is one part that I don't love about your solution, for two reasons:

  1. Reliability: relying on duck-typing means VarArgs objects will be used with libraries that weren't built for it. As you note, this usually works, but there are many exceptions where base types (especially tuple) treated differently. Offhand, I can think of examples in NumPy, Dask and Xarray. Even if duck types currently work, relying on that behavior introduces risk of being broken by future releases, since upstream libraries rarely test for full duck-type compatibility.
  2. Performance: Casting a list or tuple to VarArgs and then back to a base sequence type (if desired) entails additional overhead, because the the full sequence is copied for each conversion. For long sequences, this overhead could be non-negligible. I would rather only pay the price of iterating over a sequence once, to determine its unique types.

So again, I really don't care exactly how we spell this, but clearly there's a strong need. To keep things moving forward, let me make another proposal of what the public API could look like here:

from multipledispatch import dispatch, VarArgs

@dispatch(VarArgs[int])
def f(varargs):
    print('integers:', varargs.value)

@dispatch(VarArgs[str])
def f(varargs):
    print('strings:', varargs.value)

@dispatch(list)
def f(value):
    return f(VarArgs(value))

f([1, 2])  # integers: [1, 2]
f(['foo', 'bar'])  # strings: ['foo', 'bar']
f([[1, 2]])  # NotImplementedError: Could not find signature for f: <VarArgs[list]>

My main goal with this version is to keep things explicit and maintainable. We'll leave syntactic sugar up to downstream libraries:

  • Explicit wrapping with VarArgs() is required to wrap a sequence (and infer the combined type)
  • Explicit unwrapping with varargs.value is required to pull out the wrapped sequence.
  • We only support a single type VarArgs for sequences of values with an inferred type. Downstream libraries can make their own specialized functions for different sequence types if desired.

We do need a metaclass to implement __getitem__ on the VarArgs type, but the usage is pretty tame as far as metaclasses go. This magic also has clear upsides: we need dynamically created classes to make this work at all, and this at least provides a clear baseclass and constructor for the specialized VarArgs types.

@shoyer
Copy link
Author

shoyer commented Mar 4, 2018

We could easily make VarArgsitself implement the immutable sequence API. But it shouldn't inherit from tuple.

@hameerabbasi
Copy link

f([1, 2]) # integers: [1, 2]

Don't you mean f(VarArgs([1, 2]))?

@shoyer
Copy link
Author

shoyer commented Mar 4, 2018

Don't you mean f(VarArgs([1, 2]))?

In this example, my function is explicitly unwrapping the original value with varargs.value.

@hameerabbasi
Copy link

Ah, missed that. My bad.

@hameerabbasi
Copy link

Is there progress on this? I'd be willing to lend a hand if required. I'd really like XArray/sparse integration to be okay-ish before late May. I'm submitting a conference paper to SciPy 2018 and it'd be really nice to show this in action.

@llllllllll A go-between would be that instead of converting the sequence, you wrap it. That would get rid of the overhead, as @shoyer suggested.

@shoyer
Copy link
Author

shoyer commented Apr 16, 2018

I was mostly waiting to see what @llllllllll thought of my last proposal #72 (comment). But I agree, it would be nice to get something in here so we can use it downstream.

@llllllllll
Copy link
Collaborator

Allowing mutation of the type can lead to issues. What happens if you create an instance of VarArgs[int, float] and then append a string?

@shoyer
Copy link
Author

shoyer commented Apr 16, 2018

Allowing mutation of the type can lead to issues. What happens if you create an instance of VarArgs[int, float] and then append a string?

Yes, this would be bad, but I think it would be enough to clearly document that you shouldn't do this? VarArgs is magical enough that I expect most developers will need to refer to the documentation to understand exactly what is does anyways.

I suppose we could also restrict VarArgs to wrapping immutable types like tuple/frozenset.

@llllllllll
Copy link
Collaborator

I think it should at least always be the same interface, you use tuples and frozensets very differently. Also, is the performance cost of copying the argument list into a tuple that high? What's the longest list you plan on passing to concat, 100 elements?

@shoyer
Copy link
Author

shoyer commented Apr 16, 2018

In xarray, ~10,000 elements is probably a reasonable upper bound on concat. Still, I agree that the overhead of converting to tuple is will be negligible for all the use-cases I can currently foresee.

So I'm OK always converting the args to VarArgs to a tuple. I would still rather that it uses composition to hold a tuple in varargs.value, and implement sequence API via forwarding methods rather than subclassing directly from tuple. Unnecessary subclassing makes me nervous.

@hameerabbasi
Copy link

+1 for composition
-1 for copying (it needs O(n) RAM as well as the the performance hit)
+1 for allowing mutable types and clearly documenting that mutations should preserve the types.

@mrocklin
Copy link
Owner

-1 for copying (it needs O(n) RAM as well as the the performance hit)

Personally I'm fine with copying. I don't think that this will have an impact on performance. @hameerabbasi you may be concerned about copying all of the arrays when passing a list of arrays. I agree that this would be bad. However I don't think that that is actually what is being discussed here. I believe that this would just cause a copy of the the container holding all of the references. I don't think that asymptotic arguments should necessarily hold sway here.

In [1]: import numpy as np

In [2]: x = np.arange(100000000)

In [3]: L = [x] * 10000

In [4]: %time _ = tuple(L)
CPU times: user 159 µs, sys: 31 µs, total: 190 µs
Wall time: 199 µs

I may not be fully understanding the conversation here though.

@hameerabbasi
Copy link

hameerabbasi commented Apr 19, 2018

I understand that fully (the part where you only copy references). 😅 I might be biased here, but I'm pretty sure that:

  1. There are really no use-cases for mutating a collection with any types other than the initial ones.
  2. It actually allows mutations on the original object (keeping the types), if needed, rather than blocking them outright.

I understand the lists themselves are likely to be very small (and the memory/performance overhead insignificant). But my OCD screams at me: "O(n) memory when it could be O(1)? Madness!", but feel free to ignore it. 😅

@mrocklin
Copy link
Owner

I understand the lists themselves are likely to be very small (and the memory/performance overhead insignificant). But my OCD screams at me: "O(n) memory when it could be O(1)? Madness!", but feel free to ignore it.

OK, good to know. Does that mean that you're removing your -1 above? I generally interpret -1 as "I am sufficiently against this idea that I block it from moving forward. I require that we find an alternative."

There are really no use-cases for mutating a collection with any types other than the initial ones.
It actually allows mutations on the original object, if needed, rather than blocking them outright.

OK, these might be more serious reasons. I'm not up-to-date on this topic enough to judge.

@hameerabbasi
Copy link

Yes, I remove the -1. Another reason I thought of: If you were to do it again to a mutated collection, it'll take another O(n) instead of using the cached value.

@llllllllll
Copy link
Collaborator

llllllllll commented Apr 19, 2018

I am fine with having a tuple attribute instead of subclassing and implementing a sequence API that forwards to this tuple. This has the added benefit of not making the type partially orderable with tuple itself.

Allowing mutation will just cause hard to find bugs unless we wrap the mutable collection in some magic object that changes type based on the inserted elements, but then we are back to O(n) to check all the inserted elements so we might as well copy. I also know that some people dislike dynamically changing your __class__. The code I foresee is:

@dispatch(TypedSequence[int])
def foldl(f, sequence, starting_value=None):
    if starting_value is not None:
        sequence.insert(0, starting_value)
        return foldl(f, sequence)
    # ...

foldl(f, [1, 2, 3], starting_value=Class())

In the recursive call, the sequence will still be typed as [int], but now we just inserted a string. The actual insert site doesn't obviously look incorrect, nor does the call site, assuming Class() can be added to integers, but this code would do the wrong thing. If we copied into a new typed sequence box, then we would get something of type [int, Class] which would search for a new dispatch.

I would also like to make a side-note about the name: VarArgs: in blaze this was built to handle expression with variadic arguments, and the actual APIs that produce them are built from *args. If we generalized their use to arbitrary sequences, I think we could use another name like TypedSequence or TypedList or something, because out of context VarArgs seems a bit weird.

@hameerabbasi
Copy link

Wouldn't this sort of thing be fixed by making the decorator @dispatch(TypedSequence[int], Optional[int])? I get that someone could write code that does this, but is there a use-case for this?

@llllllllll
Copy link
Collaborator

multipledispatch doesn't kick in for keyword arguments. I think the function I wrote is a very real example, you could definite have an optimized fast path for foldl with all ints, vs foldl with arbitrary types.

@hameerabbasi
Copy link

Ah, fair enough. I'm on board with copying, but I'd rather we built a keyword argument feature into this library. Something like @dispatch(TypedSequence[int], starting_value=Optional[int]).

@llllllllll
Copy link
Collaborator

I agree that it would be useful to support keyword arguments, but that should be discussed in a new issue.

@shoyer
Copy link
Author

shoyer commented Apr 19, 2018 via email

@hameerabbasi
Copy link

+1 for TypedSequence and getting this in as quickly as possible. 😅

@shoyer
Copy link
Author

shoyer commented Apr 24, 2018 via email

javaniecampbell added a commit to javaniecampbell/requirements-generator that referenced this issue Mar 17, 2024
- remove typings no currently supported [PR #72](mrocklin/multipledispatch#72)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants