New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type-check specific datatype fields concisely and remove the class name argument #5723

Merged
merged 32 commits into from Apr 27, 2018

Conversation

Projects
None yet
4 participants
@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Apr 19, 2018

Problem

See #5716. We had to introduce a ton of boilerplate to check argument types in some very simple usage of datatype subclasses in tests in #5703. This is one way to make that easier.

Solution

  • Move TypeConstraint and subclasses from src/python/pants/engine/addressable.py to src/python/pants/util/objects.py, where they probably should have been anyway, to avoid an import cycle.
  • Remove the class name argument from datatype() (and update all call sites).
  • Allow specifying a tuple ('field_name', FieldType) in the list of fields to datatype() to have the field type-checked by an Exactly(FieldType) type constraint at construction.
  • Override __str__() and __repr__() for datatype objects, and add testing for the str and repr.
  • Add testing for informative error messages in datatype construction.

Result

We can now concisely declare datatype objects without having to repeat the class name, and can opt-in to a type check for specific fields.

@illicitonion
Copy link
Contributor

illicitonion left a comment

I'm pretty excited about this :) (And would possibly even suggest extracting this into its own library for re-use in the future). Will leave some time for others to chime in

for name, cls in field_decls.items():
if isinstance(cls, TypeDecl):
processed_type_decls[name] = cls
continue

This comment has been minimized.

@illicitonion

illicitonion Apr 19, 2018

Contributor

Rather than continue; if, this looks like it would sit pretty naturally as a chain of elifs?

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

I'm pretty sure you're right, I was just having trouble reading the elif version in my head when I wrote this bit, so used continue. I will make this change.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 21, 2018

Contributor

I converted this into Exactly.from_type_or_collection() when I moved TypeConstraint into objects.py. This actually has the same control flow as the original, but return instead of continue is a little bit more sane (imo) -- let me know if I'm wrong about that.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Apr 19, 2018

I'm on the fence about this in general, and we should talk about that before landing this.

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

@baroquebobcat
Copy link
Contributor

baroquebobcat left a comment

Did a first pass. I have a few style / assertion comments on tests and some suggestions for type constraints.

SomeTypedDatatype()
self.fail("should have errored: not providing all constructor fields")
except TypedDatatypeInstanceConstructionError as e:
self.assertIn('my_val', str(e))

This comment has been minimized.

@baroquebobcat

baroquebobcat Apr 19, 2018

Contributor

I'd like to see more tests that check parts of the error messages. I think you could elide some of the comments like # not providing all the fields if there were an assertion that checked for that information.

One pattern that might be helpful is to use the context manager provided by assertRaises. For example

with self.assertRaises(MyError) as cm:
  call()
self.assertIn('my error is the best error', cm.exception)

Then you can rewrite the above try, fail, except, assert as

with self.assertRaises(TypedDatatypeInstanceConstructionError) as cm:
  SomeTypedDatatype()

self.assertIn('my_val', str(cm.exception))
self.assertIn('missing field', str(cm.exception)) # or something like it

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

I really appreciate this feedback. I didn't make it clear in the OP, but I absolutely want to make sure error messages (and test cases) are deeply considered and rock solid for a feature like this. I didn't look deeply enough into assertRaises to realize that it yielded a context object and this seems really actionable. With respect to the code example you gave, I would probably do that and extract it into a helper method of the test class. The code samples you've provided are clear enough for me to act on.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 23, 2018

Contributor

I canonicalized the __str__() and __repr__() of TypedDatatype, as well as the messages in the exceptions it raises, and added tests for all the failure cases in 133ba06. Let me know if this is what you were looking for.

other_union_val = UnionFieldTypedDatatype(an_arg=3)
self.assertIn('UnionFieldTypedDatatype', repr(other_union_val))
self.assertIn('an_arg', repr(other_union_val))
self.assertIn('3', repr(other_union_val))

This comment has been minimized.

@baroquebobcat

baroquebobcat Apr 19, 2018

Contributor

I like that you've included tests of the repr. I'd like the assertions to be a bit more concrete though. I think testing the full repr would make it clearer what the repr is supposed to look like. How would you feel about that?

Particularly since the ideal for repr is to have a round-trippable string representation that if it were evaluated as python would recreate the object.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

I 100% agree. I have thoughts on stu's comment above (e.g. if there are no named args, then that would change the repr), but validating the repr in a structured way seems really appropriate for this use case.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 23, 2018

Contributor

I mentioned this in a response to your earlier comment: check 133ba06 to see if the testing added there addresses your concern here.

"""A wrapper over namedtuple which accepts a dict of field names and types.
This can be used to very concisely define classes which have fields that are
type-checked at construction.

This comment has been minimized.

@baroquebobcat

baroquebobcat Apr 19, 2018

Contributor

Do we want exact type matching, or allow subtypes? Currently, this allows subtypes. That might be tricky for products in v2. I think v2's type system is primarily based on exact matches of types, either as a singular type or as a type within a union.

If this for datatypes outside v2 that is probably fine. Either way, I think the entry point here should call out how types are matched.

If you want, you could use the existing TypeConstraint classes that are used by the engine rules. https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/addressable.py#L31

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

I think the right move is probably to do exact type matching for the reasons you describe. I had seen the TypeConstraint classes but hadn't used them -- that seems like the right way to tackle this.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

Also, for union types, exact type matching means you can use a set instead of a list to record the types that could match (and determine whether the type matches instantly), which in itself sounds like a fantastic argument for exact type checking.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 21, 2018

Contributor

Looked it over, and found TypeConstraint to be a drop-in replacement for the TypeMatcher I had, so used that instead. Had to move TypeConstraint into objects.py to avoid a cyclic dependency, but it was easy enough to move and passes tests locally as of ae70e02. The only difference is that it doesn't convert its self._types into a frozenset -- this seems like a tiny change so I will probably add that.

except TypeDecl.ConstructionError as e:
invalid_type_decls.append("in field '{}': {}".format(name, e))
continue
if isinstance(cls, list):

This comment has been minimized.

@baroquebobcat

baroquebobcat Apr 19, 2018

Contributor

maybe check for (set, list, tuple)? Or do we want to enforce that it's a list specifically?

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

I'm not sure of the right answer to that. I do think this functionality should be extracted into a factory method of the TypeDecl class, which is easy and I can do.

This comment has been minimized.

@cosmicexplorer
class YetAnotherNamedTypedDatatype(typed_datatype(
'YetAnotherNamedTypedDatatype', {
'nothing_special': str,
'just_another_arg': StrOrInt,

This comment has been minimized.

@baroquebobcat

baroquebobcat Apr 19, 2018

Contributor

You could also create an explicit union here.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 19, 2018

Contributor

That's true, and if I wanted to use a field like StrOrInt, I think that would be much more appropriate as its own typed_datatype (which could then have its own set of @rules converting it into some other type, if necessary).

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 21, 2018

Contributor

Using Exactly.from_type_or_collection(), I converted this into [int, str], which is then converted into an Exactly type constraint.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 19, 2018

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

I totally expected to have an extended discussion about the interface and/or impl before getting merged! I can see typed datatypes being very cool and I want to help ensure that code that lots of stuff depends on is still code that we should feel free to reach in and improve.

I hadn't remembered until you said this, but I have definitely seen the utility of not having field names, if each field is then required to have its own distinct type wrt the other fields of the typed_datatype. (This reminds me again of how someone mentioned @rules could just be types in themselves in #5580.) This would encourage/require differentiating arguments not by name, but by type, which would then mean that usage context is canonicalized in the type itself, and not just in the name of the field with that type. This is a good thing, because then we can build up a hierarchy of types that add context to primitives like PathGlobs and Snapshot, which is good for the same reasons that types are good. People don't seem to like wrapper classes for some reason but I find it much easier to follow code with thin wrapper types that provide their own usage context compared to e.g. two different fields of the snapshot type, differentiated only by name. A quick example:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': Snapshot,
  'source_snapshot': Snapshot,
})):
  pass

What assumptions are we making about the two fields? How do we know the 'binary_snapshot' contains a binary? If we changed the way we generate the value passed to 'binary_snapshot', and the path to the binary changed (for example), we would then have to find everywhere we interpret a snapshot as containing a binary, and modify those sites to manipulate the result correctly (basically: field names are more likely to change over time than type names). If instead, we did:

class BinarySnapshot(typed_datatype('BinarySnapshot', {
  'snapshot': Snapshot,
  'binary_path_globs': PathGlobs,
}):

  def get_binaries_path_stats(self):
    # ...


class SourceSnapshot(typed_datatype('SourceSnapshot', {
  'snapshot': Snapshot,
}):
  # maybe we don't need to do anything here,
  # but it's clear what this is supposed to represent
  pass

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': BinarySnapshot,
  'source_snapshot': SourceSnapshot,
}):
  pass

We could even make the get_binaries_path_stats() into an @rule which could return another wrapper type. The reason I showed the above sample is that if we allow ourselves to be liberal with wrapper types, then we could remove the field names:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', (BinarySnapshot, SourceSnapshot)):
  pass

req = SomeTypeOfRequest(BinarySnapshot(...), SourceSnapshot(...))
# this would work as a result of some (probably simple) metaprogramming
# we could apply a similar technique as Task.stable_name() to generate the accessor name
# (that would then mean that all fields of some type would be accessed
# using the same field name, which is extremely interesting)
print(req.binary_snapshot)

But if we are allowing multiple fields of the same type, I wouldn't want to be able to invoke the constructor with any positional arguments whatsoever, that seems to me to defeat the purpose of type checking entirely -- I absolutely don't trust myself to always put positional arguments in the right order, and if multiple fields can have the same type we're back where we started (speaking personally). Thinking about it more, I find the idea of requiring distinct types really interesting, especially if we had magic property accessors generated from type names (totally not required -- but I really don't want to use .0, .1, etc like rust tuple structs, I find it absurdly opaque).

Thinking about it even more, I might actually say that requiring distinct types for fields of a typed_datatype and auto-deriving field names is the approach I would want/prefer. There may be some bikeshedding about how to generate the field accessor property names from each field's type, but I really like the idea that if you do req.binary_snapshot as above, you know that the value will have been type-checked as a BinarySnapshot.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 19, 2018

(And would possibly even suggest extracting this into its own library for re-use in the future)

I think this is a fantastic idea, not at all in the scope of this PR but if we iterate on it it seems extremely plausible to me that we could make typed_datatype satisfy our needs for use in pants (perf- and expressivity-wise) and expose it as a standalone library without making compromises. This kind of thing is something I have wanted to write and/or maintain for a while and we don't need to go down that route too soon, but we also don't need to drag our feet about it once we feel confident that the model will work for the needs of pants (which won't be clear until we stabilize the v2 engine API, probably). And as always, we should make absolutely sure that this needs to exist as a new library instead of contributing to an existing one (I believe that it does, but we should continue to ponder that before extracting it).

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 20, 2018

@stuhood I also missed your comment about "on the fence about this in general" -- sorry about that. I know you mentioned perf concerns as well as (it seemed like) a question about how this explicit type checking would interact with the implicit type checking done when the v2 engine does a select. I'm not sure about that either, I just think it would be complementary for now, by decoupling type checking from the other failure modes in @rule implementations, if that makes sense? I would lean more towards using typed_datatype() to make it easier to construct sanitary wrapper objects for use in @rules for now, and leave open the possibility of incremental deeper integration with the v2 engine as the opportunities for that come into view (if they do).

I'm not sure that's addressing your concern. I would also like to address the perf concern, but it's not clear to me how to construct a benchmark other than generating a million types and doing a million checks -- since typed_datatype() and datatype() instances are the exact same after construction (where type-checking occurs) I don't see introducing typed_datatype() as creating compatibility issues, so if type-checking is a perf issue in some code path we could just not use them there.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 21, 2018

Moved TypeConstraint into src/python/pants/util/objects.py in ae70e02 and everything except lint is passing. I'm going to take a stab at modifying the field_decls of typed_datatype() to be a tuple of distinct types, the names of which are used to generate field members for the eventual datatype() call (as described a few comments up). I think this would be a fantastic interface.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 23, 2018

I implemented the magical accessors and unique, positional fields as discussed above in 46c0f2d, then in e6e97d1 I added a decorator @typed_data(<types..>) which can be applied to class declarations as follows:

class SomeClass():
  def __init__(self, arg):
    self._arg = arg

  def f():
    return self._arg

@typed_data(int, SomeClass)
class MyTypedData(SomeMixin):
  # source code...

# the above is equivalent to the below:

class MyTypedData(typed_datatype('MyTypedData', (int, SomeClass)), SomeMixin):
  # source code...

# you can then do:

data = MyTypedData(3, SomeClass('asdf'))
print(data.primitive__int)
print(data.some_class.f())

This seems pretty neat. I deleted all the TypeMatcher tests when I switched to using the existing TypeConstraint, so I'm going to add more tests for the current state of the functionality and look into testing the __repr__() and exception messages from class instances.

./pants lint is complaining about the usage of these decorators, I'll look into why that is and how to avoid the warning:

T606:ERROR   tests/python/pants_test/util/test_objects.py:129 Classes must be new-style classes.
     |@typed_data(int)

@cosmicexplorer cosmicexplorer changed the title concisely declare classes that type-check their constructor arguments with typed_datatype() concisely declare classes that type-check their constructor arguments with @typed_data() Apr 23, 2018

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 23, 2018

I added the __str__() and __repr__() testing I mentioned above, fixed the lint error I mentioned above, and switched test_isolated_process.py to use the new @typed_data(<types...>). This and the previous changes are described in the updated OP. CI is green and this is ready for review.

#
# class MyTypedData(typed_datatype('MyTypedData', (int, str)), SomeMixin):
# # source code...
def typed_data(*fields):

This comment has been minimized.

@stuhood

stuhood Apr 24, 2018

Member

This is too magical, IMO... particularly the fact that it requires modifying the linters.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

This comment makes sense. I have reverted the linter modification in 238ce3b.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

After rebasing, the sha is aa9ba67.



def typed_datatype(type_name, field_decls):
"""A wrapper over namedtuple which accepts a dict of field names and types.

This comment has been minimized.

@stuhood

stuhood Apr 24, 2018

Member

This no longer takes a dict.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

I moved the contents of this docstring to the docstring of the datatype() function and added a FIXME: to edit it later (and "later" should be right now) in 2d1489e.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

After rebasing, the sha is 6420f5b. I removed the FIXME in 378a75e.


@property
def bin_path(self):
return self.binary_location.primitive__str

This comment has been minimized.

@stuhood

stuhood Apr 24, 2018

Member

This is not great syntax, unfortunately. I think this might have gone too far when it removed field names.

I would propose that the right way to land this would be to simplify it back to:

class JavacVersionExecutionRequest(datatype('JavacVersionExecutionRequest', [('bin_path', str)])):

IE: It should be straightforward to use the datatype name and accept tuples in the field list. It should also be possible to remove the classname parameter from datatype and just make an anonymous class, on the assumption that it will be extended. Then you'd have:

class JavacVersionExecutionRequest(datatype([('bin_path', str)])):

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 25, 2018

Contributor

Ok, this is pretty clear reasoning. I think you made this syntax clear in your comments above and I just wasn't sure what the end result would look like, and was a little iffy on how to munge classes in python. The only utility of the decorator was to avoid having to spell out the same class name again, which is clear to me how to address from your description above. I really like the result you've proposed (especially the pretty natural integration with existing datatype() uses), so thanks for taking the time to make this clear.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

I implemented your proposal (which was great, btw) in 9e8055f (and was able to mostly keep the testing I already wrote intact), and changed every datatype() call site to not specify a class name in 2d1489e. Let me know what you think.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Apr 26, 2018

Contributor

After rebasing, the shas are bd3a2d6 and 6420f5b.

cosmicexplorer added some commits Apr 19, 2018

move TypeConstraint into src/python/pants/util/objects.py
yet again somehow everything Just Works

cosmicexplorer added some commits Apr 23, 2018

@cosmicexplorer cosmicexplorer force-pushed the cosmicexplorer:typed-datatypes branch from 238ce3b to 7f319fe Apr 26, 2018

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 26, 2018

Ok, just rebased and I think CI will pass, will check back tomorrow.

@stuhood
Copy link
Member

stuhood left a comment

Thanks for iterating. Nice cleanup.

cosmicexplorer added some commits Apr 26, 2018

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 26, 2018

CI was failing because I had left a FIXME comment in src/python/pants/engine/README.md, and there is a pretty obvious bug in the markdown library unless tuples had .keys() in some other version of python. I have removed the comment and will break out separate PRs for the FIXME that was removed and to work around the error in the markdown library.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Apr 27, 2018

Thanks, looks good. Please fix up the description / commit message and I'll merge this tomorrow.

@cosmicexplorer cosmicexplorer changed the title concisely declare classes that type-check their constructor arguments with @typed_data() type-check specific datatype fields concisely and remove the class name argument Apr 27, 2018

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Apr 27, 2018

Updated the post and title.

@stuhood stuhood merged commit f0e1106 into pantsbuild:master Apr 27, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

stuhood added a commit that referenced this pull request May 13, 2018

Expand engine readme (#5759)
### Problem

It's not clear how to make rule definitions available to the scheduler, and there's no discussion of the type checking added to `datatype()` fields in #5723.

### Solution

- Add a section on datatype fields, typing, and how they interact with the engine.
- Add a section on making rules available to the scheduler in and out of unit tests.
- Add a section describing types of rules after discussion on this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment