Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type-check specific datatype fields concisely and remove the class name argument #5723

Merged
merged 32 commits into from
Apr 27, 2018

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Apr 19, 2018

Problem

See #5716. We had to introduce a ton of boilerplate to check argument types in some very simple usage of datatype subclasses in tests in #5703. This is one way to make that easier.

Solution

  • Move TypeConstraint and subclasses from src/python/pants/engine/addressable.py to src/python/pants/util/objects.py, where they probably should have been anyway, to avoid an import cycle.
  • Remove the class name argument from datatype() (and update all call sites).
  • Allow specifying a tuple ('field_name', FieldType) in the list of fields to datatype() to have the field type-checked by an Exactly(FieldType) type constraint at construction.
  • Override __str__() and __repr__() for datatype objects, and add testing for the str and repr.
  • Add testing for informative error messages in datatype construction.

Result

We can now concisely declare datatype objects without having to repeat the class name, and can opt-in to a type check for specific fields.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty excited about this :) (And would possibly even suggest extracting this into its own library for re-use in the future). Will leave some time for others to chime in

for name, cls in field_decls.items():
if isinstance(cls, TypeDecl):
processed_type_decls[name] = cls
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than continue; if, this looks like it would sit pretty naturally as a chain of elifs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you're right, I was just having trouble reading the elif version in my head when I wrote this bit, so used continue. I will make this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted this into Exactly.from_type_or_collection() when I moved TypeConstraint into objects.py. This actually has the same control flow as the original, but return instead of continue is a little bit more sane (imo) -- let me know if I'm wrong about that.

@stuhood
Copy link
Sponsor Member

stuhood commented Apr 19, 2018

I'm on the fence about this in general, and we should talk about that before landing this.

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

Copy link
Contributor

@baroquebobcat baroquebobcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass. I have a few style / assertion comments on tests and some suggestions for type constraints.

SomeTypedDatatype()
self.fail("should have errored: not providing all constructor fields")
except TypedDatatypeInstanceConstructionError as e:
self.assertIn('my_val', str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more tests that check parts of the error messages. I think you could elide some of the comments like # not providing all the fields if there were an assertion that checked for that information.

One pattern that might be helpful is to use the context manager provided by assertRaises. For example

with self.assertRaises(MyError) as cm:
  call()
self.assertIn('my error is the best error', cm.exception)

Then you can rewrite the above try, fail, except, assert as

with self.assertRaises(TypedDatatypeInstanceConstructionError) as cm:
  SomeTypedDatatype()

self.assertIn('my_val', str(cm.exception))
self.assertIn('missing field', str(cm.exception)) # or something like it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate this feedback. I didn't make it clear in the OP, but I absolutely want to make sure error messages (and test cases) are deeply considered and rock solid for a feature like this. I didn't look deeply enough into assertRaises to realize that it yielded a context object and this seems really actionable. With respect to the code example you gave, I would probably do that and extract it into a helper method of the test class. The code samples you've provided are clear enough for me to act on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I canonicalized the __str__() and __repr__() of TypedDatatype, as well as the messages in the exceptions it raises, and added tests for all the failure cases in 133ba06. Let me know if this is what you were looking for.

other_union_val = UnionFieldTypedDatatype(an_arg=3)
self.assertIn('UnionFieldTypedDatatype', repr(other_union_val))
self.assertIn('an_arg', repr(other_union_val))
self.assertIn('3', repr(other_union_val))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you've included tests of the repr. I'd like the assertions to be a bit more concrete though. I think testing the full repr would make it clearer what the repr is supposed to look like. How would you feel about that?

Particularly since the ideal for repr is to have a round-trippable string representation that if it were evaluated as python would recreate the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree. I have thoughts on stu's comment above (e.g. if there are no named args, then that would change the repr), but validating the repr in a structured way seems really appropriate for this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in a response to your earlier comment: check 133ba06 to see if the testing added there addresses your concern here.

"""A wrapper over namedtuple which accepts a dict of field names and types.

This can be used to very concisely define classes which have fields that are
type-checked at construction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want exact type matching, or allow subtypes? Currently, this allows subtypes. That might be tricky for products in v2. I think v2's type system is primarily based on exact matches of types, either as a singular type or as a type within a union.

If this for datatypes outside v2 that is probably fine. Either way, I think the entry point here should call out how types are matched.

If you want, you could use the existing TypeConstraint classes that are used by the engine rules. https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/addressable.py#L31

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right move is probably to do exact type matching for the reasons you describe. I had seen the TypeConstraint classes but hadn't used them -- that seems like the right way to tackle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for union types, exact type matching means you can use a set instead of a list to record the types that could match (and determine whether the type matches instantly), which in itself sounds like a fantastic argument for exact type checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked it over, and found TypeConstraint to be a drop-in replacement for the TypeMatcher I had, so used that instead. Had to move TypeConstraint into objects.py to avoid a cyclic dependency, but it was easy enough to move and passes tests locally as of ae70e02. The only difference is that it doesn't convert its self._types into a frozenset -- this seems like a tiny change so I will probably add that.

except TypeDecl.ConstructionError as e:
invalid_type_decls.append("in field '{}': {}".format(name, e))
continue
if isinstance(cls, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check for (set, list, tuple)? Or do we want to enforce that it's a list specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the right answer to that. I do think this functionality should be extracted into a factory method of the TypeDecl class, which is easy and I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class YetAnotherNamedTypedDatatype(typed_datatype(
'YetAnotherNamedTypedDatatype', {
'nothing_special': str,
'just_another_arg': StrOrInt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also create an explicit union here.

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, and if I wanted to use a field like StrOrInt, I think that would be much more appropriate as its own typed_datatype (which could then have its own set of @rules converting it into some other type, if necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Exactly.from_type_or_collection(), I converted this into [int, str], which is then converted into an Exactly type constraint.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Apr 19, 2018

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

I totally expected to have an extended discussion about the interface and/or impl before getting merged! I can see typed datatypes being very cool and I want to help ensure that code that lots of stuff depends on is still code that we should feel free to reach in and improve.

I hadn't remembered until you said this, but I have definitely seen the utility of not having field names, if each field is then required to have its own distinct type wrt the other fields of the typed_datatype. (This reminds me again of how someone mentioned @rules could just be types in themselves in #5580.) This would encourage/require differentiating arguments not by name, but by type, which would then mean that usage context is canonicalized in the type itself, and not just in the name of the field with that type. This is a good thing, because then we can build up a hierarchy of types that add context to primitives like PathGlobs and Snapshot, which is good for the same reasons that types are good. People don't seem to like wrapper classes for some reason but I find it much easier to follow code with thin wrapper types that provide their own usage context compared to e.g. two different fields of the snapshot type, differentiated only by name. A quick example:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': Snapshot,
  'source_snapshot': Snapshot,
})):
  pass

What assumptions are we making about the two fields? How do we know the 'binary_snapshot' contains a binary? If we changed the way we generate the value passed to 'binary_snapshot', and the path to the binary changed (for example), we would then have to find everywhere we interpret a snapshot as containing a binary, and modify those sites to manipulate the result correctly (basically: field names are more likely to change over time than type names). If instead, we did:

class BinarySnapshot(typed_datatype('BinarySnapshot', {
  'snapshot': Snapshot,
  'binary_path_globs': PathGlobs,
}):

  def get_binaries_path_stats(self):
    # ...


class SourceSnapshot(typed_datatype('SourceSnapshot', {
  'snapshot': Snapshot,
}):
  # maybe we don't need to do anything here,
  # but it's clear what this is supposed to represent
  pass

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': BinarySnapshot,
  'source_snapshot': SourceSnapshot,
}):
  pass

We could even make the get_binaries_path_stats() into an @rule which could return another wrapper type. The reason I showed the above sample is that if we allow ourselves to be liberal with wrapper types, then we could remove the field names:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', (BinarySnapshot, SourceSnapshot)):
  pass

req = SomeTypeOfRequest(BinarySnapshot(...), SourceSnapshot(...))
# this would work as a result of some (probably simple) metaprogramming
# we could apply a similar technique as Task.stable_name() to generate the accessor name
# (that would then mean that all fields of some type would be accessed
# using the same field name, which is extremely interesting)
print(req.binary_snapshot)

But if we are allowing multiple fields of the same type, I wouldn't want to be able to invoke the constructor with any positional arguments whatsoever, that seems to me to defeat the purpose of type checking entirely -- I absolutely don't trust myself to always put positional arguments in the right order, and if multiple fields can have the same type we're back where we started (speaking personally). Thinking about it more, I find the idea of requiring distinct types really interesting, especially if we had magic property accessors generated from type names (totally not required -- but I really don't want to use .0, .1, etc like rust tuple structs, I find it absurdly opaque).

Thinking about it even more, I might actually say that requiring distinct types for fields of a typed_datatype and auto-deriving field names is the approach I would want/prefer. There may be some bikeshedding about how to generate the field accessor property names from each field's type, but I really like the idea that if you do req.binary_snapshot as above, you know that the value will have been type-checked as a BinarySnapshot.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Apr 19, 2018

(And would possibly even suggest extracting this into its own library for re-use in the future)

I think this is a fantastic idea, not at all in the scope of this PR but if we iterate on it it seems extremely plausible to me that we could make typed_datatype satisfy our needs for use in pants (perf- and expressivity-wise) and expose it as a standalone library without making compromises. This kind of thing is something I have wanted to write and/or maintain for a while and we don't need to go down that route too soon, but we also don't need to drag our feet about it once we feel confident that the model will work for the needs of pants (which won't be clear until we stabilize the v2 engine API, probably). And as always, we should make absolutely sure that this needs to exist as a new library instead of contributing to an existing one (I believe that it does, but we should continue to ponder that before extracting it).

@cosmicexplorer
Copy link
Contributor Author

@stuhood I also missed your comment about "on the fence about this in general" -- sorry about that. I know you mentioned perf concerns as well as (it seemed like) a question about how this explicit type checking would interact with the implicit type checking done when the v2 engine does a select. I'm not sure about that either, I just think it would be complementary for now, by decoupling type checking from the other failure modes in @rule implementations, if that makes sense? I would lean more towards using typed_datatype() to make it easier to construct sanitary wrapper objects for use in @rules for now, and leave open the possibility of incremental deeper integration with the v2 engine as the opportunities for that come into view (if they do).

I'm not sure that's addressing your concern. I would also like to address the perf concern, but it's not clear to me how to construct a benchmark other than generating a million types and doing a million checks -- since typed_datatype() and datatype() instances are the exact same after construction (where type-checking occurs) I don't see introducing typed_datatype() as creating compatibility issues, so if type-checking is a perf issue in some code path we could just not use them there.

@cosmicexplorer
Copy link
Contributor Author

Moved TypeConstraint into src/python/pants/util/objects.py in ae70e02 and everything except lint is passing. I'm going to take a stab at modifying the field_decls of typed_datatype() to be a tuple of distinct types, the names of which are used to generate field members for the eventual datatype() call (as described a few comments up). I think this would be a fantastic interface.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Apr 23, 2018

I implemented the magical accessors and unique, positional fields as discussed above in 46c0f2d, then in e6e97d1 I added a decorator @typed_data(<types..>) which can be applied to class declarations as follows:

class SomeClass():
  def __init__(self, arg):
    self._arg = arg

  def f():
    return self._arg

@typed_data(int, SomeClass)
class MyTypedData(SomeMixin):
  # source code...

# the above is equivalent to the below:

class MyTypedData(typed_datatype('MyTypedData', (int, SomeClass)), SomeMixin):
  # source code...

# you can then do:

data = MyTypedData(3, SomeClass('asdf'))
print(data.primitive__int)
print(data.some_class.f())

This seems pretty neat. I deleted all the TypeMatcher tests when I switched to using the existing TypeConstraint, so I'm going to add more tests for the current state of the functionality and look into testing the __repr__() and exception messages from class instances.

./pants lint is complaining about the usage of these decorators, I'll look into why that is and how to avoid the warning:

T606:ERROR   tests/python/pants_test/util/test_objects.py:129 Classes must be new-style classes.
     |@typed_data(int)

@cosmicexplorer cosmicexplorer changed the title concisely declare classes that type-check their constructor arguments with typed_datatype() concisely declare classes that type-check their constructor arguments with @typed_data() Apr 23, 2018
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Apr 23, 2018

I added the __str__() and __repr__() testing I mentioned above, fixed the lint error I mentioned above, and switched test_isolated_process.py to use the new @typed_data(<types...>). This and the previous changes are described in the updated OP. CI is green and this is ready for review.

#
# class MyTypedData(typed_datatype('MyTypedData', (int, str)), SomeMixin):
# # source code...
def typed_data(*fields):
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too magical, IMO... particularly the fact that it requires modifying the linters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes sense. I have reverted the linter modification in 238ce3b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing, the sha is aa9ba67.



def typed_datatype(type_name, field_decls):
"""A wrapper over namedtuple which accepts a dict of field names and types.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer takes a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the contents of this docstring to the docstring of the datatype() function and added a FIXME: to edit it later (and "later" should be right now) in 2d1489e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing, the sha is 6420f5b. I removed the FIXME in 378a75e.


@property
def bin_path(self):
return self.binary_location.primitive__str
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not great syntax, unfortunately. I think this might have gone too far when it removed field names.

I would propose that the right way to land this would be to simplify it back to:

class JavacVersionExecutionRequest(datatype('JavacVersionExecutionRequest', [('bin_path', str)])):

IE: It should be straightforward to use the datatype name and accept tuples in the field list. It should also be possible to remove the classname parameter from datatype and just make an anonymous class, on the assumption that it will be extended. Then you'd have:

class JavacVersionExecutionRequest(datatype([('bin_path', str)])):

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is pretty clear reasoning. I think you made this syntax clear in your comments above and I just wasn't sure what the end result would look like, and was a little iffy on how to munge classes in python. The only utility of the decorator was to avoid having to spell out the same class name again, which is clear to me how to address from your description above. I really like the result you've proposed (especially the pretty natural integration with existing datatype() uses), so thanks for taking the time to make this clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented your proposal (which was great, btw) in 9e8055f (and was able to mostly keep the testing I already wrote intact), and changed every datatype() call site to not specify a class name in 2d1489e. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing, the shas are bd3a2d6 and 6420f5b.

@cosmicexplorer
Copy link
Contributor Author

Ok, just rebased and I think CI will pass, will check back tomorrow.

Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. Nice cleanup.

@cosmicexplorer
Copy link
Contributor Author

CI was failing because I had left a FIXME comment in src/python/pants/engine/README.md, and there is a pretty obvious bug in the markdown library unless tuples had .keys() in some other version of python. I have removed the comment and will break out separate PRs for the FIXME that was removed and to work around the error in the markdown library.

@stuhood
Copy link
Sponsor Member

stuhood commented Apr 27, 2018

Thanks, looks good. Please fix up the description / commit message and I'll merge this tomorrow.

@cosmicexplorer cosmicexplorer changed the title concisely declare classes that type-check their constructor arguments with @typed_data() type-check specific datatype fields concisely and remove the class name argument Apr 27, 2018
@cosmicexplorer
Copy link
Contributor Author

Updated the post and title.

@stuhood stuhood merged commit f0e1106 into pantsbuild:master Apr 27, 2018
stuhood pushed a commit that referenced this pull request May 13, 2018
### Problem

It's not clear how to make rule definitions available to the scheduler, and there's no discussion of the type checking added to `datatype()` fields in #5723.

### Solution

- Add a section on datatype fields, typing, and how they interact with the engine.
- Add a section on making rules available to the scheduler in and out of unit tests.
- Add a section describing types of rules after discussion on this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants