type-check specific datatype fields concisely and remove the class name argument #5723

cosmicexplorer · 2018-04-19T13:21:26Z

Problem

See #5716. We had to introduce a ton of boilerplate to check argument types in some very simple usage of datatype subclasses in tests in #5703. This is one way to make that easier.

Solution

Move TypeConstraint and subclasses from src/python/pants/engine/addressable.py to src/python/pants/util/objects.py, where they probably should have been anyway, to avoid an import cycle.
Remove the class name argument from datatype() (and update all call sites).
Allow specifying a tuple ('field_name', FieldType) in the list of fields to datatype() to have the field type-checked by an Exactly(FieldType) type constraint at construction.
Override __str__() and __repr__() for datatype objects, and add testing for the str and repr.
Add testing for informative error messages in datatype construction.

Result

We can now concisely declare datatype objects without having to repeat the class name, and can opt-in to a type check for specific fields.

illicitonion

I'm pretty excited about this :) (And would possibly even suggest extracting this into its own library for re-use in the future). Will leave some time for others to chime in

illicitonion · 2018-04-19T14:28:22Z

src/python/pants/util/objects.py

+  for name, cls in field_decls.items():
+    if isinstance(cls, TypeDecl):
+      processed_type_decls[name] = cls
+      continue


Rather than continue; if, this looks like it would sit pretty naturally as a chain of elifs?

I'm pretty sure you're right, I was just having trouble reading the elif version in my head when I wrote this bit, so used continue. I will make this change.

I converted this into Exactly.from_type_or_collection() when I moved TypeConstraint into objects.py. This actually has the same control flow as the original, but return instead of continue is a little bit more sane (imo) -- let me know if I'm wrong about that.

stuhood · 2018-04-19T18:11:18Z

I'm on the fence about this in general, and we should talk about that before landing this.

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

baroquebobcat

Did a first pass. I have a few style / assertion comments on tests and some suggestions for type constraints.

baroquebobcat · 2018-04-19T18:19:28Z

tests/python/pants_test/util/test_objects.py

+      SomeTypedDatatype()
+      self.fail("should have errored: not providing all constructor fields")
+    except TypedDatatypeInstanceConstructionError as e:
+      self.assertIn('my_val', str(e))


I'd like to see more tests that check parts of the error messages. I think you could elide some of the comments like # not providing all the fields if there were an assertion that checked for that information.

One pattern that might be helpful is to use the context manager provided by assertRaises. For example

with self.assertRaises(MyError) as cm: call() self.assertIn('my error is the best error', cm.exception)

Then you can rewrite the above try, fail, except, assert as

with self.assertRaises(TypedDatatypeInstanceConstructionError) as cm: SomeTypedDatatype() self.assertIn('my_val', str(cm.exception)) self.assertIn('missing field', str(cm.exception)) # or something like it

I really appreciate this feedback. I didn't make it clear in the OP, but I absolutely want to make sure error messages (and test cases) are deeply considered and rock solid for a feature like this. I didn't look deeply enough into assertRaises to realize that it yielded a context object and this seems really actionable. With respect to the code example you gave, I would probably do that and extract it into a helper method of the test class. The code samples you've provided are clear enough for me to act on.

I canonicalized the __str__() and __repr__() of TypedDatatype, as well as the messages in the exceptions it raises, and added tests for all the failure cases in 133ba06. Let me know if this is what you were looking for.

baroquebobcat · 2018-04-19T18:21:37Z

tests/python/pants_test/util/test_objects.py

+    other_union_val = UnionFieldTypedDatatype(an_arg=3)
+    self.assertIn('UnionFieldTypedDatatype', repr(other_union_val))
+    self.assertIn('an_arg', repr(other_union_val))
+    self.assertIn('3', repr(other_union_val))


I like that you've included tests of the repr. I'd like the assertions to be a bit more concrete though. I think testing the full repr would make it clearer what the repr is supposed to look like. How would you feel about that?

Particularly since the ideal for repr is to have a round-trippable string representation that if it were evaluated as python would recreate the object.

I 100% agree. I have thoughts on stu's comment above (e.g. if there are no named args, then that would change the repr), but validating the repr in a structured way seems really appropriate for this use case.

I mentioned this in a response to your earlier comment: check 133ba06 to see if the testing added there addresses your concern here.

baroquebobcat · 2018-04-19T18:31:29Z

src/python/pants/util/objects.py

+  """A wrapper over namedtuple which accepts a dict of field names and types.
+
+  This can be used to very concisely define classes which have fields that are
+  type-checked at construction.


Do we want exact type matching, or allow subtypes? Currently, this allows subtypes. That might be tricky for products in v2. I think v2's type system is primarily based on exact matches of types, either as a singular type or as a type within a union.

If this for datatypes outside v2 that is probably fine. Either way, I think the entry point here should call out how types are matched.

If you want, you could use the existing TypeConstraint classes that are used by the engine rules. https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/addressable.py#L31

I think the right move is probably to do exact type matching for the reasons you describe. I had seen the TypeConstraint classes but hadn't used them -- that seems like the right way to tackle this.

Also, for union types, exact type matching means you can use a set instead of a list to record the types that could match (and determine whether the type matches instantly), which in itself sounds like a fantastic argument for exact type checking.

Looked it over, and found TypeConstraint to be a drop-in replacement for the TypeMatcher I had, so used that instead. Had to move TypeConstraint into objects.py to avoid a cyclic dependency, but it was easy enough to move and passes tests locally as of ae70e02. The only difference is that it doesn't convert its self._types into a frozenset -- this seems like a tiny change so I will probably add that.

baroquebobcat · 2018-04-19T18:38:02Z

src/python/pants/util/objects.py

+      except TypeDecl.ConstructionError as e:
+        invalid_type_decls.append("in field '{}': {}".format(name, e))
+      continue
+    if isinstance(cls, list):


maybe check for (set, list, tuple)? Or do we want to enforce that it's a list specifically?

I'm not sure of the right answer to that. I do think this functionality should be extracted into a factory method of the TypeDecl class, which is easy and I can do.

See the new Exactly.from_type_or_collection().

baroquebobcat · 2018-04-19T18:42:27Z

tests/python/pants_test/util/test_objects.py

+class YetAnotherNamedTypedDatatype(typed_datatype(
+    'YetAnotherNamedTypedDatatype', {
+      'nothing_special': str,
+      'just_another_arg': StrOrInt,


You could also create an explicit union here.

That's true, and if I wanted to use a field like StrOrInt, I think that would be much more appropriate as its own typed_datatype (which could then have its own set of @rules converting it into some other type, if necessary).

Using Exactly.from_type_or_collection(), I converted this into [int, str], which is then converted into an Exactly type constraint.

cosmicexplorer · 2018-04-19T21:56:03Z

But one concrete thing that I think is probably not the right tradeoff if we do decide to do this is using a dict for arguments. Particularly when you have typed arguments, requiring that they also be passed by name in all cases is overkill... so continuing to allow for use of positional arguments (ie, by making the list of types into [('my_field': MyFieldType)]) or similar would be preferable.

I totally expected to have an extended discussion about the interface and/or impl before getting merged! I can see typed datatypes being very cool and I want to help ensure that code that lots of stuff depends on is still code that we should feel free to reach in and improve.

I hadn't remembered until you said this, but I have definitely seen the utility of not having field names, if each field is then required to have its own distinct type wrt the other fields of the typed_datatype. (This reminds me again of how someone mentioned @rules could just be types in themselves in #5580.) This would encourage/require differentiating arguments not by name, but by type, which would then mean that usage context is canonicalized in the type itself, and not just in the name of the field with that type. This is a good thing, because then we can build up a hierarchy of types that add context to primitives like PathGlobs and Snapshot, which is good for the same reasons that types are good. People don't seem to like wrapper classes for some reason but I find it much easier to follow code with thin wrapper types that provide their own usage context compared to e.g. two different fields of the snapshot type, differentiated only by name. A quick example:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': Snapshot,
  'source_snapshot': Snapshot,
})):
  pass

What assumptions are we making about the two fields? How do we know the 'binary_snapshot' contains a binary? If we changed the way we generate the value passed to 'binary_snapshot', and the path to the binary changed (for example), we would then have to find everywhere we interpret a snapshot as containing a binary, and modify those sites to manipulate the result correctly (basically: field names are more likely to change over time than type names). If instead, we did:

class BinarySnapshot(typed_datatype('BinarySnapshot', {
  'snapshot': Snapshot,
  'binary_path_globs': PathGlobs,
}):

  def get_binaries_path_stats(self):
    # ...


class SourceSnapshot(typed_datatype('SourceSnapshot', {
  'snapshot': Snapshot,
}):
  # maybe we don't need to do anything here,
  # but it's clear what this is supposed to represent
  pass

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', {
  'binary_snapshot': BinarySnapshot,
  'source_snapshot': SourceSnapshot,
}):
  pass

We could even make the get_binaries_path_stats() into an @rule which could return another wrapper type. The reason I showed the above sample is that if we allow ourselves to be liberal with wrapper types, then we could remove the field names:

class SomeTypeOfRequest(typed_datatype('SomeTypeOfRequest', (BinarySnapshot, SourceSnapshot)):
  pass

req = SomeTypeOfRequest(BinarySnapshot(...), SourceSnapshot(...))
# this would work as a result of some (probably simple) metaprogramming
# we could apply a similar technique as Task.stable_name() to generate the accessor name
# (that would then mean that all fields of some type would be accessed
# using the same field name, which is extremely interesting)
print(req.binary_snapshot)

But if we are allowing multiple fields of the same type, I wouldn't want to be able to invoke the constructor with any positional arguments whatsoever, that seems to me to defeat the purpose of type checking entirely -- I absolutely don't trust myself to always put positional arguments in the right order, and if multiple fields can have the same type we're back where we started (speaking personally). Thinking about it more, I find the idea of requiring distinct types really interesting, especially if we had magic property accessors generated from type names (totally not required -- but I really don't want to use .0, .1, etc like rust tuple structs, I find it absurdly opaque).

Thinking about it even more, I might actually say that requiring distinct types for fields of a typed_datatype and auto-deriving field names is the approach I would want/prefer. There may be some bikeshedding about how to generate the field accessor property names from each field's type, but I really like the idea that if you do req.binary_snapshot as above, you know that the value will have been type-checked as a BinarySnapshot.

cosmicexplorer · 2018-04-19T22:10:25Z

(And would possibly even suggest extracting this into its own library for re-use in the future)

I think this is a fantastic idea, not at all in the scope of this PR but if we iterate on it it seems extremely plausible to me that we could make typed_datatype satisfy our needs for use in pants (perf- and expressivity-wise) and expose it as a standalone library without making compromises. This kind of thing is something I have wanted to write and/or maintain for a while and we don't need to go down that route too soon, but we also don't need to drag our feet about it once we feel confident that the model will work for the needs of pants (which won't be clear until we stabilize the v2 engine API, probably). And as always, we should make absolutely sure that this needs to exist as a new library instead of contributing to an existing one (I believe that it does, but we should continue to ponder that before extracting it).

cosmicexplorer · 2018-04-20T01:18:39Z

@stuhood I also missed your comment about "on the fence about this in general" -- sorry about that. I know you mentioned perf concerns as well as (it seemed like) a question about how this explicit type checking would interact with the implicit type checking done when the v2 engine does a select. I'm not sure about that either, I just think it would be complementary for now, by decoupling type checking from the other failure modes in @rule implementations, if that makes sense? I would lean more towards using typed_datatype() to make it easier to construct sanitary wrapper objects for use in @rules for now, and leave open the possibility of incremental deeper integration with the v2 engine as the opportunities for that come into view (if they do).

I'm not sure that's addressing your concern. I would also like to address the perf concern, but it's not clear to me how to construct a benchmark other than generating a million types and doing a million checks -- since typed_datatype() and datatype() instances are the exact same after construction (where type-checking occurs) I don't see introducing typed_datatype() as creating compatibility issues, so if type-checking is a perf issue in some code path we could just not use them there.

cosmicexplorer · 2018-04-21T22:50:41Z

Moved TypeConstraint into src/python/pants/util/objects.py in ae70e02 and everything except lint is passing. I'm going to take a stab at modifying the field_decls of typed_datatype() to be a tuple of distinct types, the names of which are used to generate field members for the eventual datatype() call (as described a few comments up). I think this would be a fantastic interface.

cosmicexplorer · 2018-04-23T01:02:33Z

I implemented the magical accessors and unique, positional fields as discussed above in 46c0f2d, then in e6e97d1 I added a decorator @typed_data(<types..>) which can be applied to class declarations as follows:

class SomeClass():
  def __init__(self, arg):
    self._arg = arg

  def f():
    return self._arg

@typed_data(int, SomeClass)
class MyTypedData(SomeMixin):
  # source code...

# the above is equivalent to the below:

class MyTypedData(typed_datatype('MyTypedData', (int, SomeClass)), SomeMixin):
  # source code...

# you can then do:

data = MyTypedData(3, SomeClass('asdf'))
print(data.primitive__int)
print(data.some_class.f())

This seems pretty neat. I deleted all the TypeMatcher tests when I switched to using the existing TypeConstraint, so I'm going to add more tests for the current state of the functionality and look into testing the __repr__() and exception messages from class instances.

./pants lint is complaining about the usage of these decorators, I'll look into why that is and how to avoid the warning:

T606:ERROR   tests/python/pants_test/util/test_objects.py:129 Classes must be new-style classes.
     |@typed_data(int)

cosmicexplorer · 2018-04-23T08:14:23Z

I added the __str__() and __repr__() testing I mentioned above, fixed the lint error I mentioned above, and switched test_isolated_process.py to use the new @typed_data(<types...>). This and the previous changes are described in the updated OP. CI is green and this is ready for review.

stuhood · 2018-04-24T16:28:03Z

src/python/pants/util/objects.py

+#
+# class MyTypedData(typed_datatype('MyTypedData', (int, str)), SomeMixin):
+#   # source code...
+def typed_data(*fields):


This is too magical, IMO... particularly the fact that it requires modifying the linters.

This comment makes sense. I have reverted the linter modification in 238ce3b.

After rebasing, the sha is aa9ba67.

stuhood · 2018-04-24T17:16:11Z

src/python/pants/util/objects.py

+
+
+def typed_datatype(type_name, field_decls):
+  """A wrapper over namedtuple which accepts a dict of field names and types.


This no longer takes a dict.

I moved the contents of this docstring to the docstring of the datatype() function and added a FIXME: to edit it later (and "later" should be right now) in 2d1489e.

After rebasing, the sha is 6420f5b. I removed the FIXME in 378a75e.

stuhood · 2018-04-24T17:18:07Z

tests/python/pants_test/engine/test_isolated_process.py

+
+  @property
+  def bin_path(self):
+    return self.binary_location.primitive__str


This is not great syntax, unfortunately. I think this might have gone too far when it removed field names.

I would propose that the right way to land this would be to simplify it back to:

class JavacVersionExecutionRequest(datatype('JavacVersionExecutionRequest', [('bin_path', str)])):

IE: It should be straightforward to use the datatype name and accept tuples in the field list. It should also be possible to remove the classname parameter from datatype and just make an anonymous class, on the assumption that it will be extended. Then you'd have:

class JavacVersionExecutionRequest(datatype([('bin_path', str)])):

Ok, this is pretty clear reasoning. I think you made this syntax clear in your comments above and I just wasn't sure what the end result would look like, and was a little iffy on how to munge classes in python. The only utility of the decorator was to avoid having to spell out the same class name again, which is clear to me how to address from your description above. I really like the result you've proposed (especially the pretty natural integration with existing datatype() uses), so thanks for taking the time to make this clear.

I implemented your proposal (which was great, btw) in 9e8055f (and was able to mostly keep the testing I already wrote intact), and changed every datatype() call site to not specify a class name in 2d1489e. Let me know what you think.

After rebasing, the shas are bd3a2d6 and 6420f5b.

yet again somehow everything Just Works

cosmicexplorer · 2018-04-26T08:31:03Z

Ok, just rebased and I think CI will pass, will check back tomorrow.

stuhood

Thanks for iterating. Nice cleanup.

cosmicexplorer · 2018-04-26T23:21:32Z

CI was failing because I had left a FIXME comment in src/python/pants/engine/README.md, and there is a pretty obvious bug in the markdown library unless tuples had .keys() in some other version of python. I have removed the comment and will break out separate PRs for the FIXME that was removed and to work around the error in the markdown library.

stuhood · 2018-04-27T04:38:41Z

Thanks, looks good. Please fix up the description / commit message and I'll merge this tomorrow.

cosmicexplorer · 2018-04-27T04:47:30Z

Updated the post and title.

### Problem It's not clear how to make rule definitions available to the scheduler, and there's no discussion of the type checking added to `datatype()` fields in #5723. ### Solution - Add a section on datatype fields, typing, and how they interact with the engine. - Add a section on making rules available to the scheduler in and out of unit tests. - Add a section describing types of rules after discussion on this PR.

illicitonion approved these changes Apr 19, 2018

View reviewed changes

baroquebobcat reviewed Apr 19, 2018

View reviewed changes

cosmicexplorer changed the title ~~concisely declare classes that type-check their constructor arguments with typed_datatype()~~ concisely declare classes that type-check their constructor arguments with @typed_data() Apr 23, 2018

stuhood requested changes Apr 24, 2018

View reviewed changes

cosmicexplorer added 19 commits April 26, 2018 01:18

add typed_datatype

c3b4d1a

add tests

0ce2c49

add TypeDecl and associated tests

1f7386b

add docstrings, remove lots of code

b994952

clear up a line

0650fe3

make lint pass by removing mentions of unicode

dcf643d

totally refactor type checking and everything still Just Works

00600fa

move TypeConstraint into src/python/pants/util/objects.py

1c50d9a

yet again somehow everything Just Works

add todo to objects.py

888bdaf

add interim todos

861a997

make FieldType constraint and use to generate datatype() field names

4cda98d

add typed_data decorator

8ddcb0d

add a single test for CamelCased types as fields

ca6ef2f

use @typed_data() in test_isolated_process.py

2aa7995

canonicalize and test __str__(), __repr__(), and exception messages

0ee9f47

fix some lint errors and add FieldTypeTest

9e8a78f

remove some TODOs and use triple quotes wherever possible

2d0cf01

stop new-style classes error if @typed_data() is used

6a8f0fd

add a test using a mixin

f869efd

cosmicexplorer added 10 commits April 26, 2018 01:18

add some necessary docstrings

b697fa4

fix import order

e699823

fix test failure introduced in earlier commit

a7e3c39

add test for duplicate types in fields

c304c63

add testing for the remaining two exceptions

2a4086d

remove @typed_data and move type checking support to datatype()

bd3a2d6

switch everything to use datatype() without specifying a class name

6420f5b

remove lint overrides

aa9ba67

clean up docstrings and todos

378a75e

revert linting test changes

7f319fe

cosmicexplorer force-pushed the typed-datatypes branch from 238ce3b to 7f319fe Compare April 26, 2018 08:25

stuhood approved these changes Apr 26, 2018

View reviewed changes

cosmicexplorer added 3 commits April 26, 2018 11:52

fix remaining old datatype() call syntax

6be5318

add additional assertion for slightly different failure case

f32a0de

remove comment in markdown causing error

6bc8ed2

cosmicexplorer changed the title ~~concisely declare classes that type-check their constructor arguments with @typed_data()~~ type-check specific datatype fields concisely and remove the class name argument Apr 27, 2018

stuhood merged commit f0e1106 into pantsbuild:master Apr 27, 2018

cosmicexplorer mentioned this pull request Apr 27, 2018

Expand engine readme #5759

Merged

cosmicexplorer mentioned this pull request Jan 13, 2019

merge pants.util.objects.datatype with python 3 dataclasses #7074

Closed



		def typed_datatype(type_name, field_decls):
		"""A wrapper over namedtuple which accepts a dict of field names and types.

type-check specific datatype fields concisely and remove the class name argument #5723

type-check specific datatype fields concisely and remove the class name argument #5723

Conversation

cosmicexplorer commented Apr 19, 2018 • edited Loading

Problem

Solution

Result

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Apr 19, 2018

baroquebobcat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cosmicexplorer Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cosmicexplorer commented Apr 19, 2018 • edited Loading

cosmicexplorer commented Apr 19, 2018 • edited Loading

cosmicexplorer commented Apr 20, 2018

cosmicexplorer commented Apr 21, 2018

cosmicexplorer commented Apr 23, 2018 • edited Loading

cosmicexplorer commented Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cosmicexplorer Apr 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cosmicexplorer commented Apr 26, 2018

stuhood left a comment

Choose a reason for hiding this comment

cosmicexplorer commented Apr 26, 2018

stuhood commented Apr 27, 2018 • edited Loading

cosmicexplorer commented Apr 27, 2018

cosmicexplorer commented Apr 19, 2018 •

edited

Loading

cosmicexplorer Apr 19, 2018 •

edited

Loading

cosmicexplorer commented Apr 19, 2018 •

edited

Loading

cosmicexplorer commented Apr 19, 2018 •

edited

Loading

cosmicexplorer commented Apr 23, 2018 •

edited

Loading

cosmicexplorer commented Apr 23, 2018 •

edited

Loading

cosmicexplorer Apr 25, 2018 •

edited

Loading

stuhood commented Apr 27, 2018 •

edited

Loading