New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --binaries-path-by-id fingerprinting error #6413

Merged
merged 6 commits into from Aug 30, 2018

Conversation

Projects
None yet
5 participants
@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Aug 29, 2018

Problem

Fixes #6394. Any attempt to override the global bootstrap option --binaries-path-by-id currently fails with a TypeError, as described in that ticket.

Solution

  • Override the default method on our JSONEncoder subclass to handle non-list iterables.
  • Encode each key of the fingerprint() method's option_val argument with stable_json_sha1(), if option_val is a dict (this produces string keys).

Result

The --binaries-path-by-id option can now be overridden or extended from the environment or pants.ini. Passing this argument on the command line still fails because the option is parsed as "the --path-by-id option in the binaries scope".

cosmicexplorer added some commits Aug 29, 2018

@cosmicexplorer cosmicexplorer added this to the 1.8.x milestone Aug 29, 2018

@cosmicexplorer cosmicexplorer requested review from stuhood and jsirois Aug 29, 2018

@stuhood
Copy link
Member

stuhood left a comment

Thanks, looks good.

o = list(o)
elif isinstance(o, dict):
o = self._encode_dict_wrapping_keys(o)
elif len(o) == 1:

This comment has been minimized.

@stuhood

stuhood Aug 29, 2018

Member

The first two cases look ok, but this last one seems potentially error prone.

Are there guarantees that the things coming in to this method are always collections? Because otherwise they might not have a len method.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

In the fingerprint method in options_fingerprinter.py, every input is wrapped in a list if it's not already. This implementation assumes that -- John left some comments below on how to make this more general, which I think addresses your concern specifically, but I think the correct solution is to not wrap everything in a list in fingerprint.

return {self.encode(k):v for k, v in d.items()}

def encode(self, o):
if isinstance(o, (tuple, set)):

This comment has been minimized.

@jsirois

jsirois Aug 29, 2018

Member

You could be more general.

from future.moves import collections

if isinstance(o, collections.Mapping):
  ...
elif isinstance(o, collections.Iterable):
  ...

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

I will make this change.

o = list(o)
elif isinstance(o, dict):
o = self._encode_dict_wrapping_keys(o)
elif len(o) == 1:

This comment has been minimized.

@jsirois

jsirois Aug 29, 2018

Member

This seems like premature optimization.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

Everything is wrapped in a list (check the fingerprint method in options_fingerprinter.py), and we need to be able to modify the dict (encoding the keys) before it goes to the super method, or there will be an error. This was the first way I could think of to do that -- it's clunky and could be improved, but my idea was not that it was an optimization, specifically.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

I recognize that I can just modify the calling code to not wrap everything in a list, but wasn't sure whether there was a specific reason for that.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

I'm pretty sure the correct solution is to not wrap everything in a list, and in combination with your other comment, the body of this method could potentially be made much smaller.

o = [self._encode_dict_wrapping_keys(only_element)]

return super(Encoder, self).encode(o)

def default(self, o):
if o is UnsetBool:

This comment has been minimized.

@jsirois

jsirois Aug 29, 2018

Member

Why aren't you using default? It seems to me your conditionals should move down here since this is the documented way to hook things.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

Modifying default resulted in the same error for me until I tried overriding encode -- that's not saying it's the right way whatsoever. I will look into why I was seeing that behavior and whether this change can be done entirely in default.

This comment has been minimized.

@jsirois

jsirois Aug 29, 2018

Member

This is green against your tests mod sha changes for the {'a': 3} and [{'a': 3}] cases:

diff --git a/src/python/pants/option/options_fingerprinter.py b/src/python/pants/option/options_fingerprinter.py
index 224921056..f8ce100db 100644
--- a/src/python/pants/option/options_fingerprinter.py
+++ b/src/python/pants/option/options_fingerprinter.py
@@ -10,6 +10,8 @@ import os
 from builtins import object, open, str
 from hashlib import sha1
 
+from future.moves import collections
+
 from pants.base.build_environment import get_buildroot
 from pants.base.hash_utils import stable_json_hash
 from pants.option.custom_types import (UnsetBool, dict_with_files_option, dir_option, file_option,
@@ -20,6 +22,8 @@ class Encoder(json.JSONEncoder):
   def default(self, o):
     if o is UnsetBool:
       return '_UNSET_BOOL_ENCODING'
+    elif isinstance(o, collections.Iterable) and not isinstance(o, collections.Mapping):
+      return list(o)
     return super(Encoder, self).default(o)

This comment has been minimized.

@jsirois

jsirois Aug 29, 2018

Member

Simplify!

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 29, 2018

Contributor

I think the reason it wasn't working was because I hadn't figured out that fingerprint wraps everything into a list yet, so I kept getting the same dict error and didn't at that point put any print statements to show default was getting called. Also, other code online used encode() (not as an excuse -- what you describe is the right way). Thanks for figuring this out.

self.assertEqual(stable_json_sha1([('a', 3)]), 'd6abed2e53c1595fb3075ecbe020365a47af1f6f')
self.assertEqual(stable_json_sha1({'a': 3}), '9e0e6d8a99c72daf40337183358cbef91bba7311')
self.assertEqual(stable_json_sha1([{'a': 3}]), '8f4e36849a0b8fbe9c4a822c80fbee047c65458a')
self.assertEqual(stable_json_sha1(set([1])), 'f629ae44b7b3dcfed444d363e626edf411ec69a8')

This comment has been minimized.

@wisechengyi

wisechengyi Aug 29, 2018

Contributor

Wonder if it's an option to assert on bytestring instead. Imagine someone breaks one of the tests in the future, how should they start debugging?

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

I'm not sure -- it would have to be a change to either stable_json_sha1, stable_json_hash, or the Encoder class that changes the result of stable_json_sha1, so I would look to see if I had made any changes there, or any changes in any of hash_utils.py. What do you mean by assert on bytestring?

This comment has been minimized.

@wisechengyi

wisechengyi Aug 30, 2018

Contributor

stable_json_hash is essentially hashing json_str which I am guessing is a byte string (e.g. b'123' instead of a str like '123').

json_str = json.dumps(obj, ensure_ascii=True, allow_nan=False, sort_keys=True, cls=encoder)
return hash_all(json_str, digest=digest)

so the assert would look something like

self.assertEqual(b'{"a":3}', get_json_str({'a': 3}))

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

That makes perfect sense, I can see easily how that’s much more debuggable. I can totally implement that. I don’t know if separating the hashing from the json-ification will be obvious or take like 5 minutes but I’ll implement this. Thanks!

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

I broke out #6421 because I wasn't exactly sure how to implement this in this PR (splitting out the json encoding from the hashing), but for the reasons you describe I think this should be done.

@jsirois
Copy link
Member

jsirois left a comment

LGTM once the test case that actually tests the issue case is added.

def default(self, o):
if o is UnsetBool:
return '_UNSET_BOOL_ENCODING'
elif isinstance(o, collections.Iterable):
if isinstance(o, collections.Mapping):
return self._encode_dict_wrapping_keys(o)

This comment has been minimized.

@jsirois

jsirois Aug 30, 2018

Member

Probably makes sense to just inline _encode_dict_wrapping_keys.

This comment has been minimized.

@jsirois

jsirois Aug 30, 2018

Member

It may also be worth a note that this encoding is 1-way, but that's OK since we never need to decode, just hash the encoding.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

I ended up doing the string conversion that python docs say is already done in json.dumps in the fingerprint method in cb7dfba, which seems to be the path of least resistance (I also added testing). If it's preferable to have stable_json_sha1() be able to handle this case for other uses, I can implement that easily.

@cosmicexplorer

This comment has been minimized.

Copy link
Contributor

cosmicexplorer commented Aug 30, 2018

Quoting a comment I made so everyone can see:

I ended up doing the string conversion that python docs say is already done in json.dumps in the fingerprint method in cb7dfba, which seems to be the path of least resistance (I also added testing). If it's preferable to have stable_json_sha1() be able to handle this case for other uses, I can implement that easily.

@@ -84,6 +79,11 @@ def fingerprint(self, option_type, option_val):
if option_val is None:
return None

# The python documentation (https://docs.python.org/2/library/json.html#json.dumps) states that
# dict keys are coerced to strings in json.dumps, but this appears to be incorrect.
if isinstance(option_val, dict):

This comment has been minimized.

@jsirois

jsirois Aug 30, 2018

Member

Seems wise to not use collections.{Iterable,Mapping} above or else also use those here and below.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

Sorry about that. I will use the collections types. Not trying to forget your comments.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

Using collections.Iterable here causes the pre-commit hook to fail (with Exception message: Spec has absolute path /; expected a path relative to the build root.) (even after I remembered that a Mapping is Iterable too), but it should almost definitely be used and if tests pass with the most recent commit I will figure out how to do that.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

This is the code I used to get that error (in the fingerprint() method body, everything else is the same):

    # For simplicity, we always fingerprint a list.  For non-list-valued options,
    # this will be a singleton list.
    if not isinstance(option_val, collections.Iterable):
      # The python documentation (https://docs.python.org/2/library/json.html#json.dumps) states
      # that dict keys are coerced to strings in json.dumps, but this appears to be incorrect, so we
      # need to encode them ourselves.
      if isinstance(option_val, collections.Mapping):
        option_val = {stable_json_sha1(k):v for k, v in option_val.items()}

      option_val = [option_val]
# The python documentation (https://docs.python.org/2/library/json.html#json.dumps) states that
# dict keys are coerced to strings in json.dumps, but this appears to be incorrect.
if isinstance(option_val, dict):
option_val = {str(k):v for k, v in option_val.items()}

This comment has been minimized.

@jsirois

jsirois Aug 30, 2018

Member

If the key is not primitive, hashes are now subject to the whims of a potentially changing __str__ impl. Seems better to leave this back up in the Encoder and if encode on a key fails - that's good, it means its not a primitive and we don't know how to stably encode it. IE: we get a fail fast on a systemic coding error. With this code, we silently let non-primitves through and __str__ may be the object id which will never match run over run.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

Ok, that makes sense, I'll go back to having the encoder handle it.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

I wasn't able to figure out how to do the automatic encoding of dict keys, because if they're contained inside a list, the default json encoding will take over (and raise a TypeError), unless you either (1) special-case a list with a single dict in it (2) override the private method used to encode dicts (3) re-implement the whole of the json module. So instead I just made sure to hash (with stable_json_sha1()) the keys of all dicts we receive as option_val in the fingerprint() method. Does that sound reasonable, or is there a route I'm missing?

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Aug 30, 2018

Contributor

Broke this out into #6422.

@cosmicexplorer cosmicexplorer force-pushed the cosmicexplorer:fix-binaries-path-by-id-munging branch from 3d33a26 to e3b0dab Aug 30, 2018

@cosmicexplorer cosmicexplorer merged commit 094bfce into pantsbuild:master Aug 30, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

cosmicexplorer added a commit that referenced this pull request Aug 30, 2018

Fix --binaries-path-by-id fingerprinting error (#6413)
Fixes #6394. Any attempt to override the global bootstrap option `--binaries-path-by-id` currently fails with a `TypeError`, as described in that ticket.

- Override the `default` method on our JSONEncoder subclass to handle non-list iterables.
- Encode each key of the `fingerprint()` method's `option_val` argument with `stable_json_sha1()`, if `option_val` is a dict (this produces string keys).

The `--binaries-path-by-id` option can now be overridden or extended from the environment or `pants.ini`. Passing this argument on the command line still fails because the option is parsed as "the `--path-by-id` option in the `binaries` scope".

@ity ity removed the needs-cherrypick label Sep 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment