Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mrg] Per spec configuration #888

Merged
merged 4 commits into from
Jul 17, 2019
Merged

Conversation

choldgraf
Copy link
Member

@choldgraf choldgraf commented Jul 3, 2019

In #887 we talked about how it'd be cool to have per-repo configuration. This is a first attempt at implementing it.

The basic idea is that we operate similar to how "banned" and "quota increase" functionality works, with one difference:

Instead of a list of specification regexes to match to "quota increase" or "banned", we have a dictionary where keys are specification regexes, and values are dictionaries of "key:value" pairs that can over-ride configuration on a repo-specific basis.

For example, you could do something like:

config:
    GitHubRepoProvider:
      # Add banned repositories to the list below
      # They should be strings that will match "^<org-name>/<repo-name>.*"
      spec_configuration_override:
        ^ines/spacy-binder.*:
		    quota: 1337
        ^binder-examples/requirements.*:
   			banned: true     

Questions to answer

  1. I agree w/ @betatim that "spec_configuration_override" isn't great, any suggestions for other names?
  2. What's a clean way to document what key:value configuration pairs are meaningful in this case? We'll need to add checks for them one-by-one in the code, and for now I guess we'll just start with a quota: <int> key/value. In the future, should we just expose this with good documentation for what the accepted key/values are? (we could also imagine some key/values being repoprovider-specific)

Need to add

  • Docs
  • Tests
  • Better naming

@@ -424,9 +424,14 @@ def escape(s):

async def launch(self, kube, provider):
"""Ask JupyterHub to launch the image."""
# Load the spec-specific configuration if it has been overridden
spec_config = provider.spec_configuration_override

# check quota first
if provider.has_higher_quota():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we retire this at the same time as we introduce the new functionality?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to have a deprecation transition, in which case include a warning if the old way is used.

Copy link
Member Author

@choldgraf choldgraf Jul 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine w/ a transition (though we've still only ever had one release of BinderHub so the "higher_quota" never actually made it into a release lol)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True on paper and in practice we release several times a week and people actually use BinderHub so I think we should be nice to people. The "higher quota" thing has only existed for a week or two so :-/

warnings.warn(
    "XXX is deprecated, use YYY instead",
    DeprecationWarning
)

is the generic Python warnings thing. It isn't ideal (most people won't see deprecation warnings except in their tests). Does traitlets have a tool for deprecations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the invisible-by-default deprecation warnings, I think that application-level deprecation warnings should actually not use DeprecationWarning, so they are visible by default. It's a tricky choice since deprecation warnings are always safely ignored "for now" but if folks never see them, they may as well not exist. As for a transition, I was mostly thinking of ourselves at mybinder.org, since it would allow us to decouple "adopt the new way" from "bump binderhub" PRs without config changes.

FWIW, I don't think the deprecation needs to be a blocker here, as the old way still works. Cleanly deprecating the old way can be a follow-up PR.

Does traitlets have a tool for deprecations?

Traitlets doesn't have any helpers for deprecations, but the way traitlets works can make writing deprecations handy since deprecations can go in observers, and completely removed from the bodies of methods, etc. for example:

@observe('old_way')
def _old_way_changed(self, change):
    warnings.warn("old way is deprecated as of...new way is...")
    self.new_way = modified_form(change.new)
    # never need to look directly at self.old_way after this

...

def method(self):
    only_use(self.new_way)

The simplest version of this can lead to undefined behavior (which wins?) if both the old and new way are specified in the same config, but I think that's not a big deal if the old-way warning is visible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the traitlets observe together with printing a warning (via warnings.warn()) is a good way to do this.

# check quota first
if provider.has_higher_quota():
quota = self.settings.get('per_repo_quota_higher')
elif "per_repo_quota" in spec_config:
quota = spec_config.get('per_repo_quota')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quota = spec_config.get('per_repo_quota')
quota = spec_config.get('quota')

(also on the line above)

Though you could write this as:

quota = spec_config.get('quota', self.settings.get('per_repo_quota'))

which is a well used pattern in Python for saying "get this from the dictionary and if it doesn't exist use this thing as default"

binderhub/repoproviders.py Outdated Show resolved Hide resolved
@betatim
Copy link
Member

betatim commented Jul 4, 2019

How about spec_config as name? With a comment in the doc string of the traitlet to say that you only have to use this to override settings per spec/if no entry is found defaults are used.

For documentation of which keys/options are valid I'd put a pointer in the traitlets' doc that says "repo providers are free to define their own options, see each repo provider for a list of valid keys and their meaning". Then add appropriate docs to the repo providers/base repo provider.

@@ -111,6 +121,21 @@ def has_higher_quota(self):
return True
return False

def spec_configuration_override(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of naming this spec_config() and taking care of finding and populating all the default values in this function? Then in the builder we wouldn't need to have any checks for "do we override this or not, what is the default". We have one place that is already ready to use.

The reason to call it spec_config is that it would be for one particular spec and not a whole bunch any more. Maybe we could even call it repo_config? Depends a bit on what words the builder uses (does it refer to building a spec or a repo). But this might be getting a bit OCD on naming :)

TL;DR: should we compute the configuration (combining over rides and defaults) for the spec here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it - that seems reasonable to me, lemme give it a stab and see how it looks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the latest push for an attempt at what this would look like, is this the kinda thing you had in mind?

1399d58#diff-c5688934f1e6dc3e932b6c84c1bbbd5dR133

In this case, spec_config would be defined at the helm chart config level, and repo_config is a method of RepoProvider that returns a dictionary of configuration values for that repository (which might have been updated from the spec_config setting)

@betatim betatim changed the title per repo configuration [WIP] Per spec configuration Jul 4, 2019
@@ -424,9 +424,14 @@ def escape(s):

async def launch(self, kube, provider):
"""Ask JupyterHub to launch the image."""
# Load the spec-specific configuration if it has been overridden
spec_config = provider.spec_configuration_override

# check quota first
if provider.has_higher_quota():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to have a deprecation transition, in which case include a warning if the old way is used.

for spec, config in self.spec_configuration_override:
# Ignore case, because most git providers do not
# count DS-100/textbook as different from ds-100/textbook
if re.match(spec, self.spec, re.IGNORECASE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, using re.match compiles the regex on each call, which means we are compiling every regex here on every repo provider. Compiling them on application load would make this a bit more efficient.

Related to this: dicts are in a random order, but overrides might want to have a priority if more than one pattern matches the repo. With a dict, the behavior is undefined, with a list it would be consistent. So perhaps a list of tuples is more appropriate than a dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm, I am not sure how to do this :-) feel free to suggest the code that would still allow for config patterns like the one in #888 (comment) and I'll give it a shot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the random order! I had pondered that one might have multiple patterns and if they should be additive or "last one wins" or what, then decided that for a first pass we should ignore that. However the random order thing is something we need to address (or inform the user that they have more than one pattern that matches and that this is an error (for now)).

I think what we want is a list like:

specs = [{"pattern": "some-pattern", "quota": 10}, {"pattern": "some-other-pattern", "quota": 12}]

in YAML it would look like:

# not super sure about the first one but the second two should work
- {pattern: "yet another pattern", quota: 33}
- pattern: some-pattern
  quota: 10
- pattern: some-other-pattern
  quota: 12

Copy link
Member Author

@choldgraf choldgraf Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I like this pattern because it’s more explicit! @minrk does it look good to you as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that looks great to me! The only question is that 'pattern' is in the same namespace as the config overrides. This is nice since it's more concise, but could allow collisions or confusion. The more rigorous, but slightly more tedious would be to explicitly separate the match from the override config:

- pattern: some-pattern
  config:
    quota: 10
    other_config: "x"

This would allow adding other siblings to pattern if we e.g. had some other options that should influence the selection rather than the overridden config. I don't have strong feelings, but I've occasionally regretted not separating things like this in JupyterHub in the past (see spawner options in the REST API).

You can see the difference in code where 'pattern' needs to be handled specially:

# copy because we need to modify the dict
update_config = config.copy()
# remove pattern before updating to config because pattern is not part of the config:
pattern = config.pop('pattern')
if pattern.match(...):
    config.update(update_config)

vs:

pattern = item['pattern']
update_config = item['config']
if pattern.match(spec):
    config.update(update_config)

@betatim
Copy link
Member

betatim commented Jul 9, 2019

FWIW, I don't think the deprecation needs to be a blocker here, as the old way still works. Cleanly deprecating the old way can be a follow-up PR.

That is a good point. Let's follow that suggestion. Being able to bump binderhub on mybinder.org without having to also change the config makes life easier (and lets us use henchbot!).

@choldgraf
Copy link
Member Author

See the latest push for the following updates:

  • The API should now follow what @minrk suggested, how does that look?
  • A buncha tests added (let's see if they pass!)
  • Some minor docs added

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments in-line, but this is looking good to me. I think the only decision left is what do we want to happen when two patterns match:

  1. both config overrides are applied, later in the list takes priority for any multi-matches (current behavior)
  2. only first match is applied (break after match)
  3. only last match is applied (break after match, reverse iteration order)
  4. both applied, but earlier entries have higher priority (reverse iteration order)

I think any choice is reasonable (current behavior is most powerful, but I suspect folks who don't think carefully about the implementation could expect items higher in their yaml config to have higher priority), but we should mention which one we chose in the docs.

For example, the current behavior:

If multiple patterns match a given repository, all matching overrides will be applied in the order they appear in the list, meaning that overrides in the last item in the list will have highest priority.

binderhub/repoproviders.py Show resolved Hide resolved
binderhub/repoproviders.py Outdated Show resolved Hide resolved

provider = GitHubRepoProvider(
spec='jupyterhub/zero-to-jupyterhub-k8s/v0.4',
config=[base_config]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec_config?

@betatim
Copy link
Member

betatim commented Jul 16, 2019

I'd got with option (1) for the moment in terms of priority/when to stop.

@choldgraf
Copy link
Member Author

thanks for the feedback on tests etc

The latest commit:

  • Fixes the tests so they pass
  • Adds a test for default behavior w/o a matching spec
  • Adds a note to documentation about configuration override

@choldgraf choldgraf changed the title [WIP] Per spec configuration [mrg] Per spec configuration Jul 17, 2019
@betatim betatim merged commit 137ec19 into jupyterhub:master Jul 17, 2019
yuvipanda pushed a commit to jupyterhub/helm-chart that referenced this pull request Jul 17, 2019
@betatim
Copy link
Member

betatim commented Jul 17, 2019

Nice work! Tests, docs and a new feature!


What is the etiquette for marking conversations as "resolved"? Can I do it in a PR that isn't mine? Should the author do it? When I am the author I tend to mark things are solved (if they don't auto collapse) when I have implemented the feedback or the discussion has somehow ended. In this PR there were several people involved so it felt weird to do that but I also found myself scrolling through each discussion several times trying to not miss new comments.

@choldgraf
Copy link
Member Author

Re: "resolved" that's a good question, I'm generally happy for people to resolve things even if I'm the author on the PR, as it often means that my to-do list has shrunk :-)

as with the other team compass guidelines, I trust the judgment of other folks on the project so I'm happy for them to take the initiative there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants