Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git ls-remote to resolve refs for git provider #895

Merged
merged 9 commits into from
Aug 8, 2019

Conversation

hugokerstens
Copy link
Contributor

Implements the approach specified in #843.

@akhmerov
Copy link

Shouldn't unresolved ref default to master?

@hugokerstens
Copy link
Contributor Author

hugokerstens commented Jul 11, 2019

Shouldn't unresolved ref default to master?

It already defaults to master in the front-end javascript if no ref is specified.

@akhmerov
Copy link

Test failures seem relevant; also we need to test the new cases.

@minrk
Copy link
Member

minrk commented Jul 11, 2019

If it works as advertised, this is awesome! In the future, it could also mean that we could greatly reduce our needed git provider API support, since all we'd need to do is turn provider-specific-url into git url, which should generally be doable purely in our code and without making any API requests.

@hugokerstens
Copy link
Contributor Author

The test for 073dba6 failed because Jupyterhub is not available for some reason.

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks wow this is a serious upgrade to the UX! Thanks @manics and @hugokerstens for your work on this!

1 - unresolved_ref renamed for clarity?

I was thinking from the naming of self.unresolved_ref that we assume this to be an unresolved reference (like a branch name: master, or a tag name: 0.1.0). I were about to suggest we should be able to handle both. But then I noted that you are already doing a check later within a try catch where it is also allowed for this to be a resolved reference:

        try:
            self.sha1_validate(unresolved_ref)
            self.resolved_ref = unresolved_ref

So, perhaps we can name this variable to indicate it not as an unresolved reference, but simply a reference or unparsed reference, provided reference or input reference etc?

2 - a delayed parsing of the provided git reference?

NOTE: These considerations are not important if an additional webrequest isn't needed, and these considerations are also not essential to merge/not merge of this PR.

The __init__ call to the GitRepoProvider does not trigger the parsing of the reference that could be for example "master" or "0.1.0" to a resolved reference, being a 40 char string. Instead, it will be made when get_resolved_ref() is called. Perhaps this is good, perhaps it isn't, I can't tell because I don't understand the usage pattern of RepoProvider objects yet - will this cause a lag that could be prevented, or does it actually prevent a lag? Could we return __init__ but start the fetch of a resolved reference ahead of time? Hmm...

I also figure that if we call get_resolved_ref multiple times before its return, it may cause multiple invokations of the git ls-remote command while we only want to do one, I think. Can we make it only call once? Is it important? I don't know fully, but that is stuff I considered when I looked through this code!

To summarize:

  • should we async- or sync- invoke the resolution of the reference logic on init, or not do it at all but wait until get_resolved_ref explicitly is called later?
  • should we ensure that only one lookup is done with git ls-remote?

self.sha1_validate(unresolved_ref)
self.resolved_ref = unresolved_ref
except ValueError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things here which aren't obvious from reading it a few times so maybe we can add some comments.

Why are we now swallowing exceptions here?

On L196 calling self.sha1_validate(unresolved_ref) should fail a lot because users can pass in a value like master so why call it now and not after we resolve the reference?

It seems weird to set self.resolved_ref to the value of unresolved_ref. I think we should delay that to when we have a resolved ref (and it it to None in the meantime).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass on an error since we postpone resolving the ref to the getter of resolved_ref if the ref is not a commit SHA. It makes sense to move this sha1 check to the getter for the resolved_ref and not set the resolved_ref at all in the constructor. This is in line with the other repo providers.

Copy link
Member

@consideRatio consideRatio Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also appreciate managing all logic within the getter (get_resolved_ref) as there is one. It was confusing to me the first time I read this through that some of it was found there but not all of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current implementation satisfactory?

The only thing I'm not 100% sure about is setting the resolved_ref in the else block. It is a good practice to only have code in the try block that produces the error you want to catch. However, it could be more readable if the resolved_ref is set directly after validating it in the try block. I tried to make it more readable by adding some comments, but still assign it in the else block.

binderhub/repoproviders.py Outdated Show resolved Hide resolved
binderhub/repoproviders.py Outdated Show resolved Hide resolved
Co-Authored-By: Tim Head <betatim@gmail.com>
@hugokerstens
Copy link
Contributor Author

1 - unresolved_ref renamed for clarity?

So, perhaps we can name this variable to indicate it not as an unresolved reference, but simply a reference or unparsed reference, provided reference or input reference etc?

That's indeed a lot more readable. If we move the sha1 check to the getter (as discussed here), renaming it is not needed anymore. We can just speak of an unresolved reference in the constructor, and only bother with validating in the getter for the resolved ref.

@betatim
Copy link
Member

betatim commented Jul 12, 2019

I think switching to unresolved_ref as a variable name is a good idea and consistent with how other providers name it. Maybe the most precise name would be potentially_unresolved_ref but that is quite long :-/

Right now when you enter a git repo URL in the UI you end up with the following link https://mybinder.org/v2/git/http%3A%2F%2Fgit.example.com%2Frepo/123124124
Screen Shot 2019-07-12 at 16 44 47

so we should make sure that format continues to work. It makes it a bit tricky to deal with branches that contain / though. From a quick look at the gitlab.com provider it seems to do what you discussed in the comment and encode the / in the URL of the repo and leave the ones in the branch/tag name alone. So that would be a good example to follow.

We should extend the tests to handle the new possibilities (as well as some invalid ones?). For this to work we need to mock the calls to git ls-remote as we don't want to actually run commands that need networking. To test the resolving via git ls-remote we can setup a local "throw away" repo, clone it, and then use the original as the "remote". I think that would work.

@consideRatio
Copy link
Member

consideRatio commented Jul 12, 2019

It is this change that led me to consider the variable name in the first place I guess. Previously, a resolved reference was required, but now it cannot be a resolved reference? Well the code is currently meant to support that, but I got confused due to this documentation that led me to think it was no longer supported. Hence, my thinking of changing this documentation snippet and than the variable name felt natural to change along with it.

    Users must provide a spec of the following form.

-    <url-escaped-namespace>/<resolved_ref>
+    <url-escaped-namespace>/<unresolved_ref>

raise ValueError("The specified branch, tag or commit SHA ('{}') was not found on the remote repository."
.format(self.unresolved_ref))
resolved_ref = result.stdout.split(None, 1)[0]
self.sha1_validate(resolved_ref)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is validating the SHA that is returned by git ls-remote too much? I think this safe-guard is nice to have for unexpected output of git ls-remote.

@hugokerstens
Copy link
Contributor Author

hugokerstens commented Jul 13, 2019

I added some extra documentation for the GitRepoProvider that explicitly states that both resolved and unresolved references are possible to address some of the concerns of @consideRatio. However, the variable is still called unresolved_ref in the code, which is confusing as well. The problem here is the ambiguity of 'unresolved'. It could either mean unresolved as in yet to be resolved to a commit hash, OR unresolved as in not validated at all.

@chicocvenancio
Copy link
Contributor

This fixes #675 if I'm not misunderstanding something. A great improvement.

@betatim
Copy link
Member

betatim commented Aug 8, 2019

Thanks for bumping this @chicocvenancio. Deployed this locally and tested it with some GitHub and GitLab repositories (pretending they were just git repos). Seems to work 🎉.

Merging! Thanks for this cool feature!

@betatim betatim merged commit 1a76d5f into jupyterhub:master Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Under the hood improvements and fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants