Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remote upstream repository #9

Open
edumco opened this issue May 27, 2020 · 7 comments
Open

Add remote upstream repository #9

edumco opened this issue May 27, 2020 · 7 comments

Comments

@edumco
Copy link

edumco commented May 27, 2020

Add remote upstream repository to the forked projects after cloning

Could be possible to add the upstream original repository for the forked repositories?

Every time I fork a project on Github I have to manually execute the following command:

git remote add upstream GIT_UPSTREAM_URL

This way I can keep the fork synchronized with the original repo.

I think it could be possible to add an option:

ghcloneall --include-forks --set-upstream

What do you think about it?

@mgedmin
Copy link
Owner

mgedmin commented May 28, 2020

This is an interesting idea! I need to see if the upstream URL is available in the JSON returned by GitHub's API. I don't see it in the GET /users/{username}/repos response, but it's there in the GET /repos/{owner}/{repo} response.

So this is possible (but might require extra API requests for each repo). Would you be interested in preparing a pull request?

@edumco
Copy link
Author

edumco commented May 28, 2020

So this is possible (but might require extra API requests for each repo). Would you be interested in preparing a pull request?

Yessss

I'm a python newbie. But i think that would not be too complicate.

If i got stuck I'll put some questions here!

@edumco
Copy link
Author

edumco commented May 30, 2020

@mgedmin

I did not found any endpoint on Github that shows me the upstream nor a git command.

But it is found on github repository page in 2 places:

  • A metadata tag (with lot of other information)

  • A link called "forked from ..."

I have filtered this link using curl an grep

curl 'https://github.com/edumco/ghcloneall' -so - |  \
grep 'forked from <' | \
grep -iPo '(?<=href=")(.*)(?=">)'

Result
/mgedmin/ghcloneall

So dig into the code and found that on RepoTask is where most of the logic is.

So added the following code:

    def get_github_url(self, remote_url):
        """"
        Turns a remote url into the normal Github URL: 
            Receives:   git@github.com:edumco/ghcloneall.git
            Returns:    https:github.com/edumco/ghcloneall
        """
        return (
            remote_url.replace(".git", "").replace(":", "/").replace("git@", "https:")
        )

    def get_upstream(self, dir):
        """
        Gets the remote url from the original "upstream" repository.
            Receives:   repo dir (".../repo")
            Returns:    the upstream url (git@github.com:user/repo.git)
        """
        github_url = get_github_url(get_remote_url(dir))

        # curl github_url -so - | \
        # grep 'forked from <' | \
        # grep -iPo '(?<=href=")(.*)(?=">)'

        curl_upstream = [ "curl", github_url, "-so", "-", "|",
            "grep", "'forked from <'", "|",
            "grep", "-I", "P", "o", "'(?<=href=")(.*)(?=">)'" ]

        upstream = self.check_output(curl_upstream, cwd=dir).strip()

        ## clean the first slash before return

        return "git@github.com:".join(upstream).join(".git")

I have 2 questions:

  • Do you think I should go on with this approach?

  • Pass so many parameters will work or should i break in small pieces?

@mgedmin
Copy link
Owner

mgedmin commented May 30, 2020

Ah, no. The GitHub API I linked to does provide this information, but calls it parent or source:

The parent and source objects are present when the repository is a fork. parent is the repository this repository was forked from, source is the ultimate source for the network.

I think it would be easiest to modify Repo.from_repo:

ghcloneall/ghcloneall.py

Lines 371 to 376 in 0db5163

@classmethod
def from_repo(cls, repo):
# use repo['git_url'] for anonymous checkouts, but they'e slower
# (at least as long as you use SSH connection multiplexing)
clone_url = repo['ssh_url']
return cls(repo['name'], clone_url, (repo['clone_url'],))

You need repo['owner'] to issue the additional API request and get the parent's ssh_url. You can use parent = get_json_and_links('https://api.github.com/repo/{owner}/{name}'.format_map(repo)[0].get('parent') to get the JSON.

Add another Repo attribute and __init__ argument, eitherparent_url or upstream_url, set it to None by default so from_gist() doesn't break.

Now when from_repo() constructs the class instance on the last line, pass it parent['ssh_url'] if parent else None.

Then, in RepoTask.clone(), you can add another call inside the if:

ghcloneall/ghcloneall.py

Lines 587 to 592 in 0db5163

def clone(self, repo, dir):
self.progress_item.update(' (new)')
if not self.options.dry_run:
url = self.repo_url(repo)
self.check_call(['git', 'clone', '-q', url])
self.new = True

something like

            self.check_call(['git', 'remote', 'add', 'upsteam', repo.upstream_url])

maybe guarded by if self.options.set_upstream? Or maybe do it unconditionally, I don't think anyone will mind that.

(RepoTask.options is the RepoWrangler instance, which seems daft! If I did this today, I'd introduce a new Options class.)

You know what worries me? If a separate API request is needed for each repository, then ghcloneall on an organization with 380+ repositories (like zopefoundation, my primary use case for ghcloneall) is going to issue 380+ API requests in short order. This may bump into anonymous API access rate limits.

@mgedmin
Copy link
Owner

mgedmin commented May 30, 2020

Idea for reducing GitHub API pressure:

  • instead of always doing the extra request in Repo.from_repo, store repo['owner'] in an attribute and add a get_upstream_url() method to Repo. Call it from RepoTask.clone(), so that repositories that are already checked out never perform the additional requests.

This would be easy but not helpful at all for the initial 380+ repo clone.

  • Investigate alternative GitHub APIs, maybe there's a way to fetch a list of all organization's (or user's) repositories including source URLs in one go? maybe using GraphQL?

But I think that always requires an authentication token, and I like that ghcloneall can run with zero setup or auth steps needed.

  • Add that --set-upstream option and keep it off by default

This would need to be combined with option 1, so additional requests happen only if you explicitly request them.

@edumco
Copy link
Author

edumco commented May 30, 2020

Thanks for the helpful answer!

The parent and source objects

This is exactly what i was looking for! If I understood correctly i should use only the parent and only when it is a fork.

being disabled by default.

I completely agree with you about that! The first impression is the most important!

I like that ghcloneall can run with zero setup

The option --set-upstream could alert that a configuration is needed, so the main commands would not change at all and only requires extra configs for extra options.

I will try this way and when I have some progress I show you again 👷

@mgedmin
Copy link
Owner

mgedmin commented May 30, 2020

The option --set-upstream could alert that a configuration is needed, so the main commands would not change at all and only requires extra configs for extra options.

That sounds like a good idea, and adding support for authenticating with a ticket would also help with #8.

(I would suggest not tackling everything at once: it may be better to add token authentication support in a separate PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants