# Getting data

Since we plan to analyze a few repositories in this workshop, let's download them.

We'll first get metadata about a user or organization thanks to GitHub API, and then download the repositories that interest us the most.

## Retrieving metadata about a user/organization

We iterate as long as the API gives us a pointer to another response page. We filter forks to focus on original repositories.

In [None]:
from json import dump as json_dump
from re import compile as re_compile
from typing import Optional

import requests


# Generate a personal access token here: https://github.com/settings/tokens
TOKEN = # See comment above, please generate a token and put it here


next_pattern = re_compile('<(https://api.github.com/user/[^/]+/repos\?[^>]*page=\d+[^>]*)>; rel="next"')


def parse_next(link_header: str) -> Optional[str]:
    match = next_pattern.search(link_header)
    return match.group(1) if match is not None else None


def list_repositories(user: str):
    headers = dict(Authorization="token {token}".format(token=TOKEN))
    url = "https://api.github.com/users/{user}/repos".format(user=user)
    while url is not None:
        request = requests.get(url, headers=headers)
        request.raise_for_status()
        for repo in request.json():
            if not repo["fork"]:
                yield repo["name"], repo["clone_url"], repo["size"], repo["stargazers_count"]
        url = parse_next(request.headers["Link"])


with open('output/repos.json', 'w') as fh:
    json_dump(list(list_repositories("apache")), fh)

## Filtering for repos we want to analyze

We'll keep the most popular repos by stars that are under a given size threshold.

In [None]:
from json import load as json_load
from operator import itemgetter
from pprint import pprint


MAX_SIZE = 50 * 1024


filtered = []
with open('output/repos.json', 'r') as fh:
    repos = json_load(fh)
    filtered = [(name, clone_url)
                for name, clone_url, size, _ in sorted(repos, key=itemgetter(3), reverse=True)
                if size <= MAX_SIZE]


pprint(filtered)

In [None]:
from multiprocessing.pool import ThreadPool


PARALLEL_DOWNLOADS = 10
REPOS_NUMBER = 50


def clone_repo(clone_url: str):
    !cd /devfest/repos && git clone -q {clone_url}


with ThreadPool(PARALLEL_DOWNLOADS) as pool:
    pool.map(clone_repo, [clone_url for _, clone_url in filtered[:REPOS_NUMBER]])