# Bioinformatics help wanted: find open source repos seeking contributions

## Overview

This notebook searches the GitHub API for repositories matching search terms, and for open issues within those repositories matching an issue label. For example, use this code to find a list of all repositories matching the search term "bioinformatics", written in languages of your choice, with issues labeled "help wanted".

## Usage

#### Installing requirements
`pip install -r requirements.txt`

#### GitHub credentials
A GitHub account and associated OAuth token are required to run this notebook. See these [instructions](https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line) to create a token.

#### GitHub search terms
Simply modify the "Parameters" section with your GitHub credentials and desired search terms. 

#### Results
The running notebook will print a summary of repos and open issues matching your search terms.

## Parameters

In [None]:
# GitHub credentials
gh_username = 'pamelarussell'
gh_oauth_file = 'gh_oauth_token.txt'

# GitHub search terms
topics = ['bioinformatics']
languages = ['scala', 'java']
issue_label = 'help wanted'

## Imports

In [None]:
import chardet
import json
import pycurl

from time import sleep
from github3 import login
from pycurl import Curl
from io import BytesIO
from json.decoder import JSONDecodeError

## Setup

In [None]:
with open(gh_oauth_file) as fh:
    gh_oauth_key = fh.readline().strip()
api_rate_limit_per_hour = 5000
sec_between_requests = 60 * 60 / api_rate_limit_per_hour
url_repos = 'https://api.github.com/repos'

## Utility functions for GitHub API

In [None]:
def gh_userpwd(gh_username, gh_oauth_key):
    """ Returns string version of GitHub credentials to be passed to GitHub API"""
    return('{}:{}'.format(gh_username, gh_oauth_key))

def sleep_gh_rate_limit():
    """Sleep for the required amount of time per API request to ensure rate limit is not exceeded"""    
    sleep(sec_between_requests + 0.01) 
    
def add_page_num(url, page_num):
    """Add page number to GitHub API request and return new URL"""
    if '?' in url:
        return '{}&page={}'.format(url, page_num)
    else:
        return '{}?page={}'.format(url, page_num)
    
def validate_response_found(parsed, message = ''):
    """ Check that the GitHub API returned a valid response
    
    Args:
        parsed: dict
            Parsed JSON response
        message
            Extra info to print
    """
    if 'message' in parsed:
        if parsed['message'] == 'Not Found':
            raise ValueError('Parsed response has message: Not Found. Further information:\n{}'.format(message))

def gh_curl_response(url, gh_username, gh_oauth_key):
    """Returns the parsed curl response from the GitHub API; combines pages if applicable
    
    Returns:
        Parsed API response consisting of a list of dicts, one for each record, or just one
        dict if the response was a single dict.
        
    """
    page_num = 1
    results = []
    prev_response = None
    while True:
        buffer = BytesIO()
        c = pycurl.Curl()
        c.setopt(c.URL, add_page_num(url, page_num))
        c.setopt(c.USERPWD, gh_userpwd(gh_username, gh_oauth_key))
        c.setopt(c.WRITEDATA, buffer)
        sleep_gh_rate_limit()
        try:
            c.perform()
        except pycurl.error as e:
            print(url)
            raise e
        c.close()
        body = buffer.getvalue()
        try:
            parsed = json.loads(body.decode())
            if 'message' in parsed:
                if 'API rate limit exceeded' in parsed['message']:
                    raise PermissionError(parsed['message'])
        except JSONDecodeError:
            print('Caught JSONDecodeError. Returning empty list for URL {}'.format(url))
            return []
        validate_response_found(parsed, add_page_num(url, page_num))
        if type(parsed) is dict:
            return parsed
        else:
            if len(parsed) == 0:
                break
            else:
                if parsed == prev_response:
                    # Sometimes GitHub API will return the same response for any provided page num
                    break
                else:
                    prev_response = parsed
                    results = results + parsed
                    page_num = page_num + 1
    return results

## Find issues

In [None]:
url = 'https://api.github.com/search/repositories?q={}+{}&sort=stars&order=desc'.format(
    '+'.join('topic:{}'.format(topic) for topic in topics), 
    '+'.join('language:{}'.format(language) for language in languages))
repo_data = gh_curl_response(url, gh_username, gh_oauth_key)

for repo in repo_data['items']:
    repo_url = repo['url']
    issues_url = '{}/issues?state=open&labels={}'.format(repo_url, issue_label.replace(' ', '%20'))
    issue_data = gh_curl_response(issues_url, gh_username, gh_oauth_key)
    if issue_data:
        print('\n')
        print('Repo: {}'.format(repo['full_name']))
        print('Description: {}'.format(repo['description']))
        print('Language: {}'.format(repo['language']))
        print('URL: {}'.format(repo['html_url']))
        print('Open issues with label "{}":'.format(issue_label))
        for issue in issue_data:
            print('\t- {} ({})'.format(issue['title'], issue['html_url']))