## Import Libraries

The below statements import all the relevant libraries that are used in this notebook

In [None]:
import requests, time, datetime
import pandas as pd

from bs4 import BeautifulSoup
from github import Github
from getpass import getpass
from tqdm import trange

## Get Feedstocks!

conda-forge calls the github repositories used to connect the original python packages to a conda package. Check out this page if you'd like to see what's in them: https://conda-forge.org/feedstocks/
We will later iterate over these pages so we can check out the pull requests (PRs) in them.

A status code of 200 means it was successful

In [None]:
# get webpage with all packages
r = requests.get('https://conda-forge.org/feedstocks/')

# check the status code returned
r.status_code

## Parse out what we need from the feedstocks

BeautifulSoup is a great web scraping package used. In this case we use it to convert the response from `requests` into something pythonic

As you can see it grabs the links and shows you the first three feedstocks. 

In [None]:
# parse the response to get package names
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("section", {"id": "feedstocks"})
links = [a['href'] for a in table.findAll('a')]

# inspect the links
links[:3]

## writetoresults

This function will come up in a minute. It's how we write the pull requests to a CSV

In [None]:
def writetoresults(pullrecs,writefile):
    #this is where we use pandas (note the 'pd' below). It's really only because I'm lazy
    #and it's a simple way to quickly write to a CSV. 
    pullrecs_df=pd.DataFrame(pullrecs,columns=['package name','pull request title'])
    pullrecs_df.to_csv(writefile,mode='a')

## Authenticate!

In order to parse through the git pages, we are now going to use the Github python package to request the data. 

In [None]:
# the input should be pretty self explanatory. it allows the user to type
# in their username into a prompt
username=input('enter your github username')

# in order to keep the password private, using the getpass package allows for
# the password to be not stored in the code and also appear as dots. 
password=getpass('input password')

g=Github(username,password)
writefile=input('file output path')

## getpullrecs

The meat of the script. Iterates over the `links` variable, determines if there are PRs, and if they are there they add them to the `pullrecs` variable. 

Each time it has iterated over 100 packages (counted by the `j` variable), it calls the writetoresults() function, which allows for incremental writing to a CSV in case the process is interrupted. 

In [None]:
def getpullrecs():
    j=0
    pullrecs=[]
    linktest=links[:5] 
    # trange is part of the tqdm package. it's what makes the progress bar when you run this
    with trange(len(links)) as t:
        # iterating this way over the trange is how the % on the progress bar updates 
        for i in t:
            tries=1
            # in case there is a failure with a request, we don't want the function to fail
            # instead it tries each package 10 times before it gives up and the function fails
            while tries < 10:
                try:
                    link=links[i][19:]
                    # t is the trange we set with the above with statement. this changes the
                    # description as it iterates.
                    t.set_description('the current package is: {}'.format(link[12:]))
                    r=g.get_repo(link)
                    pulls = r.get_pulls(state='open', sort='created', base='master')
                    if pulls.totalCount>0:
                        for pull in pulls:
                            pullrecs.append((r.name,pull.title))
                    break
                except:
                    # this needs to get updated to handle the specific exception. 
                    # this is here because Github doesn't allow infinite requests over and over again
                    # effectively, you can only make so many requests before they tell you to knock it off
                    # which is after 5000 attempts in 60 minutes (and right now we are making roughly
                    # 12000 calls). once you hit that limit, you have to pause for an hour before starting
                    # to make more requests, otherwise they may blackball your account or IP. 
                    print('rate limited @ {}'.format(str(datetime.datetime.now())))
                    time.sleep(3630)
            j+=1
            if j==100:
                # see? when j==100, write to results!
                writetoresults(pullrecs,writefile)
                pullrecs=[]
                # and reset the counter
                j=0

    # since we probably don't end on a number divisible by 100, once we fall out of the 
    # loop we need to write the rest of those results. 
    writetoresults(pullrecs,writefile)

In [None]:
getpullrecs()