# find_github_repo

## Goals

- To extract the names of github repositories from Pubmed abstracts that mention github.
- To use the github api to get author and contribution data about those repositories
- To create a dataset linking the two

The bit that reads in xml files and extracts article information to a data frame is from Kevin's xml_parsing script

First, set working directory

In [71]:
import os
os.chdir("../data/")
os.listdir("./")

['github_pubs.xml', 'README.md']

Next, import libraries needed for xml parsing

In [72]:
import xml.etree.ElementTree as ET
import datetime

Create Article class definition

In [73]:
class Article(object):
    """Container for publication info"""
    def __init__(self, pmid, pubdate, journal, title, abstract, authors):
        self.pmid = pmid
        self.pubdate = pubdate
        self.journal = journal
        self.title = title
        self.abstract = abstract
        self.authors = authors
    def __repr__(self):
        return "<Article PMID: {}>".format(self.pmid)

    def get_authors(self):
        for author in self.authors:
            yield author["Last"], author["First"]

Article generator function

In [74]:
def parse_pubmed_xml(xml_file):
    xml_handle = ET.parse(xml_file)
    root = xml_handle.getroot()

    for Citation in root.iter("MedlineCitation"):
        pmid = Citation[0].text
        pubdate = datetime.date(
            int(Citation[1][0].text),  # year
            int(Citation[1][1].text),  # month
            int(Citation[1][2].text)  # day
            )
        
        Journal = next(Citation.iter("Journal"))

        journal_title = Journal.find("ISOAbbreviation").text
        article_title = next(Citation.iter("ArticleTitle")).text
        
        abstract = next(Citation.iter("AbstractText")).text
        try:
            authors = [{
                "Last": Author.find("LastName").text,
                "First": Author.find("ForeName").text
                   } for Author in Citation.iter("Author")]
        except:
           continue
        
        yield Article(pmid, pubdate, journal_title, article_title, abstract, authors)

Make data frame (this differs a bit from Kevin's code by including abstract and url field) 

In [75]:
import pandas as pd
df = pd.DataFrame()
col_names = ["Date", "Journal", "Authors","Abstract","Url","Github"]

for article in parse_pubmed_xml('github_pubs.xml'):
    row = pd.Series([article.pubdate, article.journal, [(author[0], author[1]) for author in article.get_authors()],
                     article.abstract, '', ''],name=article.pmid, index=col_names)
    df = df.append(row)

df.head()

Unnamed: 0,Abstract,Authors,Date,Github,Journal,Url
26357045,Stability and sensitivity analyses of biologic...,"[(Shiraishi, Fumihide), (Yoshida, Erika), (Voi...",2015-09-11,,IEEE/ACM Trans Comput Biol Bioinform,
25601296,Most electronic data capture (EDC) and electro...,"[(Dixit, Abhishek), (Dobson, Richard J B)]",2015-01-20,,JMIR Med Inform,
25558360,Remotely sensed data - available at medium to ...,"[(Tuck, Sean L), (Phillips, Helen Rp), (Hintze...",2015-01-05,,Ecol Evol,
25553811,Whole-genome bisulfite sequencing (WGBS) is an...,"[(Chen, Junfang), (Lutsik, Pavlo), (Akulenko, ...",2015-01-02,,J Bioinform Comput Biol,
25549775,A number of computational approaches have been...,"[(Manini, Simone), (Antiga, Luca), (Botti, Lor...",2015-06-09,,Ann Biomed Eng,


Check the content of the first abstract

In [76]:
df.iat[0,0]

'Stability and sensitivity analyses of biological systems require the ad hocwriting of computer code, which is highly dependent on the particular model and burdensome for large systems. We propose a very accurate strategy to overcome this challenge. Its core concept is the conversion of the model into the format of biochemical systems theory (BST), which greatly facilitates the computation of sensitivities. First, the steady state of interest is determined by integrating the model equations toward the steady state and then using a Newton-Raphson method to fine-tune the result. The second step of conversion into the BST format requires several instances of numerical differentiation. The accuracy of this task is ensured by the use of a complex-variable Taylor scheme for all differentiation steps. The proposed strategy is implemented in a new software program, COSMOS, which automates the stability and sensitivity analysis of essentially arbitrary ODE models in a quick, yet highly accurate

Use regexp to extract github url from abstract (adapted from  http://stackoverflow.com/questions/839994/extracting-a-url-in-python) - This is overkill, but does the job. We need re to use regular expressoions.

In [77]:
import re
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

Import urlparse to allow parsing of urls

In [78]:
from urlparse import urlparse

Loop through all rows and extract github url from abstract

Extracting urls containing the string github to create project url entry
If the string starts with http(s)://github.com, we can use urlparse to get the corresponding github project
If it starts with github.com, urlparse doesn't work, but we can just truncate the string and get the github project from that

Keep track of numbers for later quality control


In [79]:
except_no = 0
no_project_from_url = 0
project_from_url = 0


for i in range(0,(len(df.index)-1)):             
    abstract = df.iat[i,0]
    github_project = ''
    try:
        project_url_list = re.findall(URL_REGEX, abstract)
        
        index = next(j for j, string in enumerate(project_url_list) if "github" in string)
        project_url = project_url_list[index]
                    
        df.iat[i,5] = project_url

        if project_url.startswith('https://github.com') or project_url.startswith('http://github.com'):
            github_project = urlparse(project_url).path[1:]
            df.iat[i,3] = github_project
            project_from_url = project_from_url + 1 
        
        elif project_url.startswith('github.com'):
            github_project = project_url[11:]
            df.iat[i,3] = github_project
            project_from_url = project_from_url + 1 
            
        else:
            no_project_from_url = no_project_from_url + 1
            
    except:
        except_no = except_no + 1 
        pass

Check how many abstracts did not have a github url associated with it

In [80]:
print except_no

270


In [81]:
df.head(200)

Unnamed: 0,Abstract,Authors,Date,Github,Journal,Url
26357045,Stability and sensitivity analyses of biologic...,"[(Shiraishi, Fumihide), (Yoshida, Erika), (Voi...",2015-09-11,BioprocessdesignLab/COSMOS,IEEE/ACM Trans Comput Biol Bioinform,https://github.com/BioprocessdesignLab/COSMOS
25601296,Most electronic data capture (EDC) and electro...,"[(Dixit, Abhishek), (Dobson, Richard J B)]",2015-01-20,,JMIR Med Inform,
25558360,Remotely sensed data - available at medium to ...,"[(Tuck, Sean L), (Phillips, Helen Rp), (Hintze...",2015-01-05,seantuck12/MODISTools,Ecol Evol,https://github.com/seantuck12/MODISTools
25553811,Whole-genome bisulfite sequencing (WGBS) is an...,"[(Chen, Junfang), (Lutsik, Pavlo), (Akulenko, ...",2015-01-02,Junfang/AKSmooth,J Bioinform Comput Biol,https://github.com/Junfang/AKSmooth
25549775,A number of computational approaches have been...,"[(Manini, Simone), (Antiga, Luca), (Botti, Lor...",2015-06-09,,Ann Biomed Eng,http://archtk.github.com
25543048,Sampling the conformational space of biologica...,"[(Bouvier, Guillaume), (Desdouits, Nathan), (F...",2015-04-28,,Bioinformatics,
25540185,With rapidly increasing volumes of biological ...,"[(Meinicke, Peter)]",2015-04-28,,Bioinformatics,
25527832,Current strategies for SNP and INDEL discovery...,"[(Lindberg, Michael R), (Hall, Ira M), (Quinla...",2015-04-12,,Bioinformatics,
25526884,An absent word of a word y of length n is a wo...,"[(Barton, Carl), (Heliou, Alice), (Mouchard, L...",2015-04-27,,BMC Bioinformatics,
25524895,VarSim is a framework for assessing alignment ...,"[(Mu, John C), (Mohiyuddin, Marghoob), (Li, Ji...",2015-04-28,,Bioinformatics,


Quality control: see for how many urls the extraction of a github project has worked

In [82]:
print no_project_from_url
print project_from_url

15
116


In [96]:
total_rows = len(df.index)+1

print project_from_url*100.0/total_rows

print project_from_url*100.0/(total)

28.7841191067


In [84]:
test = str(df.iat[0,5])
print test

https://github.com/BioprocessdesignLab/COSMOS


In [85]:
test2 = str(df.iat[4,5])
print test2

http://archtk.github.com


In [86]:
urlparse(test)


ParseResult(scheme='https', netloc='github.com', path='/BioprocessdesignLab/COSMOS', params='', query='', fragment='')

In [87]:
urlparse(test2)

ParseResult(scheme='http', netloc='archtk.github.com', path='', params='', query='', fragment='')

In [88]:
gurl = urlparse("https://github.com/BioprosessdesignLab/COSMOS")

In [89]:
print(gurl.path)

/BioprosessdesignLab/COSMOS


In [90]:
test.find("https://github.com/")



0

access github api using pygithub


In [91]:
from github import Github

g = Github()

In [92]:
whisper = g.get_repo("graphite-project/whisper")
print whisper.description

Whisper is a file-based time-series database format for Graphite.


In [93]:
cosmos = g.get_repo("BioprocessdesignLab/COSMOS")
print cosmos.description

Computation of Sensitivities in Model ODE Systems
