# find_github_repo

## Goals

- To extract the names of github repositories from Pubmed abstracts that mention github.
- To use the github api to get author and contribution data about those repositories
- To create a dataset linking the two

The bit that reads in xml files and extracts article information to a data frame is from Kevin's xml_parsing script

First, set working directory

In [1]:
import os
os.chdir("../data/")
os.listdir("./")

['github_pubs.xml', 'README.md']

Next, import libraries needed for xml parsing

In [2]:
import xml.etree.ElementTree as ET
import datetime

Create Article class definition

In [3]:
class Article(object):
    """Container for publication info"""
    def __init__(self, pmid, pubdate, journal, title, abstract, authors):
        self.pmid = pmid
        self.pubdate = pubdate
        self.journal = journal
        self.title = title
        self.abstract = abstract
        self.authors = authors
    def __repr__(self):
        return "<Article PMID: {}>".format(self.pmid)

    def get_authors(self):
        for author in self.authors:
            yield author["Last"], author["First"]

Article generator function

In [4]:
def parse_pubmed_xml(xml_file):
    xml_handle = ET.parse(xml_file)
    root = xml_handle.getroot()

    for Citation in root.iter("MedlineCitation"):
        pmid = Citation[0].text
        pubdate = datetime.date(
            int(Citation[1][0].text),  # year
            int(Citation[1][1].text),  # month
            int(Citation[1][2].text)  # day
            )
        
        Journal = next(Citation.iter("Journal"))

        journal_title = Journal.find("ISOAbbreviation").text
        article_title = next(Citation.iter("ArticleTitle")).text
        
        abstract = next(Citation.iter("AbstractText")).text
        try:
            authors = [{
                "Last": Author.find("LastName").text,
                "First": Author.find("ForeName").text
                   } for Author in Citation.iter("Author")]
        except:
           continue
        
        yield Article(pmid, pubdate, journal_title, article_title, abstract, authors)

Make data frame (this differs a bit from Kevin's code by including abstract also)

In [6]:
import pandas as pd
df = pd.DataFrame()
col_names = ["Date", "Journal", "Authors","Abstract"]

for article in parse_pubmed_xml('github_pubs.xml'):
    row = pd.Series([article.pubdate, article.journal, [(author[0], author[1]) for author in article.get_authors()],
                     article.abstract],name=article.pmid, index=col_names)
    df = df.append(row)

print(df)

                                                   Abstract  \
26357045  Stability and sensitivity analyses of biologic...   
25601296  Most electronic data capture (EDC) and electro...   
25558360  Remotely sensed data - available at medium to ...   
25553811  Whole-genome bisulfite sequencing (WGBS) is an...   
25549775  A number of computational approaches have been...   
25543048  Sampling the conformational space of biologica...   
25540185  With rapidly increasing volumes of biological ...   
25527832  Current strategies for SNP and INDEL discovery...   
25526884  An absent word of a word y of length n is a wo...   
25524895  VarSim is a framework for assessing alignment ...   
25521965  Thyroid cancer is the most common endocrine tu...   
25520192  RNA performs a diverse array of important func...   
25514851  We describe an open-source kPAL package that f...   
25505094  Circular permutation is an important type of p...   
25505087  In bioinformatic applications, computationall