# Parsing Pubmed XML file

Data sets can be found [here](https://github.com/kescobo/gender-comp-bio/tree/master/data).

## Goals

Raw data from pubmed is contained in xml files, and we'd like to extract author and date information into a spreadsheet for easier analysis. 

The first thing to do is to make sure to set the working directory to where the data is.

In [35]:
import os

print(os.getcwd())

/Users/KBLaptop/computation/gender-comp-bio/data


In [36]:
os.chdir("../data/")
os.listdir()

['.DS_Store',
 'biology-1997-2014.xml',
 'comp-bio-1997-2014.xml',
 'github_pubs.xml',
 'README.md']

Next, we'll need to parse the xml files. We can do this using the built-in [python xml module](https://docs.python.org/3.5/library/xml.etree.elementtree.html). 

In [39]:
import xml.etree.ElementTree as ET
import datetime

xml_handle = ET.parse('github_pubs.xml')
root = xml_handle.getroot()

for Citation in root.iter("MedlineCitation"):
    pmid = Citation[0].text
    pubdate = datetime.date(
        int(Citation[1][0].text),  # year
        int(Citation[1][1].text),  # month
        int(Citation[1][2].text)  # day
        )
    
    Journal = next(Citation.iter("Journal"))

    journal_title = Journal.find("ISOAbbreviation").text
    
    abstract = next(Citation.iter("AbstractText")).text
    
    # some articles don't have author fields - ignoring those
    try:
        authors = [{
                "Last": Author.find("LastName").text,
                "First": Author.find("ForeName").text
                   } for Author in Citation.iter("Author")]
    except:
        continue
    
    print("PMID: {}\nJournal: {}\nAuthors: {}\n".format(
        pmid, journal_title, [(x['Last'], x['First']) for x in authors]))

### Class Definition

Just because I need the practice, I'm going to set up an `Article` class to hold the data and make working with it easier

In [40]:
class Article(object):
    """Container for publication info"""
    def __init__(self, pmid, pubdate, journal, title, abstract, authors):
        self.pmid = pmid
        self.pubdate = pubdate
        self.journal = journal
        self.title = title
        self.abstract = abstract
        self.authors = authors
    def __repr__(self):
        return "<Article PMID: {}>".format(self.pmid)

    def get_authors(self):
        for author in self.authors:
            yield author["Last"], author["First"]

### Generator Function

And... we can turn the code above into a generator function that yields an `Article` for each document

In [41]:
def parse_pubmed_xml(xml_file):
    xml_handle = ET.parse(xml_file)
    root = xml_handle.getroot()

    for Citation in root.iter("MedlineCitation"):
        pmid = Citation[0].text
        pubdate = datetime.date(
            int(Citation[1][0].text),  # year
            int(Citation[1][1].text),  # month
            int(Citation[1][2].text)  # day
            )
        
        Journal = next(Citation.iter("Journal"))

        journal_title = Journal.find("ISOAbbreviation").text
        article_title = next(Citation.iter("ArticleTitle")).text
        
        abstract = next(Citation.iter("AbstractText")).text
        try:
            authors = [{
                "Last": Author.find("LastName").text,
                "First": Author.find("ForeName").text
                   } for Author in Citation.iter("Author")]
        except:
           continue
        
        yield Article(pmid, pubdate, journal_title, article_title, abstract, authors)

Usage:

In [46]:
for article in parse_pubmed_xml('github_pubs.xml'):
    print(article)
    print(article.pubdate)
    for last, first in article.get_authors():
        print("{}, {}".format(last, first))
    print()

### DataFrame generation

In order to get the data into a usable spreadsheet-like form, and for later analysis, I'm going to use the `DataFrame`s from the [pandas](http://pandas.pydata.org/) package. This might be overkill, but I know how to use it (sort of). 

In [43]:
import pandas as pd

col_names = ["Date", "Journal", "Authors"]

df = pd.DataFrame()
col_names = ["Date", "Journal", "Authors"]

for article in parse_pubmed_xml('github_pubs.xml'):
    row = pd.Series([article.pubdate, article.journal, [(author[0], author[1]) for author in article.get_authors()]],
                    name=article.pmid, index=col_names)
    df = df.append(row)

print(df)

                                                    Authors        Date  \
26357045  [(Shiraishi, Fumihide), (Yoshida, Erika), (Voi...  2015-09-11   
25601296         [(Dixit, Abhishek), (Dobson, Richard J B)]  2015-01-20   
25558360  [(Tuck, Sean L), (Phillips, Helen Rp), (Hintze...  2015-01-05   
25553811  [(Chen, Junfang), (Lutsik, Pavlo), (Akulenko, ...  2015-01-02   
25549775  [(Manini, Simone), (Antiga, Luca), (Botti, Lor...  2015-06-09   
25543048  [(Bouvier, Guillaume), (Desdouits, Nathan), (F...  2015-04-28   
25540185                                [(Meinicke, Peter)]  2015-04-28   
25527832  [(Lindberg, Michael R), (Hall, Ira M), (Quinla...  2015-04-12   
25526884  [(Barton, Carl), (Heliou, Alice), (Mouchard, L...  2015-04-27   
25524895  [(Mu, John C), (Mohiyuddin, Marghoob), (Li, Ji...  2015-04-28   
25521965  [(Wu, Chengkun), (Schwartz, Jean-Marc), (Braba...  2014-12-19   
25520192               [(Gardner, Paul P), (Eldai, Hisham)]  2015-01-24   
25514851  [(Anvar, Seyed 

### Getting Author Order

Author position matters, but it matters in sort of a weird way - first author and last author are most important, then decreasing as you work your way in to the middle of the list. But practically, there's not much distinction between 3rd and 4th author (or 3rd from last and 4th from last), so we'll generate scores for first, second, last, penultimate and everyone else. The trick is to avoid index errors if the author list is smaller than 5, so we need to write up some special cases. 

In [46]:
def score_short(author_list):
    first = author_list[0]
    
    if len(author_list) == 4:
        second = author_list[1]
        last = author_list[3]
        penultimate = author_list[2]
    elif len(author_list) == 3:
        last = author_list[2]
        penultimate = author_list[1]
    elif len(author_list) == 2:
        last = author_list[1]
        

def score_authors(author_list):
    if len(author_list) < 5:
        score_short(author_list)
    else:
        first = author_list[0]
        second = author_list[1]
        last = author_list[-1]
        penultimate = author_list[-2]
        others = [author for author in author_list[2:-2]]

In [47]:
for authors in df["Authors"]:
    score_authors(authors)