# Parsing Pubmed XML file

Data sets can be found [here](https://github.com/kescobo/gender-comp-bio/tree/master/data).

## Goals

Raw data from pubmed is contained in xml files, and we'd like to extract author and date information into a spreadsheet for easier analysis. 

The first thing to do is to make sure to set the working directory to where the data is.

In [1]:
import os

print(os.getcwd())

/Users/KBLaptop/computation/gender-comp-bio/src


In [2]:
os.chdir("../data/")
os.listdir()

['.DS_Store',
 'all_bio_pubs.csv',
 'author_list.csv',
 'biology-1997-2014.xml',
 'comp-bio-1997-2014.xml',
 'compbio_pubs.csv',
 'compbio_pubs2.csv',
 'github_pubs.csv',
 'github_pubs.xml',
 'README.md',
 'test_1.xml',
 'test_2.xml',
 'test_3.xml']

Next, we'll need to parse the xml files. We can do this using the built-in [python xml module](https://docs.python.org/3.5/library/xml.etree.elementtree.html). 

In [3]:
import xml.etree.ElementTree as ET
import datetime

xml_handle = ET.parse('github_pubs.xml')
root = xml_handle.getroot()

for citation in root.iter("MedlineCitation"):
    pmid = citation[0].text
    pubdate = datetime.date(
        int(citation[1][0].text),  # year
        int(citation[1][1].text),  # month
        int(citation[1][2].text)  # day
        )
    
    Journal = next(citation.iter("Journal"))

    journal_title = Journal.find("ISOAbbreviation").text
    
    abstract = next(citation.iter("AbstractText")).text
    
    # some articles don't have author fields - ignoring those
    try:
        authors = [{
                "Last": author.find("LastName").text,
                "First": author.find("ForeName").text
                   } for author in Citation.iter("Author")]
    except:
        continue
    
    print("PMID: {}\nJournal: {}\nAuthors: {}\n".format(
        pmid, journal_title, [(x['Last'], x['First']) for x in authors]))

### Class Definition

Just because I need the practice, I'm going to set up an `Article` class to hold the data and make working with it easier, and an `Author` class that we can use to deal with author names

In [4]:
class Article(object):
    """Container for publication info"""
    def __init__(self, pmid, pubdate, journal, title, abstract, authors):
        self.pmid = pmid
        self.pubdate = pubdate
        self.journal = journal
        self.title = title
        self.abstract = abstract
        self.authors = authors
    def __repr__(self):
        return "<Article PMID: {}>".format(self.pmid)

    def get_authors(self):
        for author in self.authors:
            yield author

class Author(object):
    def __init__(self, last_name, first_name):
        assert type(last_name) == str
        assert type(first_name) == str
        
        self.last_name = last_name
        self.first_name = first_name.split()[0]
        try:
            self.initials = " ".join(first_name.split()[1:])
        except IndexError:
            self.initials = None

### Generator Function

And... we can turn the code above into a generator function that yields an `Article` for each document

In [47]:
from lxml.etree import iterparse

def iter_parse_pubmed(xml_file):
    # get an iterable
    context = iterparse(xml_file, events=('end',), tag='PubmedArticle')
    counter = 0
    for event, elem in context:
        counter += 1
        print(elem.tag)
        if counter > 5:
            return

In [48]:
iter_parse_pubmed('github_pubs.xml')

PubmedArticle
PubmedArticle
PubmedArticle
PubmedArticle
PubmedArticle
PubmedArticle


In [5]:
def parse_pubmed_xml(xml_file):
    xml_handle = ET.parse(xml_file)
    root = xml_handle.getroot()

    for citation in root.iter("MedlineCitation"):
        pmid = citation[0].text
        pubdate = datetime.date(
            int(citation[1][0].text),  # year
            int(citation[1][1].text),  # month
            int(citation[1][2].text)  # day
            )
        
        journal = next(citation.iter("Journal"))

        j_title = journal.find("ISOAbbreviation")
        if j_title:
            journal_title = j_title.text
        else:
            journal_title = None
            
        try:
            article_title = next(citation.iter("ArticleTitle")).text
        except:
            article_title = None
        
        try:
            abstract = next(citation.iter("AbstractText")).text
        except:
            abstract = None
        try:
            authors = [Author(author.find("LastName").text, author.find("ForeName").text)
                for author in citation.iter("Author")]
        except:
            continue
        
        yield Article(pmid, pubdate, journal_title, article_title, abstract, authors)

Usage:

In [6]:
for article in parse_pubmed_xml('github_pubs.xml'):
    print(article)
    print(article.pubdate)
    for author in article.get_authors():
        print("{}, {} {}".format(author.last_name, author.first_name, author.initials))
    print()

<Article PMID: 26357045>
2015-09-11
Shiraishi, Fumihide 
Yoshida, Erika 
Voit, Eberhard O

<Article PMID: 25601296>
2015-01-20
Dixit, Abhishek 
Dobson, Richard J B

<Article PMID: 25558360>
2015-01-05
Tuck, Sean L
Phillips, Helen Rp
Hintzen, Rogier E
Scharlemann, Jörn Pw
Purvis, Andy 
Hudson, Lawrence N

<Article PMID: 25553811>
2015-01-02
Chen, Junfang 
Lutsik, Pavlo 
Akulenko, Ruslan 
Walter, Jörn 
Helms, Volkhard 

<Article PMID: 25549775>
2015-06-09
Manini, Simone 
Antiga, Luca 
Botti, Lorenzo 
Remuzzi, Andrea 

<Article PMID: 25543048>
2015-04-28
Bouvier, Guillaume 
Desdouits, Nathan 
Ferber, Mathias 
Blondel, Arnaud 
Nilges, Michael 

<Article PMID: 25540185>
2015-04-28
Meinicke, Peter 

<Article PMID: 25527832>
2015-04-12
Lindberg, Michael R
Hall, Ira M
Quinlan, Aaron R

<Article PMID: 25526884>
2015-04-27
Barton, Carl 
Heliou, Alice 
Mouchard, Laurent 
Pissis, Solon P

<Article PMID: 25524895>
2015-04-28
Mu, John C
Mohiyuddin, Marghoob 
Li, Jian 
Bani Asadi, Narges 
Gerstein, M

### Getting Author Order

Author position matters, but it matters in sort of a weird way - first author and last author are most important, then decreasing as you work your way in to the middle of the list. But practically, there's not much distinction between 3rd and 4th author (or 3rd from last and 4th from last), so we'll generate scores for first, second, last, penultimate and everyone else. The trick is to avoid index errors if the author list is smaller than 5, so we need to write up some special cases. 

In [7]:
def score_authors(author_list):
    if not author_list:
        first = None
    else:
        first = author_list[0]
    others, penultimate, second, last = None, None, None, None
    
    list_length = len(author_list)
    if list_length > 4:
        others = [author for author in author_list[2:-2]]
    if list_length > 3:
        penultimate = author_list[-2]
    if list_length > 2:
        second = author_list[1]
    if list_length > 1:
        last = author_list[-1]
        

    return first, last, second, penultimate, others

### DataFrame generation

In order to get the data into a usable spreadsheet-like form, and for later analysis, I'm going to use the `DataFrame`s from the [pandas](http://pandas.pydata.org/) package. This might be overkill, but I know how to use it (sort of). 

In [8]:
import pandas as pd

col_names = ["Date", "Journal", "First Author", "Last Author", "Second Author", "Penultimate Author", "Other Authors"]

df = pd.DataFrame()

for article in parse_pubmed_xml('github_pubs.xml'):
    first, last, second, penultimate, others = score_authors(article.authors)
    first = first.first_name
    try:
        last = last.first_name
    except:
        pass
    try:
        second = second.first_name
    except:
        pass
    try:
        penultimate = penultimate.first_name
    except:
        pass
    try:
        others = [x.first_name for x in others]
    except:
        pass
    
    row = pd.Series([article.pubdate, article.journal, first, last, second, penultimate, others],
                    name=article.pmid, index=col_names)
    df = df.append(row)

print(df)

                Date First Author Journal Last Author  \
26357045  2015-09-11     Fumihide    None    Eberhard   
25601296  2015-01-20     Abhishek    None     Richard   
25558360  2015-01-05         Sean    None    Lawrence   
25553811  2015-01-02      Junfang    None    Volkhard   
25549775  2015-06-09       Simone    None      Andrea   
25543048  2015-04-28    Guillaume    None     Michael   
25540185  2015-04-28        Peter    None        None   
25527832  2015-04-12      Michael    None       Aaron   
25526884  2015-04-27         Carl    None       Solon   
25524895  2015-04-28         John    None        Hugo   
25521965  2014-12-19     Chengkun    None       Goran   
25520192  2015-01-24         Paul    None      Hisham   
25514851  2015-02-06        Seyed    None      Jeroen   
25505094  2015-04-12      Spencer    None     Andreas   
25505087  2015-04-12       Gunnar    None        Hans   
25505086  2015-04-12       Daniel    None      Xiaowu   
25504847  2015-04-12     Bernha

## Getting Genders - Prep
Now the tough part - getting genders. 

I played around trying to get `sexmachine` and `GenderComputer` to work, but ran into some issues, and those projects don't seem like they're being maintained, so I thought i'd try [genderize.io](http://genderize.io) and [gender-api.com](gender-api.com). The trouble is these are a web apis, which takes more time than something run locally, and they have a limit to the number of requests you can make. Since there are probably a lot of duplicate names, I thought it might be worth collapsing the names into a set.

In [9]:
unique_names = set([])

def get_unique_names(xml_file, name_set=None):
    new_names = []
    if not name_set:
        name_set = set([])
    for article in parse_pubmed_xml(xml_file):
        first, last, second, penultimate, others = score_authors(article.authors)
        
        if first:
            fa = first.first_name
        else:
            fa = None
        if last:
            la = last.first_name
        else:
            la = None
        if second:
            sa = second.first_name
        else:
            sa = None
        if penultimate:
            pa = penultimate.first_name
        else:
            pa = None
        if others:
            oa = [o.first_name for o in others]
        else:
            oa = []

        new_names.extend([fa, la, sa, pa, *oa])
    
    return name_set.union(set(new_names))

In [10]:
print(len(unique_names))
unique_names = get_unique_names('github_pubs.xml', unique_names)
print(len(unique_names))

0
1079


So now... let's check the other, larger datasets:

In [11]:
unique_names = get_unique_names('comp-bio-1997-2014.xml', unique_names)
unique_names = get_unique_names('biology-1997-2014.xml', unique_names)

It took a while to parse all of the biology publications, so I'm just going to save the output from the last cell... Also, a few names came out as single characters, which I'm going to remove, and I'm going to make everything all lower case for consistancy

In [14]:
print(len(unique_names))
with open('all_names.txt', 'w+') as names_file:
    for name in unique_names:
        names_file.write('{},'.format(name))

74154


In [16]:
fixed_names = []
with open('all_names.txt', 'r') as names_file:
    names = names_file.read().split(",")
    
    for name in names:
        if name:
            if len(name) > 1:
                fixed_names.append(name.lower())
fixed_names.sort()
print(len(fixed_names))

74118


So now that we have a list of names, we need to get genders. Originally I used genderize.io and gender-api.com, but that was before when I thought I only had ~1000 names. But the module I finally got to work (gender-detector) only works in python2... so on to a [new notebook](../src/gender_detection.ipynb)...

## Getting Genders - Stats

Now it's time to start working on stats. Best to get gender assignments into columns along with names. Going to generate csv files with `pandas` in order to do analysis elsewhere. Using similar code from [Data Frame Generation](xml_parsing.ipynb#DataFrame-generation)

In [21]:
gender_dict = {}

with open('gender_dict.txt', 'r') as in_file:
    keys_and_values = in_file.read().split(',')
    for pair in keys_and_values:
        if pair:
            key, value = pair.split(':')
            gender_dict[key] = value


In [22]:
  
def get_gender(name):
    """
    Requires `gender_dict`
    """
    name = name.lower()
    if name in gender_dict:
        return gender_dict[name]
    else:
        return None
    

def make_csv_from_xml(xml_file, name):
    col_names = ["Date", "Journal", 
                 "First Author", "Last Author", "Second Author", "Penultimate Author", "Other Authors"]
    df = pd.DataFrame()

    counter = 0
    for article in parse_pubmed_xml(xml_file):
        counter +=1
        print(article)
        first, last, second, penultimate, others = score_authors(article.authors)
        fa = get_gender(first.first_name)
        
        if last:
            la = get_gender(last.first_name)
        else:
            la = None
        if second:
            sa = get_gender(second.first_name)
        else:
            sa = None
        if penultimate:
            pa = get_gender(penultimate.first_name)
        else:
            pa = None
        if others:
            oa = [get_gender(o.first_name) for o in others]
        else:
            oa = []
        
        row = pd.Series([article.pubdate, article.journal, fa, la, sa, pa, oa],
                        name=article.pmid, index=col_names)
        df = df.append(row)
        print(row)
        if counter >4:
            break
#     df.to_csv(name)
    


In [23]:
make_csv_from_xml('github_pubs.xml', 'github_pub.csv')

<Article PMID: 26357045>
Date                  2015-09-11
Journal                     None
First Author             unknown
Last Author                 male
Second Author             female
Penultimate Author          None
Other Authors                 []
Name: 26357045, dtype: object
<Article PMID: 25601296>
Date                  2015-01-20
Journal                     None
First Author                male
Last Author                 male
Second Author               None
Penultimate Author          None
Other Authors                 []
Name: 25601296, dtype: object
<Article PMID: 25558360>
Date                          2015-01-05
Journal                             None
First Author                        male
Last Author                         male
Second Author                     female
Penultimate Author               unknown
Other Authors         [unknown, unknown]
Name: 25558360, dtype: object
<Article PMID: 25553811>
Date                  2015-01-02
Journal                     

In [24]:
make_csv_from_xml('comp-bio-1997-2014.xml', 'compbio_pubs2.csv')

<Article PMID: 26255944>
Date                                             2015-08-10
Journal                                                None
First Author                                         female
Last Author                                         unknown
Second Author                                          male
Penultimate Author                                     male
Other Authors         [female, male, male, female, unknown]
Name: 26255944, dtype: object
<Article PMID: 26251854>
Date                  2015-08-06
Journal                     None
First Author             unknown
Last Author                 male
Second Author            unknown
Penultimate Author       unknown
Other Authors           [female]
Name: 26251854, dtype: object
<Article PMID: 26021057>


AttributeError: 'NoneType' object has no attribute 'first_name'

In [None]:
# make_csv_from_xml('biology-1997-2014.xml', 'all_bio_pubs.csv')