## Mapping Grants to Publications
Use Biopython package Entrez to fetch publication info using GrantID


### Define functions used to fetch publication ID (PMID) using Grant ID (GRID). 
First extract GrantID list from NIH grant csv files, then search pubmed to get PMID using GRID list. At last, return csv file contain both GRID and PMID lists.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from Bio import Entrez

def GRID_to_PMID(GRID):
    '''use GRID to search PMID, return dict(PMID:GRID)'''
    Entrez.email = 'yier.jin@gmail.com'
    handle = Entrez.esearch(db="pubmed", term=GRID)
    record = Entrez.read(handle)
    handle.close()
    IdList = record["IdList"]
    GRID = str(GRID)
    dic = {}
    for i in IdList:
        dic[i] = GRID
    return (dic)

def GRID_list_to_PMID(GRID_list):
    """used a GRID list to search PMID, return dict(PMID:GRID)"""
    dic = {}
    for i in GRID_list:
        dic_i = GRID_to_PMID(i)
        dic.update(dic_i)
    return (dic)

def Extract_GRID_list (NIH_file):
    """Extract GRID list from NIH summary csv files, return a GRID list""" 
    df = pd.read_csv(NIH_file, encoding = "ISO-8859-1")
    GRID = df.iloc[:,0].dropna()
    GRID_list = GRID.tolist()
    return(GRID_list)

def NIH_GRID_to_PMID(GRID_list):
    """use estracted GRID, call "GRID_list_to_PMID" func to return PMID list,
       write into csv file."""
    GR_Dic = GRID_list_to_PMID(GRID_list)
    df = DataFrame()
    df["PMID"] = GR_Dic.keys()
    df["GRID"] = GR_Dic.values()
    print(df.head())
    df.to_csv("GrantID.to_PMID.txt")

Call above functions to generate complete GRID and PMID list csv table. To speed up the search process,
NIH grant csv input files are splited and searched in paralle (code not included here). 

In [None]:
NIH_file = "FY_2010_split1.txt"
GRID_list = Extract_GRID_list(NIH_file)
NIH_GRID_to_PMID(GRID_list)

### Get more publication information 
Define functions to search publication information using PMID, return PubDate, PubTypeList, FullJournalName and Abstract.

In [None]:
def PMID_to_summary (PMID):
    """use PMID to search paper summary"""
    handle = Entrez.esummary(db="pubmed", id=PMID)
    record = Entrez.read(handle)
    handle.close()
    dic = {}
    Id = record[0]["Id"]
    PubDate = record[0]["PubDate"]
    PubTypeList = record[0]["PubTypeList"]
    FullJournalName = record[0]["FullJournalName"]
    dic[Id] = [PubDate, PubTypeList, FullJournalName]
    return dic

def PMID_list_to_summary (PMID_list):
    """use PMID list to search paper abstract"""
    dic = {}
    for i in PMID_list:
        dic_i = PMID_to_summary(i)
        dic.update(dic_i)
    return (dic)

### Merged table containing the Grant information as well as the publication information
Use PMID, GRID to generate a single table containing information of grants and publications.
Only selected colmns are shown.

In [22]:
import pandas as pd
file = "FY2010_merge_all.csv"
FY2010 = pd.read_csv(file)
FY2010 = FY2010.drop("Unnamed: 0", axis=1)
FY2010.head(6)

Unnamed: 0,ADMINISTERING_IC,APPLICATION_TYPE,BUDGET_START,BUDGET_END,CORE_PROJECT_NUM,FY,SUPPORT_YEAR,TOTAL_COST,PMID,PubDate,PubTypeList,FullJournalName,GRID
0,DK,1,5/1/10,4/30/11,F32DK085835,2010,1,52106.0,22335236,2012 Mar 2,['Journal Article'],Journal of proteome research,F32DK085835
1,DK,1,5/1/10,4/30/11,F32DK085835,2010,1,52106.0,21838295,2011 Oct 7,['Journal Article'],Journal of proteome research,F32DK085835
2,DK,1,6/15/10,6/14/11,F32DK085905,2010,1,50474.0,21690215,2011 Jul 1,"['Journal Article', 'Review']",Cold Spring Harbor perspectives in biology,F32DK085905
3,DK,1,3/2/10,3/1/11,F32DK085935,2010,1,47606.0,23334396,2013 Feb,['Journal Article'],Journal of the American Society of Nephrology ...,F32DK085935
4,DK,1,3/2/10,3/1/11,F32DK085935,2010,1,47606.0,22682975,2013 Aug,['Journal Article'],"Nutrition, metabolism, and cardiovascular dise...",F32DK085935
5,DK,1,3/2/10,3/1/11,F32DK085935,2010,1,47606.0,21693678,2011 Aug,['Journal Article'],Endocrinology,F32DK085935


 ### Generate Table containing the Grant Abstract and the Pblication Abstract
To comapre the abstract from grants and its related publications, publication abstracts are fetched from Pubmed 
and tables containing both information are gerated. NIH provie the table with application ID and the grant abstract,
we need the application ID (AppID) to be able to match publication abstract to the grant abstract.

In [23]:
file = "Abstract.csv"
FY2010 = pd.read_csv(file)
FY2010 = FY2010.drop("Unnamed: 0", axis=1)
FY2010.head(6)

Unnamed: 0,PMAbs,PMID,GRID,AppID,GRAbs
0,Exercise training is the cornerstone in the pr...,26341655,R01HL113738,8997115,DESCRIPTION (provided by applicant): Hypertens...
1,To determine usefulness and versatility of hyd...,26049382,R01HL113738,8997115,DESCRIPTION (provided by applicant): Hypertens...
2,The host protein CPSF6 possesses a domain that...,24415937,R01AI076121,9017905,DESCRIPTION (provided by applicant): HIV-1 inf...
3,The host protein CPSF6 possesses a domain that...,24415937,R01AI076121,9252791,DESCRIPTION (provided by applicant): HIV-1 inf...
4,HIV-1 replication can be inhibited by type I i...,24121441,R01AI078788,9087138,"DESCRIPTION (provided by applicant): HIV-1, th..."
5,HIV-1 replication can be inhibited by type I i...,24121441,R01AI078788,9383261,"DESCRIPTION (provided by applicant): HIV-1, th..."


Here is an example of the full abstract of grant R01HL113738 and its publication 26341655:

In [25]:
FY2010["PMAbs"][0]

'Exercise training is the cornerstone in the prevention and management of hypertension and atherosclerotic cardiovascular disease. However, blood pressure (BP) response to exercise is exaggerated in hypertension often to the range that raises the safety concern, which may prohibit patients from regular exercise. This augmented pressor response is shown to be related to excessive sympathetic stimulation caused by overactive muscle reflex. Exaggerated sympathetic-mediated vasoconstriction further contributes to the rise in BP during exercise in hypertension. Exercise training has been shown to reduce both exercise pressor reflex and attenuate the abnormal vasoconstriction. Hypertension also contributes to cognitive impairment, and exercise training has been shown to improve cognitive function through both BP-dependent and BP-independent pathways. Additional studies are still needed to determine if newer modes of exercise training such as high-intensity interval training may offer advanta

In [26]:
FY2010["GRAbs"][0]

'DESCRIPTION (provided by applicant): Hypertensive patients are known to display exaggerated rise in blood pressure (BP) during exercise but the underlying mechanisms are poorly understood. Normally, exercise is accompanied by decreased parasympathetic activity and increased sympathetic activity caused by central command and activation of thin fiber muscle afferents that reflexively increase sympathetic outflow and BP. Traditionally, muscle afferents were dichotomized as metaboreceptors, which are activated slowly and only during intense or ischemic muscle contraction, or mechanoreceptors, which respond quickly to even mild deformation of their receptive fields. The increase in SNA and BP caused by activation of these receptors, known as exercise pressor reflex, is normally buffered by activation arterial baroreceptors, which are reset to operate at higher BP range but at the same level of sensitivity. Our recent work in spontaneously hypertensive rats (SHR) and patients with essential