# Interview Test
Notebook by Jim Arnold

## Introduction

Data Scientist: Python Test

A common task in data science is to pull data via a RESTful API and parse the output accordingly.

For this test, we ask you to write a Python script to access and process json content from MyGene.info for a given list of genes. Following this you will be asked to pull in associated publication information via the Entrez E-utilities API.

Before starting the test, familiarise yourself with the MyGene.info REST API here: 
http://mygene.info/v3/api

You may also wish to refer to the documentation here: http://docs.mygene.info/en/latest/doc/query_service.html#query-syntax

and here:

https://dataguide.nlm.nih.gov/eutilities/utilities.html

Please send your solution as a zip file, including a readme with details of any dependencies.

# Task 1: 
1.1) From the MyGeneInfo API, use the “Gene query service" GET method to return details on the following GENE symbols, filtered for species, “human":   CDK2, FGFR1, SLC6A4

1.2) From the returned json, parse out the “name", “symbol" and “entrezgene" values and print to screen

## Prepare Data

In [1]:
import numpy as np
import pandas as pd
import requests
import json
from Bio import Entrez

# Using requests and json for test, but good alternate opt
# import mygene - optional python client from mygene.info.


# ExcelWriter requires openpyxl - include in setup.py

In [178]:
def queryGenes(geneList):
    """Takes list of gene symbols and returns list of dicts of 1st query result for each gene from mygene.info.
    Parses the returned json for the first returned hit.
    Prints symbol, name, and entrezgene values for first hit. Appends to results list.
    Returns list of dicts with keys as default mygene fields (symbol,name,taxid,entrezgene,ensemblgene)"""
    ls = []
    print('Querying...')
    for g in geneList:
        q = 'http://mygene.info/v3/query?q=%s&species=human' % g
        res = requests.get(q).json()['hits'][0]
        print("")
        print("Symbol: ", res['symbol'])
        print("Gene Name: ", res['name'])
        print("Entrez GeneID: ", res['entrezgene'])
        ls.append(res)
    print('Done.')
    return ls

# Task 2
2.1) 	Using the appropriate identifier from the above result, send a query to the MyGeneInfo “Gene annotation services" method for each gene

2.2)	From the resulting json, collate up to 5 generif descriptions per gene

2.3)	Write the results to an Excel spreadsheet with columns: gene_symbol, gene_name, entrez_id, generifs

In [179]:
def getAnno(queryResult):
    """Takes output of queryGenes() 
    Iterates queryGenes list, parses EntrezID and passes to mygene.info annotation service. 
    Parses the returned json for generif, taking up to first 5 entries.
    Results are stored in pubdic, key=symbol, value=results as pd.Series.
    Returns concatentated results."""
    pubdic = {}
    print('')
    print('Fetching annotations...')
    for r in queryResult:
        q = 'http://mygene.info/v3/gene/%s' % r['entrezgene']
        pubdic[r['symbol']] = pd.Series(requests.get(q).json()['generif'][:5])
        # print("")
        #print('Found %d publications for %s.' % (len(res), r['symbol']))
        # print("")
    print('Done.')
    # collate generifs by symbol with pd, rename cols, drop extra cols
    return pd.concat(pubdic).reset_index().rename(columns={'level_0': 'symbol', 0: 'generifs'}).drop('level_1', axis=1)

In [180]:
def mergeWrite(queryResult, pubc):
    """Takes output of queryGenes() and getAnno(), returns collated DataFrame and writes to Excel.
    Excel file will have name of genes in original geneList separated by '_'
    Returns clean DataFrame."""
    print('')
    print('Collating results...')
    # merge with resList, drop extra cols
    pbmds = pd.merge(pd.DataFrame(queryResult).drop(
        ['_id', '_score', 'taxid'], axis=1), pubc, on='symbol', how='outer')

    # relabel and order to spec
    pbmds.rename({'symbol': 'gene_symbol', 'name': 'gene_name',
                  'entrezgene': 'entrez_id'}, axis=1, inplace=True)
    cols = ['gene_symbol', 'gene_name', 'entrez_id', 'generifs']
    pbmds = pbmds[cols]

    # write to Excel
    print('Writing to Excel...')
    pbmds.to_excel("%s.xlsx" % '_'.join(
        [queryResult[i]['symbol'] for i in range(len(queryResult))]), index=False)
    print('Done.')
    return pbmds

# Task 3:
Use the Pubmed IDs associated with the above generif content to extract additional bibliographic information.

e.g.
https://dataguide.nlm.nih.gov/eutilities/utilities.html#esummary

Hint:     from Bio import Entrez

In [None]:
def addBibs(df):
    """Takes output from mergeWrite and adds cols for corresponding pubmed features. 
    Parses Entrez esummary pubmed results for desired bibliographic features.
    Iterates for each pmid in input's generifs col.
    Casts results to new df and merges with input df.
    Returns df with bib features for each pmid."""

    # This should be made variable and entered by user
    Entrez.email = 'jimmyjamesarnold@gmail.com'
    bib_feats = ['Id', 'PubDate', 'Source', 'Title', 'LastAuthor',
                 'DOI', 'PmcRefCount']  # should be made arg later
    df['pmid'] = [i['pubmed']
                  for i in df.generifs]  # extracts pmid data to new col
    print('')
    print('Extracting PubMed data for...')

    ls = []  # constructs list of biblio data for each generif
    for pb in [i['pubmed'] for i in df.generifs]:
        # should be made arg later
        record = Entrez.read(Entrez.esummary(db="pubmed", id=pb))
        # use dict compr to extract bib_feats per record, convert to series and append to ls
        ls.append(pd.Series({i: record[0][i]
                             for i in bib_feats if i in record[0]}))

    # merge with df, cast dtypes for merging and time series
    print('Done.')
    return pd.merge(df, pd.DataFrame(ls).astype({'Id': 'int64', 'PubDate': 'datetime64'}), left_on='pmid', right_on='Id').drop('Id', axis=1)

In [None]:
def InterviewTest(geneList):
    """Given list of gene symbols, performs all tasks specified."""
    qR = queryGenes(geneList)  # Task 1
    df = mergeWrite(qR, getAnno(qR))  # Task 2
    dfb = addBibs(df)  # Task 3
    print('Tasks Completed.')
    return dfb

In [159]:
'_'.join([resList[i]['symbol'] for i in range(len(resList))])

'CDK2_FGFR1_SLC6A4'

In [161]:
pbmds = mergeWrite(resList, getAnno(resList))

Fetching annotations...
Done.
Collating results...
Writing to Excel...
Done.


In [174]:
def InterviewTest(geneList):
    """Given list of gene symbols, performs all tasks specified."""
    qR = queryGenes(geneList) # Task 1
    df = mergeWrite(qR, getAnno(qR)) # Task 2
    dfb = addBibs(df) # Task 3
    print('Tasks Completed.')
    return dfb

In [176]:
InterviewTest(gene_list)

Querying...

Symbol:  CDK2
Gene Name:  cyclin dependent kinase 2
Entrez GeneID:  1017


Symbol:  FGFR1
Gene Name:  fibroblast growth factor receptor 1
Entrez GeneID:  2260


Symbol:  SLC6A4
Gene Name:  solute carrier family 6 member 4
Entrez GeneID:  6532

Done.
Fetching annotations...
Done.

Collating results...
Writing to Excel...
Done.

Extracting PubMed data for...
Done.


Unnamed: 0,gene_symbol,gene_name,entrez_id,generifs,pmid,PubDate,Source,Title,LastAuthor,DOI,PmcRefCount
0,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 11907280, 'text': 'Cyclin A/Cdk2 an...",11907280,2002-03-01,Mol Biol Cell,Cyclin A- and cyclin E-Cdk complexes shuttle b...,Pines J,10.1091/mbc.01-07-0361,52
1,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12049628, 'text': 'results argue th...",12049628,2002-06-15,Biochem J,HIV-1 Tat-associated RNA polymerase C-terminal...,Kumar A,10.1042/BJ20011191,24
2,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12081504, 'text': 'Activation mecha...",12081504,2002-07-02,Biochemistry,Activation mechanism of CDK2: role of cyclin b...,Lew J,,12
3,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12114499, 'text': 'CDK2/cyclin E is...",12114499,2002-09-13,J Biol Chem,HIV-1 Tat interaction with RNA polymerase II C...,Nekhai S,10.1074/jbc.M111349200,28
4,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12149264, 'text': 'CDK2 binding to ...",12149264,2002-10-18,J Biol Chem,The oncogenic activity of cyclin E is not conf...,Moroy T,10.1074/jbc.M205919200,21
5,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11693202, 'text': 'vitronectin incr...",11693202,2001-08-01,Mol Cell Biochem,Integrin activation is required for VEGF and F...,Isik FF,,15
6,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11746971, 'text': 'In the fusion of...",11746971,2001-12-01,Genes Chromosomes Cancer,Fusion of the BCR and the fibroblast growth fa...,Johansson B,,12
7,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11759058, 'text': 'distribution in ...",11759058,2001-12-01,Appl Immunohistochem Mol Morphol,Immunohistochemical detection of fibroblast gr...,Sessa F,,1
8,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11919391, 'text': 'REVIEW; The 8p11...",11919391,2002-01-01,Acta Haematol,The 8p11 myeloproliferative syndrome: a distin...,Cross NC,10.1159/000046639,16
9,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 12031912, 'text': 'overexpressed in...",12031912,2002-06-01,Haematologica,Overexpression of translocation-associated fus...,Knuutila S,,19


In [112]:
pubdic = {}
print('Fetching annotations...')
for r in resList:
    q = 'http://mygene.info/v3/gene/%s' % r['entrezgene']
    pubdic[r['symbol']] = pd.Series(requests.get(q).json()['generif'][:5])
    #print("")
    #print('Found %d publications for %s.' % (len(res), r['symbol'])) 
    #print("")

# collate generifs by symbol
print('Collating results...')
pubc = pd.concat(pubdic).reset_index().rename(columns={'level_0':'symbol',0:'generifs'}).drop('level_1',axis=1)
# merge with resList, drop cols
pbmds = pd.merge(pd.DataFrame(resList).drop(['_id','_score','taxid'],axis=1), pubc, on='symbol',how='outer')
# relabel and order to spec
pbmds.rename({'symbol':'gene_symbol','name':'gene_name','entrezgene':'entrez_id'},axis=1,inplace=True)
cols = ['gene_symbol', 'gene_name', 'entrez_id', 'generifs']
pbmds = pbmds[cols]

# write to Excel
print('Writing to Excel...')
pbmds.to_excel("output.xlsx",index=False)
print('Finished.')

Fetching annotations...
Collating results...
Writing to Excel...
Finished.


## Using mygene python client

In [2]:
# Task 1 - given list of genes, return info

raw = input("Enter gene(s), separating with commas: ")
gene_list = [x.strip() for x in raw.split(',')]
mg = mygene.MyGeneInfo()
res = mg.querymany(gene_list, scopes='symbol', fields='entrezgene,name,generif', species='human')
print("")
print("Genes Found:")
for i in range(len(res)):
    print("")
    print("Symbol: ",res[i]['query'])
    print("Gene Name: ",res[i]['name'])
    print("Entrez GeneID: ",res[i]['entrezgene'])
    print("")

Enter gene(s), separating with commas: CDK2, FGFR1, SLC6A4
querying 1-3...done.
Finished.

Genes Found:

Symbol:  CDK2
Gene Name:  cyclin dependent kinase 2
Entrez GeneID:  1017


Symbol:  FGFR1
Gene Name:  fibroblast growth factor receptor 1
Entrez GeneID:  2260


Symbol:  SLC6A4
Gene Name:  solute carrier family 6 member 4
Entrez GeneID:  6532



CDK2, FGFR1, SLC6A4

Task 2 done

### Task 3:
Use the Pubmed IDs associated with the above generif content to extract additional bibliographic information.

e.g.
https://dataguide.nlm.nih.gov/eutilities/utilities.html#esummary

Hint:     from Bio import Entrez

In [171]:
def addBibs(df):
    """Takes output from mergeWrite and adds cols for corresponding pubmed features. 
    Parses Entrez esummary pubmed results for desired bibliographic features.
    Iterates for each pmid in input's generifs col.
    Casts results to new df and merges with input df.
    Returns df with bib features for each pmid."""
    
    Entrez.email = 'jimmyjamesarnold@gmail.com' # This should be made variable and entered by user
    bib_feats = ['Id', 'PubDate', 'Source', 'Title', 'LastAuthor', 'DOI', 'PmcRefCount'] # should be made arg later
    df['pmid'] = [i['pubmed'] for i in df.generifs] # extracts pmid data to new col
    print('')
    print('Extracting PubMed data for...')
    
    ls = [] # constructs list of biblio data for each generif
    for pb in [i['pubmed'] for i in df.generifs]:
        record = Entrez.read(Entrez.esummary(db="pubmed", id=pb)) # should be made arg later
        # use dict compr to extract bib_feats per record, convert to series and append to ls
        ls.append(pd.Series({i:record[0][i] for i in bib_feats if i in record[0]}))
    
    # merge with df, cast dtypes for merging and time series
    print('Done.')
    return pd.merge(df, pd.DataFrame(ls).astype({'Id':'int64', 'PubDate':'datetime64'}), left_on='pmid', right_on='Id').drop('Id', axis=1)

In [173]:
addBibs(pbmds)


Extracting PubMed data for...
Done.


Unnamed: 0,gene_symbol,gene_name,entrez_id,generifs,pmid,PubDate,Source,Title,LastAuthor,DOI,PmcRefCount
0,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 11907280, 'text': 'Cyclin A/Cdk2 an...",11907280,2002-03-01,Mol Biol Cell,Cyclin A- and cyclin E-Cdk complexes shuttle b...,Pines J,10.1091/mbc.01-07-0361,52
1,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12049628, 'text': 'results argue th...",12049628,2002-06-15,Biochem J,HIV-1 Tat-associated RNA polymerase C-terminal...,Kumar A,10.1042/BJ20011191,24
2,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12081504, 'text': 'Activation mecha...",12081504,2002-07-02,Biochemistry,Activation mechanism of CDK2: role of cyclin b...,Lew J,,12
3,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12114499, 'text': 'CDK2/cyclin E is...",12114499,2002-09-13,J Biol Chem,HIV-1 Tat interaction with RNA polymerase II C...,Nekhai S,10.1074/jbc.M111349200,28
4,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12149264, 'text': 'CDK2 binding to ...",12149264,2002-10-18,J Biol Chem,The oncogenic activity of cyclin E is not conf...,Moroy T,10.1074/jbc.M205919200,21
5,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11693202, 'text': 'vitronectin incr...",11693202,2001-08-01,Mol Cell Biochem,Integrin activation is required for VEGF and F...,Isik FF,,15
6,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11746971, 'text': 'In the fusion of...",11746971,2001-12-01,Genes Chromosomes Cancer,Fusion of the BCR and the fibroblast growth fa...,Johansson B,,12
7,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11759058, 'text': 'distribution in ...",11759058,2001-12-01,Appl Immunohistochem Mol Morphol,Immunohistochemical detection of fibroblast gr...,Sessa F,,1
8,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11919391, 'text': 'REVIEW; The 8p11...",11919391,2002-01-01,Acta Haematol,The 8p11 myeloproliferative syndrome: a distin...,Cross NC,10.1159/000046639,16
9,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 12031912, 'text': 'overexpressed in...",12031912,2002-06-01,Haematologica,Overexpression of translocation-associated fus...,Knuutila S,,19


Unnamed: 0,gene_symbol,gene_name,entrez_id,generifs,pmid
0,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 11907280, 'text': 'Cyclin A/Cdk2 an...",11907280
1,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12049628, 'text': 'results argue th...",12049628
2,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12081504, 'text': 'Activation mecha...",12081504
3,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12114499, 'text': 'CDK2/cyclin E is...",12114499
4,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12149264, 'text': 'CDK2 binding to ...",12149264
5,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11693202, 'text': 'vitronectin incr...",11693202
6,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11746971, 'text': 'In the fusion of...",11746971
7,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11759058, 'text': 'distribution in ...",11759058
8,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11919391, 'text': 'REVIEW; The 8p11...",11919391
9,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 12031912, 'text': 'overexpressed in...",12031912


In [77]:
# constructs list of biblio data for each generif
ls = []
print('Extracting PubMed data...')
for pb in [i['pubmed'] for i in pbmds.generifs]:
    record = Entrez.read(Entrez.esummary(db="pubmed", id=pb))
    ls.append(pd.Series({i:record[0][i] for i in bib_feats if i in record[0]}))

# merge with existing data for future analysis
# fix some dtype issues for merging and timeseries work
df = pd.merge(pbmds, pd.DataFrame(ls).astype({'Id':'int64', 'PubDate':'datetime64'}), left_on='pmid', right_on='Id').drop('Id', axis=1)

In [166]:
ls = [] # constructs list of biblio data for each generif
for pb in [i['pubmed'] for i in df.generifs]:
    print(pb)
    record = Entrez.read(Entrez.esummary(db="pubmed", id=pb)) # should be made arg later
    # use dict compr to extract bib_feats per record, convert to series and append to ls
    ls.append(pd.Series({i:record[0][i] for i in bib_feats if i in record[0]}))
pd.DataFrame(ls)

NameError: name 'df' is not defined

In [152]:
addBibs(pbmds)

Extracting PubMed data for...
11907280
12049628
12081504
12114499
12149264
11693202
11746971
11759058
11919391
12031912
10666888
11027924
11044587
11113619
11121166
Done.


Unnamed: 0,gene_symbol,gene_name,entrez_id,generifs,pmid,PubDate,Source,Title,LastAuthor,DOI,PmcRefCount
0,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 11907280, 'text': 'Cyclin A/Cdk2 an...",11907280,2002-03-01,Mol Biol Cell,Cyclin A- and cyclin E-Cdk complexes shuttle b...,Pines J,10.1091/mbc.01-07-0361,52
1,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12049628, 'text': 'results argue th...",12049628,2002-06-15,Biochem J,HIV-1 Tat-associated RNA polymerase C-terminal...,Kumar A,10.1042/BJ20011191,24
2,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12081504, 'text': 'Activation mecha...",12081504,2002-07-02,Biochemistry,Activation mechanism of CDK2: role of cyclin b...,Lew J,,12
3,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12114499, 'text': 'CDK2/cyclin E is...",12114499,2002-09-13,J Biol Chem,HIV-1 Tat interaction with RNA polymerase II C...,Nekhai S,10.1074/jbc.M111349200,28
4,CDK2,cyclin dependent kinase 2,1017,"{'pubmed': 12149264, 'text': 'CDK2 binding to ...",12149264,2002-10-18,J Biol Chem,The oncogenic activity of cyclin E is not conf...,Moroy T,10.1074/jbc.M205919200,21
5,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11693202, 'text': 'vitronectin incr...",11693202,2001-08-01,Mol Cell Biochem,Integrin activation is required for VEGF and F...,Isik FF,,15
6,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11746971, 'text': 'In the fusion of...",11746971,2001-12-01,Genes Chromosomes Cancer,Fusion of the BCR and the fibroblast growth fa...,Johansson B,,12
7,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11759058, 'text': 'distribution in ...",11759058,2001-12-01,Appl Immunohistochem Mol Morphol,Immunohistochemical detection of fibroblast gr...,Sessa F,,1
8,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 11919391, 'text': 'REVIEW; The 8p11...",11919391,2002-01-01,Acta Haematol,The 8p11 myeloproliferative syndrome: a distin...,Cross NC,10.1159/000046639,16
9,FGFR1,fibroblast growth factor receptor 1,2260,"{'pubmed': 12031912, 'text': 'overexpressed in...",12031912,2002-06-01,Haematologica,Overexpression of translocation-associated fus...,Knuutila S,,19


In [115]:
email = 'jimmyjamesarnold@gmail.com'
def e_check(mail=email):
    if mail is not None: 
        print('yes')
    else:
        mail = input('Please provide email: ')

In [116]:
e_check(email)

yes


In [46]:
bib_feats = ['PubDate', 'Source', 'Title', 'LastAuthor', 'DOI', 'PmcRefCount']
Entrez.email = 'jimmyjamesarnold@gmail.com'
record = Entrez.read(Entrez.esummary(db="pubmed", id="11907280"))


['2002 Mar',
 'Mol Biol Cell',
 ['Jackman M', 'Kubota Y', 'den Elzen N', 'Hagting A', 'Pines J'],
 'Cyclin A- and cyclin E-Cdk complexes shuttle between the nucleus and the cytoplasm.',
 'Pines J',
 '10.1091/mbc.01-07-0361',
 IntegerElement(52, attributes={})]

In [67]:
bib_feats = ['Id', 'PubDate', 'Source', 'Title', 'LastAuthor', 'DOI', 'PmcRefCount']
bib = {}
for i in bib_feats:
    bib[i] = record[0][i]
    pd.Series(bib)

TypeError: Index(...) must be called with a collection of some kind, 'Id' was passed

In [65]:
record

[DictElement({'Item': [], 'Id': '11907280', 'PubDate': '2002 Mar', 'EPubDate': '', 'Source': 'Mol Biol Cell', 'AuthorList': ['Jackman M', 'Kubota Y', 'den Elzen N', 'Hagting A', 'Pines J'], 'LastAuthor': 'Pines J', 'Title': 'Cyclin A- and cyclin E-Cdk complexes shuttle between the nucleus and the cytoplasm.', 'Volume': '13', 'Issue': '3', 'Pages': '1030-45', 'LangList': ['English'], 'NlmUniqueID': '9201390', 'ISSN': '1059-1524', 'ESSN': '1939-4586', 'PubTypeList': ['Journal Article'], 'RecordStatus': 'PubMed - indexed for MEDLINE', 'PubStatus': 'ppublish', 'ArticleIds': DictElement({'pubmed': ['11907280'], 'medline': [], 'doi': '10.1091/mbc.01-07-0361', 'pmc': 'PMC99617', 'rid': '11907280', 'eid': '11907280', 'pmcid': 'pmc-id: PMC99617;'}, attributes={}), 'DOI': '10.1091/mbc.01-07-0361', 'History': DictElement({'pubmed': ['2002/03/22 10:00'], 'medline': ['2002/12/31 04:00'], 'entrez': '2002/03/22 10:00'}, attributes={}), 'References': [], 'HasAbstract': IntegerElement(1, attributes={})

In [None]:
# This extracts data from generifs, currently for 5
rec = {}
for i in range(len(gene_list)):
    rec[gene_list[i]] = pd.DataFrame(res[i]['generif'][:5])
    
# this concatenates results to df
t = pd.concat(rec).reset_index().rename(columns={'level_0':'query'}).drop('level_1',axis=1)
t

In [None]:
# this concatenates results to df
t = pd.concat(rec).reset_index().rename(columns={'level_0':'query'}).drop('level_1',axis=1)
t

In [None]:
pd.merge(df_res.drop(['_id','_score','generif'],axis=1), t, on='query',how='outer')

In [27]:
pd.Series(pubrec)

CDK2      [11907280, 12049628, 12081504, 12114499, 12149...
FGFR1     [11693202, 11746971, 11759058, 11919391, 12031...
SLC6A4    [10666888, 11027924, 11044587, 11113619, 11121...
dtype: object

In [None]:
# need to check what they mean by 'collate'

# found this on SO, modified to take 5
# https://stackoverflow.com/questions/42012152/unstack-a-pandas-column-containing-lists-into-multiple-rows

df_res = pd.DataFrame(res, index=gene_list)

# join pubrec keys on index
df_res['pmid'] = pd.Series(pubrec)

lst_col = 'generif'
# collate

pd.DataFrame({col:np.repeat(df_res[col].values, 5) for col in df_res.columns.difference([lst_col])
    }).assign(**{lst_col:np.concatenate(df_res[lst_col].values)})[df_res.columns.tolist()]

In [None]:
def pubrec(n=5):
    rec = {}
    for i in range(len(gene_list)):
        for j in range(n):
            rec[gene_list[i]] = df_res.iloc[i][lst_col][j]
    return rec

In [None]:
r = pubrec()

In [None]:
r

In [None]:
res[0]

In [None]:
df_res = pd.DataFrame(res, index=gene_list)
df_res

In [None]:
for i in df_res.index:
    print(str(i))

In [None]:
df_res

In [None]:
def get_pmids(n=5):
    pubrec = {}
    for i in range(len(gene_list)):
        pubrec[gene_list[i]] = [res[0]['generif'][j]['pubmed'] for j in range(n)]
    return pubrec

In [None]:
def get_pmids(x,n=5):
    return [x[0]['pubmed'] for _ in range(5)]

In [None]:
# pulls from res directly
pubrec = {}
for i in range(len(gene_list)):
    pubrec[gene_list[i]] = [df_res.iloc[i]['generif'][0]['pubmed'] for _ in range(5)]

In [None]:
pubrec

In [None]:
def get_pmids(series,n=5):
    return lambda x: [x[i]['pubmed'] for j in range(n)]

In [None]:
# join pubrec keys on index
df_res['pmid'] = pd.Series(pubrec)
df_res

In [None]:
try lambda func to iterate over list of jsons. 
maybe
x: x[i]['pubmed'] for j in range(n)

In [None]:
def get_pmids(series, n):
    [pd.DataFrame(res[i])['generif'][j]['pubmed'] for j in range(n)]

In [None]:
pmids = [res[0]['generif'][i]['pubmed'] for i in range(5)]

In [None]:
pmids

In [None]:
pubrec = {}
for i in range(len(gene_list)):
    pubrec[gene_list[i]] = [res[0]['generif'][i]['pubmed'] for i in range(5)]

In [None]:
pd.DataFrame(pubrec)

In [None]:
df