## Acquiring a Dataset of Fight for Sight Publications

Example of the page of a research publication on the EPMC website:
* http://europepmc.org/abstract/MED/24439297

Example response from the EPMC API for the same publication:
* https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=24439297&resultType=core

### Import Libraries

In [1]:
# for file directory operations
from glob import glob
import os

# for sleep command
import time

# to print some things nicely
import textwrap

# to deal with api requests
import requests
import json

# to save data
import pandas as pd

### Building the API Query - Search Criteria

Search for "fight for sight" or its previous names in the metadata fields for Grant Agency and Acknowledgements.

In [2]:
# API metadata fields to search in
fields = ['GRANT_AGENCY','ACK_FUND']

# terms to search for in those fields
terms = ['fight for sight',
         'iris fund for prevention of blindness',
         'british eye research foundation',
         'prevention of blindness research fund']

# combine the fields and terms into string with correct
# format for API
query_str = ''

for term in terms:
    for field in fields:
        query_str = query_str + field + ':"' + term + '" OR '

query_str = query_str[:-4]

print('Search Criteria:')
print('-'*60)
print(query_str.replace('OR ','OR\n'))

Search Criteria:
------------------------------------------------------------
GRANT_AGENCY:"fight for sight" OR
ACK_FUND:"fight for sight" OR
GRANT_AGENCY:"iris fund for prevention of blindness" OR
ACK_FUND:"iris fund for prevention of blindness" OR
GRANT_AGENCY:"british eye research foundation" OR
ACK_FUND:"british eye research foundation" OR
GRANT_AGENCY:"prevention of blindness research fund" OR
ACK_FUND:"prevention of blindness research fund"


### Building the API Query - Response Options

resultType=core: Return full meta-data for the paper, including abstracts etc.

pageSize=1000: Return 1000 results per query (maximum allowed by API)

format=json: Return the query in json format.

In [3]:
options = {'resultType':'core','pageSize':'1000',
           'format':'json'}

options_str = ''

for key,value in options.items():
    options_str = options_str + '&'+key+'='+value

print('Search Options:')
print('-'*60)
print(options_str.replace('&','\n&'))

Search Options:
------------------------------------------------------------

&resultType=core
&pageSize=1000
&format=json


### Building the API Query - Full URL

In [4]:
base_url = 'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query='
base_url = base_url + query_str + options_str

print('Query URL:')
print('-'*60)
print(textwrap.fill(base_url,width=100))

Query URL:
------------------------------------------------------------
https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=GRANT_AGENCY:"fight for sight" OR
ACK_FUND:"fight for sight" OR GRANT_AGENCY:"iris fund for prevention of blindness" OR ACK_FUND:"iris
fund for prevention of blindness" OR GRANT_AGENCY:"british eye research foundation" OR
ACK_FUND:"british eye research foundation" OR GRANT_AGENCY:"prevention of blindness research fund"
OR ACK_FUND:"prevention of blindness research fund"&resultType=core&pageSize=1000&format=json


### Query the API

As there are more than 1000 Fight for Sight related publications, the API needs to be queries multiple times to extract all the results. The "cursorMark" options is changed to return the second pages of results etc.

In [5]:
# define and create save directory
save_name = 'ffs_papers'
save_dir='data/EPMC/json/ffs_papers'

if not os.path.isdir(save_dir):
    os.makedirs(save_dir)

# how many times to try a query before giving up
n_attempts=10
# how long to wait between queries (to limit rate)
time_between_requests=1

# initial parameters to enter loop
nextCursorMark = '*'
page_number = 0
n_results = 0
failed_count = 0
n_response = 999999
hit_count = 999999

print('Querying EPMC...')

# while new results left to return
while (n_results<hit_count) and (n_response>0):
    if failed_count==0:
        print('Querying page '+str(page_number+1)+': ',end='')

    try:
        # add page number (cursorMark) to query str
        query_full_url = base_url + '&cursorMark=' + str(nextCursorMark)  
        
        # query the api
        response = requests.get(query_full_url)
        response = response.json()
        
        # get the cursorMark (page) to query next time
        nextCursorMark = response['nextCursorMark']

        # check how many results were returned
        n_response = len(response['resultList']['result'])
        n_results = n_results + n_response
        hit_count = response['hitCount']

        print(hit_count,'hits total,',n_response,'in this page,',n_results,'collected so far.')

    except:
        # something went wrong with the query, try again
        if failed_count>=n_attempts:
            print("FAILED.")
        else:
            # failed too many times, move on to the next page
            failed_count = failed_count+1
            time.sleep(time_between_requests)
            continue

    page_number = page_number+1

    # if there are new results, save them to file
    if (failed_count<n_attempts) and (n_response>0):
        file_name = save_dir+'/'+save_name+'_page'+str(page_number).zfill(6)+'.json'

        with open(file_name,'w') as f:
            json.dump(response,f)

    failed_count = 0

    # wait a while before making the next request
    time.sleep(time_between_requests)

print('Done.')

Querying EPMC...
Querying page 1: 1602 hits total, 1000 in this page, 1000 collected so far.
Querying page 2: 1602 hits total, 602 in this page, 1602 collected so far.
Done.


### Convert the JSON Files to a Pandas DataFrame

In [6]:
def json_to_df(response, 
               columns=['abstractText','authorList','chemicalList',
                        'citedByCount','doi','firstPublicationDate',
                        'grantsList','id','journalInfo',
                        'keywordList','meshHeadingList','pmcid',
                        'pmid','pubYear','title']):
                            
    '''convert a single json response from the EPMC search API into a
    pandas data frame. Any length 1 dictionaries or lists found in the
    resulting columns are flattened.'''
    
    # extract the list of results from the json
    df = pd.DataFrame(response['resultList']['result'])
    # select subset of columns
    df = df[[col for col in columns if df.columns.contains(col)]]
    
    for col in df.columns:
        
        # flatten length 1 dictionaries
        is_dict = [type(x) is dict for x in df[col]]
        
        if sum(is_dict)>0:
            # if all the entries in this column are a dict with only
            # one key, the dicts can be replaced by their one value.
            len1_dict = [len(x)==1 for x in df.loc[is_dict,col]]
            
            if all(len1_dict):
                def extract_key_from_dict(the_dict):
                    if (type(the_dict) is dict) and (len(the_dict)==1):
                        key = list(the_dict.keys())
                        key = key[0]
                        return the_dict[key]
                        
                    else:
                        return the_dict
            
                df[col] = df[col].apply(extract_key_from_dict)

        # flatten length 1 lists
        is_list = [type(x) is list for x in df[col]]
        
        if sum(is_list)>0:
            # if all the entries in this column are a list with only
            # one value, the lists can be replaced by their one value.
            len1_list = [len(x)==1 for x in df.loc[is_list,col]]
            
            if all(len1_list):
                def flatten_list(the_list):
                    if (type(the_list) is list) and (len(the_list)==1):
                        return the_list[0]
                        
                    else:
                        return the_list
            
                df[col] = df[col].apply(flatten_list)
       
    return df

In [7]:
print('Processing Data...')

# get the list of json files to combine
json_paths = glob('data/EPMC/json/ffs_papers/*.json')

# list to store dataframes for each json file
df_list = []

# loop over json files
for idx,path in enumerate(sorted(json_paths)):
    print(idx+1,'out of',len(json_paths),':',path)
    
    # load json
    with open(path,'r') as f:
        response = json.loads(f.read())

    # convert json to data frame
    new_df = json_to_df(response)

    # add dataframe to the list
    df_list.append(new_df)

# combine all the dataframes into one
df = pd.concat(df_list, ignore_index=True, sort=False)

# check for and remove any duplicates
# pandas duplicates can't deal with dicts so convert everything to str first
df = df[~df.astype(str).duplicated()].reset_index(drop=True)

# save the data frame in excel and pickle format
df.to_excel('data/EPMC/ffs_papers.xlsx', index=False)
df.to_pickle('data/EPMC/ffs_papers.pkl')

print('Done.')

display(df.head())

Processing Data...
1 out of 2 : data/EPMC/json/ffs_papers\ffs_papers_page000001.json
2 out of 2 : data/EPMC/json/ffs_papers\ffs_papers_page000002.json
Done.


Unnamed: 0,abstractText,authorList,chemicalList,citedByCount,doi,firstPublicationDate,grantsList,id,journalInfo,keywordList,meshHeadingList,pmcid,pmid,pubYear,title
0,The transepithelial potential difference (TEP)...,"[{'fullName': 'Cao L', 'firstName': 'Lin', 'la...",,1,10.1111/jcmm.13829,2018-08-30,"[{'grantId': '1361/1362', 'agency': 'Fight for...",30160348,"{'issue': '11', 'volume': '22', 'journalIssueI...","[Atp1b1, Cell-cell Connection, Extracellular E...",,PMC6201363,30160348,2018,Polarized retinal pigment epithelium generates...
1,,"[{'fullName': 'Chen M', 'firstName': 'Mei', 'l...",,0,10.21037/atm.2018.10.31,2018-11-01,"[{'grantId': '1425/26', 'agency': 'Fight for S...",30613630,"{'issue': 'Suppl 1', 'volume': '6', 'journalIs...",,,PMC6291610,30613630,2018,"Cholesterol homeostasis, macrophage malfunctio..."
2,PURPOSE:Quantitative analysis of hyperautofluo...,"[{'fullName': 'Tee JJL', 'firstName': 'James J...",,0,10.1097/IAE.0000000000001871,2018-12-01,"[{'grantId': '1578/79', 'agency': 'Fight for S...",29016458,"{'issue': '12', 'volume': '38', 'journalIssueI...",,,PMC5797695,29016458,2018,QUANTITATIVE ANALYSIS OF HYPERAUTOFLUORESCENT ...
3,BACKGROUND:Uncontrolled microglial activation ...,"[{'fullName': 'Wang L', 'firstName': 'Luxi', '...",,0,10.1186/s13024-019-0305-9,2019-01-11,"[{'grantId': '1361/62', 'agency': 'Fight for S...",30634998,"{'issue': '1', 'volume': '14', 'journalIssueId...","[Microglia, Retinal degeneration, neuroinflamm...",,PMC6329071,30634998,2019,Glucose transporter 1 critically controls micr...
4,Cell therapy using endothelial progenitors hol...,"[{'fullName': 'Reid E', 'firstName': 'Emma', '...",,4,10.1002/sctm.17-0187,2017-11-22,"[{'grantId': '10JTA', 'agency': 'The Sir Jules...",29164803,"{'issue': '1', 'volume': '7', 'journalIssueId'...","[Cell therapy, Stem Cells, Endothelial Progeni...",,PMC5746158,29164803,2018,Preclinical Evaluation and Optimization of a C...
