# 3b. Downloading JSON Files for Non-Retracted Papers




## Introduction



This notebook makes a series of API calls to access all the **bibliographic information for our non-retracted** papers that is available on OpenAlex. It then saves it in a number of **JSON files** for future use. 

The Notebook uses the list of DOIs of non-retracted papers generated by **Notebook 3a**. The JSON files downloaded here, in turn, will be used in **Notebook 4** to extract abstracts of non-retracted papers.

The workflow has thus been set up as follows:

- Input parameters: **one .csv file** with a list of DOIs of non-retracted papers.
- Output parameters: **one JSON file per DOI** of a non-retracted paper in out input file.



## Input / Output Parameters


Input paramaters:

In [1]:
# File path for file with DOIs of non-retracted papers

input_path = "../data/dois_non_retracted/cell_biology/non_retracted_dois_cell_bio.csv"

Output parameters:

In [2]:
# File path .json files of non-retracted papers

output_path = "../data/json_files/cell_biology/non_retracted"

# File path for log concerning download process

output_path_log = "../data/logs/cell_biology/non_retracted_json_download"

## Preliminaries

Let us start by importing the required libraries:

In [3]:
# Import required libraries

import pandas as pd
import numpy as np

import requests
import csv
import os

from json.decoder import JSONDecodeError
import json

import warnings
warnings.filterwarnings("ignore")

And by loading the data in our input file:

In [4]:

# Load input .csv data into data frame  

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize data frame

df

Unnamed: 0,0
0,https://doi.org/10.1038/s41419-019-2011-5
1,https://doi.org/10.21037/atm.2019.09.128
2,https://doi.org/10.1083/jcb.200501162
3,https://doi.org/10.1038/ncb2252
4,https://doi.org/10.1083/jcb.200704166
...,...
7393,https://doi.org/10.1515/jbcpp-2017-0221
7394,https://doi.org/10.1111/age.12816
7395,https://doi.org/10.1074/jbc.m109.022897
7396,https://doi.org/10.1038/srep45523


In [7]:
# !!! Apply for non-retracted papers only!

# Replace 'https://doi.org/' with nothing, effectively removing it directly from the first column
df.iloc[:, 0] = df.iloc[:, 0].str.replace('https://doi.org/', '', regex=False)

# Rename the first column to 'original_paper_doi' after the operation
df.columns = ['original_paper_doi']

# Show the updated DataFrame
df

Unnamed: 0,original_paper_doi
0,10.1038/s41419-019-2011-5
1,10.21037/atm.2019.09.128
2,10.1083/jcb.200501162
3,10.1038/ncb2252
4,10.1083/jcb.200704166
...,...
7393,10.1515/jbcpp-2017-0221
7394,10.1111/age.12816
7395,10.1074/jbc.m109.022897
7396,10.1038/srep45523


In [8]:
#show datafroma info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7398 entries, 0 to 7397
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original_paper_doi  7398 non-null   object
dtypes: object(1)
memory usage: 57.9+ KB


In [9]:
#Check if there are any DOI duplicates in the input csv

duplicate_dois = df['original_paper_doi'].value_counts()
duplicate_dois = duplicate_dois[duplicate_dois > 1]

duplicate_dois

Series([], Name: count, dtype: int64)

## Function Definitions

- We will use a few functions to fetch the required information for our papers in a quick and efficient way. 
- The following two functions take the DOI of a paper and build the url-s required to fetch the required information from OpenAlex

In [12]:
# Define address_builder function

def address_builder(doi):
    """Takes a DOI identifier and builds full URL address for an OpenAlex API call"""
    
    # Store url addresses in string   
    base_address = "https://api.openalex.org/works/https://doi.org/" + doi
    polite_address = base_address + "?mailto=" + "jennyhuelsmeier@posteo.de" # Use polite address for faster API call performance
    
    # Return polite address
    
    return polite_address

-Function to extract the relevant meta-data for a paper with DOI as the identifier in the API call. 

## Fetching the Data


- In some cases is it necessary download the jsonfile in batches, eg, when there was an interruption. To mitigate this risk the following process checks what has already been processed by checking the dois in the filenames of the output folder.

In [None]:
# create a list of doi from the filnames of the existing files in a specified folder

def create_downloaded_doi_dataframe(json_directory):
    # List all JSON files and extract DOIs
    json_files = [f for f in os.listdir(json_directory) if f.endswith('.json')]
    dois = [f[:-5].replace('_', '/') for f in json_files]
    doi_df = pd.DataFrame(dois, columns=['DOI'])
    return doi_df

existing_doi_df = create_downloaded_doi_dataframe(json_directory)

In [29]:
# Create a Log File for Existing DOIs
def log_existing_dois(existing_doi_df, log_directory):
    log_file_path = os.path.join(log_directory, 'existing_doi_log.csv')
    existing_doi_df.to_csv(log_file_path, index=False)

log_directory = "../data/non-retracted_papers_json_APIcall_logs"

log_existing_dois(existing_doi_df, log_directory)

In [30]:
# Filter Input DataFrame to Remove Existing DOIs found in the proevious step

def filter_new_dois(input_df, existing_doi_df):
    filtered_df = input_df[~input_df['original_paper_doi'].isin(existing_doi_df['DOI'])]
    return filtered_df


filtered_df = filter_new_dois(discipline_df, existing_doi_df)


**Check if logic worked and the results in exisiting_doi_log.csv and the output folder as well as the results in filtered_df are as expected**


**Functions to create the API call loop for download*

In [32]:
#Define API Call Function for each DOIs

def fetch_doi_fulljson(disciplines_doi_df, json_directory):
    log = []
    for doi in disciplines_doi_df['original_paper_doi']:
        url = address_builder(doi)
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            # Save JSON file
            file_path = os.path.join(json_directory, doi.replace('/', '_') + '.json')
            with open(file_path, 'w') as file:
                json.dump(data, file)
            log.append({'DOI': doi, 'Status': 'Success'})
        else:
            log.append({'DOI': doi, 'Status': f"Failed - {response.status_code}"})
    return pd.DataFrame(log)


In [33]:
#Define function to write log for API calls

def write_api_call_log(api_log_df, log_directory):
    log_file_path = os.path.join(log_directory, 'doi_calling_log.csv')
    api_log_df.to_csv(log_file_path, index=False)


In [34]:
# Define function to Run Api calls and log results
def fetch_and_log_data(filtered_doi_df, json_directory, log_directory):
    # Fetch data for the new DOIs
    api_log_df = fetch_doi_fulljson(filtered_doi_df, json_directory)
    
    # Log the results of the API calls
    write_api_call_log(api_log_df, log_directory)


**Check if the code works on a sample subset**

In [50]:
# Define sample size

sample_size = 20

# Check if sample_size is less than the number of rows in the DataFrame

if sample_size <= len(discipline_df):
    # Create a random sample of the DataFrame with the defined sample size
    sample_df = discipline_df.sample(n=sample_size, random_state=1)  # random_state for reproducibility
else:
    print("Sample size is larger than the DataFrame.")

# Obtain info for the new DataFrame
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original_paper_doi  20 non-null     object
dtypes: object(1)
memory usage: 292.0+ bytes



- Inspect results and of all works well proceed to download the rest of the files

In [None]:
# run function to fetch data for sample and log results

fetch_and_log_data(sample_df, json_directory, log_directory) 

In [None]:

# rerun code to add already downloaded sample files to existing doi list and exclude from main call

log_existing_dois(existing_doi_df, log_directory)
filtered_df = filter_new_dois(discipline_df, existing_doi_df)


- Master function to fetch json files for all retracted papers

In [35]:
# Fetch all json files for the main corpus of the discipline 

fetch_and_log_data(filtered_df, json_directory, log_directory) 



- Check logs for error messages and unavailable files and compare count of downloaded jsonfiles in output folder