# 2a. Downloading JSON Files for Retracted Papers



## Introduction


This notebook **retrieves all the information available for our sample of retracted papers from OpenAlex, then stores it by creating one JSON file per paper**. It does so by accessing this database via a series of API calls.

The Notebook takes the .csv file generated by **Notebook 1c**, which contained our cleaned dataset, filtered by severity score and limited to a single discipline. It then uses the DOI for all papers in that .csv file to perform the API call to OpenAlex and store all the information available in the database in a JSON file. These JSON files will be used in **Notebooks 2b and 2c** to obtain abstracts for all of the retracted papers under investigation, along with the exact distribution of these papers by country and year. 

The **workflow** for this Notebook is therefore as follows:

- Input: **one .csv file** with the clean data for all retracted papers within a specific field.
- Output: **one .json file for each paper in our input .csv file**, **two .csv file** with logs concerning the download process and the points at which it may have failed.


## Input / Output Parameters


Input paramters:

In [1]:

# Path for input .csv file

input_path = "../data/retraction_watch_data_set/4_cell_bio_data_set.csv"



Output parameters:

In [2]:

# Path to directory to store .json files

#json_directory = "/Volumes/TOSHIBA_EXT/cellbiology_retracted_fulljsonfiles"

json_directory = "../data/json_files/cell_biology/retracted"

# Path for log with information about downloaded of .json files progress

log_directory_api_outcome = "../data/logs/cell_biology/retracted_json_download.csv"

# Path for log with information about downloaded of .json files progress

log_directory_doi_list = "../data/logs/cell_biology/retracted_doi_list.csv"


## Importing Libraries

Let us start by importing all the libraries that we will use in the Notebook:

In [3]:

# Import required libraries

import pandas as pd
import numpy as np

import requests
import csv
import os

from json.decoder import JSONDecodeError
import json

import warnings
warnings.filterwarnings("ignore")

## Loading Input Data

Next we will load our input .csv file into a data frame:

In [4]:

# Load input .csv data into data frame  

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize data frame

df.head(1)


Unnamed: 0,record_id,title,institution,journal,publisher,country,author,urls,article_type,retraction_date,...,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes,reason_list,severity_score,subject
0,52739,Anti-breast Cancer Activity of Co(II) Complex ...,"Luohe Medical College, Luohe, Henan, China; Lu...",Journal of Cluster Science,Springer - Nature Publishing Group,China,Ting Yin;Ruirui Wang;Shaozhe Yang,,Research Article;,1/24/2024 0:00,...,10/27/2021 0:00,10.1007/s10876-021-02192-4,0.0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,See also: https://pubpeer.com/publications/739...,"['Concerns/Issues About Image', 'Concerns/Issu...",4,(BLS) Biology - Cellular


## Downloading JSON Files: Function Definitions


Our goal in this notebooks will be to download all the information that Open Alex has on the retracted papers that we are investigating. One can do that by making an API call to the OpenAlex database, using an URL that is specific to each paper. Luckily, this URL can easily be constructed from that paper's DOI. 

Since we will need to use it in what follows, let us go ahead and define a function that constructs an OpenAlex-appropriate URL give a papers DOI:


In [5]:

# Define address_builder function

def address_builder(doi):
    """Takes a DOI identifier and builds the full URL address to perform an API call
    on OpenAlex from it"""
    
    # Build url address and store it in string   
    
    base_address = "https://api.openalex.org/works/https://doi.org/" + doi
    polite_address = base_address + "?mailto=" + "pabloruizdeolano@gmail.com" # Use polite address for faster API call performance
    
    # Return url address
    
    return polite_address




Next, we will define a function that systematically accesses whatever information OpenAlex contains for all our papers, and stores it in a .json file:


In [6]:

# Define function to obtain .json files for all papers in our data frame

"""
Function takes a data frame two file paths to two directories, extracts a list of DOIs from
the "original_paper_doi" column of input data frame and performs one API call per DOI, 
writes outcome as .json file in one of the two specified directories. If also keeps a log of successful and
failed API calls, writes log as a .csv file in second directory.

"""

def fetch_json_files(df, json_directory, log_directory):
    
    # Create empty list to store log with success or failure of each API call
    
    log = []
    
   # For loop to perform one API call per DOI in input data frame

    for doi in df['original_paper_doi']:
    
        # Skip empty or invalid DOIs
    
        if not isinstance(doi, str) or not doi.strip(): 
            log.append({'DOI': doi, 'Status': 'Skipped - Empty or Invalid DOI'}) 
            continue  
    
        try:
            # Build url address by calling address_builder function
            
            url = address_builder(doi)
        
            # Perform API call using URL address and store result in variable
            
            response = requests.get(url, timeout=10)  # Added timeout to prevent hanging requests
        
            # If clause to control for case in which API call fails
            
            if response.status_code == 200:
            
                # Convert result of API call to json format
            
                data = response.json()
            
                # Create file path to save .json file with result of API call
            
                full_path = os.path.join(json_directory, doi.replace('/', '_') + '.json')
            
                # Save result of API call to .json file
                
                with open(full_path, 'w') as file:
                    json.dump(data, file)
            
                # Update log list with dictionary specifying success for current DOI
            
                log.append({'DOI': doi, 'Status': 'Success'})
        
            else:
                # Update log list with dictionary specifying failure for current DOI
                
                log.append({'DOI': doi, 'Status': f"Failed - {response.status_code}"})
    
        except requests.RequestException as e:
            
            # Handle exceptions during the API call (e.g., connection errors, timeouts)
            
            log.append({'DOI': doi, 'Status': f"Failed - {str(e)}"})

    # Convert log list to data frame 
    
    df_log = pd.DataFrame(log)
    
    # Write content of log data frame into resulting path
    
    df_log.to_csv(log_directory, index=False)
    


## Downloading JSON Files: Test Trial


Having defined those functions, we can go ahead and start downloading information for our retracted papers from OpenAlex. Since data sets of interest will typically be quite large, we will first do that on a smaller sample, just to make sure that everything works properly. 

Let us first generate the required sample data frame, with some desired sample size:

In [7]:

# Define sample size

sample_size = 20

# Check if sample_size is less than the number of rows in the data frame

if sample_size <= len(df):
    
    # Create a random sample of the data frame with the defined sample size
    
    df_sample = df.sample(n=sample_size, random_state=1)  
    
else:
    
    print("Sample size is larger than the DataFrame.")



We can now call our master function to store all the available information about these papers in .json files:

In [11]:

# Call fetch_and_log_data function to download data for sample data frame

fetch_json_files(df_sample, json_directory, log_directory_api_outcome)


Because we will be retreieving information for many retracted papers, and because API calls are often slow, downloading all the information that we are interested in will take a considerable amount of time. It will therefore be convenient to devise a system with which we can keep track of what items have been already downloaded and resume the process from there in case any interrumptions take place. We will do devise one such system now, which we will use to see how our first test run at downloading information for our papers went.

To do that, let us start by defining a function that checks what are the papers for which we have already been downloaded a .json file, then returns a data frame with the DOIs of those papers:

In [12]:

# Define function to get list of already downloaded DOIs

def downloaded_paper_list_getter(directory, log_directory):
    """
    Function takes a file path as input, checks how many .json files there are in that
    direcotry, then reconstructs the DOIs associated to each file from their names by 
    removing the file extension, returns a data frame with resulting DOIs and writes
    the content of this data frame into a .csv file.
    """
    
    # Create list with names of all files in input directory
    
    file_names = [file for file in os.listdir(directory) if file.endswith('.json')]
    
    # Create list with names of all files minus ".json" extension  
    # Given the name structure of our files, this will give us a list of all DOIs in folder
    
    paper_dois = [file[:-5].replace('_', '/') for file in file_names]
    
    # Create data frame with names of all files in folder
    
    df_dois = pd.DataFrame(paper_dois, columns=['doi'])
    
    # Write content of log data frame into resulting path
    
    df_dois.to_csv(log_directory, index=False)
    
    # Return data frame
    
    return df_dois



Let us call this function to generate a data frame with the DOIs of all the papers for which we already have a .json file:

In [17]:

# Call function to generate data frame with DOIs of downloaded papers

existing_doi_df = downloaded_paper_list_getter(json_directory, log_directory_doi_list)

# Check size of resulting data frame

existing_doi_df.shape

(0, 1)


We can now inspect the information that the content of the .json files that were created for our sample data frame, and the log files that were generated in the process. If it all looks good, we can go ahead and download information for the rest of our retracted papers by using the functions that we defined earlier. 


## Output: Downloading JSON Files for Entire Data Set



To download .json files for our entire data set, it will be useful to define a new function that removes the papers for which we already have .json files available:


In [18]:

# Define function to remove papers for which we already have a .json file from original data frame

def non_downloaded_papers_selector(df, existing_doi_df):
    """
    Function takes an input data frame and a data frame with a list of DOIs, returns 
    the input data frame without those papers whose DOIs where included in the list.
    """

    # Create data frame with DOIs of papers that have not been downloaded only
    
    df_not_downloaded = df[~df['original_paper_doi'].isin(existing_doi_df['doi'])]
    
    # Return data frame
    
    return df_not_downloaded

We can call this function to obtain a data frame which only contains papers for which a .json file still has to be downloaded:

In [26]:

# Create data frame with DOIs of papers for which no data has been downloaded

df_not_downloaded = non_downloaded_papers_selector(df, existing_doi_df)



Having done that, we can call our master function to download .json files for all the remaining papers in our data frame:

In [20]:

fetch_json_files(df_not_downloaded, json_directory, log_directory_api_outcome)


KeyboardInterrupt: 

Note that, should the process be interrupted, we can always repeat this process to restart it right were it was left. This is, in fact, the system to cope with possible interruptions that we mentioned earlier.

To resume the process, we can simply proceed as we did above. First we obtain a data frame with the DOIs of those papers for which a .json file was downloaded:

In [24]:

# Call function to obtain data frame with DOIs of papers for which we have a .json file

existing_doi_df = downloaded_paper_list_getter(json_directory, log_directory_doi_list)

# Check number of papers for which a .json file was downloaded

existing_doi_df.shape


(0, 1)

Then we remove those papers for which data was already downloaded from our data frame:

In [25]:

# Call function to remove downloaded papers from data frame

df_not_downloaded = non_downloaded_papers_selector(df, existing_doi_df)

# Check number of papers for which a .json file has not been downloaded

df_not_downloaded.shape


(10241, 22)

And finally we call our master functio to resume the process once again:

In [23]:

# Call master function to downloaded remaining .json files

fetch_json_files(df_not_downloaded, json_directory, log_directory_api_outcome)


KeyboardInterrupt: 