# 2. Fetching .json Files for Retracted Papers



## Introduction


This notebook **retrieves all the information available for our retracted papers from OpenAlex**. It does so by performing an API call from OpenAlex.

The Notebook takes the .csv file generated by **Notebook 1c**, which contained the cleaned dataset which included a severity score for each paper and was limited to a single discipline. It then uses the DOI for all papers in that .csv file to perform the API call and store all the information available in .json files. These .json files will be used in **Notebooks 2b and 2c** to obtain abstracts for all of the retracted papers under investigation, along with the exact distribution of these papers by country and year. 

The **worklflow** for this Notebook is therefore as follows:

- Input: **one .csv file** with the our clean data for all retracted papers within a specific field.
- Output: **one .json file for each paper in our input .csv file**, with all the information available on OpenAlex for the paper in question.


## Input / Output Parameters


Input paramters:

In [1]:

# Path for input .csv file

input_path = "../data/subject_cell_bio"



Output parameters:

In [33]:

# Path for output .json files

#json_directory = "/Volumes/TOSHIBA_EXT/cellbiology_retracted_fulljsonfiles"

json_directory = "../data/json_files/cellbiology_retracted_fulljsonfiles"

# Path for log with information about downloaded of .json files progress

log_directory = "../data/logs/retracted_papers_json_APIcall_logs"


## Importing Libraries

Let us start by importing all the libraries that we will use in the Notebook:

In [3]:

# Import required libraries

import pandas as pd
import numpy as np

import requests
import csv
import os

from json.decoder import JSONDecodeError
import json

import warnings
warnings.filterwarnings("ignore")

## Loading Input Data

Next we will load our input .csv file into a data frame:

In [4]:

# Load input .csv data into data frame  

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize data frame

df.head(1)


Unnamed: 0,record_id,title,institution,journal,publisher,country,author,urls,article_type,retraction_date,...,original_paper_date,original_paper_doi,original_paper_pubmed_id,reason,paywalled,notes,year,reason_list,severity_score,subject
0,52739,Anti-breast Cancer Activity of Co(II) Complex ...,"Luohe Medical College, Luohe, Henan, China; Lu...",Journal of Cluster Science,Springer - Nature Publishing Group,China,Ting Yin;Ruirui Wang;Shaozhe Yang,,Research Article;,1/24/2024 0:00,...,2021-10-27,10.1007/s10876-021-02192-4,0.0,+Concerns/Issues About Image;+Concerns/Issues ...,No,See also: https://pubpeer.com/publications/739...,2021,"['Concerns/Issues About Image', 'Concerns/Issu...",4,(BLS) Biology - Cellular


## Downloading Abstracts for Retracted Papers


Our goal in this notebooks will be to download all the information that Open Alex has on the retracted papers that we are investigating. One can do that by making an API call to the OpenAlex database, using an URL that is specific to each paper. Luckily, this URL can easily be constructed from each papers DOI. 

Since we will need to use this in what follows, let us go ahead and define a function that constructs an OpenAlex-appropriate URL give a papers DOI:


In [5]:

# Define address_builder function

def address_builder(doi):
    """Takes a DOI identifier and builds the full URL address to perform an API call
    on OpenAlex from it"""
    
    # Build url address and store it in string   
    
    base_address = "https://api.openalex.org/works/https://doi.org/" + doi
    polite_address = base_address + "?mailto=" + "pabloruizdeolano@gmail.com" # Use polite address for faster API call performance
    
    # Return url address
    
    return polite_address


Because we will be retreieving information for many retracted papers, and because API calls are often slow, downloading all the information that we are interested in will take a considerable amount of time. It will therefore be convenient to devise a system with which we can keep track of what items have been already downloaded and resume the process from there in case any interrumptions take place. 

We can do that by defining first the following function, which checks what are the names of the files in our output directory, then reconstructs the DOIs of the papers for which we already have information, returns a data frame with this information:

In [6]:

# Define function to get list of already downloaded DOIs

def downloaded_doi_list_getter(directory):
    """
    Function takes a file path as input, checks how many .json files there are in there,
    then reconstructs the DOIs associated to each file from their names by removing the file
    extension, returns a data frame with resulting DOIs.
    """
    
    # Create list with names of all files in input directory
    
    file_names = [file for file in os.listdir(directory) if file.endswith('.json')]
    
    # Create list with names of all files minus ".json" extension  
    # Given the name structure of our files, this will give us a list of all DOIs in folder
    
    paper_dois = [file[:-5].replace('_', '/') for file in file_names]
    
    # Create data frame with names of all files in folder
    
    df_dois = pd.DataFrame(paper_dois, columns=['doi'])
    
    # Return data frame
    
    return df_dois



For compactness, it will also be useful to define a function that writes the list of DOIs of papers for which we already have a .json file into a .csv file:


In [7]:

# Define function to print list of downloaded DOIs into .csv file

"""
Function takes a data frame and a file path as input, prints the content of the data frame
as a .csv file in the directory that the file path indicates.
"""

def data_frame_printer(df, directory, file_name):
    
    # Create file path for log
    
    full_path = os.path.join(directory, file_name)
    
    #full_path = os.path.join(directory, 'existing_doi_list.csv')

    # Save data frame with DOI of to .csv
    
    df.to_csv(full_path, index=False)



And finally, let us also define a function that we can use to generate to remove the papers for which a .json file has already been downloaded from our original data frame:

In [9]:

# Define function to remove papers for which we already have a .json file from original data frame

def filter_new_dois(df, existing_doi_df):
    """
    Function takes an input data frame and a data frame with a list of DOIs, returns 
    the input data frame without those papers whose DOIs where included in the list.
    """

    # Create data frame with DOIs of papers that have not been downloaded only
    
    df_ony_not_downloaded = df[~df['original_paper_doi'].isin(existing_doi_df['DOI'])]
    
    # Return data frame
    
    return df_ony_not_downloaded


We can now proceed to define the functions that we will use to download information for our retracted papers. We will start by defining the main function that we will use to download the data of interest for retracted papers:

In [8]:

# Define function to obtain .json files for all papers in our data frame

"""
Function takes a data frame and a file path to a directory, extracts a list of DOIs from
the "original_paper_doi" column of input data frame and performs one API call per DOI, 
writes outcome as .json file in input directory. If also keeps a log of successful and
failed API calls, returns log as data frame with list of DOIs with outcome of call.

"""

def fetch_json_files(df, directory):
    
    # Create empty list to store log with success or failure of each API call
    
    log = []
    
    # For loop to perform one API call per DOI in input data frame
    
    for doi in df['original_paper_doi']:
        
        # Build url address by calling address_builder function
        
        url = address_builder(doi)
        
        # Perform API call using URL address and store result in variable
        
        response = requests.get(url)
        
        # If clause to control for case in which API call fails
        
        if response.status_code == 200:
            
            # Convert result of API call to json format
            
            data = response.json()
            
            # Create file path to save .json file with result of API call
            
            full_path = os.path.join(directory, doi.replace('/', '_') + '.json')
            
            # Save result of API call to .json file 

            with open(full_path, 'w') as file:
                json.dump(data, file)
            
            # Update log list with dictionary specifying success for current DOI 
            
            log.append({'DOI': doi, 'Status': 'Success'})
            
        else:
            
            # Update log list with dictionary specifying failure for current DOI 
            
            log.append({'DOI': doi, 'Status': f"Failed - {response.status_code}"})
    
    # Convert log list to data frame and return 
    
    return pd.DataFrame(log)



THIS CELL CAN BE DELETED BUT I'M KEEPING IT HERE TO KEEP A RECORD OF THE FILE NAMES USED ETC

In [11]:

# Define write_api_call_log

def write_api_call_log(api_log_df, log_directory):
    
    # Create path to create .csv file from directory passed as input
    
    log_file_path = os.path.join(log_directory, 'doi_calling_log.csv')
    
    # Write content of data frame into resulting path
    
    api_log_df.to_csv(log_file_path, index=False)
    

Finally, we will create a function that calls the last two functions and thus attempts to download information for all relevant DOIs and logs the outcome of each attempt, all in one go:


In [12]:

# Define function to run Api calls and log results

def fetch_and_log_data(filtered_doi_df, json_directory, log_directory):
    
    # Fetch data for DOIs in data frame passed as input
    
    api_log_df = fetch_json_files(filtered_doi_df, json_directory)
    
    # Write log with result of API calls
    
    write_api_call_log(api_log_df, log_directory)
    

## First Trial


Having defined those functions, we can go ahead and start downloading information for our retracted papers from OpenAlex. Since data sets of interest will typically be quite large, we will first do that on a smaller sample, just to make sure that everything works properly. Let us first generate the required sample data size, with some desired sample size:

In [13]:
# Define sample size

sample_size = 20

# Check if sample_size is less than the number of rows in the data frame

if sample_size <= len(discipline_df):
    
    # Create a random sample of the data frame with the defined sample size
    
    sample_df = discipline_df.sample(n=sample_size, random_state=1)  
    
else:
    
    print("Sample size is larger than the DataFrame.")



We can now call our main function to download data and write the appropriate logs for our sample data frame:

In [13]:

# Call fetch_and_log_data function to download data for sample data frame

fetch_and_log_data(sample_df, json_directory, log_directory) 



Next we call one of the functions that we defined earlier to generate a data frame with the DOIs of all the papers that have already been downloaded:

In [58]:

# Call function to generate data frame with DOIs of downloaded papers

existing_doi_df = downloaded_doi_list_getter(json_directory)


And we save the information in this data frame into a .csv file:

In [59]:

# Create .csv file with DOIs of papers for which information has already been downloaded

data_frame_printer(existing_doi_df, log_directory, "existing_doi_list.csv")
#log_existing_dois(existing_doi_df, log_directory)


We can now inspect the information that the content of the .json files that were created for our sample data frame, and the log files that were generated in the process. If it all looks good, we can go ahead and download information for the rest of our retracted papers by using the functions that we defined earlier. 

We will start by creating a new data frame with the DOIs of all the papers of interest for which no data has yet been downloaded:

In [60]:

# Create data frame with DOIs of papers for which no data has been downloaded

filtered_df = filter_new_dois(discipline_df, existing_doi_df)



## Output


With this information, we can now go on to perform the appropriate API calls for the rest of our retracted papers. Once we are done with that, we will have obtained all the information that we wanted:

In [57]:

# Fetch all json files for the main corpus of the discipline 

fetch_and_log_data(filtered_df, json_directory, log_directory) 


In [61]:
existing_doi_df.shape

(7071, 1)

In [62]:
filtered_df.shape

(228, 22)