# 3. Fetching .json Files for Retracted Papers



## Introduction


This notebook **retrieves all the information available for our retracted papers from OpenAlex**. It does so by performing an API call. 

The Notebook takes the .csv file generated by **Notebook 2b**, which contained the cleaned dataset from the Retraction Watch Database, limited to a single discipline. It uses the DOI for all papers in that .csv file to perform the API call, then stores all the information available in .json files. These .json files will be used in **Notebook 4** to obtain abstracts for all of the retracted papers under investigation. These abstracts will in turn be used as input to train our model in **Notebook 6**.

The **input and output parameters** for this Notebook uses are therefore as follows:

-Input: **one csv file** with the data for all retracted papers within a specific field.
-Output: **one json files** for each paper in our input file, with all the information available on OpenAlex for the paper in question.

Please note that the output for this notebook takes up a lot of space. For this reason, its output path is set to an **external hardrive**. The path will therefore have to be adjusted when ran in a different machine.



## Input / Output Parameters


- Input paramters:

In [11]:

# Path for input .csv file

input_path = "../data/disciplines/BLS_Biology__Cellular.csv"   


- Output parameters:

In [12]:

# Path for output .json files

json_directory = "/Volumes/Hurricane/cellbiology_retracted_fulljsonfiles"

# Path for log with information about downloaded items via API

log_directory = "../data/retracted_papers_json_APIcall_logs"


## Importing Libraries

- Let us start by importing all the libraries that we will use in the Notebook:

In [5]:

# Import required libraries

import pandas as pd
import numpy as np

import requests
import csv
import os

from json.decoder import JSONDecodeError
import json

import warnings
warnings.filterwarnings("ignore")

## Loading Input Data

- Next we will load our input .csv file into a data frame:

In [13]:

# Load input .csv data into data frame  

discipline_df = pd.read_csv(input_path, encoding='latin-1')

# Visualize data frame

discipline_df.head(1)


Unnamed: 0,record_id,title,institution,journal,publisher,country,author,urls,article_type,retraction_date,...,original_paper_date,original_paper_doi,original_paper_pubmed_id,reason,paywalled,notes,year,reason_list,severity_score,subject
0,52739,Anti-breast Cancer Activity of Co(II) Complex ...,"Luohe Medical College, Luohe, Henan, China; Lu...",Journal of Cluster Science,Springer - Nature Publishing Group,China,Ting Yin;Ruirui Wang;Shaozhe Yang,,Research Article;,1/24/2024 0:00,...,2021-10-27,10.1007/s10876-021-02192-4,0.0,+Concerns/Issues About Image;+Concerns/Issues ...,No,See also: https://pubpeer.com/publications/739...,2021,"['Concerns/Issues About Image', 'Concerns/Issu...",4,(BLS) Biology - Cellular


## Function Definitions

- We will use a few functions to fetch the required information for our papers in a quick and efficient way. First, note that our dataset uses DOIs as the main identifiers for each retracted paper. In order to perform an API call and retrive all the information that OpenAlex possesses for a given paper, however, we will need to use an URL that conforms to the specific standards of our database. Luckily, this URL can easily be generated for each paper from its DOI. We will do that by using the following function:

In [12]:

# Define address_builder function

def address_builder(doi):
    """Takes a DOI identifier and builds the full URL address to perofrm an API call
    on OpenAlex from it"""
    
    # Build url address and store it in string   
    
    #base_address = "https://api.openalex.org/works/https://doi.org/" + doi
    polite_address = base_address + "?mailto=" + "pabloruizdeolano@gmail.com" # Use polite address for faster API call performance
    
    # Return url address
    
    return polite_address


- Because we will be retreieving information for many retracted papers, and because API calls are often slow, downloading all the information that we are interested in will take a considerable amount of time. It will therefore be convenient to devise a system with which we can keep track of what items have been already downloaded and resume the process from there in case any interrumptions take place. We can do that by defining first the following function, which checks what are the DOIs of the papers for which the .json file with all the relevant data has already been downloaded:

In [14]:

def create_downloaded_doi_dataframe(json_directory):
    
    # Create list with all downloaded files in directory
    
    json_files = [file for file in os.listdir(json_directory) if file.endswith('.json')]
    
    # Create list with DOIs of all downloaded files
    
    dois = [file[:-5].replace('_', '/') for file in json_files]
    
    # Create data frame with DOIs of all downloaded files
    
    doi_df = pd.DataFrame(dois, columns=['DOI'])
    
    # Return data frame
    
    return doi_df


- We will also need a function that writes our data frame with the DOIs of the papers for which information has been downloaded into into a .csv file:

In [17]:

# Define log_existing_dois function

def log_existing_dois(existing_doi_df, log_directory):
    
    # Create file path for log
    
    log_file_path = os.path.join(log_directory, 'existing_doi_log.csv')
    
    # Save data frame with DOI of to .csv
    
    existing_doi_df.to_csv(log_file_path, index=False)


- And finally, we will define a function that we can use to generate a data frame with the DOIs of those papers for which no information has been downloaded:

In [18]:

# Define filter_new_dois function

def filter_new_dois(input_df, existing_doi_df):
    
    # Create data frame with DOIs of papers that have not been downloaded only
    
    filtered_df = input_df[~input_df['original_paper_doi'].isin(existing_doi_df['DOI'])]
    
    # Return data frame
    
    return filtered_df

- We can now proceed to define the functions that we will use to download information for our retracted papers. We will start by defining the main function that we will use to download the data of interest for retracted papers:

In [None]:

# Define API Call Function for each DOI

def fetch_doi_fulljson(disciplines_doi_df, json_directory):
    
    # Create empty list to 
    
    log = []
    
    # For loop to perform one API call per DOI in input data frame
    
    for doi in disciplines_doi_df['original_paper_doi']:
        
        # Build url address by calling address_builder function
        
        url = address_builder(doi)
        
        # Perform API call using URL address and store result in variable
        
        response = requests.get(url)
        
        # If clause to control for case in which API call fails
        
        if response.status_code == 200:
            
            # Convert result of API call to json format
            
            data = response.json()
            
            # Create file path to create .json file with result of API call
            
            file_path = os.path.join(json_directory, doi.replace('/', '_') + '.json')
            
            # Save result of API call to .json file 

            with open(file_path, 'w') as file:
                json.dump(data, file)
            
            # Update log list with dictionary specifying success for current DOI 
            
            log.append({'DOI': doi, 'Status': 'Success'})
            
        else:
            
            # Update log list with dictionary specifying failure for current DOI 
            
            log.append({'DOI': doi, 'Status': f"Failed - {response.status_code}"})
    
    # Convert log list to data frame and return 
    
    return pd.DataFrame(log)


- We will also use a function to write the contet of the dataframe that we will use to keep track of what are the papers for which information has already been downloaded into a .csv file:

In [19]:

# Define write_api_call_log

def write_api_call_log(api_log_df, log_directory):
    
    # Create path to create .csv file from directory passed as input
    
    log_file_path = os.path.join(log_directory, 'doi_calling_log.csv')
    
    # Write content of data frame into resulting path
    
    api_log_df.to_csv(log_file_path, index=False)
    

- Finally, we will create a function that calls the last two functions and thus attempts to download information for all relevant DOIs and logs the outcome of each attempt, all in one go:


In [20]:

# Define function to Run Api calls and log results

def fetch_and_log_data(filtered_doi_df, json_directory, log_directory):
    
    # Fetch data for DOIs in data frame passed as input
    
    api_log_df = fetch_doi_fulljson(filtered_doi_df, json_directory)
    
    # Write log with result of API calls
    
    write_api_call_log(api_log_df, log_directory)
    

## Output


- Having defined those definitions, we can go ahead and start downloading information for our retracted papers from OpenAlex. Let us start by calling one of the functions that we defined earlier to generate a data frame with the DOIs of all the papers that have already been downloaded:

In [21]:

# Call function to generate data frame with DOIs of downloaded papers

existing_doi_df = create_downloaded_doi_dataframe(json_directory)


FileNotFoundError: [Errno 2] No such file or directory: '/Volumes/Hurricane/cellbiology_retracted_fulljsonfiles'

- Next we will save the information in this data frame into a .csv file:

In [None]:

# Create .csv file with DOIs of papers for which information has already been downloaded

log_existing_dois(existing_doi_df, log_directory)


- Having done this, we can now create a data frame that only contains the DOIs of those retracted papers for which no information has been yet downloaded:

In [15]:

# Create data frame with DOIs of papers for which no data has been downloaded

filtered_df = filter_new_dois(discipline_df, existing_doi_df)

NameError: name 'existing_doi_df' is not defined


- We can now use the rest of the functions that we defined to start performing API calls to download information for our retracted papers from OpenAlex. Since data sets of interest will typically be quite large, we will first do that on a smaller sample, just to make sure that everything works properly. Let us first generate the required sample data size, with some desired sample size:

In [50]:
# Define sample size

sample_size = 20

# Check if sample_size is less than the number of rows in the data frame

if sample_size <= len(discipline_df):
    
    # Create a random sample of the data frame with the defined sample size
    
    sample_df = discipline_df.sample(n=sample_size, random_state=1)  
    
else:
    
    print("Sample size is larger than the DataFrame.")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original_paper_doi  20 non-null     object
dtypes: object(1)
memory usage: 292.0+ bytes



- Let us now call our main function to download data and write the appropriate logs for our sample data frame:

In [None]:

# Call fetch_and_log_data function to download data for sample data frame

fetch_and_log_data(sample_df, json_directory, log_directory) 


- We can now inspect the information that the content of the .json files that were created for our sample data frame, and the log files that were generated in the process. If it all looks good, we can go ahead and download information for the rest of our retracted papers by using the functions that we defined earlier, which find out what are the papers for which information has already been downloaded and generate a new data frame that contains only the DOIs of those papers for which an API call still remains to be performed:

In [None]:

# Call function to write .csv file with DOIs of papers for which we already have information

log_existing_dois(existing_doi_df, log_directory)

# Call function to generate data frame from .csv file produced by previous function call

filtered_df = filter_new_dois(discipline_df, existing_doi_df)


- Having found out what are the papers for which information still has to be downloaded, we can go on to perform the appropriate API calls for the rest of our retracted papers. With this, we will have obtained all the information that we wanted:

In [35]:

# Fetch all json files for the main corpus of the discipline 

fetch_and_log_data(filtered_df, json_directory, log_directory) 
