# 3b. Downloading JSON Files for Non-Retracted Papers




## Introduction



This notebook makes a series of API calls to access all the **bibliographic information for our non-retracted** papers that is available on OpenAlex. It then saves it in a number of **JSON files**, just like we did with the bibliographic information that we downloaded for our retracted papers.

The Notebook uses the list of DOIs of non-retracted papers generated by **Notebook 3a**. The JSON files downloaded here, in turn, will be used in **Notebook 3c** to extract abstracts of non-retracted papers.

The workflow has thus been set up as follows:

- Input parameters: **one .csv file** with a list of DOIs of non-retracted papers.
- Output parameters: **one .json file** per each non-retracted paper in our input file.



## Input / Output Parameters


Input paramaters:

In [1]:
# File path for file with DOIs of non-retracted papers

input_path = "../data/dois_non_retracted/cell_biology/non_retracted_dois_cell_bio.csv"

Output parameters:

In [22]:
# File path .json files of non-retracted papers

output_path = "../data/json_files/cell_biology/non_retracted"

# File path for log concerning download process

output_path_log = "../data/logs/cell_biology/non_retracted_json_download.csv"

## Importing Libraries

Let us start by importing the required libraries:

In [16]:
# Import required libraries

import pandas as pd
import numpy as np

import requests
import csv
import os

from json.decoder import JSONDecodeError
import json

import warnings
warnings.filterwarnings("ignore")

import function_definitions

# Loading Input Data

And by loading the data in our input file:

In [4]:

# Load input .csv data into data frame  

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize data frame

df

Unnamed: 0,0
0,https://doi.org/10.1155/2022/7099589
1,https://doi.org/10.1093/jn/nxz155
2,https://doi.org/10.1016/j.canlet.2014.09.047
3,https://doi.org/10.5114/ada.2020.93382
4,https://doi.org/10.1038/s41419-021-03393-5
...,...
6576,https://doi.org/10.1093/mmy/myv008
6577,https://doi.org/10.1094/pdis-11-17-1704-pdn
6578,https://doi.org/10.26355/eurrev_201908_18743
6579,https://doi.org/10.1038/s41419-018-0302-x


# Data Cleaning

Before we can go on to download our .json files for our non-retracted papers, we need to clean up our input data slightly:

In [5]:

# Replace 'https://doi.org/' from input DOI list

df.iloc[:, 0] = df.iloc[:, 0].str.replace('https://doi.org/', '', regex=False)

# Rename the first column to 'original_paper_doi' 

df.columns = ['original_paper_doi']

# Visualize clean data frame

df

Unnamed: 0,original_paper_doi
0,10.1155/2022/7099589
1,10.1093/jn/nxz155
2,10.1016/j.canlet.2014.09.047
3,10.5114/ada.2020.93382
4,10.1038/s41419-021-03393-5
...,...
6576,10.1093/mmy/myv008
6577,10.1094/pdis-11-17-1704-pdn
6578,10.26355/eurrev_201908_18743
6579,10.1038/s41419-018-0302-x


It will also be useful to make sure that there are no null entries:

In [6]:

# Display data frame info in case there are NaNs

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6581 entries, 0 to 6580
Data columns (total 1 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original_paper_doi  6581 non-null   object
dtypes: object(1)
memory usage: 51.5+ KB


And that there are no duplicates:

In [7]:

# Obtain data fame with value counts

duplicate_dois = df['original_paper_doi'].value_counts()

# Limit value count data frame to entries with value count greater than one

duplicate_dois = duplicate_dois[duplicate_dois > 1]

# Visualize resulting data frame

duplicate_dois

Series([], Name: count, dtype: int64)

## Downloading JSON Files: Test Trial


We can now go ahead and start downloading all the bibliographic information that OpenAlex has for our non-retracted papers. 

Like we did earlier with our retracted papers, it will be useful to start with a test trial for a small sample from our non-retracted papers:

In [8]:

# Define sample size

sample_size = 20

# Check if sample_size is less than the number of rows in the data frame

if sample_size <= len(df):
    
    # Create a random sample of the data frame with the defined sample size
    
    df_sample = df.sample(n=sample_size, random_state=1)  
    
else:
    
    print("Sample size is larger than the DataFrame.")



We can now use the functions that we defined earlier to start obtaining the .json files for our sample set of DOIs for non-retracted papers:

In [9]:

# Call fetch_and_log_data function to download data for sample data frame

function_definitions.fetch_json_files(df_sample, output_path, output_path_log)



We can also re-use the functions that we defined earlier to find out for which ones of the non-retracted papers in our list we saved .json files with bibliographic information:

In [10]:

# Call function to generate data frame with DOIs of downloaded papers

existing_doi_df = function_definitions.downloaded_paper_list_getter(output_path, output_path_log)

# Check size of resulting data frame

existing_doi_df.shape


(20, 1)

# Output: Downloading .JSON Files for Entire Data Set


Having completed our test trial, we can create a data frame which contains only the DOIs of those non-retracted papers for which we have not yet downloaded bibliographic information, just like we earlier for retracted papers:

In [11]:

# Create data frame with DOIs of papers for which no data has been downloaded

df_not_downloaded = function_definitions.non_downloaded_papers_selector(df, existing_doi_df)

# Obtain shape of new data frame

df_not_downloaded.shape


(6561, 1)


We can now repeat the process again with this new data frame to download .json files for all the retracted papers in our data set, using once again the function that we defined in Notebook 2a:

In [23]:

function_definitions.fetch_json_files(df_not_downloaded, output_path, output_path_log)


KeyboardInterrupt: 


Finally, we can repeat the process to check that information for all non-retracted papers has been downloaded:

In [24]:

# Call function to generate data frame with DOIs of downloaded papers

existing_doi_df = function_definitions.downloaded_paper_list_getter(output_path, output_path_log)

# Check size of resulting data frame

existing_doi_df.shape


(710, 1)

And that there are no more papers to download information for:

In [25]:

# Create data frame with DOIs of papers for which no data has been downloaded

df_not_downloaded = function_definitions.non_downloaded_papers_selector(df, existing_doi_df)

# Obtain shape of new data frame

df_not_downloaded.shape

(5875, 1)

In case the download is still not complete, we can iterate the process until we have .json files for to complete it in case there were any interruptions:

In [19]:

# Call master function to downloaded remaining .json files

function_definitions.fetch_json_files(df_not_downloaded,  output_path, output_path_log)


KeyboardInterrupt: 