# 3a. Getting DOIs of Non-Retracted Papers with Specified Country and Year Distribution


## Introduction

This Notebook **obtains DOIs of non-retracted papers**. The papers with these DOIs, furthermore, will be **randomly** selected from OpenAlex, and they will have the **same country and year distribution** as the retracted papers in our original data set. Needless to say, they will also be papers from the same discipline.

The Notebook takes as input the .csv file that was generated by **Notebook 2c**, contained the country and year distribution of our retracted papers. 

The **workflow** of the notebook is therefore as follows:

- Input: **one .csv file** with the country and year distribution of our retracted papers.
- Output: **one .csv file** with the DOIs of non-retracted papers that we got from OpenAlex.

## Input / Output Variables

Input parameters:

In [20]:

# File path for .csv input file with bucketing specifications
#input_path = "../data/Country_Year_Buckets_Cellbio.csv"

# Id of subfield to fitler our paper search 
# Id value for cell_bio is 1307
subfield_filter_value = "1307"

# File path for ISO country code equivalences
input_path_dictionary = "../data/country_code_dictionary.csv"

# Upsize factor for each bucket
upsize_factor = 1.3



Output parameters:

In [21]:
#Paths for previous draft

# File name for .csv with DOIs of non-retracted papers
#output_file_name = "dois_jenny_corrected_3.csv"

# File path for .csv with unique DOIs of non-retracted papers
#output_path_unique = "../data/results/dois_jenny_unique.csv"

# File path for .jsonl file with text data for abstracts
#output_path = "../data/results/" + output_file_name

# Current path

# File path for .csv with DOIs of non-retracted papers

output_path = "../data/dois_non_retracted/non_retracted_dois_cell_bio.csv" 

# File path for log file

output_path_log = "../data/logs/non_retracted_dois_getting_log.txt" 

## Importing Libraries


As always, let's start by importing all required libraries:

In [22]:

# Import required libraries

import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import requests
from json.decoder import JSONDecodeError
import json
import matplotlib.pyplot as plt
import random
import time


## Loading Data

And by loading the relevant data from our .csv files:

In [23]:

# Load .csv with bucketing specifications into data frame
df = pd.read_csv(input_path, encoding='latin-1', sep = ";")

# Load .csv file with ISO country code equivalences
df_country_codes = pd.read_csv("../data/country_codes_dictionary.csv", encoding='latin-1')


The data specifying our contry code equivalences requires some cleaninig, so we will go ahead and make the necessary adjustments:

In [24]:

# Clean spurious spaces in "Country" column
df_country_codes['Country'] = df_country_codes['Country'].str.strip()

# Clean spurious spaces in "TIS" column
df_country_codes['TIS'] = df_country_codes['TIS'].str.strip()

# Create country code dictionary from countr code data frame
country_codes_dictionary = df_country_codes.set_index('Country')['TIS'].str.strip().to_dict()

# Rename country column 
df.rename(columns={"ï»¿country": "country"}, inplace=True)


## Function Definitions


Our goal in this Notebook will consist of obtaining a list of DOIs for non-retracted papers. So as to avoid biases in our model, we will make sure that these papers belong to the same discipline, and have the same country and year distribution as our retracted papers (recall, of course, that we obtained the country and year "buckets" for these papers in a previous Notebook). 

To do this, we will query OpenAlex for papers with these characteristics. This, in turn, is done by making an API call to an URL address that incorporates the appropriate filters. The first step in our task, therefore, will be to define a function that builds an OpenAlex-compatible URL, given a number of paper characteristics that we want to filter our search for:


In [25]:

# Define url_builder function

def url_builder(country, year, field_id, page = "1"):
    """
    Builds an URL that can be used to search papers with appropriate filters on OpenAlex 
    
    Parameters: 
        country (str): country of papers to be queried
        year (str): year of papers to be queried
        field_id (str): id code of field of papers
        page (str): number of page from which papers will be queried (in case we need more papers than can fit in a single "page")
    
    Returns:
        url (str): full url to be used for OpenAlex API query
    """
    
    # Add subfield filter to our URL
    subfield_filter = "primary_topic.subfield.id:" + field_id    
        
    # Add publication year filter to our URL
    year_filter = "publication_year:" + year

    # Add retraction filter to make sure queried papers are NOT retracted
    retraction_filter = "is_retracted:false"
    
    # Add country filter to our URL
    country_code = country
    country_filter = "institutions.country_code:" + country_code
    
    # Add type of work filter
    type_filter = "type:article"
    
    # Add page number
    page_number = page
    
    # Add filters to base url
    url = "https://api.openalex.org/works?page=" + page_number + "&per-page=200&filter=" + subfield_filter + "," + year_filter + "," + country_filter + "," + type_filter + "," + retraction_filter

    # Return full URL
    return url



Our next step will consist of obtaining a specified number of DOI addresses for the appropriately filtered papers that our URLs can query from OpenAlex. We will define a new function in order to do that:


In [26]:

# Master function
# Define doi_getter function

def doi_getter(country, year, field, paper_num):
    
    """
    Extracts DOIs of papers based on specified criteria.

    Parameters:
        country (str): The country associated with the papers.
        year (str): The year of publication of the papers.
        field (str): The field or discipline of the papers.
        paper_num (int): The number of DOIs to be extracted.

    Returns:
        list of str: A list containing the extracted DOIs of the papers.
    """
    
    # Initialize list to store DOIs
    doi_lst = []  
    
    # Calculate the number of pages require to fetch the specified number of papers
    page_num = int(paper_num / 200) + 1
    
    # Perform API calls until the desired number of DOIs is collected
    for page in range(1, page_num + 1):

        # Update search URL with page number
        page_code = str(page)
        url = url_builder(country,year,field,page_code)
        
        # Perform API call and decode JSON result to obtain meta-data
        response = requests.get(url)
        meta_data = response.json()
                
        # Extract DOIs from API response and add to doi_lst
        for element in meta_data["results"]:
            if element["doi"] is not None:
                doi_lst.append(element["doi"])
            if len(doi_lst) >= paper_num:
                return list(set(doi_lst))  




Note of course that it is possible that our our API call to access information from OpenAlex may fail or that other complications may arise in our attempts to get DOIs for non-retracted papers. In order to get around this problem, we will try to collect more DOIs of non-retracted papers than we need from each country and year "bucket." In fact, we will make this quantity substantially larger so as to make sure that we always get as many DOIs for non-retracted papers as we need. Of course, this means that we will typically get more DOIs than necessary for each country and year bucket. Whenever this happens, we will randomly select as many DOIs as we actually need from the ones that we obtained.

We will therefore need a function to pick a specified number of elements of a given list at random, which we can go on to define:

In [27]:

# Define pick_random_entries function

def pick_random_entries(input_list, number):
    """
    Picks a specified number of random elements from a list
    
    Parameters:
        input_list (list): The list with elements to be picked at random
        number (int): Number of elements to be picked at random
    
    Returns:
        (list): List with specified number of elements picked at random
    """
    
    # Generate n random indices within the range of the string length
    random_indices = random.sample(range(len(input_list)), number)
    
    # Select the elements at the random indices
    random_elements = [input_list[i] for i in random_indices]
    
    return random_elements



Finally, and since we will be obtaining DOIs for a considerable number of papers, it will  be useful to have a sense of how each attempted at downloading information takes. In order to be able to present that information in a more readable format, we will define a function that converts an amount of time expressed in seconds, to the same amount expressed in hours, minutes, and seconds:

In [28]:

# Define seconds_to_hms function

def seconds_to_hms(seconds):
    """
    Convert seconds to hours, minutes, and seconds.

    Parameters:
        seconds (int): Number of seconds.

    Returns:
        tuple: A tuple containing the equivalent time in hours, minutes, and seconds.
    """
    hours = seconds // 3600
    minutes = (seconds % 3600) // 60
    seconds = seconds % 60

    return hours, minutes, seconds


## Getting DOIs of Non-Retracted Papers

We can now make use of these functions to obtain the desired number of DOIs for non-retracted papers for each year and country. We will also write a log documenting how the download process advances and how many DOIs are obtained for each year:

In [29]:

# Open log file
with open("output_path_log", "w") as f:

    # Write introductory message into log
    f.write("Log file opened.\n")
    
    # Store start time of loop execution
    start_time = time.time()
    
    # Initialize list to store dois
    final_doi_list = []

    # For loop to gather dois
    for index, row in df.iterrows():
    
        # Intialize variables with values from input bucketing specifications
        year = str(row['year'])
        country = str(row['country'])
        count = row['count']
        triple_count = 3*count

        # Call doi_getter to get three times as many DOI URLs as retraced papers from that bucket
        doi_list = doi_getter(country, year, "1307", triple_count)
    
        # Write log entry with number of DOIs grabbed for this iteration of the loop
        f.write("+++++++++++++++++++++++++++++++++++++++++++")
        f.write(f"LOOP NUMBER {index + 1}: Year={year}, COUNTRY={country}, COUNT={count}. \n")
        f.write(f"Picked {len(doi_list)} DOIs, triple count was {triple_count}. \n")
        f.write(f"Of those, {len(set(doi_list))} were unique DOIs \n\n")

        # Obtain size of upsized sample from upsize factor defined at the beginning
        upsized_sample = int(count * upsize_factor)
    
        # Randomly get a number of dois equal to a slightly upsized sample size
        if len(doi_list) > 0 and len(doi_list) > upsized_sample:
            doi_list = pick_random_entries(doi_list, upsized_sample)
        else:
            doi_list = []
        
        # Writing log entry with number of DOIs randomly grabbed at this point
        f.write(f"I then picked {len(doi_list)} DOIs randomly. \n")
        f.write(f"Of these, {len(set(doi_list))} were unique DOIs. \n")
        f.write(f"Target count was {count}.\n")
        
        if len(doi_list) != len(set(doi_list)):
            f.write(f"ERROR: WE HAVE {len(doi_list)} - {len(set(doi_list))} REPEATED DOIS. \n")
        
        # Add result of current iteration to final list of dois
        final_doi_list.extend(doi_list) 

    # Calculate elapsed time for for lopp execution
    elapsed_time = time.time() - start_time
    elapsed_hours, elapsed_minutes, elapsed_seconds = seconds_to_hms(elapsed_time)

    print(f"Time taken for the loop: {elapsed_hours}h, {elapsed_minutes}m, {round(elapsed_seconds,1)}s. \n")

    # Write final message in log file
    f.write(f"Time taken for the loop: {elapsed_hours}h, {elapsed_minutes}m, {elapsed_seconds}s. \n")
    f.write("End of log file.\n")
                        

Time taken for the loop: 0.0h, 17.0m, 42.7s. 



## Output

To conclude, let us write our list of DOIs for non-retracted papers into a .csv file:

In [33]:
# Create set with DOIs to get rid of repeated entries

final_doi_set = set(final_doi_list)

# Convert set to data frame

df_unique_dois = pd.DataFrame(final_doi_set)

# Write data frame to .csv

df_unique_dois.to_csv(output_path, index=False)

In [None]:
#!python "04c_extract_nonretract_abstract_as_text.py"