# 4a. Fetching DOIs for Non-Fraudulent Papers with .csv File with Bucketing Specifications


## Introduction

This Notebook **obtains the DOIs of a number of randomly selected non-retracted papers, whose year and country specifications match those of the retracted papers under investigation**. In order to do that, we will pick the DOIs for our non-retracted papers from a number of "buckets" that we will define for each year and country. These buckest will be built so as to contain as many non-retracted papers as retracted papers we had in our original data base, for each year and country. 

The Notebook takes as input the .csv file that was generated by **script 3b**, which contained the size of the bucket associated to each year and country. 

The **workflow** of the notebook is therefore as follows:

- Input: **two .csv files**. The first one contains the bucketing specifications obtained from script 3b, whereas the second one gives us the appropriate code for each country, which we will use to add the appropriate filters to our search.

- Output: **one .csv file** with the DOIs of non-retracted papers gathered from our buckets.

## Input / Output Variables

- Input parameters:

In [1]:

# File path for .csv input file with bucketing specifications

input_path = "../data/Country_Year_Buckets_Cellbio.csv"

# Id of subfield to fitler our paper search 
# Id value for cell_bio is 1307

subfield_filter_value = "1307"

# File path for ISO country code equivalences

input_path_dictionary = "../data/country_code_dictionary.csv"

# Upsize factor for each bucket

upsize_factor = 1.3



- Output parameters:

In [2]:

# File name for .csv with DOIs of non-retracted papers

output_file_name = "dois_jenny_corrected_3.csv"

# File name for .csv with unique DOIs of non-retracted papers

output_path_unique = "../data/results/dois_jenny_unique.csv"

# File path for .jsonl file with text data for abstracts

output_path = "../data/results/" + output_file_name


## Importing Libraries


- As always, let's start by importing all required libraries:

In [3]:

# Import required libraries

import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import requests
from json.decoder import JSONDecodeError
import json
import matplotlib.pyplot as plt
import random
import time


## Loading Data

- And by loading the relevant data from our .csv files:

In [4]:

# Load .csv with bucketing specifications into data frame

df = pd.read_csv(input_path, encoding='latin-1', sep = ";")

# Load .csv file with ISO country code equivalences

df_country_codes = pd.read_csv("../data/country_codes_dictionary.csv", encoding='latin-1')


- The data specifying our contry code equivalences requires some cleaninig, so we will go ahead and make the necessary adjustments:

In [5]:

# Clean spurious spaces in "Country" column

df_country_codes['Country'] = df_country_codes['Country'].str.strip()

# Clean spurious spaces in "TIS" column

df_country_codes['TIS'] = df_country_codes['TIS'].str.strip()

# Create country code dictionary from countr code data frame

country_codes_dictionary = df_country_codes.set_index('Country')['TIS'].str.strip().to_dict()

# Rename country column 

df.rename(columns={"ï»¿country": "country"}, inplace=True)


## Function Definitions

In [6]:
def seconds_to_hms(seconds):
    """
    Convert seconds to hours, minutes, and seconds.

    Parameters:
    seconds (int): Number of seconds.

    Returns:
    tuple: A tuple containing the equivalent time in hours, minutes, and seconds.
    """
    hours = seconds // 3600
    minutes = (seconds % 3600) // 60
    seconds = seconds % 60

    return hours, minutes, seconds


In [7]:

def url_builder(country, year, field_id, page = "1"):
    """This function builds the URL that OpenAlex requires to find papers with the characteristics
    specified in the filters below."""
    
    # Add subfield filter to our URL
    
    subfield_filter = "primary_topic.subfield.id:" + field_id
    
    # Call country_code_getter to obtain code for year
    
    country_code = country
        
    # Add publication year filter to our URL
    
    year_filter = "publication_year:" + year

    # Add retraction filter

    retraction_filter = "is_retracted:false"
    
    # Add country filter to our URL
    
    country_filter = "institutions.country_code:" + country_code
    
    # Add type of work giler
    
    type_filter = "type:article"
    
    # Add page number
    
    page_number = page
    
    # Add filters to base url
    
    url = "https://api.openalex.org/works?page=" + page_number + "&per-page=200&filter=" + subfield_filter + "," + year_filter + "," + country_filter + "," + type_filter + "," + retraction_filter

    # Return full URL
    
    return url



In [16]:

def doi_getter_pagination(country, year, field, paper_num):
    
    """Function takes the URL to perform an API search in OpenAlex with a number of 
    search filters. It then extracts the DOIs of the specified number of papers, then 
    returns a list with the DOIs in question"""

    doi_lst = []  # Initialize list to store DOIs
    page = 1  # Initialize page number
    
    # Calculate required page number
    
    page_num = int(paper_num / 200) + 1
    
    # Perform API calls until the desired number of DOIs is collected
    
    for page in range(1, page_num + 1):

        # Update search URL with page number
        
        page_code = str(page)
        url = url_builder(country,year,field,page_code)
        
        # Perform API call and decode JSON result to obtain meta-data
        
        response = requests.get(url)
        meta_data = response.json()
        
        # Extract DOIs from API response and add to doi_lst
        
        # Extract DOIs from API response and add to doi_lst
        
        for element in meta_data["results"]:
            if element["doi"] is not None:
                doi_lst.append(element["doi"])
            if len(doi_lst) >= paper_num:
                return list(set(doi_lst))  



In [8]:

def doi_from_address(address_lst):
    """Function takes the whole URL address associated to the DOI of a given paper, 
    then subtracts the first part of the address to obtain its DOI code only. It
    takes as input a list of DOI URLs, returns a list of DOI codes"""

    doi_lst = []
    
    for element in address_lst:
        doi_lst.append(element.removeprefix("https://doi.org/"))
        
    return doi_lst


In [9]:

def address_builder(doi):
    """Takes a DOI identifier and builds full URL address from it, with format
    required for a normal OpenAlex API call"""
    
    # Store url addresses in string
    
    base_address = "https://api.openalex.org/works/https://doi.org/" + doi
    polite_address = base_address + "?mailto=" + "pabloruizdeolano@gmail.com" # Use polite address for faster API call performance
    
    # Return polite address
    
    return polite_address


In [10]:

def meta_data_extractor(doi, n_grams="False"):
    """Function takes a DOI and calls address_builder function to build full URL. 
    It then performs an API call, and it returns the metadata as JSON dictionary.
    If n_grams is set to "True" it grabs the n_grams for the paper, if those 
    are avaiable. Otherwise it grabs its abstract word index."""
    
    # Build correct address depending on value of "n_grams" parameter
    
    if n_grams == "False":
        url = address_builder(doi)
    else:
        url = address_builder_ngrams(doi)
    
    # Perform API call and store result in response variable
    
    response = requests.get(url)
    
    # Convert meta-data to json format if possible and store result in variable
    
    try:
        meta_data = response.json()
    except JSONDecodeError: # This is in case result of API call was not in JSON format
        print("Error: Unable to decode JSON response")
        meta_data = None
    
    return meta_data


In [11]:

def pick_random_entries(string, n):
    
    # Generate n random indices within the range of the string length
    
    random_indices = random.sample(range(len(string)), n)
    
    # Select the characters at the random indices
    
    random_entries = [string[i] for i in random_indices]
    
    return random_entries

## Getting DOIs

In [21]:

# Open log file

with open("../data/logs/" + output_file_name + "logfile.txt", "w") as f:

    f.write("Log file opened.\n")
    
    # Store start time of loop execution

    start_time = time.time()
    
    # Initialize list to store dois

    final_doi_list = []

    # For loop to gather dois

    for index, row in df.iterrows():
    
        # Intialize variables
    
        year = str(row['year'])
    
        country = str(row['country'])
    
        count = row['count']
    
        triple_count = 3*count

        # Call doi_getter to get list as many filtered DOI URLs as our sample size
            
        doi_list = doi_getter_pagination(country, year, "1307", triple_count)
    
        # Write log entry with number of DOIs grabbed for this iteration of the loop
    
        f.write("+++++++++++++++++++++++++++++++++++++++++++")
        f.write(f"LOOP NUMBER {index + 1}: Year={year}, COUNTRY={country}, COUNT={count}. \n")
        f.write(f"Picked {len(doi_list)} DOIs, triple count was {triple_count}. \n")
        f.write(f"Of those, {len(set(doi_list))} were unique DOIs \n\n")

        # Obtain size of upsized sample from upsize factor defined at the beginning
    
        upsized_sample = int(count * upsize_factor)
    
        # Randomly get a number of dois equal to a slightly upsized sample size
    
        if len(doi_list) > 0 and len(doi_list) > upsized_sample:
            doi_list = pick_random_entries(doi_list, upsized_sample)
        else:
            doi_list = []
        
        # Writing log entry with number of DOIs randomly grabbed at this point
    
        f.write(f"I then picked {len(doi_list)} DOIs randomly. \n")
        f.write(f"Of these, {len(set(doi_list))} were unique DOIs. \n")
        f.write(f"Target count was {count}.\n")
        
        if len(set(doi_list)) != len(set(doi_list)):
            f.write(f"ERROR: WE HAVE {len(doi_list)} - {len(set(doi_list))} REPEATED DOIS. \n")
        
        # Add result of current iteration to final list of dois
    
        final_doi_list.extend(doi_list) 

    # Calculate elapsed time for for lopp execution

    elapsed_time = time.time() - start_time
    elapsed_hours, elapsed_minutes, elapsed_seconds = seconds_to_hms(elapsed_time)

    print(f"Time taken for the loop: {elapsed_hours}h, {elapsed_minutes}m, {round(elapsed_seconds,1)}s. \n")

    # Write final message in log file
    
    f.write(f"Time taken for the loop: {elapsed_hours}h, {elapsed_minutes}m, {elapsed_seconds}s. \n")
    f.write("End of log file.\n")
                        

Time taken for the loop: 0.0h, 17.0, 30.622215032577515s. 



## Output

In [24]:

# Convert list to data frame

df_dois = pd.DataFrame(final_doi_list)

# Write data frame to .csv

df_dois.to_csv(output_path, index=False)
    

In [25]:
# Create set with DOIs to get rid of repeated entries

final_doi_set = set(final_doi_list)

# Convert set to data frame

df_unique_dois = pd.DataFrame(final_doi_set)

# Write data frame to .csv

df_unique_dois.to_csv(output_path_unique, index=False)

In [14]:
!python "04c_extract_nonretract_abstract_as_text.py"

Processed 200 files
Processed 400 files
Processed 600 files
Processed 800 files
Processed 1000 files
Processed 1400 files
Processed 1600 files
Processed 2200 files
Processed 2400 files
Processed 2800 files
Processed 3200 files
Processed 3400 files
Processed 3600 files
Processed 3800 files
Processed 4000 files
Processed 4200 files
Processed 4400 files
Processed 4600 files
Processed 4800 files
Processed 5000 files
Processed 5200 files
Number of rows to be written to CSV: 5063
Processing complete.
