# 3b. Extracting Country and Year Buckets from Retracted Papers

## Introduction

This Notebook analyzes the country and year distribution of our retracted papers and it specifies the number of articles per year and country. In other words, **it extracts the number of retracted papers in each year and country "bucket"** for our retracted papers.

The Notebook takes as input the .json files for retracted papers that were dowloaded in **Notebook 3**. The bucketing distribution that it produces will in turn be used by **Notebook 4a** to download a set of non-retracted papers with the same year and country distribution as our original retracted paper data set.

The **workflow** of the Notebook is therefore set up as follows:

- Input: **a set of .json files** with all the bibliographic information that we downloaded from OpenAlex for our retracted papers.

- Output: **one .csv file** with the number of retracted papers for each year and country bucket.

## Importing Libraries

- Let us start by importing the various libraries that we will use in the Notebook:

In [1]:

import os
import json
import csv


## Input / Output Parameters

- Input parameters:

In [2]:

# File path of the directory with .json files

#input_path = '/Volumes/Hurricane/CellBiology_AllData'

input_path = '../data/json_files/cellbiology_retracted_fulljsonfiles'


- Output parameters:

In [3]:

# File path for .csv file with results

output_path = '../data/results.csv'


## Extracting Year and Country Buckets


- We can now go ahead and analyze the .json files for our retracted papers and find the amount of papers per year and country bucket:

In [5]:

# Create list with name of headers to be written in out output .csv file

headers = ['file_name', 'author_country', 'publication_year', 'abstract_info', 'ngram_info', 'error']

# Open .csv file to write values for each bucket

with open(output_path, mode='w', newline='') as file:
    
    # Initialize writer
    
    writer = csv.DictWriter(file, fieldnames=headers)
    
    # Write header names
    
    writer.writeheader() 

    # Create loop to iterate over all .json files in input directory
    
    for filename in os.listdir(input_path):
        
        # If clause to make sure we only loop through .json files
        
        if filename.endswith('.json'):
            
            # Construct full file path for current .json file in loop iteration
            
            file_path = os.path.join(input_path, filename)
                        
            # Initialize error message and other variables
            
            error_message = ""
            
            author_country = "N/A"
            
            ngram_info = False

            # Try to open and read .json file in current loop iteration
            
            try:
                with open(file_path, 'r', encoding='utf-8') as json_file:
                    
                    # Read .json file and dump into 
                    
                    content = json_file.read()
                    
                    # If clause to account for situation in which .json file is empty
                    
                    if not content:
                        
                        raise ValueError("File is empty")
                        
                    # Load content of current .json file into data variable
                        
                    data = json.loads(content)
                    
            # Clause to update error message if reading of .json file fails

            except Exception as e:
                
                error_message = str(e)
                
        # Extract the country code from data variable
                    
        if 'authorships' in data:
                        
            for authorship in data['authorships']:
                            
                if 'institutions' in authorship and any(inst.get('country_code') for inst in authorship['institutions']):
                                
                    author_country = next((inst['country_code'] for inst in authorship['institutions'] if 'country_code' in inst), "N/A")
                                
                    break
                    
        # Extract ngrams information from data variable 
                    
        ngrams_url = data.get('ngrams_url', '')
                    
        if ngrams_url.startswith('https://api.'):
                        
            ngram_info = True

        # Extract publication year from data variable
            
        publication_year = data.get('publication_year', 'N/A') if not error_message else 'N/A'
        
        # Extract abstract info from data variable

        abstract_info = bool(data.get('abstract_inverted_index')) if not error_message else False

        # Write data or error to .csv file
            
        writer.writerow({
            'file_name': filename,
            'author_country': author_country,
            'publication_year': publication_year,
            'abstract_info': abstract_info,
            'ngram_info': ngram_info,
            'error': error_message
        })

# Write confirmation message

print(f"Data processing complete and saved to .csv file in {output_path}.")


Data processing complete and saved to .csv file in ../data/results.csv.
