# 2b. Extracting Country and Year Distribution of Retracted Papers

## Introduction

This Notebook analyzes the country and year distribution of our sample of retracted papers. In other words, **it extracts the year and country distribution** of the set of retracted papers under investigation.

The Notebook takes as input the JSON files with all the bibliographic information that OpenAlex had for our retracted papers, which we dowloaded in **Notebook 2a**. The bucket distribution that it produces is then in turn be used by **Notebook 3a** to download a set of non-retracted papers with the same year and country distribution as our original retracted paper data set (Notebooks 2c extracts abstract information for our retracted papers. Its labeling as belonging to the "2" series indicates that it performs data extraction and pre-processing for retracted articles).

The **workflow** of the Notebook is therefore set up as follows:

- Input: **a set of JSON files** with all the bibliographic information that we downloaded from OpenAlex for our retracted papers.
- Output: **one .csv file** with the number of retracted papers for each year and country bucket.

## Importing Libraries

Let us start by importing the various libraries that we will use in the Notebook:

In [4]:

import pandas as pd
import os
import json
import csv


## Input / Output Parameters

Input parameters:

In [2]:

# File path of the directory with .json files

input_path = '../data/json_files/cell_biology/retracted'


Output parameters:

In [3]:

# File path for .csv file with results

output_path = '../data/buckets/cell_biology/cell_bio_buckets.csv'

# File path for .csv file with value counts

output_path_value_counts = '../data/buckets/cell_biology/cell_bio_buckets_value.csv'


## Output: Extracting Year and Country Buckets


We can now go ahead and analyze the .json files for our retracted papers and find the amount of papers per year and country bucket:

In [5]:

# Create list with name of headers to be written in output .csv file
headers = ['file_name', 'author_country', 'publication_year', 'abstract_info', 'ngram_info', 'error']

# Initialize paper count and data list for DataFrame
paper_count = 0
data_list = []

# Open .csv file to write values for each bucket
with open(output_path, mode='w', newline='') as file:
    # Initialize writer
    writer = csv.DictWriter(file, fieldnames=headers)
    
    # Write header names
    writer.writeheader()

    # Create loop to iterate over all .json files in input directory
    for filename in os.listdir(input_path):
        # Update paper count
        paper_count += 1

        # If clause to make sure we only loop through .json files
        if filename.endswith('.json'):
            # Construct full file path for current .json file in loop iteration
            file_path = os.path.join(input_path, filename)

            # Initialize error message and other variables
            error_message = ""
            author_country = "N/A"
            ngram_info = False

            # Try to open and read .json file in current loop iteration
            try:
                with open(file_path, 'r', encoding='utf-8') as json_file:
                    # Read .json file
                    content = json_file.read()

                    # If clause to account for situation in which .json file is empty
                    if not content:
                        raise ValueError("File is empty")

                    # Load content of current .json file into data variable
                    data = json.loads(content)

            # Clause to update error message if reading of .json file fails
            except Exception as e:
                error_message = str(e)

            # Extract the country code from data variable
            if 'authorships' in data:
                for authorship in data['authorships']:
                    if 'institutions' in authorship and any(inst.get('country_code') for inst in authorship['institutions']):
                        author_country = next((inst['country_code'] for inst in authorship['institutions'] if 'country_code' in inst), "N/A")
                        break

            # Extract ngrams information from data variable
            ngrams_url = data.get('ngrams_url', '')
            if ngrams_url.startswith('https://api.'):
                ngram_info = True

            # Extract publication year from data variable
            publication_year = data.get('publication_year', 'N/A') if not error_message else 'N/A'

            # Extract abstract info from data variable
            abstract_info = bool(data.get('abstract_inverted_index')) if not error_message else False

            # Create row of data
            row = {
                'file_name': filename,
                'author_country': author_country,
                'publication_year': publication_year,
                'abstract_info': abstract_info,
                'ngram_info': ngram_info,
                'error': error_message
            }

            # Write data or error to .csv file
            writer.writerow(row)

            # Append data to list for DataFrame
            data_list.append(row)

# Convert data list to DataFrame
df = pd.DataFrame(data_list)

# Save DataFrame to a CSV file
df.to_csv(output_path, index=False)

# Write confirmation message
print(f"Country and year distribution of {paper_count} papers complete, result saved to {output_path}.")


Country and year distribution of 9089 papers complete, result saved to ../data/buckets/cell_biology/cell_bio_buckets.csv.


Let us also create another data frame with the value counts per country and year, and save the results into a .csv file for later use:

In [6]:

# Create a DataFrame with value counts per country and year

value_counts_df = df.groupby(['author_country', 'publication_year']).size().reset_index(name='count')

# Save the value counts DataFrame to a CSV file

value_counts_df.to_csv(output_path_value_counts, index=False)

# Write confirmation message

print("Value counts per country and year saved to value_counts.csv.")


Value counts per country and year saved to value_counts.csv.
