# 2c. Extract Abstracts for Retracted Papers

## Introduction

This Notebook **extracts abstracts** from the JSON files that we downloaded from OpenAlex **for our retracted papers**. It does so by extracting inverted word indeces first, and then reconstructing them to obtain readable abstract text.

The Notebook thus takes as input the JSON files that were downloaded in **Notebook 2a**. The reconstructed abstract generated in here will in turn be used as input in **Notebook 5** in order to generate the corpus of text from both retracted and non-retracted papers that we will used to train our model (Notebooks in the "3" series will extract and pre-process text for non-retracted papers, hence the numbering convention).

The **workflow** of the Notebook has been set to be as follows:

- Input: as many **JSON files** as retracted papers there are under investigation.
- Output: **one .csv file** with the reconstructed abstracts, along with the DOI, year, and country of each paper, and **one .csv file** with log information about the extraction and reconstruction process.

## Input / Output Parameters

Input parameters:

In [1]:

# Input path

input_path = '../data/json_files/cell_biology/retracted'


Output parameters:

In [2]:

# Output path for abstracts

output_path_abstracts = '../data/abstracts/cell_biology/retracted/retracted_cell_biology_abstracts.csv'

# Output path for logs

output_path_logs = '../data/logs/retracted_abstract_extraction_log.csv'


## Importing Libraries

In [3]:
import os
import json
import pandas as pd
import csv
import string

## Extracting Abstracts from JSON Files

We will use one master function to extract and store the abstracts from the .json files that we downloaded from OpenAlex for non-retracted papers. Let us go ahead and define the function in question, which we will use again in a future notebook to extract abstracts from non-retracted papers:

In [4]:

# Define abstract_getter function

def abstract_getter(input_path, target_val):
    
    # Initialize variables
    data = []
    log_entries = []
    
    # Obtain file names of .json files in input directory
    files = os.listdir(input_path)
    
    # For loop to iterate over all .json files in input directory
    for i, filename in enumerate(files):
    
        # If sentence to make sure we only process .json files
        if filename.endswith('.json'):
        
            # Construct full path for current .json file in loop
            file_path = os.path.join(input_path, filename)
        
            # Try sentence to account for errors reading current .json file
            try:
            
                # Open and read current .json file
                with open(file_path, 'r', encoding='utf-8') as file:
                
                    # Write content of current .json file into content variable
                    content = json.load(file)
                
                    # Extract DOI of article associated to current .json file
                    doi = filename.replace('.json', '')

                    # Get inverted index from content variable 
                    abstract_inverted_index = content.get('abstract_inverted_index', {})
                
                    # If sentence to make sure we only process non-empty word indeces
                    if abstract_inverted_index:
                    
                        # Initialize list of tuples with each word and the position that it occupies in the abstract text
                        index_word_pairs = [(index, word) for word, indices in abstract_inverted_index.items() for index in indices]
                    
                        # Sort list according to the position of each word in the abstract text
                        index_word_pairs.sort()
                    
                        # Reconstruct abstract by adding words in list of tuples in the order of their occurrence in the text
                        # We add a space at the beginning and strip it at the end
                        abstract_text = ' '.join(word for _, word in index_word_pairs).strip()
                   
                        # Create string with delimiter characters to be removed from reconstructed text
                        delimiters = ",;|{}\n\r\t[]<>"
                    
                        # For loop to iterate over all delimiters in delimiter string
                        for delimiter in delimiters:
                        
                            # Replace current delimiter in loop with blank space
                            abstract_text = abstract_text.replace(delimiter, ' ')
                        
                        # Call function to remove non printable characters from reconstructed text 
                        abstract_text = remove_non_printable(abstract_text)
                    
                    # Else sentence to account for situation in which inverted index is empty
                    else:
                        log_entries.append({'filename': filename, 'success': False, 'message': 'No abstract_inverted_index provided'})
                        continue
                    
                    # Initialize author_country string
                    author_country = 'Unknown'  # Default value
                
                    # If clause to check if authorship information is present in content variable
                    if 'authorships' in content:
                    
                        # For loop to iterate over authorsips list
                        for authorship in content['authorships']:
                        
                            # Extract country code from institution of first author for which information is available
                            if 'institutions' in authorship and any(inst.get('country_code') for inst in authorship['institutions']):
                                author_country = next((inst['country_code'] for inst in authorship['institutions'] if 'country_code' in inst), "Unknown")
                                break

                    # Extract publication year from content variable
                    year = content.get('publication_year', 'Unknown')
                
                    # Check for the presence of "retract%" or "withdraw%" in abstract_text
                    retracted_flag = any(word in abstract_text.lower() for word in ["retract", "withdraw", "retracted", "retraction", "withdrew", "withdrawal","withdrawn", "retracts"])

                    # Update data variable with reconstructed text and additional information
                    data.append({
                        'abstract_text': f'"{abstract_text}"',  # Ensure the text is surrounded by double quotes
                        'target': target_val,
                        'doi': doi,
                        'country': author_country,
                        'year': year,
                        'ret_flag': retracted_flag
                    })

                    # Update log variable  with success message
                    log_entries.append({'filename': filename, 'success': True, 'message': 'Processed successfully'})
        
        
            # Clause to account for errors in reading the current .json file
            except Exception as e:
            
                # Update log variable with current error message
                log_entries.append({'filename': filename, 'success': False, 'message': f'Error processing file - {str(e)}'})

    return data, log_entries


The function uses a smaller function to remove non-printable characters from our abstract text, which we go on to define as follows:

In [5]:

# Define function to remove non-printable characters

def remove_non_printable(text):
    printable = set(string.printable)
    return ''.join(filter(lambda x: x in printable, text))


We can now go on to call our master function to extract the abstract information that we are interested in:

In [6]:

# Initialize lists to store data and log entries

data = []
log_entries = []

data, log_entries = abstract_getter(input_path, 1)


## Output

To conclude, we save the abstract text that we extracted from our .json files into a .csv file:

In [7]:

# Create data frame with content of data list and pipe symbol as delimiter

df = pd.DataFrame(data)

# Save content of data frame to .csv

df.to_csv(output_path_abstracts, sep='|', index=False, quoting=csv.QUOTE_MINIMAL)



And we do the same with the log files that we generated in the process:

In [8]:

# Create data frame with content of log_entries list 

log_df = pd.DataFrame(log_entries)

# Save content of log entry data frame to .csv

log_df.to_csv(output_path_logs, index=False, quoting=csv.QUOTE_MINIMAL)
