# 2b. Grouping the Retracted Papers by Discipline / Field


## Introduction

This Notebook generates **one data frame with all the retracted papers associated to a each area of study** present in the "subject" column. Note that, since the "subject" column often mentions more than one subject, the same paper will often appear in more than one data frame. The subjects used in order to generate our subject-specific data frames, in other words, are the unique entires of the "subject" column.

The Notebook takes as input the clean data set generated by **Notebooks 1 and 2a**, which is already filtered by severity score. The output that it generates will in turn be used in **Notebook 3** to fetch text data from retracted papers.

Its **input and output** parameters are thus as follows:
    
- Input parameters: one .csv file with the clean data set with retracted papers filtered by severity score. 
- Output parameters: as many .csv files as unique entries there are in the "subject" column of the input .csv file.

## Input / Output Parameters

- Input parameters:

In [1]:

# File path for input .csv file

input_path = "../data/data_sets/severity_3_4.csv"


- Output parameters:

In [2]:

# File path for output .csv

output_path = '../data/disciplines'  


## Importing Libraries

In [3]:
# Import required libraries

import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import re
import os

## Analizing the "Subject" Column

- Let us start by loading our data from our .csv file into a dataframe:

In [4]:

# Read data from .csv file 

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize first row of data frame

df.head(1)


Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,...,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,reason,paywalled,notes,year,reason_list,severity_score
0,52762,Convolutional neural network and Kalman filter...,(B/T) Data Science;(B/T) Technology;(PHY) Engi...,"Engineering Campus, School of Electrical and E...",Applied Nanoscience,Springer,Malaysia,Bushra N Alsunbuli;Widad Ismail;Nor M Mahyuddin,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,0.0,2021-09-17,10.1007/s13204-021-02043-8,0.0,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/83D...,2021,['Concerns/Issues about Referencing/Attributio...,3


- Let us now take a look at the "subject" column:

In [5]:
df.subject

0        (B/T) Data Science;(B/T) Technology;(PHY) Engi...
1                                        (B/T) Technology;
2        (B/T) Technology;(HSC) Medicine - General;(HSC...
3                        (B/T) Technology;(SOC) Education;
4                     (B/T) Data Science;(B/T) Technology;
                               ...                        
31331    (BLS) Biochemistry;(BLS) Biology - Cellular;(B...
31332    (BLS) Biochemistry;(BLS) Biology - Molecular;(...
31333    (BLS) Biology - Molecular;(HSC) Medicine - Pul...
31334    (BLS) Biology - Cancer;(BLS) Biology - Cellula...
31335    (BLS) Biology - Cellular;(BLS) Biology - Molec...
Name: subject, Length: 31336, dtype: object

- As we can see, the different rows in this column often contain more than one subject separated by a semicolon. Let us create a set with all the unique subjects in our data frame so that we can gain a better sense of what is going on:

In [6]:

# Create empty set to store unique subjects in data frame

unique_subjects = set()

# Store unique entries of "subject" column in new set

df['subject'].apply(lambda x: unique_subjects.update(x.split(';')))

# Convert set with unique subjects to list

unique_subjects_list = sorted(list(unique_subjects))

# Visualize result

#unique_subjects_list

## Output


- In the future, we will be interested in studying papers that belong to a single area of study. In order to be able to do this, we will create one data frame with all retracted papers associated to each unique subject, and write its content into a .csv file:


In [7]:

# Define function to create on .csv file per unique subject 

def subject_csv_creator(df):
    """Function takes a data frame as input, splits its subject column into all of its
    unique values as they come separated by semicolons, then creates one data frame
    with all the retracted papers with a given subject and it saves its content
    into a .csv file"""

    # Create series with all unique subjects of the "subject" column of our input data frame
    
    subject_series = df['subject'].str.split(';', expand=True).stack()
    
    # Adjust series index so that it matches indexing of original data frame
    
    subject_series.index = subject_series.index.droplevel(-1)  
    
    # Assign "subject" as name of series
    
    subject_series.name = 'subject'
    
    # Drop "subject" column for original data frame and add series with new indexing as column
    
    df = df.drop('subject', axis=1).join(subject_series)

    # Create list of subject and count tuples ordered by count in descending order
        
    sorted_subjects = df['subject'].value_counts().index.tolist()

    # For loop to iterate over the tuples in our sorted subjects list
    
    for subject in sorted_subjects:
        
        # Create data frame for current subject in the loop
        
        subject_df = df[df['subject'] == subject]
        
        # Create valid file name to store .csv without blanck spaces, dashes, slashes, or parentheses
        
        file_name = subject.replace('/', '').replace(' ', '_').replace('(', '').replace(')', '').replace('-', '') + '.csv'
       
        # Create full file path using file name for current subject and initial output path
    
        file_path = os.path.join(output_path, file_name)  # Construct full file path
        
        # Write data frame for current subject to .csv
        
        try:
            subject_df.to_csv(file_path, index=False)
                
        except Exception as e:
            print(f"Failed to save {file_path}: {e}")


# Call function to create one .csv file with all retracted papers per subject

subject_csv_creator(df)
