# 1c. Data Cleaning: Dividing our Data Set Based on the "Subject" Column


## Introduction

This Notebook generates **one data frame with all the retracted papers associated to a each area of study** present in the "subject" column. Note that, since the "subject" column often mentions more than one subject, the same paper will sometimes appear in more than one data frame. 

The Notebook takes as input the clean data set generated by **Notebook 1b**, which is already filtered by severity score. The output that it generates will in turn be used by **Notebook 2** in order to fetch text data from retracted papers.

The workdlow is as follows:
    
- Input parameters: **one .csv** file with the clean data set with retracted papers filtered by severity score. 
- Output parameters: as many **.csv files** as unique entries there are in the "subject" column of the input .csv file.

## Input / Output Parameters

Input parameters:

In [2]:

# File path for input .csv file

input_path = "../data/data_sets/severity_3_4.csv"


Output parameters:

In [26]:

# File path for output .csv

output_path = '../data/subject_cell_bio'  


## Importing Libraries

In [25]:
# Import required libraries

import pandas as pd
import numpy as np
import function_definitions

## Selecting a Subject

Let us start by loading our data from our .csv file into a dataframe:

In [5]:

# Read data from .csv file 

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize first row of data frame

df.head(1)


Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,...,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,reason,paywalled,notes,year,reason_list,severity_score
0,52762,Convolutional neural network and Kalman filter...,(B/T) Data Science;(B/T) Technology;(PHY) Engi...,"Engineering Campus, School of Electrical and E...",Applied Nanoscience,Springer,Malaysia,Bushra N Alsunbuli;Widad Ismail;Nor M Mahyuddin,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,0.0,2021-09-17,10.1007/s13204-021-02043-8,0.0,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/83D...,2021,['Concerns/Issues about Referencing/Attributio...,3


Let us now take a look at the "subject" column:

In [6]:
df.subject

0        (B/T) Data Science;(B/T) Technology;(PHY) Engi...
1                                        (B/T) Technology;
2        (B/T) Technology;(HSC) Medicine - General;(HSC...
3                        (B/T) Technology;(SOC) Education;
4                     (B/T) Data Science;(B/T) Technology;
                               ...                        
31331    (BLS) Biochemistry;(BLS) Biology - Cellular;(B...
31332    (BLS) Biochemistry;(BLS) Biology - Molecular;(...
31333    (BLS) Biology - Molecular;(HSC) Medicine - Pul...
31334    (BLS) Biology - Cancer;(BLS) Biology - Cellula...
31335    (BLS) Biology - Cellular;(BLS) Biology - Molec...
Name: subject, Length: 31336, dtype: object


As we can see, the different rows in this column often contain more than one subject separated by a semicolon. It will be useful to find out how many unique subjects there are in our data frame:

In [34]:

# Create series with all subjects by splitting and exploding entries in subject column

subjects_series = df['subject'].str.split(';').explode()

# Obtain list with unique subjects from series

unique_subjects_list = sorted(subjects_series.unique())

# Visualize unique subjects

unique_subjects_list


['',
 '(B/T) Business - Accounting',
 '(B/T) Business - Economics',
 '(B/T) Business - General',
 '(B/T) Business - Management',
 '(B/T) Business - Manufacturing',
 '(B/T) Business - Marketing',
 '(B/T) Business - Public Relations',
 '(B/T) Computer Science',
 '(B/T) Data Science',
 '(B/T) Foreign Aid',
 '(B/T) Government',
 '(B/T) International Relations',
 '(B/T) Technology',
 '(B/T) Transportation',
 '(B/T) Urban Planning',
 '(BLS) Agriculture',
 '(BLS) Anatomy/Physiology',
 '(BLS) Anthropology',
 '(BLS) Archeology',
 '(BLS) Biochemistry',
 '(BLS) Biology - Cancer',
 '(BLS) Biology - Cellular',
 '(BLS) Biology - General',
 '(BLS) Biology - Molecular',
 '(BLS) Forensic Sciences',
 '(BLS) Genetics',
 '(BLS) Microbiology',
 '(BLS) Neuroscience',
 '(BLS) Nutrition',
 '(BLS) Paleontology',
 '(BLS) Parasitology',
 '(BLS) Plant Biology/Botany',
 '(BLS) Toxicology',
 '(BLS) Zoology',
 '(ENV) Climate Change',
 '(ENV) Climatology',
 '(ENV) Ecology',
 '(ENV) Environmental Sciences',
 '(ENV) Fo


It will also be useful to get a sense of how many papers we have per subject:

In [36]:

# Get series with value counts of each unique subject

subject_counts = subjects_series.value_counts()

# Visualize value counts

subject_counts


subject
                                 31336
(BLS) Biology - Cellular          7090
(B/T) Technology                  5219
(BLS) Genetics                    4813
(B/T) Computer Science            4442
                                 ...  
(HUM) History - United States        5
(HUM) History - Africa               3
(PHY) Forensic Sciences              3
(HUM) Arts - Biography               1
(HUM) History - Australia            1
Name: count, Length: 131, dtype: int64


In the future, we will be interested in studying papers that belong to a single area of study. In order to be able to do this, let us start by defining a function that restricts our data set to a single one of all the subjects present in our data set:


In [37]:


def subject_selector(df, subject):
    """
    Function takes a data frame and a string with a subject name as input.
    It processes the 'subject' column by splitting its values separated by semicolons, 
    filters the data frame to include only rows with the specified subject, then
    returns the filtered data frame.
    """

    # Create series with exploded subjects by splitting the 'subject' column
    
    exploded_subjects = df['subject'].str.split(';').explode()
    
    # Drop the original 'subject' column and join the exploded subjects
    
    df = df.drop('subject', axis=1).join(exploded_subjects.rename('subject'))
    
    # Filter the data frame for the specified subject
    
    subject_df = df[df['subject'] == subject]
    
    # Return filtered data frame
    
    return subject_df

    


Since it is the subject with the most entries, molecular biology will be a good test subject case for our study. Let us thus call our function to restrict our data frame to papers in these area only:

In [40]:

# Call function to create one .csv file with all retracted papers per subject

df_subject = subject_selector(df,"(BLS) Biology - Cellular")

# Check that it worked

#df_subject


As always, it will be useful to get an idea of how many papers we have in our filtered data set, and of what percentage of the data set that we loaded at the beginning of this notebook that represents:

In [43]:

# Print total number of papers in data set

print(f"Total number of papers in data set: {df_subject.shape[0]}")

# Print percentage of papers retained compare with initial data set

print(f"Percentage of original data set that represents: {round(df_subject.shape[0] *100 / df.shape[0],2)} %")


Total number of papers in data set: 7090
Percentage of original data set that represents: 22.63 %


## Output


Having done all this, we can now go ahead safe our new data frame, which has been restricted to a single subject, into a .csv file:

In [42]:

# Save date frame as .csv

df_subject.to_csv(output_path, index=False)
