# 1c. Data Cleaning: Dividing our Data Set Based on the "Subject" Column


## Introduction

This Notebook generates **a data frame with all the retracted papers in our data set from a specific field from the "Subject" column**, which will be selected by hand. Note that, since the "subject" column often mentions more than one subject, a single paper may be associated to more than one scientific field. 

The Notebook takes as input the clean data set generated by **Notebook 1b**, which is already filtered by severity score. The output that it generates will in turn be used by **Notebook 2a** in order to fetch text data from retracted papers.

The **workflow** is as follows:
    
- Input parameters: **one .csv** file with the clean data set with retracted papers filtered by severity score. 
- Output parameters: as many **.csv files** as unique entries there are in the "subject" column of the input .csv file.

## Input / Output Parameters

Input parameters:

In [12]:

# File path for input .csv file

input_path = "../data/retraction_watch_data_set/3_severe_only_data_set.csv"

# Code for subject to restrict analysis to

subject = "(BLS) Biology - Cellular"


Output parameters:

In [13]:

# File path for output .csv

output_path = '../data/retraction_watch_data_set/4_cell_bio_data_set.csv'  


## Importing Libraries

In [14]:
# Import required libraries

import pandas as pd
import numpy as np
import function_definitions

## Selecting a Subject

Let us start by loading our data from our .csv file into a dataframe:

In [15]:

# Read data from .csv file 

df = pd.read_csv(input_path, encoding='latin-1')

# Visualize first row of data frame

df.head(1)


Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,...,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes,reason_list,severity_score
0,52765,An integrated 3D model based face recognition ...,(B/T) Data Science;(B/T) Technology;,"Management Information System Department, Cypr...",Applied Nanoscience,Springer,Turkey,Ali Milad;Kamil Yurtkan,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,0.0,2/3/2022 0:00,10.1007/s13204-021-02123-9,0.0,Correction,+Error in Text;+Updated to Retraction;,No,See also: https://pubpeer.com/publications/BCC...,"['Error in Text', 'Updated to Retraction']",3


Let us now take a look at the "subject" column:

In [16]:
df.subject

0                     (B/T) Data Science;(B/T) Technology;
1        (B/T) Data Science;(B/T) Technology;(PHY) Engi...
2                                        (B/T) Technology;
3        (B/T) Technology;(HSC) Medicine - General;(HSC...
4                        (B/T) Technology;(SOC) Education;
                               ...                        
52285    (BLS) Biochemistry;(HSC) Medicine - Cardiology...
52286    (BLS) Biology - Cellular;(BLS) Genetics;(BLS) ...
52287    (BLS) Biology - Cellular;(BLS) Genetics;(BLS) ...
52288    (BLS) Biology - Cancer;(BLS) Biology - Cellula...
52289    (BLS) Biology - Cellular;(BLS) Biology - Molec...
Name: subject, Length: 52290, dtype: object


As we can see, the different rows in this column often contain more than one subject separated by a semicolon. It will be useful to find out how many unique subjects there are in our data frame:

In [17]:

# Create series with all subjects by splitting and exploding entries in subject column

subjects_series = df['subject'].str.split(';').explode()

# Obtain list with unique subjects from series

unique_subjects_list = sorted(subjects_series.unique())

# Visualize unique subjects

unique_subjects_list


['',
 '(B/T) Business - Accounting',
 '(B/T) Business - Economics',
 '(B/T) Business - General',
 '(B/T) Business - Management',
 '(B/T) Business - Manufacturing',
 '(B/T) Business - Marketing',
 '(B/T) Business - Public Relations',
 '(B/T) Computer Science',
 '(B/T) Data Science',
 '(B/T) Foreign Aid',
 '(B/T) Government',
 '(B/T) International Relations',
 '(B/T) Technology',
 '(B/T) Transportation',
 '(B/T) Urban Planning',
 '(BLS) Agriculture',
 '(BLS) Anatomy/Physiology',
 '(BLS) Anthropology',
 '(BLS) Archeology',
 '(BLS) Biochemistry',
 '(BLS) Biology - Cancer',
 '(BLS) Biology - Cellular',
 '(BLS) Biology - General',
 '(BLS) Biology - Molecular',
 '(BLS) Forensic Sciences',
 '(BLS) Genetics',
 '(BLS) Microbiology',
 '(BLS) Neuroscience',
 '(BLS) Nutrition',
 '(BLS) Paleontology',
 '(BLS) Parasitology',
 '(BLS) Plant Biology/Botany',
 '(BLS) Toxicology',
 '(BLS) Zoology',
 '(ENV) Climate Change',
 '(ENV) Climatology',
 '(ENV) Ecology',
 '(ENV) Environmental Sciences',
 '(ENV) Fo


It will also be useful to get a sense of how many papers we have per subject:

In [18]:

# Get series with value counts of each unique subject

subject_counts = subjects_series.value_counts()

# Visualize value counts

subject_counts


subject
                                 52290
(BLS) Biology - Cellular         10241
(B/T) Technology                  7442
(BLS) Genetics                    6595
(BLS) Biochemistry                6067
                                 ...  
(B/T) Foreign Aid                    9
(HUM) History - South America        7
(HUM) History - United States        7
(PHY) Forensic Sciences              5
(HUM) History - Australia            1
Name: count, Length: 131, dtype: int64


In the future, we will be interested in studying papers that belong to a single area of study. In order to be able to do this, let us start by defining a function that restricts our data set to a single one of all the subjects present in our data set:


In [19]:


def subject_selector(df, subject):
    """
    Function takes a data frame and a string with a subject name as input.
    It processes the 'subject' column by splitting its values separated by semicolons, 
    filters the data frame to include only rows with the specified subject, then
    returns the filtered data frame.
    """

    # Create series with exploded subjects by splitting the 'subject' column
    
    exploded_subjects = df['subject'].str.split(';').explode()
    
    # Drop the original 'subject' column and join the exploded subjects
    
    df = df.drop('subject', axis=1).join(exploded_subjects.rename('subject'))
    
    # Filter the data frame for the specified subject
    
    subject_df = df[df['subject'] == subject]
    
    # Return filtered data frame
    
    return subject_df

    


Since it is the subject with the most entries, molecular biology will be a good test subject case for our study. Let us thus call our function to restrict our data frame to papers in these area only (recall that the value of the variable "df_subject" had already been set that way when defining our input parameters above):

In [20]:

# Call function to create one .csv file with all retracted papers per subject

df_subject = subject_selector(df, subject)

# Check that it worked

df_subject

Unnamed: 0,record_id,title,institution,journal,publisher,country,author,urls,article_type,retraction_date,...,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes,reason_list,severity_score,subject
20,52739,Anti-breast Cancer Activity of Co(II) Complex ...,"Luohe Medical College, Luohe, Henan, China; Lu...",Journal of Cluster Science,Springer - Nature Publishing Group,China,Ting Yin;Ruirui Wang;Shaozhe Yang,,Research Article;,1/24/2024 0:00,...,10/27/2021 0:00,10.1007/s10876-021-02192-4,0.0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,See also: https://pubpeer.com/publications/739...,"['Concerns/Issues About Image', 'Concerns/Issu...",4,(BLS) Biology - Cellular
21,52738,Anticancer Activity of Zn(II) Coordination Pol...,"Department of Medical Genetics, School of Basi...",Journal of Cluster Science,Springer - Nature Publishing Group,China,Hao Liu;Liying Wu;Jiji Cui;Dan Wang,,Research Article;,1/30/2024 0:00,...,11/23/2021 0:00,10.1007/s10876-021-02201-6,0.0,Retraction,+Concerns/Issues About Image;+Concerns/Issues ...,No,See also: https://pubpeer.com/publications/297...,"['Concerns/Issues About Image', 'Concerns/Issu...",4,(BLS) Biology - Cellular
32,52727,Predigested high-fat meats based on Lactobacil...,"Department of Microbial Biotechnology, Genetic...",Applied Nanoscience,Springer,Egypt,A B Abeer Mohammed;A E Hegazy;Ahmed Salah,https://retractionwatch.com/2024/01/26/springe...,Research Article;,1/11/2024 0:00,...,5/17/2021 0:00,10.1007/s13204-021-01879-4,0.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/430...,['Concerns/Issues about Referencing/Attributio...,3,(BLS) Biology - Cellular
34,52725,Mechanical model of the physiological microenv...,"School of Information Engineering, East China ...",Applied Nanoscience,Springer,China,Yuejin Zhang;Juan Wang;Qi Liu;Xiaohui Guan;Mei...,https://retractionwatch.com/2024/01/26/springe...,Research Article;,1/9/2024 0:00,...,7/19/2021 0:00,10.1007/s13204-021-01951-z,0.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/73D...,['Concerns/Issues about Referencing/Attributio...,3,(BLS) Biology - Cellular
106,52621,Design and evaluation of ciprofloxacin loaded ...,"School of Biochemical Engineering, Indian Inst...",Biomedical Materials,IOP Publishing,India,Satyavrat Tripathi;Bhisham Narayan Singh;Singh...,,Research Article;,3/5/2024 0:00,...,2/23/2021 0:00,10.1088/1748-605X/abd1b8,33291087.0,Retraction,+Concerns/Issues About Image;+Duplication of I...,No,,"['Concerns/Issues About Image', 'Duplication o...",3,(BLS) Biology - Cellular
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52281,9,Sphingosine kinase 1 regulates pro-inflammator...,"Department of Physiology, Yong Loo Lin School ...",Journal of Cellular Physiology,Wiley,Singapore,Liang Zhi;Bernard P Leung;Alirio J Melendez,http://retractionwatch.com/2012/04/04/third-re...,Research Article;,3/20/2012 0:00,...,3/30/2006 0:00,10.1002/jcp.20646,16575915.0,Retraction,+Duplication of Image;+Investigation by Compan...,No,,"['Duplication of Image', 'Investigation by Com...",4,(BLS) Biology - Cellular
52286,4,MtvR is a global small noncoding regulatory RN...,Institute for Biotechnology and Bioengineering...,Journal of Bacteriology,American Society for Microbiology,Portugal,Christian G Ramos;AndrÃÂÃÂ© M Grilo;Paulo J...,http://retractionwatch.com/2014/11/03/post-doc...,Research Article;,11/1/2014 0:00,...,5/31/2013 0:00,10.1128/JB.00242-13,2372964.0,Retraction,+Duplication of Image;+Manipulation of Images;,No,exact date of retraction unknown,"['Duplication of Image', 'Manipulation of Imag...",2,(BLS) Biology - Cellular
52287,3,"The second RNA chaperone, Hfq2, is also requir...",IBBÃÂ¢ÃÂÃÂInstitute for Biotechnology and...,Journal of Bacteriology,American Society for Microbiology,Portugal,Christian G Ramos;SÃÂÃÂ­lvia A Sousa;AndrÃ...,http://retractionwatch.com/2014/10/17/this-sit...,Research Article;,11/1/2014 0:00,...,1/28/2011 0:00,10.1128/JB.01375-10,21278292.0,Retraction,+Duplication of Image;+Error in Image;,No,,"['Duplication of Image', 'Error in Image']",2,(BLS) Biology - Cellular
52288,2,Regulation of Wnt/beta-catenin pathway by cPLA...,"Department of Pathology, University of Pittsbu...",Journal of Cellular Biochemistry,Wiley,United States,Chang Han;Kyu Lim;Lihong Xu;Guiying Li;Tong Wu,http://retractionwatch.com/2015/02/09/figure-d...,Research Article;,1/29/2015 0:00,...,7/17/2008 0:00,10.1002/jcb.21852,18636547.0,Retraction,+Duplication of Image;+Falsification/Fabricati...,No,,"['Duplication of Image', 'Falsification/Fabric...",3,(BLS) Biology - Cellular



As always, it will be useful to get an idea of how many papers we have in our filtered data set, and of what percentage of the data set that we loaded at the beginning of this notebook that represents:

In [21]:

# Print total number of papers in data set

print(f"Total number of papers in data set: {df_subject.shape[0]}")

# Print percentage of papers retained compare with initial data set

print(f"Percentage of original data set that represents: {round(df_subject.shape[0] *100 / df.shape[0],2)} %")


Total number of papers in data set: 10241
Percentage of original data set that represents: 19.59 %


## Output


Having done all this, we can now go ahead safe our new data frame, which has been restricted to a single subject, into a .csv file:

In [22]:

# Save date frame as .csv

df_subject.to_csv(output_path, index=False)
