# 2a. Feature Engineering: Severity Score


## Introduction


This notebooks adds an additional column to the clean data set produced in **Notebook 1**. This additional column is a severity score indicating how concerning the kind of retraction associated to each paper in our data set is.

This is accomplished by manually creating a one-to-one correspondence between the various reasons for retraction in the "Retraction Reason" column of our data set and a **severity score** that ranges from 0 to 4. The notebook then filters out all papers with a severity score below a certain threshold so as to keep only those items that represnt a genuine case of malpractice. 

The **input and output parameters** are therefore as follows:

- Input parameters: **one .csv file** with our clean data set produced and **another .csv file** with the severity score correspondences that we created manually. 
- Output paramters: **one .csv file** with. 


## Input / Output Parameters

Input parameters:

In [17]:

# Input path for severity score correspondence

input_path_severity_score = "../data/reasons_severityscore.csv"

# Input path for clean data set

input_path_data_set = "../data/data_sets/clean_data_set.csv"

# Severity threshold
# Only papers with this severity score or higher will be kept in our data frame

severity_threshold = 3

Output parameters:

In [18]:

# Output path for data set with severity score column

output_path = "../data/data_sets/clean_data_severity.csv"

## Importing Libraries

In [19]:

# Import required libraries

import pandas as pd
import numpy as np
import function_definitions

## Adding a "Severity Score" to our Data Set

Let us start by loading our one-to-one correspondence between reasons for retraction and their associated severity score. We will do that by reading it from the .csv file that we created manually:

In [20]:

# Load severity score data from .csv file into data frame

df_severity = pd.read_csv(input_path_severity_score, sep = ";")

# Make sure column names are in snake case

df_severity.columns = [col.lower().replace(" ", "_") for col in df_severity.columns]

# Visualize result

df_severity

Unnamed: 0,reason,severity_score
0,Author Unresponsive,0
1,Bias Issues or Lack of Balance,3
2,Breach of Policy by Author,3
3,Cites Retracted Work,3
4,Civil Proceedings,4
...,...,...
104,Upgrade/Update of Prior Notice,0
105,Withdrawal,2
106,Withdrawn (out of date),2
107,Withdrawn to Publish in Different Journal,0


Next we will load the clean data set that we generated in Notebook 1:

In [21]:

# Read clean data set from input path

df = pd.read_csv('../data/data_sets/clean_data_set.csv')

# Visualize result

df.head(1)

Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,retraction_date,retraction_doi,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes
0,52765,An integrated 3D model based face recognition ...,(B/T) Data Science;(B/T) Technology;,"Management Information System Department, Cypr...",Applied Nanoscience,Springer,Turkey,Ali Milad;Kamil Yurtkan,https://retractionwatch.com/2024/01/26/springe...,Research Article;,6/30/2023 0:00,10.1007/s13204-024-03010-9,0.0,2/3/2022 0:00,10.1007/s13204-021-02123-9,0.0,Correction,+Error in Text;+Updated to Retraction;,No,See also: https://pubpeer.com/publications/BCC...


Having done that, we can go ahead and start the process to assign a severity score to each paper. Our first step will consist of adding a new column to our data frame, which will contain a list with all the reasons for retraction mentioned for that paper. We will first define a function that returns a list of individual reasons given a overall string of reasons separated by a given delimiter (such as the ones in the rows of our data frame, which are separated by a semicolon):


In [22]:
def separate_and_clean_reasons(reasons_string, delimiter, extra_symbol):
    """Function takes a single string containing multiple reasons separated by a given delimiter 
    and returns a list with all individual reasons as its elements, removing white spaces
    and any other extraneous symbols of choice, and the delimiter simbol."""
    
    # Return empty list is reasons string is NaN
    
    if pd.isna(reasons_string):
        return []

    # Store invididual reasons as separated by delimiter in list
    
    individual_reasons_list = reasons_string.split(delimiter)  
    
    # Intialize another list to store cleaned individual reasons
    
    clean_reasons_list = [] 
    
    # For loop to clean individual reasons by dropping delimiter and extra symbols
    
    for individual_reason in individual_reasons_list:
        
        # Clean individual reason and store it in new variable
        
        clean_individual_reason = individual_reason.strip(f"{delimiter}{extra_symbol} ")
        
        # If not empty, append clean individual reason to clean individual reasons list
        
        if clean_individual_reason:  
            clean_reasons_list.append(clean_individual_reason)
            
    # Return list with individual clean reasons
    
    return clean_reasons_list



We can now call our function to effectively add that extra column to our data frame. We will have to use the so-called lambda function in order to be able to use the "apply" method while passing the required arguments to our function:

In [23]:

# Call "parse_reasons" function to add a column to data frame with lists of individual reasons as rows

df['reason_list'] = df['reason'].apply(lambda x: separate_and_clean_reasons(x, delimiter=";", extra_symbol="+"))


Next we will create an additional function that parses the content of each list of reasons, then returns the highest severity score of those associated to those reasons: 

In [24]:

# Define function to calculate the maximum severity score of those in the reasons_list column

def get_max_severity(reason_list):
    
    """Function takes a list with reasons for retraction, checks what are the severity scores
    associated to each individual reason, then returns the maximum score in that collection."""
    
    # Create list with scores associated to reasons in input list
    
    scores = df_severity[df_severity['reason'].isin(reason_list)]['severity_score']
    
    # Return maximum vaue in the scores list
    
    return max(scores) if not scores.empty else None


We can finally call our new function to add yet another column to our data frame, which will contain the highest severity score of those scores associated to the reasons why the paper was retracted:

In [25]:
# Call function to create new "severity_score" column with maximum severity score for each paper

df['severity_score'] = df['reason_list'].apply(get_max_severity)

# Convert elements of "severity_score" column to int type

df['severity_score'] = df['severity_score'].astype(int)


It will be important to get a sense of what the distribution of papers in our original data set per severity score is: 

In [26]:

# Obtain value count of each severity score in our data frame

score_counts = df['severity_score'].value_counts()

# Print value counts

score_counts


severity_score
3    23374
4    18805
0     4955
2     4412
1      744
Name: count, dtype: int64

Let us also find out what percentage of the total entries in our data set those numbers represent. We will use the percentage_finder function that we defined in Notebook 1 for this purpose, calling it appropriately from the python file that we imported above:

In [27]:
function_definitions.percentage_printer(df, "severity_score")

3: 44.7%
4: 35.96%
0: 9.48%
2: 8.44%
1: 1.42%


We see how papers have a severity score of 3 or more, which is good because we are only interested in investigating the most severe cases of misconduct. Given these percentages, we can go ahead and safely drop all papers with a severity score below our threshold (which was set to be equal to 3 at the beginning of the notebook but can be changed to have different values if necessary):


In [28]:

# Filter the dataframe to inlcude only those above our severity threshold

df = df[df['severity_score'] >= severity_threshold]



Finally, we can also drop the column "reasons_list" from our data set, which we added as an intermediate step and won't be needing again in the future:

In [29]:

# Drop "reasons_list" column

df.drop(columns=['reason_list'], inplace=True)

# Verify result

df


Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,...,retraction_doi,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes,severity_score
0,52765,An integrated 3D model based face recognition ...,(B/T) Data Science;(B/T) Technology;,"Management Information System Department, Cypr...",Applied Nanoscience,Springer,Turkey,Ali Milad;Kamil Yurtkan,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,10.1007/s13204-024-03010-9,0.0,2/3/2022 0:00,10.1007/s13204-021-02123-9,0.0,Correction,+Error in Text;+Updated to Retraction;,No,See also: https://pubpeer.com/publications/BCC...,3
1,52762,Convolutional neural network and Kalman filter...,(B/T) Data Science;(B/T) Technology;(PHY) Engi...,"Engineering Campus, School of Electrical and E...",Applied Nanoscience,Springer,Malaysia,Bushra N Alsunbuli;Widad Ismail;Nor M Mahyuddin,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,10.1007/s13204-024-03006-5,0.0,9/17/2021 0:00,10.1007/s13204-021-02043-8,0.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/83D...,3
2,52761,Provide a new framework for blockchain-based i...,(B/T) Technology;,"Electrical and Computer Engineering, Altinbas ...",Applied Nanoscience,Springer,Turkey,Firas Hammoodi Neanah Al-mutar;Abdullahi Abdu ...,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,10.1007/s13204-024-03023-4,0.0,2/3/2022 0:00,10.1007/s13204-021-02175-x,0.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/B48...,3
3,52760,Integration of Healthcare 4.0 and blockchain i...,(B/T) Technology;(HSC) Medicine - General;(HSC...,"Godwit Technologies, Pune, India; Business Inf...",Applied Nanoscience,Springer,India;Iraq;Turkey,Hemant B Mahajan;Ameer Sardar Rashid;Aparna A ...,,Research Article;,...,10.1007/s13204-024-03007-4,0.0,2/4/2022 0:00,10.1007/s13204-021-02164-0,35136707.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/347...,3
4,52759,A framework for adopting gamified learning sys...,(B/T) Technology;(SOC) Education;,Department of Electrical and Computer Engineer...,Applied Nanoscience,Springer,Turkey,Farazdaq Nahedh Alsamawi;Sefer Kurnaz,https://retractionwatch.com/2024/01/26/springe...,Research Article;,...,10.1007/s13204-024-03028-z,0.0,6/16/2021 0:00,10.1007/s13204-021-01909-1,34155468.0,Retraction,+Concerns/Issues about Referencing/Attribution...,No,See also: https://pubpeer.com/publications/F2E...,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52283,7,Bradykinin-induced increase in pulmonary vascu...,(BLS) Biochemistry;(BLS) Biology - Molecular;(...,Departments of Pediatrics (Pulmonary Division)...,Journal of Applied Physiology,APS: American Physiological Society,United States,S Alex Stalcup;Hugh M O'Brodovich;Leila Mei Pa...,,Research Article;,...,10.1152/jappl.1985.59.4.1333,3902780.0,2/1/1982 0:00,10.1152/jappl.1982.52.2.370,7037714.0,Retraction,+Concerns/Issues About Data;+Results Not Repro...,Yes,,3
52284,6,Inhibition of converting enzyme activity by ac...,(BLS) Biology - Molecular;(HSC) Medicine - Pul...,Departments of Pediatrics (Pulmonary Division)...,Journal of Applied Physiology,APS: American Physiological Society,United States,S Alex Stalcup;Joel S Lipset;Paul M Legant;Phi...,,Research Article;,...,10.1152/jappl.1985.59.4.1333,3902780.0,2/1/1979 0:00,10.1152/jappl.1979.46.2.227,217854.0,Retraction,+Results Not Reproducible;,Yes,,3
52285,5,Effect of Perindopril on Large Artery Stiffnes...,(BLS) Biochemistry;(HSC) Medicine - Cardiology...,"Alfred and Baker Medical Unit, Baker Heart Res...",JAMA: Journal of the American Medical Association,American Medical Association,Australia,Anna A Ahimastos;Anuradha Aggarwal;Kellie M D'...,http://retractionwatch.com/2015/11/23/jama-ret...,Clinical Study;Research Article;,...,10.1001/jama.2015.16678,26594834.0,10/3/2007 0:00,10.1001/jama.298.13.1539,1791149.0,Retraction,+Falsification/Fabrication of Data;+Investigat...,No,,4
52288,2,Regulation of Wnt/beta-catenin pathway by cPLA...,(BLS) Biology - Cancer;(BLS) Biology - Cellula...,"Department of Pathology, University of Pittsbu...",Journal of Cellular Biochemistry,Wiley,United States,Chang Han;Kyu Lim;Lihong Xu;Guiying Li;Tong Wu,http://retractionwatch.com/2015/02/09/figure-d...,Research Article;,...,10.1002/jcb.25020,25767853.0,7/17/2008 0:00,10.1002/jcb.21852,18636547.0,Retraction,+Duplication of Image;+Falsification/Fabricati...,No,,3


## Output

To conclude, we write the resulting data frame into a .csv file:

In [30]:

# Save date frame as .csv

df.to_csv(output_path, index=False)
