# 2a. Feature Engineering: Severity Score


## Introduction


This notebooks adds an additional column to the clean data set produced in **Notebook 1**. This additional column is a severity score indicating how concerning the kind of retraction associated to each paper in our data set is.

This is accomplished by manually creating a one-to-one correspondence between the various reasons for retraction in the "Retraction Reason" column of our data set and a **severity score** that ranges from 0 to 4. Later on, this will allows us to run our analysis only on papers that clearly point towards some sort of genuine malpractice. 

The **input and output parameters** are therefore as follows:

- Input parameters: **one .csv file** with our clean data set produced and **another .csv file** with the severity score correspondences that we created manually. 
- Output paramters: **one .csv file** with 


## Input / Output Parameters

- Input parameters:

In [2]:

# Input path for severity score correspondence

input_path_severity_score = "../data/reasons_severityscore.csv"

# Input path for clean data set

input_path_data_set = "../data/data_sets/clean_data_set.csv"

# Severity threshold
# Only papers with this severity socre or higher will be kept in our data frame

severity_threshold = 3

- Output parameters:

In [3]:

# Output path for data set with severity score column

output_path = "../data/data_sets/clean_data_severity.csv"

## Importing Libraries

In [4]:

# Import required libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno
import warnings
warnings.filterwarnings("ignore")

## Adding a "Severity Score" to our Data Set

- Let us start by loading our one-to-one correspondence between reasons for retraction and their associated severity score. We will do that by reading it from the .csv file that we created manually:

In [5]:

# Load severity score data from .csv file into data frame

df_severity = pd.read_csv(input_path_severity_score, sep = ";")

# Visualize result

df_severity

Unnamed: 0,reason,severity_score
0,Author Unresponsive,0
1,Bias Issues or Lack of Balance,3
2,Breach of Policy by Author,3
3,Cites Retracted Work,3
4,Civil Proceedings,4
...,...,...
104,Upgrade/Update of Prior Notice,0
105,Withdrawal,2
106,Withdrawn (out of date),2
107,Withdrawn to Publish in Different Journal,0


- Next we will load the clean data set that we generated in Notebook 1:

In [6]:

# Read clean data set from input path

df = pd.read_csv('../data/data_sets/clean_data_set.csv')

# Visualize result

df.head(1)

Unnamed: 0,record_id,title,subject,institution,journal,publisher,country,author,urls,article_type,retraction_date,retraction_doi,retraction_pubmed_id,original_paper_date,original_paper_doi,original_paper_pubmed_id,retraction_nature,reason,paywalled,notes
0,52765,An integrated 3D model based face recognition ...,(B/T) Data Science;(B/T) Technology;,"Management Information System Department, Cypr...",Applied Nanoscience,Springer,Turkey,Ali Milad;Kamil Yurtkan,https://retractionwatch.com/2024/01/26/springe...,Research Article;,6/30/2023 0:00,10.1007/s13204-024-03010-9,0.0,2/3/2022 0:00,10.1007/s13204-021-02123-9,0.0,Correction,+Error in Text;+Updated to Retraction;,No,See also: https://pubpeer.com/publications/BCC...


- Having done that, we can go ahead and add a new column to our data frame with the severity score associated to the reasons why each paper was retracted. In many cases, a single paper will be associated to more than one reason for retraction, the different reasons having different severity scores attached to them. In these cases, we will associated the highest severity score of those associated to the reasons why the paper got retracted:

In [15]:

# Create new column with list of reasons in "reason" column 

df['reason_list'] = df['reason'].apply(lambda x: [i.strip('+; ') for i in x.split(';') if i.strip()])

# Define function to calculate the maximum severity score of those in the reasons_list column

def get_max_severity(reason_list):
    """Function takes a list with reasons for retraction, checks what are the severity scores
    associated to each individual reason, then returns the maximum score in that collection."""
    
    # Create list with scores associated to reasons in input list
    
    scores = df_severity[df_severity['reason'].isin(reason_list)]['severity_score']
    
    # Return maximum vaue in the scores list
    
    return max(scores) if not scores.empty else None

# Call function to create new "severity_score" column with maximum severity score for each paper

df['severity_score'] = df['reason_list'].apply(get_max_severity)

# Convert elements of "severity_score" column to int type

df['severity_score'] = df['severity_score'].astype(int)


- We can now take a look and see what the distribution of papers in our original data set is per severity score: 

In [22]:

# Obtain value count of each severity score in our data frame

score_counts = df['severity_score'].value_counts()

# Print value counts

score_counts


severity_score
3    23374
4    18805
0     4955
2     4412
1      744
Name: count, dtype: int64

- Finally, we can go ahead and remove all papers with a severity score below the severity threshold that we chose at the beginning of this notebook:

In [25]:

# Filter the dataframe to inlcude only those above our severity threshold

df = df[df['severity_score'] >= severity_threshold]


## Output

- To conclude, we write the resulting data frame into a .csv file:

In [26]:

# Save date frame as .csv

filtered_df.to_csv(output_path, index=False)
