# **Association between Abstract Characteristics and Ranking Position in Technology-Assisted Reviewing (TAR): data preprocessing**

### **by Isa Spiero <br>**

#### **Part I: Loading TRIPOD characteristics**
In the first part, the datasets are loaded which contain the titles and abstracts of all records of the reviews including the corresponding labels for the the title-abstract level inclusions and full-text level inclusions. These datasets are merged with the datasets containing the TRIPOD characteristics of the full-text level inclusions of the reviews.

#### **Part II: Adding structure characteristics**
In the second part, the structural characteristics of the abstracts are derived and added to the datasets: the number of words in the abstract, the average sentence length, and the abstract structuring (structured vs unstructured). 

#### **Part III: Adding terminology characteristics**
In the third part, the terminology usage of the abstract is computed by comparing the mean TF-IDF vectors with the average mean vector of the entire dataset as a measure of abberrant terminology usage per abstract, with larger values indicating more abberant abstracts than smaller values.


In [1]:
import os
import pandas as pd
import numpy as np
import nltk
import re
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings

warnings.filterwarnings('ignore')

In [2]:
path_data = '../data/'

#### **Part I: Loading TRIPOD characteristics**

**Part I.a Dataset based on the systematic review by Andaur Navarro *et al.* (2022): 'Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review'**

Load the data containing the TRIPOD scores:

In [3]:
df1_scores = pd.read_csv(path_data + 'raw/AndaurNavarro_et_al_2022/20201127_DATA_TRIPOD.csv')

# Select only the relevant columns:
df1_scores_sel = df1_scores[['article_id', 
                             't_title_1', 't_title_2', 't_title_3', 't_title_4',
                             't_abstract_1', 't_abstract_2', 't_abstract_3', 't_abstract_4',
                             't_abstract_5', 't_abstract_6', 't_abstract_7', 't_abstract_8',
                             't_abstract_9', 't_abstract_10', 't_abstract_11', 't_abstract_12']]

# Rename the columns:
df1_scores_sel.rename(columns={'t_title_1': '1i',
                               't_title_2': '1ii',
                               't_title_3': '1iii',
                               't_title_4': '1iv',
                               't_abstract_1': '2i',
                               't_abstract_2': '2ii',
                               't_abstract_3': '2iii',
                               't_abstract_4': '2iv',
                               't_abstract_5': '2v',
                               't_abstract_6': '2vi',
                               't_abstract_7': '2vii',
                               't_abstract_8': '2viii',
                               't_abstract_9': '2ix',
                               't_abstract_10': '2x',
                               't_abstract_11': '2xi',
                               't_abstract_12': '2xii',}, inplace=True)

# Check that there are 152 inclusions in the review that were scored with TRIPOD
print(len(df1_scores_sel))

#df1_scores_sel.head()

152


In [4]:
# # To check the unique values per column:

# selected_columns = df1_scores_sel.filter(regex='^(1|2)')
# # Get unique values for each selected column
# unique_values_dict = {col:df1_scores_sel[col].unique() for col in selected_columns}

# # Print unique values
# for col, unique_values in unique_values_dict.items():
#     print(f"{col}: {unique_values}")

In [5]:
# Replace values in column '2vi' for consistency:
# Reason: based on file '20201021_CODEBOOK_TRIPOD.r'from the review derived from dataverse,
# there were three levels, but only for criterion 2vi, and these were coded with 'YES', 'NO', and 'NA', respectively
# All others were coded with 1='YES' and 0='NO'.
df1_scores_sel['2vi'] = df1_scores_sel['2vi'].replace({1: 1, 2: 0, 3: np.nan})
df1_scores_sel['2vi'] = df1_scores_sel['2vi'].astype('Int64')

#df1_scores_sel.head()

In [6]:
# Compute total number of applicable items (= non-NaN values) in the TRIPOD scoring, the total number of reported (= 1) TRIPOD items, 
# and the percentage of reported of the applicable items:
selected_cols = df1_scores_sel.filter(regex='^(1|2)')
df1_scores_sel['total_applicable'] = selected_cols.notna().sum(axis=1)
df1_scores_sel['total_reported'] = selected_cols.eq(1).sum(axis=1)
df1_scores_sel['percentage_reported'] = df1_scores_sel['total_reported'] / df1_scores_sel['total_applicable'] * 100

#df1_scores_sel.head()

Load the data containing the inclusion labels:

In [7]:
# This file contains the abstracts, titles, and title-abstract level inclusions
# The article_ids column was manually added based on JCMachineLearningSys-Datalinked_DATA_2024-10-02_1410.csv
df1_labels = pd.read_excel(path_data + 'raw/AndaurNavarro_et_al_2022/Prog_reporting_labeled_ids.xlsx')

# Only the 152 inclusions have an article_id (added manually to the file), fill the others with NA
df1_labels['article_id'] = df1_labels['article_id'].fillna(0)
df1_labels['article_id'] = df1_labels['article_id'].astype(int)
print(df1_labels['article_id'].ne(0).sum())

# Add the label '1' for each of 152 full-text level inclusions and leave it as '0' for the exclusions
df1_labels['label_ft_included'] = np.where(df1_labels['article_id'] == 0, 0, 1)
print(df1_labels['label_ft_included'].sum())

# Check that the label '1' for each of the title-abstract level inclusions corresponds to the correct number of 312
df1_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df1_labels['label_ta_included'].sum())

#df1_labels.head()

152
152
312


Merge the data containing the TRIPOD scores with the data containing the inclusion labels:

In [8]:
df1 = pd.merge(df1_labels, df1_scores_sel, on='article_id', how='outer')

# Check that dataframe consists 2482 records in total:
print(len(df1))
# Check the number of title-abstract inclusions of 312:
print(df1['label_ta_included'].sum())
# Check the number of full-text inclusions of 152:
print(df1['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df1['1i'].isna().sum())

#df1.head()

2482
312
152
2330


**Part I.b Dataset based on the systematic review by Heus *et al.* (2018): 'Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement'**

Load the data containing the TRIPOD scores:

In [9]:
df2_scores = pd.read_excel(path_data + 'raw/Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                           sheet_name='Overview (n=147)') 

# Select only the relevant columns:
df2_scores_sel = df2_scores[
    ['ID', 'Endnote ID', 
     '1i', '1ii', '1iii', '1iv',
     '2i', '2ii', '2iii', '2iv', '2v',
     '2vi', '2vii', '2viii', '2ix', '2x',
     '2xi', '2xii', '2xiii', '2xiv', '2xv',
     '1i.1', '1ii.1', '1iii.1', '1iv.1',
     '2i.1', '2ii.1', '2iii.1', '2iv.1', '2v.1',
     '2vi.1', '2vii.1', '2viii.1', '2ix.1', '2x.1',
     '2xi.1', '2xii.1', '2xiii.1', '2xiv.1', '2xv.1',
     '1i.2', '1ii.2', '1iii.2', '1iv.2',
     '2i.2', '2ii.2', '2iii.2', '2iv.2', '2v.2',
     '2vi.2', '2vii.2', '2viii.2', '2ix.2', '2x.2',
     '2xi.2', '2xii.2', '2xiii.2', '2xiv.2', '2xv.2'
    ]
]

# List of sets of columns to merge, since columns are spread according to prediction model type:
columns_to_merge = [
    ['1i', '1i.1', '1i.2'],
    ['1ii', '1ii.1', '1ii.2'],
    ['1iii', '1iii.1', '1iii.2'],
    ['1iv', '1iv.1', '1iv.2'],
    ['2i', '2i.1', '2i.2'],
    ['2ii', '2ii.1', '2ii.2'],
    ['2iii', '2iii.1', '2iii.2'],
    ['2iv', '2iv.1', '2iv.2'],
    ['2v', '2v.1', '2v.2'],
    ['2vi', '2vi.1', '2vi.2'],
    ['2vii', '2vii.1', '2vii.2'],
    ['2viii', '2viii.1', '2viii.2'],
    ['2ix', '2ix.1', '2ix.2'],
    ['2x', '2x.1', '2x.2'],
    ['2xi', '2xi.1', '2xi.2'],
    ['2xii', '2xii.1', '2xii.2'],
    ['2xiii', '2xiii.1', '2xiii.2'],
    ['2xiv', '2xiv.1', '2xiv.2'],
    ['2xv', '2xv.1', '2xv.2']
]

df2_scores_merg = df2_scores_sel.copy()  

# Loop through each set of columns and merge them
for cols in columns_to_merge:
    df2_scores_merg[cols[0]] = df2_scores_merg[cols].apply(
        lambda row: 1 if set(row.dropna().unique()) == {1} 
        # If all scores for item x are 0, set final score to 0
        else 0 if set(row.dropna().unique()) == {0} 
        # If scores for item x are 0 or 1, set final score to 1
        # regardless of how many 0s or 1s
        else 1 if set(row.dropna().unique()) == {0, 1} 
        # If score for item x is only 3 (which codes for NA, set final score to NA
        else float('nan') if set(row.dropna().unique()).issubset({3}) 
        else float('nan'), 
        axis=1
    )
    # Drop the merged columns except the first one
    df2_scores_merg = df2_scores_merg.drop(columns=cols[1:])

#df2_scores_merg.head()

In [21]:
# # To check the unique values per column:

# selected_columns = df2_scores_merg.filter(regex='^(1|2)')
# # Get unique values for each selected column
# unique_values_dict = {col:df2_scores_merg[col].unique() for col in selected_columns}

# # Print unique values
# for col, unique_values in unique_values_dict.items():
#     print(f"{col}: {unique_values}")

Load the data containing the inclusion labels:

In [11]:
df2_labels = pd.read_excel(path_data + 'raw/Heus_et_al_2018/Prog_tripod_labeled.xlsx')

# Check the total number of records equals 4871:
print(len(df2_labels))

# Check that the number of title-abstract level inclusions equals 347:
df2_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df2_labels['label_ta_included'].sum())

df2_labels.head()

4871
347


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,language,label_ta_included
0,0,Journal Article,,2014,Factors associated with short-term changes in ...,AIDS,24959963,,OBJECTIVES: Among antiretroviral therapy (ART)...,eng,0
1,1,Journal Article,,2014,Reducing Injury Risk From Body Checking in Boy...,Pediatrics,24864185,,Ice hockey is an increasingly popular sport th...,Eng,0
2,2,Journal Article,,2014,Hypothermia and Neonatal Encephalopathy,Pediatrics,24864176,,Data from large randomized clinical trials ind...,Eng,0
3,3,Journal Article,,2014,Incontinence: Leak point pressure predicts suc...,Nat Rev Urol,24861330,,,eng,0
4,4,Journal Article,,2014,Identifying Gene-Environment Interactions in S...,Schizophr Bull,24860087,,Recent years have seen considerable progress i...,eng,0


Load the data with the PubMed ID's (pmid) to link the TRIPOD scores with the labels:

In [12]:
# A function to parse the ris file
def parse_ris_file(file_path):
    references = []
    entry = {}

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.strip() == "":  
                continue

            if line.startswith("TY  -"):  
                if entry:  
                    references.append(entry)
                entry = {}  
            try:
                tag, value = line.split('  - ', 1)  
                entry[tag] = value.strip()  
            except ValueError:
                continue  

        if entry:
            references.append(entry)

    return references

# Convert parsed data into a pandas DataFrame
parsed_data = parse_ris_file(path_data + 'raw/Heus_et_al_2018/TRIPOD adherence included_final-Converted.txt')
df2_link = pd.DataFrame(parsed_data)

# The column 'AN' contains the PubMed IDs:
#df2_link.head()

Merge the PubMed IDs with the dataframe containing the TRIPOD scores:

In [13]:
# Convert both column types to string:
df2_scores_merg['ID'] = df2_scores_merg['ID'].astype(str)
df2_link['ID'] = df2_link['ID'].astype(str)

# Merge the PubMed ID's with the dataframe with the TRIPOD scores:
df2_scores_pmid = df2_scores_merg.merge(df2_link[['ID', 'AN']], on='ID', how='left')

# Rename the column:
df2_scores_pmid.rename(columns={'AN': 'pmid'}, inplace=True)

# Add a column for full-text level inclusions:
df2_scores_pmid['label_ft_included'] = int(1)

# Compute total number of applicable items (= non-NaN values) in the TRIPOD scoring, the total number of reported (= 1) TRIPOD items, 
# and the percentage of reported of the applicable items:
selected_cols = df2_scores_pmid.filter(regex='^(1|2)')
df2_scores_pmid['total_applicable'] = selected_cols.notna().sum(axis=1)
df2_scores_pmid['total_reported'] = selected_cols.eq(1).sum(axis=1) + selected_cols.eq(0.5).sum(axis=1) * 0.5
df2_scores_pmid['percentage_reported'] = df2_scores_pmid['total_reported'] / df2_scores_pmid['total_applicable'] * 100

#df2_scores_pmid.head()

In [14]:
df2_scores_pmid['pmid'] = df2_scores_pmid['pmid'].astype(str)
df2_labels['pmid'] = df2_labels['pmid'].astype(str)


df2 = pd.merge(df2_labels, df2_scores_pmid, on='pmid', how='outer', sort=False)
df2 = df2.set_index('pmid').reindex(df2_labels['pmid']).reset_index()


# Change all NaN to 0 for full text level inclusions (inclusions are indicated with 1 already)
df2['label_ft_included'] = df2['label_ft_included'].fillna(int(0))

# Check that dataframe consists 4871 records in total:
print(len(df2))
# Check the number of title-abstract inclusions of 312:
print(df2['label_ta_included'].sum())
# Check the number of full-text inclusions of 147:
print(df2['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df2['2vii'].isna().sum())

#df2.head()

4871
347
147.0
4724


#### **Part II: Adding structure characteristics**

Add the number of words per abstract:

In [15]:
df1["word_count"] = df1["abstract"].apply(lambda x: len(str(x).split()))
df2["word_count"] = df2["abstract"].apply(lambda x: len(str(x).split()))

Add the average sentence length per abstract:

In [16]:
def avg_sentence_length(text):
    sentences = sent_tokenize(str(text)) 
    if len(sentences) == 0:
        return 0  
    word_counts = [len(sentence.split()) for sentence in sentences]  
    return sum(word_counts) / len(sentences)  
    
df1["avg_sentence_length"] = df2["abstract"].apply(avg_sentence_length)
df2["avg_sentence_length"] = df2["abstract"].apply(avg_sentence_length)

Add structured (1) vs unstructured (0):

In [17]:
# Make the distinction based on the most common words for structured abstracts to start with:
keywords = ["background", "objective", "objectives", "purpose", "introduction", "aim", "aims"] 

df1["structured"] = df1["abstract"].apply(
    lambda x: 1 if str(x).split(":")[0].lower() in [keyword.lower() for keyword in keywords] else 0)

df2["structured"] = df2["abstract"].apply(
    lambda x: 1 if str(x).split(":")[0].lower() in [keyword.lower() for keyword in keywords] else 0)

#### **Part III: Adding terminology characteristics**

Compute the deviation in terminology usage using TF-IDF vectors

In [18]:
vectorizer = TfidfVectorizer()

# Make NaN abstracts empty for vectorization
df1["abstract"] = df1["abstract"].fillna("")
df2["abstract"] = df2["abstract"].fillna("")

# Filter only rows where 'label_ft_included' == 1
df1_filtered = df1[df1["label_ft_included"] == 1].copy()
df2_filtered = df2[df2["label_ft_included"] == 1].copy()

# Create the vectors only for filtered rows
tfidf_matrix1 = vectorizer.fit_transform(df1_filtered["abstract"])
tfidf_matrix2 = vectorizer.fit_transform(df2_filtered["abstract"])

# Compute the mean of each vector per abstract
df1_filtered["tfidf_mean"] = [np.nan if text == "" else score for text, score in zip(df1_filtered["abstract"], tfidf_matrix1.mean(axis=1).A1)]
df2_filtered["tfidf_mean"] = [np.nan if text == "" else score for text, score in zip(df2_filtered["abstract"], tfidf_matrix2.mean(axis=1).A1)]

# Compute the average of the means of all vectors (only for included rows) across abstracts
average_tfidf1 = df1_filtered["tfidf_mean"].mean()
average_tfidf2 = df2_filtered["tfidf_mean"].mean()

# Compute the deviation of the mean of an abstract from the overall average across all abstracts
df1_filtered["tfidf_deviation"] = df1_filtered["tfidf_mean"].apply(lambda x: np.nan if pd.isna(x) else x - average_tfidf1)
df2_filtered["tfidf_deviation"] = df2_filtered["tfidf_mean"].apply(lambda x: np.nan if pd.isna(x) else x - average_tfidf2)

df1_filtered["tfidf_deviation"] = df1_filtered["tfidf_deviation"].abs()
df2_filtered["tfidf_deviation"] = df2_filtered["tfidf_deviation"].abs()

# Merge the results back into the original DataFrames
df1 = df1.merge(df1_filtered[["tfidf_mean", "tfidf_deviation"]], how="left", left_index=True, right_index=True)
df2 = df2.merge(df2_filtered[["tfidf_mean", "tfidf_deviation"]], how="left", left_index=True, right_index=True)

In [19]:
# Export the processed files
# Note: keep original names of 'Prog1' and 'Prog3' to be able to merge with rankings of previous simulations that had this naming
df1.to_csv(path_data + 'processed/Prog1_reporting.csv', index=False)
df2.to_csv(path_data + 'processed/Prog3_tripod.csv', index=False)

df1.to_excel(path_data + 'processed/Prog1_reporting.xlsx', index=False)
df2.to_excel(path_data + 'processed/Prog3_tripod.xlsx', index=False)

print("Data preprocessing completed")

Data preprocessing completed
