# **Association between Abstract Characteristics and Ranking Position in Technology-Assisted Reviewing (TAR): data preprocessing**

### **by Isa Spiero <br>**

#### **Part I: Loading TRIPOD characteristics**
In the first part, the datasets are loaded which contain the TRIPOD characteristics of the inclusions of the reviews and are merged with the title-abstract level inclusions and full-text level inclusions of the respective reviews.

#### **Part II: Adding structure characteristics**
In the second part, the structural characteristics of the abstracts are derived and added to the datasets: the number of words in the abstract, the average sentence length, and the abstract structuring (structured vs unstructured). 

#### **Part III: Adding terminology characteristics**
In the third part, the terminology usage of the abstract is computed by comparing the mean TF-IDF vectors with the average mean vector of the entire dataset as a measure of abberrant terminology usage per abstract, with larger values indicating more abberant abstracts than smaller values.


In [2]:
import os
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
path_data = '../data/'

#### **Part I: Loading TRIPOD characteristics**

**Part I.a Dataset based on Andaur Navarro et al. 2022: 'Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review'**

Load the data containing the TRIPOD scores:

In [7]:
df1_scores = pd.read_csv(path_data + 'raw/AndaurNavarro_et_al_2022/20201127_DATA_TRIPOD.csv')

# Select only the relevant columns:
df1_scores_sel = df1_scores[['article_id', 
                             't_title_1', 't_title_2', 't_title_3', 't_title_4',
                             't_abstract_1', 't_abstract_2', 't_abstract_3', 't_abstract_4',
                             't_abstract_5', 't_abstract_6', 't_abstract_7', 't_abstract_8',
                             't_abstract_9', 't_abstract_10', 't_abstract_11', 't_abstract_12']]

# Rename the columns:
df1_scores_sel.rename(columns={'t_title_1': '1i',
                               't_title_2': '1ii',
                               't_title_3': '1iii',
                               't_title_4': '1iv',
                               't_abstract_1': '2i',
                               't_abstract_2': '2ii',
                               't_abstract_3': '2iii',
                               't_abstract_4': '2iv',
                               't_abstract_5': '2v',
                               't_abstract_6': '2vi',
                               't_abstract_7': '2vii',
                               't_abstract_8': '2viii',
                               't_abstract_9': '2ix',
                               't_abstract_10': '2x',
                               't_abstract_11': '2xi',
                               't_abstract_12': '2xii',}, inplace=True)

# Check that there are 152 inclusions in the review that were scored with TRIPOD
print(len(df1_scores_sel))

df1_scores_sel.head()

152


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1_scores_sel.rename(columns={'t_title_1': '1i',


Unnamed: 0,article_id,1i,1ii,1iii,1iv,2i,2ii,2iii,2iv,2v,2vi,2vii,2viii,2ix,2x,2xi,2xii
0,172,0,0,1,1,1,0,0,1,0,2,0,1,1,0,0,1
1,174,0,1,0,1,0,1,1,1,1,2,0,1,1,1,0,1
2,175,0,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1
3,177,0,1,1,1,1,1,0,1,1,2,1,1,1,0,0,1
4,178,0,1,0,1,1,0,0,0,0,2,1,1,1,0,0,1


Load the data containing the inclusion labels:

In [9]:
# This file contains the abstracts, titles, and title-abstract level inclusions
# The article_ids column was manually added based on JCMachineLearningSys-Datalinked_DATA_2024-10-02_1410.csv
df1_labels = pd.read_excel(path_data + 'raw/AndaurNavarro_et_al_2022/Prog_reporting_labeled_ids.xlsx')

# Only the 152 inclusions have an article_id, fill the others with NA
df1_labels['article_id'] = df1_labels['article_id'].fillna(0)
df1_labels['article_id'] = df1_labels['article_id'].astype(int)
print(df1_labels['article_id'].ne(0).sum())

# Add the 152 full-text level inclusions
df1_labels['label_ft_included'] = np.where(df1_labels['article_id'] == 0, 0, 1)
print(df1_labels['label_ft_included'].sum())

# Check the 312 title-abstract level inclusions
df1_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df1_labels['label_ta_included'].sum())

df1_labels.head()

152
152
312


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,label_ta_included,article_id,label_ft_included
0,2278,JOUR,"['Csato V', 'Kadir SZSA', 'Khavandi K', 'Benne...",2019.0,"""A Step and a Ceiling"": mechanical properties ...",,,"['eppi-reviewer4', 'Ca2+ spark', 'oxidant sign...",We investigated the biomechanical relationship...,0,0,0
1,1242,JOUR,,2019.0,"""Implications of emotion regulation strategies...",,,['eppi-reviewer4'],"Reports an error in ""Implications of emotion r...",0,0,0
2,1632,JOUR,"['Moyano J', 'Mases L', 'Izeta T', 'Flores T',...",2019.0,"""In Vitro"" Study About Variables that Influenc...",,,"['eppi-reviewer4', 'conventional brackets', 'f...",Many advantages have been described surroundin...,0,0,0
3,187,JOUR,"['Song J', 'Han K', 'Lee D', 'Kim SW']",2018.0,"""Is a picture really worth a thousand words?"":...",,,"['eppi-reviewer4', 'Adolescent', 'Age Factors'...",Because using social media has become a major ...,0,0,0
4,2406,JOUR,"['Rodrigues MAV', 'Olmos RD', 'Kira CM', 'Lotu...",2019.0,"""Shadow"" OSCE examiner. A cross-sectional stud...",,,['eppi-reviewer4'],OBJECTIVES: Feedback is a powerful learning to...,0,0,0


Merge the data containing the TRIPOD scores with the data containing the inclusion labels:

In [11]:
df1 = pd.merge(df1_labels, df1_scores_sel, on='article_id', how='outer')

# Check that dataframe consists 2482 records in total:
print(len(df1))
# Check the number of title-abstract inclusions of 312:
print(df1['label_ta_included'].sum())
# Check the number of full-text inclusions of 152:
print(df1['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df1['1i'].isna().sum())

df1.head()

2482
312
152
2330


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,label_ta_included,...,2iii,2iv,2v,2vi,2vii,2viii,2ix,2x,2xi,2xii
0,2278,JOUR,"['Csato V', 'Kadir SZSA', 'Khavandi K', 'Benne...",2019.0,"""A Step and a Ceiling"": mechanical properties ...",,,"['eppi-reviewer4', 'Ca2+ spark', 'oxidant sign...",We investigated the biomechanical relationship...,0,...,,,,,,,,,,
1,1242,JOUR,,2019.0,"""Implications of emotion regulation strategies...",,,['eppi-reviewer4'],"Reports an error in ""Implications of emotion r...",0,...,,,,,,,,,,
2,1632,JOUR,"['Moyano J', 'Mases L', 'Izeta T', 'Flores T',...",2019.0,"""In Vitro"" Study About Variables that Influenc...",,,"['eppi-reviewer4', 'conventional brackets', 'f...",Many advantages have been described surroundin...,0,...,,,,,,,,,,
3,187,JOUR,"['Song J', 'Han K', 'Lee D', 'Kim SW']",2018.0,"""Is a picture really worth a thousand words?"":...",,,"['eppi-reviewer4', 'Adolescent', 'Age Factors'...",Because using social media has become a major ...,0,...,,,,,,,,,,
4,2406,JOUR,"['Rodrigues MAV', 'Olmos RD', 'Kira CM', 'Lotu...",2019.0,"""Shadow"" OSCE examiner. A cross-sectional stud...",,,['eppi-reviewer4'],OBJECTIVES: Feedback is a powerful learning to...,0,...,,,,,,,,,,


**Part I.b Dataset based on Heus et al. 2018: 'Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement'**

Load the data containing the TRIPOD scores:

In [14]:
df2_scores = pd.read_excel(path_data + 'raw/Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                           sheet_name='Overview (n=147)') 

# Select only the relevant columns:
df2_scores_sel = df2_scores[
    ['ID', 'Endnote ID', 
     '1i', '1ii', '1iii', '1iv',
     '2i', '2ii', '2iii', '2iv', '2v',
     '2vi', '2vii', '2viii', '2ix', '2x',
     '2xi', '2xii', '2xiii', '2xiv', '2xv',
     '1i.1', '1ii.1', '1iii.1', '1iv.1',
     '2i.1', '2ii.1', '2iii.1', '2iv.1', '2v.1',
     '2vi.1', '2vii.1', '2viii.1', '2ix.1', '2x.1',
     '2xi.1', '2xii.1', '2xiii.1', '2xiv.1', '2xv.1',
     '1i.2', '1ii.2', '1iii.2', '1iv.2',
     '2i.2', '2ii.2', '2iii.2', '2iv.2', '2v.2',
     '2vi.2', '2vii.2', '2viii.2', '2ix.2', '2x.2',
     '2xi.2', '2xii.2', '2xiii.2', '2xiv.2', '2xv.2'
    ]
]

# List of sets of columns to merge, since columns are spread according to prediction model type:
columns_to_merge = [
    ['1i', '1i.1', '1i.2'],
    ['1ii', '1ii.1', '1ii.2'],
    ['1iii', '1iii.1', '1iii.2'],
    ['1iv', '1iv.1', '1iv.2'],
    ['2i', '2i.1', '2i.2'],
    ['2ii', '2ii.1', '2ii.2'],
    ['2iii', '2iii.1', '2iii.2'],
    ['2iv', '2iv.1', '2iv.2'],
    ['2v', '2v.1', '2v.2'],
    ['2vi', '2vi.1', '2vi.2'],
    ['2vii', '2vii.1', '2vii.2'],
    ['2viii', '2viii.1', '2viii.2'],
    ['2ix', '2ix.1', '2ix.2'],
    ['2x', '2x.1', '2x.2'],
    ['2xi', '2xi.1', '2xi.2'],
    ['2xii', '2xii.1', '2xii.2'],
    ['2xiii', '2xiii.1', '2xiii.2'],
    ['2xiv', '2xiv.1', '2xiv.2'],
    ['2xv', '2xv.1', '2xv.2']
]

# Loop through each set of columns and merge them
df2_scores_merg = df2_scores_sel.copy()  
for cols in columns_to_merge:
    df2_scores_merg[cols[0]] = df2_scores_merg[cols].apply(
        lambda row: row.dropna().iloc[0] if not row.dropna().empty else None, 
        axis=1
    )
    df2_scores_merg = df2_scores_merg.drop(columns=cols[1:])

#df2_scores_merg.head()

  for idx, row in parser.parse():


Load the data containing the inclusion labels:

In [16]:
df2_labels = pd.read_excel(path_data + 'raw/Heus_et_al_2018/Prog_tripod_labeled.xlsx')

# Check the total number of records equals 4871:
print(len(df2_labels))

# Check that the number of title-abstract level inclusions equals 347:
df2_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df2_labels['label_ta_included'].sum())

#df2_labels.head()

4871
347


Load the data with the PubMed ID's (pmid) to link the TRIPOD scores with the labels:

In [18]:
# A function to parse the ris file
def parse_ris_file(file_path):
    references = []
    entry = {}

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.strip() == "":  
                continue

            if line.startswith("TY  -"):  
                if entry:  
                    references.append(entry)
                entry = {}  
            try:
                tag, value = line.split('  - ', 1)  
                entry[tag] = value.strip()  
            except ValueError:
                continue  

        if entry:
            references.append(entry)

    return references

# Convert parsed data into a pandas DataFrame
parsed_data = parse_ris_file(path_data + 'raw/Heus_et_al_2018/TRIPOD adherence included_final-Converted.txt')
df2_link = pd.DataFrame(parsed_data)

# The column 'AN' contains the PubMed IDs:
#df2_link.head()

Merge the PubMed IDs with the dataframe containing the TRIPOD scores:

In [20]:
# Convert both column types to string:
df2_scores_merg['ID'] = df2_scores_merg['ID'].astype(str)
df2_link['ID'] = df2_link['ID'].astype(str)

# Merge the PubMed ID's with the dataframe with the TRIPOD scores:
df2_scores_pmid = df2_scores_merg.merge(df2_link[['ID', 'AN']], on='ID', how='left')

# Rename the column:
df2_scores_pmid.rename(columns={'AN': 'pmid'}, inplace=True)

# Add a column for full-text level inclusions:
df2_scores_pmid['label_ft_included'] = int(1)

df2_scores_pmid

Unnamed: 0,ID,Endnote ID,1i,1ii,1iii,1iv,2i,2ii,2iii,2iv,...,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv,pmid,label_ft_included
0,4,10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,3.0,0.0,3.0,1.0,24854341,1
1,6,15,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,0.0,0.0,3.0,1.0,24690476,1
2,12,25,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,0.0,3.0,1.0,24958751,1
3,15,28,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,3.0,3.0,0.0,3.0,1.0,24515568,1
4,16,32,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,3.0,3.0,3.0,1.0,3.0,1.0,24419662,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,166,306,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0,24815676,1
143,167,307,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,1.0,1.0,3.0,3.0,3.0,0.0,3.0,1.0,24123609,1
144,168,310,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,1.0,0.0,3.0,1.0,3.0,0.0,3.0,1.0,24520119,1
145,172,328,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,3.0,1.0,24879844,1


In [21]:
df2_scores_pmid['pmid'] = df2_scores_pmid['pmid'].astype(str)
df2_labels['pmid'] = df2_labels['pmid'].astype(str)

df2 = pd.merge(df2_labels, df2_scores_pmid, on='pmid', how='outer')

# Change all NaN to 0 for full text level inclusions (inclusions are indicated with 1 already)
df2['label_ft_included'] = df2['label_ft_included'].fillna(int(0))


# Check that dataframe consists 4871 records in total:
print(len(df2))
# Check the number of title-abstract inclusions of 312:
print(df2['label_ta_included'].sum())
# Check the number of full-text inclusions of 147:
print(df2['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df2['2vii'].isna().sum())

df2.head()

4871
347
147.0
4724


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,language,...,2vii,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv,label_ft_included
0,1645,Journal Article,R. Haring; N. Friedrich; H. Volzke; R. S. Vasa...,2014,Positive association of serum prolactin concen...,Eur Heart J,22843444,,AIMS: Increased serum prolactin (PRL) concentr...,eng,...,,,,,,,,,,0.0
1,2831,Journal Article,C. Mertens; D. Wiens; H. G. Steveling; A. Sand...,2014,Maxillary sinus-floor elevation with nanoporou...,Clin Implant Dent Relat Res,22897709,,BACKGROUND: Insufficient bone height in the po...,eng,...,,,,,,,,,,0.0
2,3259,Journal Article,G. Peeters; Y. R. van Gellecum; J. G. van Uffe...,2014,Contribution of house and garden work to the a...,Br J Sports Med,22936410,,OBJECTIVE: Although physical activity occurs i...,eng,...,,,,,,,,,,0.0
3,3578,Journal Article,G. E. Romanos; S. May; D. May,2014,Implant-supporting telescopic maxillary prosth...,Clin Implant Dent Relat Res,22998571,,Immediate loading (IL) in the maxilla is a suc...,eng,...,,,,,,,,,,0.0
4,1893,Journal Article,C. Jacobsen; A. Kruse; H. T. Lubbers; R. Zwahl...,2014,Is mandibular reconstruction using vascularize...,Clin Implant Dent Relat Res,22998581,,PURPOSE: this study retrospectively analyzed t...,eng,...,,,,,,,,,,0.0


#### **Part II: Adding structure characteristics**

Add the number of words per abstract:

In [24]:
df1["word_count"] = df1["abstract"].apply(lambda x: len(str(x).split()))
df2["word_count"] = df2["abstract"].apply(lambda x: len(str(x).split()))

Add the average sentence length per abstract:

In [26]:
def avg_sentence_length(text):
    sentences = sent_tokenize(str(text)) 
    if len(sentences) == 0:
        return 0  
    word_counts = [len(sentence.split()) for sentence in sentences]  
    return sum(word_counts) / len(sentences)  
    
df1["avg_sentence_length"] = df2["abstract"].apply(avg_sentence_length)
df1["avg_sentence_length"] = df2["abstract"].apply(avg_sentence_length)

Add structured (1) vs unstructured (0):

In [28]:
# Make the distinction based on the most common words for structured abstracts to start with:
keywords = ["background", "objective", "objectives", "purpose", "introduction", "aim", "aims"] 

df1["structured"] = df1["abstract"].apply(
    lambda x: 1 if str(x).split(":")[0].lower() in [keyword.lower() for keyword in keywords] else 0)

df2["structured"] = df2["abstract"].apply(
    lambda x: 1 if str(x).split(":")[0].lower() in [keyword.lower() for keyword in keywords] else 0)


#### **Part III: Adding terminology characteristics**

Compute the deviation in terminology usage using tf-idf vectors

In [31]:
vectorizer = TfidfVectorizer()

# Make NaN abstracts empty for vectorization
df1["abstract"] = df1["abstract"].fillna("")
df2["abstract"] = df2["abstract"].fillna("")

# Create the vectors
tfidf_matrix1 = vectorizer.fit_transform(df1["abstract"])
tfidf_matrix2 = vectorizer.fit_transform(df2["abstract"])

# Compute the mean of each vector
df1["tfidf_mean"] = [np.nan if text == "" else score for text, score in zip(df1["abstract"], tfidf_matrix1.mean(axis=1).A1)]
df2["tfidf_mean"] = [np.nan if text == "" else score for text, score in zip(df2["abstract"], tfidf_matrix2.mean(axis=1).A1)]

# Compute the average of the means of all vectors
average_tfidf1 = df1["tfidf_mean"].mean()
average_tfidf2 = df2["tfidf_mean"].mean()

# Compute the deviation of the mean from the overall average
df1["tfidf_deviation"] = df1["tfidf_mean"].apply(lambda x: np.nan if pd.isna(x) else x - average_tfidf1)
df2["tfidf_deviation"] = df2["tfidf_mean"].apply(lambda x: np.nan if pd.isna(x) else x - average_tfidf2)

In [32]:
# Export the processed files
df1.to_csv(path_data + 'processed/Prog_reporting.csv', index=False)
df2.to_csv(path_data + 'processed/Prog_tripod.csv', index=False)

print("Data preprocessing completed")

Data preprocessing completed
