# **Association between Abstract Characteristics and Ranking Position in Technology-Assisted Reviewing (TAR): data preprocessing**

### **by Isa Spiero <br>**

#### **Part I: Loading TRIPOD characteristics**
In the first part, the datasets are loaded which contain the TRIPOD characteristics of the inclusions of the reviews and are merged with the title-abstract level inclusions and full-text level inclusions of the respective reviews.

#### **Part II: Adding structure characteristics**

#### **Part III: Adding terminology characteristics**

In [7]:
import os
import pandas as pd
import numpy as np

In [9]:
path_data = '../data/raw/'

#### **Part I: Loading TRIPOD characteristics**

**Part I.a Dataset based on Andaur Navarro et al. 2022: 'Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review'**

Load the data containing the TRIPOD scores:

In [14]:
df1_scores = pd.read_csv(path_data + 'AndaurNavarro_et_al_2022/20201127_DATA_TRIPOD.csv')

# Select only the relevant columns:
df1_scores_sel = df1_scores[['article_id', 
                             't_title_1', 't_title_2', 't_title_3', 't_title_4',
                             't_abstract_1', 't_abstract_2', 't_abstract_3', 't_abstract_4',
                             't_abstract_5', 't_abstract_6', 't_abstract_7', 't_abstract_8',
                             't_abstract_9', 't_abstract_10', 't_abstract_11', 't_abstract_12']]
# Check that there are 152 inclusions in the review that were scored with TRIPOD
print(len(df1_scores_sel))

df1_scores_sel.head()

152


Unnamed: 0,article_id,t_title_1,t_title_2,t_title_3,t_title_4,t_abstract_1,t_abstract_2,t_abstract_3,t_abstract_4,t_abstract_5,t_abstract_6,t_abstract_7,t_abstract_8,t_abstract_9,t_abstract_10,t_abstract_11,t_abstract_12
0,172,0,0,1,1,1,0,0,1,0,2,0,1,1,0,0,1
1,174,0,1,0,1,0,1,1,1,1,2,0,1,1,1,0,1
2,175,0,0,0,1,1,0,0,1,1,1,1,1,1,0,0,1
3,177,0,1,1,1,1,1,0,1,1,2,1,1,1,0,0,1
4,178,0,1,0,1,1,0,0,0,0,2,1,1,1,0,0,1


Load the data containing the inclusion labels:

In [17]:
# This file contains the abstracts, titles, and title-abstract level inclusions
# The article_ids column was manually added based on JCMachineLearningSys-Datalinked_DATA_2024-10-02_1410.csv
df1_labels = pd.read_excel(path_data + 'AndaurNavarro_et_al_2022/Prog_reporting_labeled_ids.xlsx')

# Only the 152 inclusions have an article_id, fill the others with NA
df1_labels['article_id'] = df1_labels['article_id'].fillna(0)
df1_labels['article_id'] = df1_labels['article_id'].astype(int)
print(df1_labels['article_id'].ne(0).sum())

# Add the 152 full-text level inclusions
df1_labels['label_ft_included'] = np.where(df1_labels['article_id'] == 0, 0, 1)
print(df1_labels['label_ft_included'].sum())

# Check the 312 title-abstract level inclusions
df1_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df1_labels['label_ta_included'].sum())

df1_labels.head()

152
152
312


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,label_ta_included,article_id,label_ft_included
0,2278,JOUR,"['Csato V', 'Kadir SZSA', 'Khavandi K', 'Benne...",2019.0,"""A Step and a Ceiling"": mechanical properties ...",,,"['eppi-reviewer4', 'Ca2+ spark', 'oxidant sign...",We investigated the biomechanical relationship...,0,0,0
1,1242,JOUR,,2019.0,"""Implications of emotion regulation strategies...",,,['eppi-reviewer4'],"Reports an error in ""Implications of emotion r...",0,0,0
2,1632,JOUR,"['Moyano J', 'Mases L', 'Izeta T', 'Flores T',...",2019.0,"""In Vitro"" Study About Variables that Influenc...",,,"['eppi-reviewer4', 'conventional brackets', 'f...",Many advantages have been described surroundin...,0,0,0
3,187,JOUR,"['Song J', 'Han K', 'Lee D', 'Kim SW']",2018.0,"""Is a picture really worth a thousand words?"":...",,,"['eppi-reviewer4', 'Adolescent', 'Age Factors'...",Because using social media has become a major ...,0,0,0
4,2406,JOUR,"['Rodrigues MAV', 'Olmos RD', 'Kira CM', 'Lotu...",2019.0,"""Shadow"" OSCE examiner. A cross-sectional stud...",,,['eppi-reviewer4'],OBJECTIVES: Feedback is a powerful learning to...,0,0,0


Merge the data containing the TRIPOD scores with the data containing the inclusion labels:

In [20]:
df1 = pd.merge(df1_labels, df1_scores_sel, on='article_id', how='outer')

# Check that dataframe consists 2482 records in total:
print(len(df1))
# Check the number of title-abstract inclusions of 312:
print(df1['label_ta_included'].sum())
# Check the number of full-text inclusions of 152:
print(df1['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df1['t_abstract_1'].isna().sum())

df1.head()

2482
312
152
2330


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,label_ta_included,...,t_abstract_3,t_abstract_4,t_abstract_5,t_abstract_6,t_abstract_7,t_abstract_8,t_abstract_9,t_abstract_10,t_abstract_11,t_abstract_12
0,2278,JOUR,"['Csato V', 'Kadir SZSA', 'Khavandi K', 'Benne...",2019.0,"""A Step and a Ceiling"": mechanical properties ...",,,"['eppi-reviewer4', 'Ca2+ spark', 'oxidant sign...",We investigated the biomechanical relationship...,0,...,,,,,,,,,,
1,1242,JOUR,,2019.0,"""Implications of emotion regulation strategies...",,,['eppi-reviewer4'],"Reports an error in ""Implications of emotion r...",0,...,,,,,,,,,,
2,1632,JOUR,"['Moyano J', 'Mases L', 'Izeta T', 'Flores T',...",2019.0,"""In Vitro"" Study About Variables that Influenc...",,,"['eppi-reviewer4', 'conventional brackets', 'f...",Many advantages have been described surroundin...,0,...,,,,,,,,,,
3,187,JOUR,"['Song J', 'Han K', 'Lee D', 'Kim SW']",2018.0,"""Is a picture really worth a thousand words?"":...",,,"['eppi-reviewer4', 'Adolescent', 'Age Factors'...",Because using social media has become a major ...,0,...,,,,,,,,,,
4,2406,JOUR,"['Rodrigues MAV', 'Olmos RD', 'Kira CM', 'Lotu...",2019.0,"""Shadow"" OSCE examiner. A cross-sectional stud...",,,['eppi-reviewer4'],OBJECTIVES: Feedback is a powerful learning to...,0,...,,,,,,,,,,


**Part I.b Dataset based on Heus et al. 2018: 'Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement'**

Load the data containing the TRIPOD scores:

In [126]:
df2_scores = pd.read_excel(path_data + 'Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                           sheet_name='Overview (n=147)') 

# Select only the relevant columns:
df2_scores_sel = df2_scores[
    ['ID', 'Endnote ID', 
     '1i', '1ii', '1iii', '1iv',
     '2i', '2ii', '2iii', '2iv', '2v',
     '2vi', '2vii', '2viii', '2ix', '2x',
     '2xi', '2xii', '2xiii', '2xiv', '2xv',
     '1i.1', '1ii.1', '1iii.1', '1iv.1',
     '2i.1', '2ii.1', '2iii.1', '2iv.1', '2v.1',
     '2vi.1', '2vii.1', '2viii.1', '2ix.1', '2x.1',
     '2xi.1', '2xii.1', '2xiii.1', '2xiv.1', '2xv.1',
     '1i.2', '1ii.2', '1iii.2', '1iv.2',
     '2i.2', '2ii.2', '2iii.2', '2iv.2', '2v.2',
     '2vi.2', '2vii.2', '2viii.2', '2ix.2', '2x.2',
     '2xi.2', '2xii.2', '2xiii.2', '2xiv.2', '2xv.2'
    ]
]

# List of sets of columns to merge, since columns are spread according to prediction model type:
columns_to_merge = [
    ['1i', '1i.1', '1i.2'],
    ['1ii', '1ii.1', '1ii.2'],
    ['1iii', '1iii.1', '1iii.2'],
    ['1iv', '1iv.1', '1iv.2'],
    ['2i', '2i.1', '2i.2'],
    ['2ii', '2ii.1', '2ii.2'],
    ['2iii', '2iii.1', '2iii.2'],
    ['2iv', '2iv.1', '2iv.2'],
    ['2v', '2v.1', '2v.2'],
    ['2vi', '2vi.1', '2vi.2'],
    ['2vii', '2vii.1', '2vii.2'],
    ['2viii', '2viii.1', '2viii.2'],
    ['2ix', '2ix.1', '2ix.2'],
    ['2x', '2x.1', '2x.2'],
    ['2xi', '2xi.1', '2xi.2'],
    ['2xii', '2xii.1', '2xii.2'],
    ['2xiii', '2xiii.1', '2xiii.2'],
    ['2xiv', '2xiv.1', '2xiv.2'],
    ['2xv', '2xv.1', '2xv.2']
]

# Loop through each set of columns and merge them
df2_scores_merg = df2_scores_sel.copy()  
for cols in columns_to_merge:
    df2_scores_merg[cols[0]] = df2_scores_merg[cols].apply(
        lambda row: row.dropna().iloc[0] if not row.dropna().empty else None, 
        axis=1
    )
    df2_scores_merg = df2_scores_merg.drop(columns=cols[1:])

df2_scores_merg

  for idx, row in parser.parse():


Unnamed: 0,ID,Endnote ID,1i,1ii,1iii,1iv,2i,2ii,2iii,2iv,...,2vi,2vii,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv
0,4,10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,3.0,0.0,3.0,1.0
1,6,15,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,3.0,1.0
2,12,25,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,1.0
3,15,28,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,3.0,3.0,3.0,0.0,3.0,1.0
4,16,32,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,3.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,166,306,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0
143,167,307,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,3.0,3.0,3.0,0.0,3.0,1.0
144,168,310,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,3.0,1.0,3.0,0.0,3.0,1.0
145,172,328,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,0.0,3.0,0.0,3.0,0.0,3.0,1.0


Load the data containing the inclusion labels:

In [129]:
df2_labels = pd.read_excel(path_data + 'Heus_et_al_2018/Prog_tripod_labeled.xlsx')

# Check the total number of records equals 4871:
print(len(df2_labels))

# Check that the number of title-abstract level inclusions equals 347:
df2_labels.rename(columns={'label_included': 'label_ta_included'}, inplace=True)
print(df2_labels['label_ta_included'].sum())

df2_labels.head()

4871
347


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,language,label_ta_included
0,0,Journal Article,,2014,Factors associated with short-term changes in ...,AIDS,24959963,,OBJECTIVES: Among antiretroviral therapy (ART)...,eng,0
1,1,Journal Article,,2014,Reducing Injury Risk From Body Checking in Boy...,Pediatrics,24864185,,Ice hockey is an increasingly popular sport th...,Eng,0
2,2,Journal Article,,2014,Hypothermia and Neonatal Encephalopathy,Pediatrics,24864176,,Data from large randomized clinical trials ind...,Eng,0
3,3,Journal Article,,2014,Incontinence: Leak point pressure predicts suc...,Nat Rev Urol,24861330,,,eng,0
4,4,Journal Article,,2014,Identifying Gene-Environment Interactions in S...,Schizophr Bull,24860087,,Recent years have seen considerable progress i...,eng,0


Load the data with the PubMed ID's (pmid) to link the TRIPOD scores with the labels:

In [132]:
# A function to parse the ris file
def parse_ris_file(file_path):
    references = []
    entry = {}

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.strip() == "":  
                continue

            if line.startswith("TY  -"):  
                if entry:  
                    references.append(entry)
                entry = {}  
            try:
                tag, value = line.split('  - ', 1)  
                entry[tag] = value.strip()  
            except ValueError:
                continue  

        if entry:
            references.append(entry)

    return references

# Convert parsed data into a pandas DataFrame
parsed_data = parse_ris_file(path_data + 'Heus_et_al_2018/TRIPOD adherence included_final-Converted.txt')
df2_link = pd.DataFrame(parsed_data)

# The column 'AN' contains the PubMed IDs:
df2_link

Unnamed: 0,﻿TY,AB,AD,AN,AU,DA,DO,DP,ET,J2,...,TI,UR,ID,ER,TY,IS,SP,VL,KW,C2
0,JOUR,BACKGROUND: -Vascular adhesion protein-1 (VAP-...,"MediCity Research Laboratory, University of Tu...",24850810,"Salmi, M.",May 21,10.1161/circgenetics.113.000543,NLM,2014/05/23,Circulation. Cardiovascular genetics,...,Soluble Vascular Adhesion Protein-1 Predicts I...,http://circgenetics.ahajournals.org/content/7/...,1,,,,,,,
1,,Previous studies have shown that hippocampal v...,"Biomedical Imaging Group Rotterdam, Department...",24039001,"de Bruijne, M.",May,10.1002/hbm.22333,NLM,2013/09/17,Human brain mapping,...,Hippocampal shape is predictive for the develo...,http://onlinelibrary.wiley.com/store/10.1002/h...,2,,JOUR,5,2359-71,35,,
2,,BACKGROUND: Many of the common equations for w...,"From the Department of Anesthesiology, Univers...",24681659,"Nafiu, O. O.",May,10.1213/ane.0000000000000163,NLM,2014/04/01,Anesthesia and analgesia,...,Assessing the accuracy of common pediatric age...,,3,,JOUR,5,1027-33,118,Aging/ physiology,
3,,The Model for End-Stage Liver Disease (MELD) s...,"Department of Surgery, Dumont-UCLA Transplant ...",24854341,"Busuttil, R. W.",Jul,10.1111/ajt.12759,NLM,2014/05/24,American journal of transplantation : official...,...,Liver transplantation in recipients receiving ...,http://onlinelibrary.wiley.com/store/10.1111/a...,4,,JOUR,7,1638-47,14,,
4,,OBJECTIVES: Psoriasis is a chronic inflammator...,"Department of Cardiology, Copenhagen Universit...",24860914,"Hansen, P. R.",May 26,10.1111/joim.12272,NLM,2014/05/28,Journal of internal medicine,...,Risk of thromboembolism and fatal stroke in pa...,http://onlinelibrary.wiley.com/store/10.1111/j...,5,,JOUR,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,,Abstract Traumatic brain injury (TBI) is commo...,"1 Department of Biostatistics, University of W...",24552494,"Newgard, C. D.",Jun 1,10.1089/neu.2013.3122,NLM,2014/02/21,Journal of neurotrauma,...,Addressing the challenges of obtaining functio...,http://online.liebertpub.com/doi/abs/10.1089/n...,176,,JOUR,11,1029-38,31,,4043258
176,,BACKGROUND: Acute kidney injury (AKI) is a fre...,,24293449,"Zhang, Y.",May,10.1515/cclm-2013-0823,NLM,2013/12/03,Clinical chemistry and laboratory medicine : C...,...,Performance of urinary NGAL and L-FABP in pred...,http://www.degruyter.com/view/j/cclm.2014.52.i...,177,,JOUR,5,671-8,52,,
177,,BACKGROUND & AIMS: Anemia is a common adverse ...,Johann Wolfgang Goethe University Medical Cent...,24486089,"Witek, J.",Jun,10.1016/j.jhep.2014.01.013,NLM,2014/02/04,Journal of hepatology,...,Risk factors predictive of anemia development ...,http://www.journal-of-hepatology.eu/article/S0...,178,,JOUR,6,1112-7,60,,
178,,BACKGROUND: Early warning scores (EWS) are des...,"Division of Biomedical Informatics, Cincinnati...",24813568,"Solti, I.",May 9,10.1016/j.resuscitation.2014.04.009,NLM,2014/05/13,Resuscitation,...,Developing and evaluating a machine learning b...,http://www.resuscitationjournal.com/article/S0...,179,,JOUR,,,,,


Merge the PubMed IDs with the dataframe containing the TRIPOD scores:

In [166]:
# Convert both column types to string:
df2_scores_merg['ID'] = df2_scores_merg['ID'].astype(str)
df2_link['ID'] = df2_link['ID'].astype(str)

# Merge the PubMed ID's with the dataframe with the TRIPOD scores:
df2_scores_pmid = df2_scores_merg.merge(df2_link[['ID', 'AN']], on='ID', how='left')

# Rename the column:
df2_scores_pmid.rename(columns={'AN': 'pmid'}, inplace=True)

# Add a column for full-text level inclusions:
df2_scores_pmid['label_ft_included'] = int(1)

df2_scores_pmid

Unnamed: 0,ID,Endnote ID,1i,1ii,1iii,1iv,2i,2ii,2iii,2iv,...,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv,pmid,label_ft_included
0,4,10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,3.0,0.0,3.0,1.0,24854341,1
1,6,15,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,0.0,0.0,3.0,1.0,24690476,1
2,12,25,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,0.0,3.0,1.0,24958751,1
3,15,28,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,3.0,3.0,0.0,3.0,1.0,24515568,1
4,16,32,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,3.0,3.0,3.0,1.0,3.0,1.0,24419662,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,166,306,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0,24815676,1
143,167,307,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,1.0,1.0,3.0,3.0,3.0,0.0,3.0,1.0,24123609,1
144,168,310,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,...,1.0,0.0,3.0,1.0,3.0,0.0,3.0,1.0,24520119,1
145,172,328,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,3.0,1.0,24879844,1


In [174]:
df2_scores_pmid['pmid'] = df2_scores_pmid['pmid'].astype(str)
df2_labels['pmid'] = df2_labels['pmid'].astype(str)

df2 = pd.merge(df2_labels, df2_scores_pmid, on='pmid', how='outer')

# Change all NaN to 0 for full text level inclusions (inclusions are indicated with 1 already)
df2['label_ft_included'] = df2['label_ft_included'].fillna(int(0))


# Check that dataframe consists 4871 records in total:
print(len(df2))
# Check the number of title-abstract inclusions of 312:
print(df2['label_ta_included'].sum())
# Check the number of full-text inclusions of 147:
print(df2['label_ft_included'].sum())
# Check the number of NaN for the TRIPOD corresponds to the number of full-text exclusions of 2330:
print(df2['2vii'].isna().sum())

df2.head()

4871
347
147.0
4724


Unnamed: 0.1,Unnamed: 0,type,authors,year,title,journal,pmid,keywords,abstract,language,...,2vii,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv,label_ft_included
0,1645,Journal Article,R. Haring; N. Friedrich; H. Volzke; R. S. Vasa...,2014,Positive association of serum prolactin concen...,Eur Heart J,22843444,,AIMS: Increased serum prolactin (PRL) concentr...,eng,...,,,,,,,,,,0.0
1,2831,Journal Article,C. Mertens; D. Wiens; H. G. Steveling; A. Sand...,2014,Maxillary sinus-floor elevation with nanoporou...,Clin Implant Dent Relat Res,22897709,,BACKGROUND: Insufficient bone height in the po...,eng,...,,,,,,,,,,0.0
2,3259,Journal Article,G. Peeters; Y. R. van Gellecum; J. G. van Uffe...,2014,Contribution of house and garden work to the a...,Br J Sports Med,22936410,,OBJECTIVE: Although physical activity occurs i...,eng,...,,,,,,,,,,0.0
3,3578,Journal Article,G. E. Romanos; S. May; D. May,2014,Implant-supporting telescopic maxillary prosth...,Clin Implant Dent Relat Res,22998571,,Immediate loading (IL) in the maxilla is a suc...,eng,...,,,,,,,,,,0.0
4,1893,Journal Article,C. Jacobsen; A. Kruse; H. T. Lubbers; R. Zwahl...,2014,Is mandibular reconstruction using vascularize...,Clin Implant Dent Relat Res,22998581,,PURPOSE: this study retrospectively analyzed t...,eng,...,,,,,,,,,,0.0


In [68]:
# n=74
df2a_scores = pd.read_excel(path_data + 'Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                            sheet_name='Developm for SPSS',
                            header=1) 
#df2a_scores['origin'] = 'Development'

# n=43
df2b_scores = pd.read_excel(path_data + 'Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                            sheet_name='Validation for SPSS',
                            header=1)
#df2b_scores['origin'] = 'Validation'

# n=33
df2c_scores = pd.read_excel(path_data + 'Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                            sheet_name='IV for SPSS_2',
                            header=1)
#df2c_scores['origin'] = 'IV'

# n=22
df2d_scores = pd.read_excel(path_data + 'Heus_et_al_2018/170509_Data_set_for_SPSS.xlsx', 
                            sheet_name='D&V for SPSS_2',
                            header=1)
#df2d_scores['origin'] = 'D&V'

  for idx, row in parser.parse():
  for idx, row in parser.parse():
  for idx, row in parser.parse():
  for idx, row in parser.parse():


In [86]:
df2 = pd.concat([df2a_scores,
                 df2b_scores,
                 df2c_scores,
                 df2d_scores], axis = 0)
#df2 = df2[['origin'] + [col for col in df2.columns if col != 'origin']]

# Select only the relevant columns:
df2 = df2[['Author name', 
                             '1i', '1ii', '1iii', '1iv',
                             '2i', '2ii', '2iii', '2iv', '2v',
                             '2vi', '2vii', '2viii', '2ix', '2x',
                             '2xi', '2xii', '2xiii', '2xiv', '2xv']]

#df2 = df2.drop_duplicates()
df2.sort_values(by='Author name')
#df2 = df2.drop_duplicates(subset=['Author name'], keep='first')



df2.to_excel(path_data+'check.xlsx', index=False)
df2

Unnamed: 0,Author name,1i,1ii,1iii,1iv,2i,2ii,2iii,2iv,2v,2vi,2vii,2viii,2ix,2x,2xi,2xii,2xiii,2xiv,2xv
0,Agopian,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,3.0,0.0,3.0,1.0
1,AlHilli,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,3.0,1.0
2,Arnold,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,1.0
3,Aurello,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,3.0,3.0,0.0,3.0,1.0
4,Barthelemy,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,3.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17,Schmidt,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,,1.0
18,Schmit,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,,1.0
19,Tadiparthi,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,3.0,3.0,3.0,0.0,3.0,1.0
20,van der Meer,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,3.0,1.0


In [None]:
import pandas as pd


# Step 1: Parse the .txt (RIS-formatted) file
def parse_ris_file(file_path):
    references = []
    entry = {}

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.strip() == "":  # Skip empty lines
                continue

            # Split the line based on RIS format (Tag - Value)
            if line.startswith("TY  -"):  # New reference entry starts with TY tag
                if entry:  # If we already have an entry, append it
                    references.append(entry)
                entry = {}  # Start a new entry
            try:
                tag, value = line.split('  - ', 1)  # Split by '  - ' to get tag and value
                entry[tag] = value.strip()  # Strip any excess spaces/newlines
            except ValueError:
                continue  # Handle lines that don't match the pattern

        # Add the last entry if it exists
        if entry:
            references.append(entry)

    return references

# Step 2: Convert parsed data into a pandas DataFrame
file_path = path_data + 'Heus_et_al_2018/TRIPOD adherence included_final-Converted.txt'
parsed_data = parse_ris_file(file_path)

# Convert list of dictionaries to a DataFrame
df = pd.DataFrame(parsed_data)

# Display the DataFrame
df.to_excel('TRIPOD adherence included.xlsx', index=False)
df