- Prepare AbstRCT datasets for llama factory.

- Joint Pipeline (acc + aric) task.

- We create the data files: `abstRCT_pipeline_train.json`, `abstRCT_pipeline_test_neo.json`, `abstRCT_pipeline_test_gla.json`, `abstRCT_pipeline_test_mix.json`

In [1]:
import os
import re
import ast
import json
import random
import pickle
import pandas as pd

from pathlib import Path

### Read CSV into dataframes

In [2]:
DATAFILES_DIR = Path("/home/umushtaq/am_reasoning/abstRCT/data_files")

In [3]:
neo_train_df = pd.read_csv(DATAFILES_DIR / "neo_train.csv", index_col=0)
neo_dev_df = pd.read_csv(DATAFILES_DIR / "neo_dev.csv", index_col=0)
gla_test_df = pd.read_csv(DATAFILES_DIR / "gla_test.csv", index_col=0)
mix_test_df = pd.read_csv(DATAFILES_DIR / "mix_test.csv", index_col=0)
neo_test_df = pd.read_csv(DATAFILES_DIR / "neo_test.csv", index_col=0)

In [4]:
neo_train_df.columns

Index(['filename', 'type_indicator_x', 'ac_id', 'ac_type', 'ac_start_bound',
       'ac_end_bound', 'ac', 'type_indicator_y', 'source', 'target',
       'relation_type', 'text_file', 'abstract_text', 'source_acs',
       'target_acs', 'source_ids', 'target_ids', 'relations', 'egalite'],
      dtype='object')

In [5]:
gla_test_df.columns

Index(['filename', 'type_indicator_x', 'ac_id', 'ac_type', 'ac_start_bound',
       'ac_end_bound', 'ac', 'type_indicator_y', 'source', 'target',
       'relation_type', 'text_file', 'abstract_text', 'source_acs',
       'target_acs', 'source_ids', 'target_ids', 'relations', 'egalite'],
      dtype='object')

In [6]:
gla_test_df.iloc[57]

filename                                                 19590523.ann
type_indicator_x                                 ['T', 'T', 'T', 'T']
ac_id                                        ['T1', 'T2', 'T3', 'T4']
ac_type                    ['Premise', 'Claim', 'Premise', 'Premise']
ac_start_bound                       ['1030', '1422', '1153', '1258']
ac_end_bound                         ['1152', '1528', '1257', '1421']
ac                  ["At 24 months, complete success rates were 53...
type_indicator_y                                      ['R', 'R', 'R']
source                                             ['T1', 'T4', 'T3']
target                                             ['T2', 'T2', 'T2']
relation_type                       ['Support', 'Support', 'Support']
text_file                                                19590523.txt
abstract_text       \n\nTo compare the effectiveness and safety of...
source_acs          ["At 24 months, complete success rates were 53...
target_acs          

### Prepare Prompts

In [7]:
def formatting_fct(instruction, input, output):
    
    prompt_d ={
            
        "instruction": f"""{instruction}""",
        "input": f"""{input}""",
        "output": f"""{output}"""
            
        }
    
    return prompt_d

In [8]:
def build_instruction():
    
    instruction = """### You are an expert in Argument Mining tasked with analyzing biomedical abstracts. 

INPUT:
You will receive:
- A biomedical abstract text.
- An enumerated list of identified arguments in the biomedical abstract.

TASK 1: Argument Classification
- Classify each argument in the biomedical abstract into one of the following categories: "MajorClaim"(M), "Claim"(C) or "Premise"(P).
- MajorClaim (M): Major claims are broad conclusions supported by specific claims or general statements about treatments or diseases.
- Claim (C): A claim is the author's concluding statement, summarizing how the new treatment compares to existing treatments based on the study's results.
- Premise (P): A premise is an observed or measured fact—such as side effects or outcomes—that supports or challenges a claim. 
- You must return a list of argument types in following JSON format: {"argument_types": [argument_types (str), argument_types (str), ..., argument_types (str)]}

TASK 2: Argument Relations Identification and Classification
- Identify relationships between arguments by determining which arguments support or attack other arguments.
- For each related argument pair, classify the relationship as either: "Support"(S) or "Attack"(A).
- You must return a list of triplets in the following JSON format: {"argument_relations": [(target_index (int), source_index (int), relation_type (str)), (target_index (int), source_index (int), relation_type (str)), ...]}
- Note: Indices are 1-based, referring to the position in the provided arguments list.

Example:

### Output:
{"argument_types": ['M', 'P', 'P', 'P', 'P', 'P', 'C']}
{"argument_relations": [(7, 1, 'S'), (2, 7, 'S'), (3, 7, 'S'), (5, 7, 'S'), (4, 7, 'S'), (6, 1, 'S')]}
"""

    return instruction

In [9]:
def build_input(abstract_text, acs, ac_types):
    
    question = f"""### Abstract text:\n{abstract_text}\n"""    
    ac_l = [ac for ac, ac_type in zip(eval(acs), eval(ac_types)) if ac_type != "none"]
    
    numbered_string = '\n'.join(f"{i+1}. {item}" for i, item in enumerate(ac_l))
    args = f"""### List of arguments in the abstract:\n{numbered_string}"""
    
    return question + args

In [10]:
print(build_input(neo_train_df.iloc[0].abstract_text, neo_train_df.iloc[0].ac, neo_train_df.iloc[0].ac_type))

### Abstract text:
 A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer. The purpose of this study was to assess the effects of these treatments on health-related quality of life (HQL). Men with metastatic prostate cancer (n = 161) were randomized to receive either daily prednisone alone or mitoxantrone (every 3 weeks) plus prednisone. Those who received prednisone alone could have mitoxantrone added after 6 weeks if there was no improvement in pain. HQL was assessed before treatment initiation and then every 3 weeks using the European Organization for Research and Treatment of Cancer Quality-of-Life Questionnaire C30 (EORTC QLQ-C30) and the Quality of Life Module-Prostate 14 (QOLM-P14), a trial-specific module developed for this study. An intent-to-treat analysis was used to determine the mean duration of HQL improvement and differences in improvement duration between groups of

In [11]:
def build_answer(relations, relation_types, ac_types):
    
    relation_types = [(pair[0], pair[1], r) for pair, r in zip(eval(relations), eval(relation_types))]
    
    
    ac_type_d = {"Premise": "P", "Claim": "C", "MajorClaim": "M"}
    rel_type_d = {"Support": "S", "Attack": "A"}

    output_indic = "### Output:\n"
    
    ac_types = [ac_type_d[label] for label in eval(ac_types) if label != 'none']   
    ac_types = f'{{"argument_types": {ac_types}}}\n'
    
    rels_types = [(i, j, rel_type_d[r_type]) for i, j , r_type in relation_types]
    rels_types = f'{{"relation_types": {rels_types}}}\n'
    

    return output_indic + ac_types + rels_types

In [12]:
print(build_answer(neo_train_df.iloc[0].relations, neo_train_df.iloc[0].relation_type, neo_train_df.iloc[0].ac_type))

### Output:
{"argument_types": ['M', 'P', 'P', 'P', 'P', 'P', 'C']}
{"relation_types": [(7, 1, 'S'), (2, 7, 'S'), (3, 7, 'S'), (5, 7, 'S'), (4, 7, 'S'), (6, 1, 'S')]}



### Build Data Files 

In [13]:
data_file_train = []

for _, row in neo_train_df.iterrows():

    instruction = build_instruction()
    question = build_input(row.abstract_text, row.ac, row.ac_type)
    answer = build_answer(row.relations, row.relation_type, row.ac_type)
    
    data_file_train.append( formatting_fct(instruction, question, answer) )

In [14]:
len(data_file_train)

350

In [15]:
print(data_file_train[0]['instruction'])

### You are an expert in Argument Mining tasked with analyzing biomedical abstracts. 

INPUT:
You will receive:
- A biomedical abstract text.
- An enumerated list of identified arguments in the biomedical abstract.

TASK 1: Argument Classification
- Classify each argument in the biomedical abstract into one of the following categories: "MajorClaim"(M), "Claim"(C) or "Premise"(P).
- MajorClaim (M): Major claims are broad conclusions supported by specific claims or general statements about treatments or diseases.
- Claim (C): A claim is the author's concluding statement, summarizing how the new treatment compares to existing treatments based on the study's results.
- Premise (P): A premise is an observed or measured fact—such as side effects or outcomes—that supports or challenges a claim. 
- You must return a list of argument types in following JSON format: {"argument_types": [argument_types (str), argument_types (str), ..., argument_types (str)]}

TASK 2: Argument Relations Identificat

In [16]:
print(data_file_train[0]['input'])

### Abstract text:
 A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer. The purpose of this study was to assess the effects of these treatments on health-related quality of life (HQL). Men with metastatic prostate cancer (n = 161) were randomized to receive either daily prednisone alone or mitoxantrone (every 3 weeks) plus prednisone. Those who received prednisone alone could have mitoxantrone added after 6 weeks if there was no improvement in pain. HQL was assessed before treatment initiation and then every 3 weeks using the European Organization for Research and Treatment of Cancer Quality-of-Life Questionnaire C30 (EORTC QLQ-C30) and the Quality of Life Module-Prostate 14 (QOLM-P14), a trial-specific module developed for this study. An intent-to-treat analysis was used to determine the mean duration of HQL improvement and differences in improvement duration between groups of

In [17]:
print(data_file_train[0]['output'])

### Output:
{"argument_types": ['M', 'P', 'P', 'P', 'P', 'P', 'C']}
{"relation_types": [(7, 1, 'S'), (2, 7, 'S'), (3, 7, 'S'), (5, 7, 'S'), (4, 7, 'S'), (6, 1, 'S')]}



In [18]:
for _, row in neo_dev_df.iterrows():

    instruction = build_instruction()
    question = build_input(row.abstract_text, row.ac, row.ac_type)
    answer = build_answer(row.relations, row.relation_type, row.ac_type)
    
    data_file_train.append( formatting_fct(instruction, question, answer) )

In [19]:
#### Test data files

In [20]:
data_file_test_neo = []

for _, row in neo_test_df.iterrows():

    instruction = build_instruction()
    question = build_input(row.abstract_text, row.ac, row.ac_type)
    answer = build_answer(row.relations, row.relation_type, row.ac_type)
    
    data_file_test_neo.append(formatting_fct(instruction, question, answer))

In [21]:
data_file_test_mix = []

for _, row in mix_test_df.iterrows():

    instruction = build_instruction()
    question = build_input(row.abstract_text, row.ac, row.ac_type)
    answer = build_answer(row.relations, row.relation_type, row.ac_type)
    
    data_file_test_mix.append(formatting_fct(instruction, question, answer))

In [22]:
data_file_test_gla = []

for _, row in gla_test_df.iterrows():

    instruction = build_instruction()
    question = build_input(row.abstract_text, row.ac, row.ac_type)
    answer = build_answer(row.relations, row.relation_type, row.ac_type)
    
    data_file_test_gla.append(formatting_fct(instruction, question, answer))

In [23]:
len(data_file_test_neo), len(data_file_test_gla), len(data_file_test_mix)

(100, 100, 100)

In [24]:
print(data_file_test_gla[57]['instruction'])

### You are an expert in Argument Mining tasked with analyzing biomedical abstracts. 

INPUT:
You will receive:
- A biomedical abstract text.
- An enumerated list of identified arguments in the biomedical abstract.

TASK 1: Argument Classification
- Classify each argument in the biomedical abstract into one of the following categories: "MajorClaim"(M), "Claim"(C) or "Premise"(P).
- MajorClaim (M): Major claims are broad conclusions supported by specific claims or general statements about treatments or diseases.
- Claim (C): A claim is the author's concluding statement, summarizing how the new treatment compares to existing treatments based on the study's results.
- Premise (P): A premise is an observed or measured fact—such as side effects or outcomes—that supports or challenges a claim. 
- You must return a list of argument types in following JSON format: {"argument_types": [argument_types (str), argument_types (str), ..., argument_types (str)]}

TASK 2: Argument Relations Identificat

In [25]:
print(data_file_test_gla[57]['input'])

### Abstract text:


To compare the effectiveness and safety of excimer laser trabeculotomy (ELT) ab interno vs selective laser trabeculoplasty (SLT) over 24 months of follow-up in patients with primary open-angle glaucoma (POAG) refractory to medical therapy.
This prospective, randomized study included 30 consecutive eyes assigned randomly to either ELT or SLT group. ELT was carried out using a XeCl Excimer Laser with an emission wavelength of 308 nm. Eight spots were equally distributed at a distance of 500 microm from one another over the anterior trabeculum. The SLT patients were treated with a frequency-doubled q-switched neodymium:yytrium-aluminum-garnet laser (wavelength 532 nm). Approximately 50 adjacent, but not overlapping, laser spots were distributed over 180 degrees of the trabecular meshwork, using an energy level ranging from 0.7 to 1.0 mJ per pulse. The main outcome measure was intraocular pressure (IOP) lowering after ELT and SLT. Success was defined as >or=20% reducti

In [26]:
print(data_file_test_gla[57]['output'])

### Output:
{"argument_types": ['P', 'C', 'P', 'P']}
{"relation_types": [(1, 2, 'S'), (4, 2, 'S'), (3, 2, 'S')]}



In [27]:
### Save JSON files

In [28]:
DATASETS_DIR = Path("/home/umushtaq/am_reasoning/abstRCT/datasets")

In [29]:
file_path = DATASETS_DIR / "abstRCT_pipeline_neo_train.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_train, file)

In [30]:
file_path = DATASETS_DIR / "abstRCT_pipeline_neo_test.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_test_neo, file)

In [31]:
file_path = DATASETS_DIR / "abstRCT_pipeline_mix_test.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_test_mix, file)

In [32]:
file_path =  DATASETS_DIR / "abstRCT_pipeline_gla_test.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_test_gla, file)