# Prepare dataset (jsonl file)

- Prepare CDCP datasets for llama factory.

- Argument Relation Identification and Classification (ARIC)

- We create the data files: `CDCP_aric_train.json`, `CDCP_aric_test.json`

## Libraries

In [1]:
import os
import json
import datasets
from datasets import load_dataset

## Load Data

In [2]:
cdcp_dataset = load_dataset("DFKI-SLT/cdcp", trust_remote_code=True)

In [3]:
cdcp_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'propositions', 'relations'],
        num_rows: 580
    })
    test: Dataset({
        features: ['id', 'text', 'propositions', 'relations'],
        num_rows: 150
    })
})

In [4]:
cdcp_dataset["train"][24]

{'id': '00239',
 'text': 'First, I have learned there is a significant legal difference between 1) identify theft, 2) mistaken identity (see contract law Furthermore, that these three words need to be separated, clarified and their legal handling completely examined. There is almost zero legal remedy for wrong person found. More information would NOT have helped. The lawyer had plenty information  but he did not validate any of it after he found my address. The clerk told me these lawyers and their agents are "just going through the phone book/internet" hauling people with the same first and last name into court thinking they will eventually find the debtor. A complete waste of court resources, an enormous expense to the innocent citizen, and the unlawful detainment and threatening of an innocent citizen and their assets. The debtor, it has since been discovered has left the county. The lawyer could have easily validated his information using online public property records from both th

In [5]:
def formatting_fct(instruction="", input="", output="", mode="train"):
    
    prompt_d ={
            
        "instruction": f"""{instruction}""",
        "input": f"""{input}""",
        "output": f"""{output if mode=='train' else ''}"""
            
        }
    
    return prompt_d

In [10]:
instruction = """### You are an expert in Argument Mining. You are given a text which contains numbered argument components enclosed by <AC></AC> tags. Your task is to identify argument relations between argument components in the text. You must return a list of pairs in the following JSON format: {"list_argument_relations": [[source AC (int), target AC (int)], ..., [source AC (int), target AC (int)]]}
"""

In [11]:
instruction

'### You are an expert in Argument Mining. You are given a text which contains numbered argument components enclosed by <AC></AC> tags. Your task is to identify argument relations between argument components in the text. You must return a list of pairs in the following JSON format: {"list_argument_relations": [[source AC (int), target AC (int)], ..., [source AC (int), target AC (int)]]}\n'

In [12]:
def insert_tags(text, start_indices, end_indices):

    for i, (start_i, end_i) in enumerate(zip(start_indices, end_indices)):
            
        start_tag = "<AC" + str(i) + ">"
        end_tag = "</AC" + str(i) + ">"
        
        start_idx = start_i + i*(len(start_tag)  + len(end_tag))
        end_idx = end_i + i*(len(start_tag)  + len(end_tag))
        
        text_r = text[start_idx:end_idx]
        new_text = start_tag + text_r + end_tag
        text = text.replace(text_r, new_text)

        question = f"""### Here is the text: {text}"""

    return question

In [17]:
def get_ars(raw_relations):

    sources = raw_relations["head"]
    targets = raw_relations["tail"]
    # raw_labels = raw_relations["label"]
    
    # relation_type_labels = ["evidence", "reason"]

    ars_l = []

    for i in range(len(sources)):

        rel = (sources[i], targets[i])
        ars_l.append(rel)

    return json.dumps({"list_argument_relations": ars_l})

## Create Data Files

In [18]:
data_file_train = []

for sample in cdcp_dataset["train"]:

    sample_text = sample["text"]
    start_l = sample["propositions"]["start"]
    end_l = sample["propositions"]["end"]
    raw_relations = sample["relations"]
    
    question = insert_tags(sample_text, start_l, end_l)
    answer = get_ars(raw_relations)
    
    data_file_train.append( formatting_fct(instruction, question, answer, mode="train") )

In [19]:
len(data_file_train)

580

In [20]:
for i in range(5):
    
    print(data_file_train[i])
    print()

{'instruction': '### You are an expert in Argument Mining. You are given a text which contains numbered argument components enclosed by <AC></AC> tags. Your task is to identify argument relations between argument components in the text. You must return a list of pairs in the following JSON format: {"list_argument_relations": [[source AC (int), target AC (int)], ..., [source AC (int), target AC (int)]]}\n', 'input': '### Here is the text: <AC0>State and local court rules sometimes make default judgments much more likely.</AC0><AC1> For example, when a person who allegedly owes a debt is told to come to court on a work day, they may be forced to choose between a default judgment and their job.</AC1><AC2> I urge the CFPB to find practices that involve scheduling hearings at inconvenient times unfair, deceptive, and abusive, or inconsistent with 1692i.</AC2>', 'output': '{"list_argument_relations": [[0, 1], [2, 0]]}'}

{'instruction': '### You are an expert in Argument Mining. You are give

In [21]:
data_file_test = []

for sample in cdcp_dataset["test"]:

    sample_text = sample["text"]
    start_l = sample["propositions"]["start"]
    end_l = sample["propositions"]["end"]
    raw_relations = sample["relations"]
    
    question = insert_tags(sample_text, start_l, end_l)
    answer = get_ars(raw_relations)
    
    data_file_test.append( formatting_fct(instruction, question, answer, mode="train") )

In [22]:
len(data_file_test)

150

In [23]:
for i in range(5):
    
    print(data_file_test[i])
    print()

{'instruction': '### You are an expert in Argument Mining. You are given a text which contains numbered argument components enclosed by <AC></AC> tags. Your task is to identify argument relations between argument components in the text. You must return a list of pairs in the following JSON format: {"list_argument_relations": [[source AC (int), target AC (int)], ..., [source AC (int), target AC (int)]]}\n', 'input': "### Here is the text: <AC0>Recently, courts have held that debt collectors can escape 1692i's venue provisions entirely by pursuing debt collection through arbitration instead.</AC0><AC1> As the NAF studies reflect, arbitration has not proven a satisfactory alternative.</AC1><AC2> I urge the CFPB to include in a rule language interpreting 1692i as requiring debt collectors to proceed in court, not through largely-unregulated arbitral forums.</AC2>", 'output': '{"list_argument_relations": [[2, 1]]}'}

{'instruction': '### You are an expert in Argument Mining. You are given a

## Save `jsonl` files

In [24]:
file_path = os.path.join(os.getcwd(), "../datasets/CDCP_ari_train.json")

with open(file_path, 'w') as file:
    
    json.dump(data_file_train, file)

In [25]:
file_path = os.path.join(os.getcwd(), "../datasets/CDCP_ari_test.json")

with open(file_path, 'w') as file:
    
    json.dump(data_file_test, file)