# Prepare dataset (jsonl file)

- Prepare CDCP datasets for llama factory.

- Joint Task

- We create the data files: `CDCP_joint_train.json`, `CDCP_joint_test.json`

## Libraries

In [1]:
import os
import json
import datasets

import pandas as pd
from pathlib import Path
from datasets import load_dataset

## Load Data

In [2]:
cdcp_dataset = load_dataset("DFKI-SLT/cdcp", trust_remote_code=True)

README.md:   0%|          | 0.00/9.53k [00:00<?, ?B/s]

cdcp.py:   0%|          | 0.00/7.90k [00:00<?, ?B/s]

In [3]:
cdcp_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'propositions', 'relations'],
        num_rows: 580
    })
    test: Dataset({
        features: ['id', 'text', 'propositions', 'relations'],
        num_rows: 150
    })
})

In [4]:
cdcp_dataset['test'][43] # type: ignore

{'id': '00525',
 'text': "AT & T is notorious for setting up new accounts that don't have an official phone number attached. I was put in the situation of having an 800 number that AT & T customer service reps could not find in their system. I had no way to cancel it! Eventually the false 100 dollar debt went to a debt collector. I respectfully explained that the debt was not valid, in a letter. A few months later I got a letter from a different debt collector. Again I wrote a letter. A few months later another letter over the same faux debt from a third debt collector.",
 'propositions': {'start': [0, 98, 215, 242, 305, 372, 439, 463],
  'end': [98, 215, 242, 305, 372, 439, 463, 550],
  'label': [0, 3, 3, 3, 3, 3, 3, 3],
  'url': ['', '', '', '', '', '', '', '']},
 'relations': {'head': [0, 0, 0, 0, 0, 0, 0],
  'tail': [1, 2, 3, 4, 5, 6, 7],
  'label': [0, 0, 0, 0, 0, 0, 0]}}

In [5]:
def build_data(dataset):
    
    data = []
    
    for sample in dataset:
    
        id = sample['id']
        rel_sources = sample["relations"]["head"]
        rel_targets = sample["relations"]["tail"]
        rel_labels = sample["relations"]["label"]
        
        acs = []
        rels = []
        
        class_labels = ["fact", "policy", "reference", "testimony", "value"]
        relation_labels = ["evidence", "reason"]   
        
        
        text = sample["text"]
        start_idxs = sample["propositions"]['start']
        end_idxs = sample["propositions"]['end']
        
        for i, j in zip(start_idxs, end_idxs):
            
            acs.append(text[i:j+1].strip())
            
        ac_labels = [class_labels[i] for i in sample["propositions"]['label']]
        
        for i in range(len(rel_sources)):

            rel = (rel_sources[i]+1, rel_targets[i]+1, relation_labels[rel_labels[i]])
            rels.append(rel)
            
        data.append([id, text, acs, ac_labels, rel_sources, rel_targets, rel_labels, rels])
        
    return data
        
    

In [6]:
train_data = build_data(cdcp_dataset['train'])

In [7]:
test_data = build_data(cdcp_dataset['test'])

In [8]:
len(train_data), len(test_data)

(580, 150)

In [9]:
train_df = pd.DataFrame(train_data, columns=["id", "text", "acs_list", "ac_types", "rel_sources", "rel_targets", "rel_types", "relations"])
test_df = pd.DataFrame(test_data, columns=["id", "text", "acs_list", "ac_types", "rel_sources", "rel_targets", "rel_types", "relations"])

In [10]:
train_df

Unnamed: 0,id,text,acs_list,ac_types,rel_sources,rel_targets,rel_types,relations
0,00195,State and local court rules sometimes make def...,[State and local court rules sometimes make de...,"[value, value, policy]","[0, 2]","[1, 0]","[1, 1]","[(1, 2, reason), (3, 1, reason)]"
1,00197,There is currently a split between the Ninth a...,[There is currently a split between the Ninth ...,"[fact, fact, value, policy]",[3],[0],[1],"[(4, 1, reason)]"
2,00198,In many districts where bad check diversion pr...,[In many districts where bad check diversion p...,"[fact, fact, policy, policy]",[2],[0],[1],"[(3, 1, reason)]"
3,00200,Consumer attorneys occasionally advise their c...,[Consumer attorneys occasionally advise their ...,"[fact, policy]",[1],[0],[1],"[(2, 1, reason)]"
4,00201,When an attorney is a 'Debt Collector' (as per...,[When an attorney is a 'Debt Collector' (as pe...,"[fact, policy, fact, fact, fact, fact, fact, f...","[2, 5, 8]","[3, 6, 9]","[1, 1, 1]","[(3, 4, reason), (6, 7, reason), (9, 10, reason)]"
...,...,...,...,...,...,...,...,...
575,01409,"it still bothers me that ""financial literacy"" ...","[it still bothers me that ""financial literacy""...","[value, value, fact, fact, fact, value, testim...",[6],[7],[0],"[(7, 8, evidence)]"
576,01410,Even when the right person is hauled into cour...,[Even when the right person is hauled into cou...,"[value, policy, fact, policy, policy]","[3, 4]","[2, 2]","[1, 1]","[(4, 3, reason), (5, 3, reason)]"
577,01412,This Draft Summary of Discussion fails to addr...,[This Draft Summary of Discussion fails to add...,"[value, value, value, fact, fact, value, value...","[3, 4, 2, 2, 2, 8, 10]","[12, 14, 3, 4, 6, 9, 11]","[0, 0, 1, 1, 1, 1, 1]","[(4, 13, evidence), (5, 15, evidence), (3, 4, ..."
578,01414,I would disagree that there is confusion among...,[I would disagree that there is confusion amon...,"[value, testimony, policy, value, value, polic...","[2, 11]","[3, 8]","[1, 1]","[(3, 4, reason), (12, 9, reason)]"


In [11]:
test_df

Unnamed: 0,id,text,acs_list,ac_types,rel_sources,rel_targets,rel_types,relations
0,00194,"Recently, courts have held that debt collector...","[Recently, courts have held that debt collecto...","[fact, value, policy]",[2],[1],[1],"[(3, 2, reason)]"
1,00196,When alleged debtors are served with state cou...,[When alleged debtors are served with state co...,"[value, policy, policy, policy, policy]",[],[],[],[]
2,00199,To avoid consumers getting sued repeatedly on ...,[To avoid consumers getting sued repeatedly on...,"[policy, policy]",[],[],[],[]
3,00226,The solution to this problem is a mandate that...,[The solution to this problem is a mandate tha...,"[value, value]",[],[],[],[]
4,00236,If the present mortgage repayment debacle is a...,[If the present mortgage repayment debacle is ...,"[value, value, value, testimony, value]","[4, 4]","[2, 3]","[1, 1]","[(5, 3, reason), (5, 4, reason)]"
...,...,...,...,...,...,...,...,...
145,01388,I appreciate Mr. Bartmann's comments but they ...,"[I appreciate Mr. Bartmann's comments, but the...","[value, value, fact, fact, value, value, testi...","[16, 19]","[17, 21]","[1, 1]","[(17, 18, reason), (20, 22, reason)]"
146,01398,This is the PROBLEM over and over Why should l...,"[This is the PROBLEM over and over, Why should...","[value, policy, value, value, value, value, va...","[4, 7]","[3, 8]","[1, 1]","[(5, 4, reason), (8, 9, reason)]"
147,01405,I disagree. Keeping a paper trail protects eve...,"[I disagree., Keeping a paper trail protects e...","[value, value]",[0],[1],[1],"[(1, 2, reason)]"
148,01411,The Permissible hours of 9am - 8 pm is good. I...,"[The Permissible hours of 9am - 8 pm is good.,...","[value, value, value, value, value, value, val...","[0, 3, 5, 6, 7, 9, 11, 13]","[3, 4, 4, 3, 8, 10, 12, 12]","[1, 1, 1, 1, 1, 1, 1, 1]","[(1, 4, reason), (4, 5, reason), (6, 5, reason..."


In [12]:
def sanity_check(row):
    
    acs_check = 1 if (len(row.acs_list) == len(row.ac_types)) else 0
    rels_check = 1 if (len(row.rel_sources) == len(row.rel_targets) == len(row.rel_types) == len(row.relations)) else 0
    
    return 1 if acs_check and rels_check else 0

In [13]:
train_df['sanity'] = train_df.apply(lambda row: sanity_check(row), axis=1)
test_df['sanity'] = test_df.apply(lambda row: sanity_check(row), axis=1)

In [14]:
train_df['sanity'].value_counts(), test_df['sanity'].value_counts()

(sanity
 1    580
 Name: count, dtype: int64,
 sanity
 1    150
 Name: count, dtype: int64)

In [15]:
train_data[159]

['00538',
 'People get fired for personal calls at work. I worked at a place where employee phone call was recorded and randomly monitored due to the sensitivity of our work. And every now and then an employee would get caught making/taking a personal phone call and would be fired with no questions asked. Work is not the place for personal business.',
 ['People get fired for personal calls at work.',
  'I worked at a place where employee phone call was recorded and randomly monitored due to the sensitivity of our work.',
  'And every now and then an employee would get caught making/taking a personal phone call and would be fired with no questions asked.',
  'Work is not the place for personal business.'],
 ['fact', 'testimony', 'testimony', 'value'],
 [0, 0, 3],
 [1, 2, 0],
 [0, 0, 1],
 [(1, 2, 'evidence'), (1, 3, 'evidence'), (4, 1, 'reason')]]

### Build Prompts

In [16]:
def formatting_fct(instruction="", input="", output=""):
    
    prompt_d ={
            
        "instruction": f"""{instruction}""",
        "input": f"""{input}""",
        "output": f"""{output}"""
            
        }
    
    return prompt_d

In [17]:
def build_instruction():
    
    instruction = """### You are an expert in Argument Mining tasked with analyzing user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB). 

INPUT:
You will receive:
- The complete user comment.
- An enumerated list of identified arguments extracted from the user comment.

TASK 1: Argument Classification
- Classify each argument in the user comment into one of the following categories: "fact"(F), "policy"(P), "reference"(R), "testimony"(T) or "value"(V).
- "fact"(F): expressing or dealing with facts or conditions as perceived without distortion by personal feelings, prejudices, or interpretations
- "policy"(P): a specific course of action to be taken, typically contains modal verbs like "should" and "ought to"
- "reference"(R): a source of objective evidence
- "testimony"(T): an objective proposition about the author's personal state or experience
- "value"(V): judgments without making specific claims about what should be done
- You must return a list of argument types in following JSON format: {"argument_types": [argument_types (str), argument_types (str), ..., argument_types (str)]}

TASK 2: Argument Relations Identification and Classification
- Identify relationships between arguments by determining which arguments evidence or reason for other arguments.
- For each related argument pair, classify the relationship as either: "evidence"(E) or "reason"(R).
- "evidence"(E): X (testimony, fact, reference) is evidence for a proposition Y if X proves whether proposition Y is true or not
- "reason"(R): X (source) is reason for a proposition Y (target; policy, value, fact, testimony) if X provides rationale for Y
- You must return a list of triplets in the following JSON format: {"argument_relations": [(target_index (int), source_index (int), relation_type (str)), (target_index (int), source_index (int), relation_type (str)), ...]}
- Note: Indices are 1-based, referring to the position in the provided arguments list.

Example:

### Output:
{"argument_types": ['F', 'T', 'T', 'V']}
{"argument_relations": [(1, 2, 'E'), (1, 3, 'E'), (4, 1, 'R')]}
"""

    return instruction

In [24]:
def build_input(text, acs_l):

    text = f"""### User comment:\n{text}\n\n"""
    acs = "### List of arguments in the user comment: \n" + '\n'.join(f"{i + 1}. {ac}" for i, ac in enumerate(acs_l)) + "\n\n"

    return text + acs

In [25]:
def build_output(ac_types, relations):
    
    class_labels_d = {"fact": "F", "policy": "P", "reference": "R", "testimony": "T", "value": "V"}
    relation_labels_d = {"evidence": "E", "reason": "R"}
    
    ac_types = [class_labels_d[elem] for elem in ac_types]
    relations = [(i, j, relation_labels_d[elem]) for i, j, elem in relations]

    output_indic = "### Output:\n"
    ac_types = f'{{"argument_types": {ac_types}}}\n'
    rels_types = f'{{"relation_types": {relations}}}\n'
    
    return output_indic + ac_types + rels_types

## Create Data Files

In [26]:
data_file_train = []

for _, row in train_df.iterrows():


    instruction = build_instruction()
    question = build_input(row.text, row.acs_list)
    answer = build_output(row.ac_types, row.relations)
    
    data_file_train.append(formatting_fct(instruction, question, answer))

In [27]:
len(data_file_train)

580

In [28]:
print(data_file_train[0]['instruction'])

### You are an expert in Argument Mining tasked with analyzing user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB). 

INPUT:
You will receive:
- The complete user comment.
- An enumerated list of identified arguments extracted from the user comment.

TASK 1: Argument Classification
- Classify each argument in the user comment into one of the following categories: "fact"(F), "policy"(P), "reference"(R), "testimony"(T) or "value"(V).
- "fact"(F): expressing or dealing with facts or conditions as perceived without distortion by personal feelings, prejudices, or interpretations
- "policy"(P): a specific course of action to be taken, typically contains modal verbs like "should" and "ought to"
- "reference"(R): a source of objective evidence
- "testimony"(T): an objective proposition about the author's personal state or experience
- "value"(V): judgments without making specific claims about what should be done
- You must return a

In [29]:
print(data_file_train[0]['input'])

### User comment:
State and local court rules sometimes make default judgments much more likely. For example, when a person who allegedly owes a debt is told to come to court on a work day, they may be forced to choose between a default judgment and their job. I urge the CFPB to find practices that involve scheduling hearings at inconvenient times unfair, deceptive, and abusive, or inconsistent with 1692i.

### List of arguments in the user comment: 
1. State and local court rules sometimes make default judgments much more likely.
2. For example, when a person who allegedly owes a debt is told to come to court on a work day, they may be forced to choose between a default judgment and their job.
3. I urge the CFPB to find practices that involve scheduling hearings at inconvenient times unfair, deceptive, and abusive, or inconsistent with 1692i.




In [30]:
print(data_file_train[0]['output'])

### Output:
{"argument_types": ['V', 'V', 'P']}
{"relation_types": [(1, 2, 'R'), (3, 1, 'R')]}



In [31]:
data_file_test = []

for _, row in test_df.iterrows():


    instruction = build_instruction()
    question = build_input(row.text, row.acs_list)
    answer = build_output(row.ac_types, row.relations)
    
    data_file_test.append(formatting_fct(instruction, question, answer))

In [32]:
len(data_file_test)

150

In [33]:
print(data_file_test[0]['instruction'])

### You are an expert in Argument Mining tasked with analyzing user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB). 

INPUT:
You will receive:
- The complete user comment.
- An enumerated list of identified arguments extracted from the user comment.

TASK 1: Argument Classification
- Classify each argument in the user comment into one of the following categories: "fact"(F), "policy"(P), "reference"(R), "testimony"(T) or "value"(V).
- "fact"(F): expressing or dealing with facts or conditions as perceived without distortion by personal feelings, prejudices, or interpretations
- "policy"(P): a specific course of action to be taken, typically contains modal verbs like "should" and "ought to"
- "reference"(R): a source of objective evidence
- "testimony"(T): an objective proposition about the author's personal state or experience
- "value"(V): judgments without making specific claims about what should be done
- You must return a

In [34]:
print(data_file_test[0]['input'])

### User comment:
Recently, courts have held that debt collectors can escape 1692i's venue provisions entirely by pursuing debt collection through arbitration instead. As the NAF studies reflect, arbitration has not proven a satisfactory alternative. I urge the CFPB to include in a rule language interpreting 1692i as requiring debt collectors to proceed in court, not through largely-unregulated arbitral forums.

### List of arguments in the user comment: 
1. Recently, courts have held that debt collectors can escape 1692i's venue provisions entirely by pursuing debt collection through arbitration instead.
2. As the NAF studies reflect, arbitration has not proven a satisfactory alternative.
3. I urge the CFPB to include in a rule language interpreting 1692i as requiring debt collectors to proceed in court, not through largely-unregulated arbitral forums.




In [35]:
print(data_file_test[0]['output'])

### Output:
{"argument_types": ['F', 'V', 'P']}
{"relation_types": [(3, 2, 'R')]}



## Save `jsonl` files

In [36]:
DATASETS_DIR = Path("/home/umushtaq/am_reasoning/cdcp/datasets")

In [37]:
file_path = DATASETS_DIR / "cdcp_pipeline_train.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_train, file)

In [38]:
file_path = DATASETS_DIR / "cdcp_pipeline_test.json"

with open(file_path, 'w') as file:
    
    json.dump(data_file_test, file)