---

#### $Load$ $Libraries$

---

In [1]:
from nlp import load_dataset
import os
import json
import re

---

#### $Break$ $Dataset$ 




---

##### *$Break$ is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases.*

##### Datasets: 

1. $QDMR$: Contains questions over text, images and databases annotated with their Question Decomposition Meaning Representation. In addition to the train, dev and (hidden) test sets we are provided with lexicon_tokens files. For each question, the lexicon file contains the set of valid tokens that could potentially appear in its decomposition.

2. $QDMR high-level$: Contains questions annotated with the high-level variant of QDMR. These decomposition are exclusive to Reading Comprehension tasks. lexicon_tokens files are also provided.

3. $Logical-forms$: Contains questions and QDMRs annotated with full logical-forms of QDMR operators + arguments. Full logical-forms were inferred by the annotation-consistency algorithm.

*$Note$: It is also provided (in the github) the  `app_store_generation.py` file in order to generate valid lexicon tokens for a new example. We need to use the valid_annotation_tokens method. Note that we would still need to format the valid lexicon tokens according to the lexicon file format {"source": "NL question", "allowed_tokens": [valid lexicon tokens]} (for new examples)*

---

#### $Load$ $Break$ - $QDMR$

---

In order to download the dataset we can visit https://github.com/allenai/Break/tree/master/break_dataset/QDMR and: 
* Option 1: Clone the repo 
* Option 2: use the load_dataset('break_data', 'QDMR') and if needed download the lexicon token files malually from the github repo.




The QDMR version contains `train`, `validation` and `test` sets. 

* Train/Val Set: Provide the questions and the decompositions
* Test Set: Provide the questions yet the desomposition is not provided.

In [2]:
qdmr_dataset = load_dataset('break_data', 'QDMR')

In [3]:
qdmr_dataset.keys()

dict_keys(['train', 'validation', 'test'])

Each example contains: 

1. `question_id`: The Break question id, of the format [ORIGINAL DATASET]_[original split]_[original id]. 
   * e.g., NLVR2_dev_dev-1049-1-1 is from NLVR2 dev split with its NLVR2 id being, dev-1049-1-1.

2. `question_text`: Original question text.

3. `decomposition`: The annotated QDMR of the question, its steps delimited by ;. 
   * e.g., return flights ;return #1 from  washington ;return #2 to boston ;return #3 in the afternoon.

4. `operators`: List of tagged QDMR operators for each step. QDMR operators are fully described in (Section 2) of the paper. The 14 potential operators are, select, project, filter, aggregate, group, superlative, comparative, union, intersection, discard, sort, boolean, arithmetic, comparison. Unidefntified operators are tagged with None.

5. `split`: The Break dataset split of the example (train, dev, test).

In [4]:
qdmr_dataset['train'].features

{'question_id': Value(dtype='string', id=None),
 'question_text': Value(dtype='string', id=None),
 'decomposition': Value(dtype='string', id=None),
 'operators': Value(dtype='string', id=None),
 'split': Value(dtype='string', id=None)}

In [5]:
qdmr_example = qdmr_dataset["train"][0]

print(f"ID: {qdmr_example['question_id']}")
print(f"Split: {qdmr_example['split']}\t | Operator: {qdmr_example['operators']}")
print("="*50)
print(f"\nQuestion: {qdmr_example['question_text']}")
print(f"\nDecomposition: {qdmr_example['decomposition']}")



ID: ACADEMIC_train_0
Split: train	 | Operator: ['select', 'filter']

Question: return me the homepage of PVLDB . 

Decomposition: return homepages ;return #1 of  PVLDB


---

#### $Load$ $Break$ - $QDMR$ $high$ $level$

---

In order to download the dataset we can visit https://github.com/allenai/Break/tree/master/break_dataset/QDMR-high-level and: 
* Option 1: Clone the repo 
* Option 2: use the load_dataset('break_data', 'QDMR') and if needed download the lexicon token files malually from the github repo.




*$Note$: It is also provided (in the github) the  `app_store_generation.py` file in order to generate valid lexicon tokens for a new example. We need to use the valid_annotation_tokens method. Note that we would still need to format the valid lexicon tokens according to the lexicon file format {"source": "NL question", "allowed_tokens": [valid lexicon tokens]} (for new examples)*

In [6]:
qdmr_high_level_dataset = load_dataset('break_data', 'QDMR-high-level')

In [7]:
qdmr_high_level_dataset.keys()

dict_keys(['train', 'validation', 'test'])

Each example contains (same as QDMR): 

1. `question_id`: The Break question id, of the format [ORIGINAL DATASET]_[original split]_[original id]. 
   * e.g., NLVR2_dev_dev-1049-1-1 is from NLVR2 dev split with its NLVR2 id being, dev-1049-1-1.

2. `question_text`: Original question text.

3. `decomposition`: The annotated QDMR of the question, its steps delimited by ;. 
   * e.g., return flights ;return #1 from  washington ;return #2 to boston ;return #3 in the afternoon.

4. `operators`: List of tagged QDMR operators for each step. QDMR operators are fully described in (Section 2) of the paper. The 14 potential operators are, select, project, filter, aggregate, group, superlative, comparative, union, intersection, discard, sort, boolean, arithmetic, comparison. Unidefntified operators are tagged with None.

5. `split`: The Break dataset split of the example (train, dev, test).

In [8]:
qdmr_high_level_example = qdmr_high_level_dataset["train"][0]

print(f"ID: {qdmr_high_level_example['question_id']}")
print(f"Split: {qdmr_high_level_example['split']}\t | Operator: {qdmr_high_level_example['operators']}")
print("="*50)
print(f"\nQuestion: {qdmr_high_level_example['question_text']}")
print(f"\nDecomposition: {qdmr_high_level_example['decomposition']}")



ID: CWQ_train_WebQTest-1_ffaa5abbcbb61976117b13d82a5011ed
Split: train	 | Operator: ['select', 'filter']

Question: What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?

Decomposition: return office that was  held by James K. Polk before he was president ;return #1 that was  held by a member of the  Maine House of  Representatives


---

#### $QDMR$ $vs.$ $QDMR$ $high$ $level$

---

* In the **QDMR** version, the decomposition is broken down into low-level, programmatic steps. Each step corresponds to a basic logical operation (like *SELECT*, *FILTER*, or *AGGREGATE*), making it ideal for testing a model's ability to generate formal, machine-readable queries.

* The **QDMR-high-level** version provides a more abstract and natural decomposition. The steps are sub-questions that are much closer to how a human would reason through the problem, making it a better fit for evaluating the step-by-step thinking of conversational LLMs.

For our project, which focuses on evaluating an LLM's ability to perform natural prompt decomposition, the **QDMR-high-level** dataset is the more appropriate choice. Its human-like reasoning style is a better match for the "step-by-step thinking" we want to test.

---

#### $Dataset$ $Creation$

---

##### $Lexicon$ $Tokens$

1. We visit the https://github.com/allenai/Break/tree/master/break_dataset/QDMR-high-level
2. We download the train_lexicon_tokens.json
3. We download the dev_lexicon_tokens.json 


*First, we will create a few_shot_examples.json file by taking a sample from the training set. Next, we will build a separate evaluation dataset from the validation set, following the same process. The .json files we are going to create will contain the lexicon tokens provided as well.*

In [15]:
# This function processes a single example from the QDMR high level set.
# It extracts the ID, question, decomposition and lexicon tokens. 
# In the decomposition, the `;` is replaced by /n in order to be more clear for the model.
# It returns these elements in a structured dictionary.

def decomposition_prompt(qdmr_high_level_example, lexicon_dict) -> dict:
    question_id = qdmr_high_level_example['question_id']
    question = qdmr_high_level_example['question_text']

    # Get the corresponding lexicon tokens for this question
    lexicon_tokens = lexicon_dict.get(question, [])

    # Clean decomposition
    steps = qdmr_high_level_example['decomposition'].split(';')
    cleaned_steps = [re.sub(r'\s+', ' ', step).strip() for step in steps if step.strip()]
    decomposition = '\n'.join(cleaned_steps)

    # Final dictionary
    dict_prompt = {
        "id": question_id,
        "question": question,
        "lexicon tokens": lexicon_tokens,
        "decomposition": decomposition
    }

    return dict_prompt


##### $Few$ $Shot$ $Examples$

In [16]:
few_shot_examples = []


In [17]:
lexicon_tokens_filename = '../notebooks/train_lexicon_tokens.json'

lexicon_dict = dict()
with open(lexicon_tokens_filename, 'r') as f:
    for line in f:
        entry = json.loads(line.strip())
        # Map question text to allowed tokens
        lexicon_dict[entry['source']] = entry['allowed_tokens']


In [19]:
for example in qdmr_high_level_dataset["train"]:
    if len(few_shot_examples) < 5:
        
        few_shot_examples.append(decomposition_prompt(example, lexicon_dict))
    else: 
        break

In [22]:
filename = "qdmr_few_shot.json"
folder = "../QDMR_dataset/"

# Construct the full path for the file
full_path = os.path.join(folder, filename)

# Save the file
with open(full_path, 'w', encoding='utf-8') as f:
	json.dump(few_shot_examples, f, ensure_ascii=False, indent=4)
print("Results have been saved!")


Results have been saved!


##### $Evaluation$ $Examples$

In [29]:
evaluation = []

In [30]:
lexicon_tokens_filename = '../notebooks/dev_lexicon_tokens.json'

lexicon_dict = dict()
with open(lexicon_tokens_filename, 'r') as f:
    for line in f:
        entry = json.loads(line.strip())
        # Map question text to allowed tokens
        lexicon_dict[entry['source']] = entry['allowed_tokens']


In [31]:
for example in qdmr_high_level_dataset["validation"]:
    if len(evaluation) < 5:
        
        evaluation.append(decomposition_prompt(example, lexicon_dict))
    else: 
        break

In [32]:
evaluation[0]

{'id': 'CWQ_dev_WebQTest-1011_c0be4f76a5397ba6d0d06f53905e504b',
 'question': 'What Tibetan speaking countries have a population of less than 993885000?',
 'lexicon tokens': "['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', 'countries', 'equal', 'hundred', 'those', 'sorted by', 'elevation', 'which ', '@@6@@', '993885000', 'was ', 'did ', 'population', 'height', 'one', 'that ', 'on', 'did', 'who', 'true', '@@2@@', '100', 'false', 'and', 'was', 'speaking', 'populations', 'who ', 'a ', 'the', 'number of ', '@@16@@', 'if ', 'where', '@@18@@', 'how', 'larger than', 'is ', 'from ', 'a', 'less', 'for each', 'are ', '@@19@@', '@@4@@', '@@11@@', 'distinct ', 'to', 'not ', 'objects', 'with ', ', ', 'lowest', 'in', 'has ', 'zero', 'in ', 'there ', 'lower than', 'highest', '@@9@@', 'than', 'size', 'multiplication', 'with', 'besides ', ',', '@@1@@', 'what', 'have', 'those ', 'of', '@@3@@', 'that', 'there', '@@10@@', '@@5@

In [33]:
filename = "qdmr_dataset.json"
folder = "../QDMR_dataset/"

# Construct the full path for the file
full_path = os.path.join(folder, filename)

# Save the file
with open(full_path, 'w', encoding='utf-8') as f:
	json.dump(evaluation, f, ensure_ascii=False, indent=4)
print("Results have been saved!")


Results have been saved!
