# Summary

The Beetle dataset was released in XML format. This notebook outlines the steps taken to convert it into a Hugging Face dataset.

While converting the dataset, we have made a couple of design decisions: (1) We included only the 5-way labels and provided Python code to convert them into 3-way or 2-way labels. (2) We reformatted the identifiers. (3) The dataset contains multiple reference answers of varying standards for each question. Including a column with lists as values would complicate data processing and model training, as most functions and language models expect scalar values. Typically, only one reference answer per question would be used in most NLP tasks. Therefore, we selected the reference answer rated as the best by the authors for each question. During the filtering process, we observed that some questions had multiple reference answers rated as the best. In such cases, we chose the first instance of the best-rated reference answer. However, we created a separate set containing all the reference answers along with their rated standards for researchers who are interested in them.

# Install Packages

In [None]:
# For parsing XML files
%pip install beautifulsoup4 lxml

# For creating Hugging Face dataset
%pip install datasets

# Load Data

The dataset is provided in XML format, distributed across 103 files, with each file corresponding to a single question in the set. Each file contains a question, reference answers, and student answers along with their associated labels. Our goal is to parse the information from these XML files and organize them into a unified structure.

To accomplish this, we will use the Beautiful Soup library to parse the XML files. In addition to extracting the primary information, we need to collect and store supplementary metadata (e.g., module names and IDs) to trace each piece of information back to its source if needed in the future. Each question is identified by a unique ID generated from keywords in the question, such as `BULB_C_VOLTAGE_EXPLAIN_WHY1`. Answer IDs (e.g., `FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.sbj3-l1.qa193`) follow a specific structure, incorporating the module name, question ID, subject ID, subject level, and answer number in sequence. We will abbreviate module names, index question IDs, and reformat both question and answer IDs to create shorter and uniform identifiers.

In [1]:
import os
import re
from bs4 import BeautifulSoup

In [2]:
# Abbreviate the module names
# md stands for metadata
md_modules = {
    'FaultFinding': 'FF',
    'SwitchesBulbsParallel': 'PC', # Parallel Circuit
    'SwitchesBulbsSeries': 'SC' # Series Circuit
}

# Index question ids from each module in alphabetical order
# We will reference them by their index (1-based)
md_question_ids = {
    'FF': [
        'BULB_C_VOLTAGE_EXPLAIN_WHY1', 'BULB_C_VOLTAGE_EXPLAIN_WHY2', 'BULB_C_VOLTAGE_EXPLAIN_WHY4', 'BULB_C_VOLTAGE_EXPLAIN_WHY6', 'BULB_ONLY_EXPLAIN_WHY2', 'BULB_ONLY_EXPLAIN_WHY4', 'BULB_ONLY_EXPLAIN_WHY6', 'BURNED_BULB_LOCATE_EXPLAIN_Q', 'DESCRIBE_GAP_LOCATE_PROCEDURE_Q', 'OTHER_TERMINAL_STATE_EXPLAIN_Q', 'TERMINAL_STATE_EXPLAIN_Q', 'VOLTAGE_AND_GAP_DISCUSS_Q', 'VOLTAGE_DEFINE_Q', 'VOLTAGE_DIFF_DISCUSS_1_Q', 'VOLTAGE_DIFF_DISCUSS_2_Q', 'VOLTAGE_ELECTRICAL_STATE_DISCUSS_Q', 'VOLTAGE_GAP_EXPLAIN_WHY1', 'VOLTAGE_GAP_EXPLAIN_WHY2', 'VOLTAGE_GAP_EXPLAIN_WHY3', 'VOLTAGE_GAP_EXPLAIN_WHY4', 'VOLTAGE_GAP_EXPLAIN_WHY5', 'VOLTAGE_GAP_EXPLAIN_WHY6', 'VOLTAGE_INCOMPLETE_CIRCUIT_2_Q'
        ],
    'PC': [
        'BURNED_BULB_PARALLEL_EXPLAIN_Q1', 'BURNED_BULB_PARALLEL_EXPLAIN_Q2', 'BURNED_BULB_PARALLEL_EXPLAIN_Q3', 'BURNED_BULB_PARALLEL_WHY_Q', 'GIVE_CIRCUIT_TYPE_HYBRID_EXPLAIN_Q2', 'GIVE_CIRCUIT_TYPE_HYBRID_EXPLAIN_Q3', 'GIVE_CIRCUIT_TYPE_PARALLEL_EXPLAIN_Q2', 'HYBRID_BURNED_OUT_EXPLAIN_Q1', 'HYBRID_BURNED_OUT_EXPLAIN_Q2', 'HYBRID_BURNED_OUT_EXPLAIN_Q3', 'HYBRID_BURNED_OUT_WHY_Q1', 'HYBRID_BURNED_OUT_WHY_Q2', 'HYBRID_BURNED_OUT_WHY_Q3', 'OPT1_EXPLAIN_Q2', 'OPT2_EXPLAIN_Q', 'PARALLEL_SWITCH_EXPLAIN_Q1', 'PARALLEL_SWITCH_EXPLAIN_Q2', 'PARALLEL_SWITCH_EXPLAIN_Q3', 'SWITCH_TABLE_EXPLAIN_Q1', 'SWITCH_TABLE_EXPLAIN_Q2', 'SWITCH_TABLE_EXPLAIN_Q3'
        ],
    'SC': [
        'BURNED_BULB_SERIES_Q2', 'CLOSED_PATH_EXPLAIN', 'CONDITIONS_FOR_BULB_TO_LIGHT', 'DAMAGED_BUILD_EXPLAIN_Q', 'DAMAGED_BULB_EXPLAIN_2_Q', 'DAMAGED_BULB_SWITCH_Q', 'GIVE_CIRCUIT_TYPE_SERIES_EXPLAIN_Q', 'SHORT_CIRCUIT_EXPLAIN_Q_2', 'SHORT_CIRCUIT_EXPLAIN_Q_4', 'SHORT_CIRCUIT_EXPLAIN_Q_5', 'SHORT_CIRCUIT_X_Q', 'SWITCH_OPEN_EXPLAIN_Q'
        ]
}

In [3]:
def format_id(answer_id: str) -> str:
    '''
    Reformats answer ids.
    
    Args:
        answer_id (str): Original answer id.
    
    Returns:
        str: Reformatted answer id.
    '''
    
    # Split the answer id into segments
    segments = list(re.fullmatch(r'([a-z]+)-([^\.]+)\.([^-]+)-([^\.]+)\.q(a\d+)', answer_id, flags=re.IGNORECASE).groups())
    # Replace the module name
    segments[0] = md_modules[segments[0]]
    # Replace the question id with its index (1-based)
    segments[1] = 'q' + str(md_question_ids[segments[0]].index(segments[1]) + 1)
    # Change "qa" to "sa" to align with the term "student answer"
    segments[-1] = 's' + segments[-1]
    # Combine all modified segments into a single string separated by periods and convert it to uppercase
    return '.'.join(segments).upper()

In [4]:
# Dictionary to store the data, with one key for each set and a dedicated key to list all the reference answers
data = {
    'train': [],
    'test_ua': [],
    'test_uq': [],
    'all_reference_answers': []
}

# File location of each set
data_map = {
    'train': 'Raw/train/5-way/',
    'test_ua': 'Raw/test/5-way/test-unseen-answers/',
    'test_uq': 'Raw/test/5-way/test-unseen-questions/'
}

In [5]:
# Parse the files and load the data
for set_name in data_map:
    # Traverse through the files in each set
    for file in os.scandir(data_map[set_name]):
        # Parse XML files
        if file.is_file() and file.name.endswith('.xml'):
            with open(file.path, 'r') as file:
                xml = BeautifulSoup(file, 'xml')
            root = xml.find('question')
            
            # Extract and convert the module name
            module = md_modules[root.get('module')]
            # Extract and convert the question id
            question_id = md_question_ids[module].index(root.get('id')) + 1
            # Extract the question
            question = xml.find('questionText').text
            
            # Extract and store all the reference answers except for the "ua" set since the questions of the "ua" set is a subset of train
            if set_name == 'test_ua':
                # Extract the first best reference answer to be included in samples
                reference_answer = xml.find('referenceAnswer', attrs={'category': 'BEST'}).text
            else:
                # Variable to store the first best reference answer, to be included in samples later.
                # NOTE: There can be more than one best reference answer for the same question.
                reference_answer = None
                # Start a new counter to generate ids for the reference answers in the order they are listed
                ra_id = 1
                # Iterate through the reference answer elements
                for el in xml.find_all('referenceAnswer'):
                    # Extract the standatd/quality of the answer
                    standard = el.get('category').lower()
                    # Skip if it is not an answer but rather a keyword
                    if standard == 'keyword': continue
                    # Check and store the first best reference answer
                    if not reference_answer and standard == 'best':
                        reference_answer = el.text
                    # Store the reference answer as a sample in its dedicated set
                    data['all_reference_answers'].append({
                        'id': f'{module}.Q{question_id}.RA{ra_id}',
                        'question': question,
                        'reference_answer': el.text,
                        'standard': standard
                    })
                    # Increment the counter
                    ra_id += 1
            
            # Extract the student answers and store each as a sample
            for el in xml.find_all('studentAnswer'):
                data[set_name].append({
                    'id': format_id(el.get('id')),
                    'question': question,
                    'reference_answer': reference_answer,
                    'student_answer': el.text,
                    'label': el.get('accuracy')
                })
    
    # Sort the samples by their ids
    data[set_name].sort(key=lambda e: e['id'])

### Export Data

In [6]:
import json
import pickle

# For preview
with open('Beetle.json', 'w') as file:
    json.dump(data, file, indent=4)

# For scripts
with open('Beetle.pkl', 'wb') as file:
    pickle.dump(data, file)

# Create Dataset for Hugging Face

Before we started building the dataset, we created a new dataset in our Hugging Face account and cloned the repository locally to a directory named `HuggingFace`.

### Import Data

In [7]:
import pickle

with open('Beetle.pkl', 'rb') as file:
    data = pickle.load(file)

### Prepare Features

In [8]:
from datasets import ClassLabel
from datasets import Features
from datasets import Value

In [9]:
# Define the internal structure of the dataset.
# NOTE: The class labels are not in alphabetical order since it is
# important to preserve their conceptual relationship and direction.
features_default = Features({
    'id': Value('string'),
    'question': Value('string'),
    'reference_answer': Value('string'),
    'student_answer': Value('string'),
    'label': ClassLabel(names=['correct', 'contradictory', 'partially_correct_incomplete', 'irrelevant', 'non_domain'])
})

# We created a dedicated set for all the reference answers.
# This set has a different internal structure and needs to be defined independently.
# ara stands for "all reference answers"
features_ara = Features({
    'id': Value('string'),
    'question': Value('string'),
    'reference_answer': Value('string'),
    'standard': ClassLabel(names=['minimal', 'good', 'best'])
})

### Transform Data Into Datasets

Please note that the term "dataset" (not the variable) in this section refers to a single set/split and not the entire data.

In [10]:
import os
from datasets import Dataset

In [11]:
# Dictionary to store the datasets
dataset = {}

# Ensure the directory exist to export the datasets.
# All data files should be stored in the "data" subdirectory, following standard practice.
os.makedirs('HuggingFace/data/', exist_ok=True)

# Iterate through each set
for set_name in data:
    # Transform the set into a dataset
    dataset[set_name] = Dataset.from_list(data[set_name], features=features_ara if set_name == 'all_reference_answers' else features_default)
    # Export the dataset in Parquest format
    dataset[set_name].to_parquet(f'HuggingFace/data/{set_name.replace("_", "-")}-00001.parquet')

Creating parquet from Arrow format: 100%|██████████| 4/4 [00:00<00:00, 571.92ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 500.87ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 333.78ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 500.93ba/s]


In [12]:
# Overview of the datasets
print(dataset)

{'train': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 3941
}), 'test_ua': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 439
}), 'test_uq': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 819
}), 'all_reference_answers': Dataset({
    features: ['id', 'question', 'reference_answer', 'standard'],
    num_rows: 242
})}


# Generate Metadata For Readme

In [13]:
# Begin metadata section
print('---')

# Print dataset information section
print('dataset_info:')

# Print dataset features
print('  features:')
for name, value in features_default.items():
    print('  - name:', name)
    if name == 'label':
        print(' '*3, 'dtype:')
        print(' '*5, 'class_label:')
        print(' '*7, 'names:')
        for i, name in enumerate(value.names):
            print(' '*9, f"'{i}': {name}")
    else:
        print(' '*3, 'dtype:', value.dtype)

# Print dataset splits metadata and calculate dataset size
print('  splits:')
dataset_size = 0
for set_name in dataset:
    # Skip "all_reference_answers" since it is not part of the default subset
    if set_name == 'all_reference_answers':
        continue
    
    print('  - name:', set_name)
    print(' '*3, 'num_examples:', len(dataset[set_name]))
    num_bytes = os.stat(f'HuggingFace/data/{set_name.replace("_", "-")}-00001.parquet').st_size
    print(' '*3, 'num_bytes:', num_bytes)
    dataset_size += num_bytes

# Print dataset size
print('  dataset_size:', dataset_size)

# Print data file configurations
print('configs:')
# Define config for the default subset
print('- config_name: default')
print('  data_files:')
for set_name in dataset:
    # Skip "all_reference_answers" since we will put it in a separate subset
    if set_name == 'all_reference_answers':
        continue
    
    print('  - split:', set_name)
    print(' '*3, 'path:', f'data/{set_name.replace("_", "-")}-*')

# Define a separate subset/config for the all_reference_answers set
print('''
- config_name: all_reference_answers
  data_files:
  - split: all_reference_answers
    path: data/all-reference-answers-*
'''.strip())

# End metadata section
print('---')

---
dataset_info:
  features:
  - name: id
    dtype: string
  - name: question
    dtype: string
  - name: reference_answer
    dtype: string
  - name: student_answer
    dtype: string
  - name: label
    dtype:
      class_label:
        names:
          '0': correct
          '1': contradictory
          '2': partially_correct_incomplete
          '3': irrelevant
          '4': non_domain
  splits:
  - name: train
    num_examples: 3941
    num_bytes: 120274
  - name: test_ua
    num_examples: 439
    num_bytes: 21208
  - name: test_uq
    num_examples: 819
    num_bytes: 27339
  dataset_size: 168821
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  - split: test_ua
    path: data/test-ua-*
  - split: test_uq
    path: data/test-uq-*
- config_name: all_reference_answers
  data_files:
  - split: all_reference_answers
    path: data/all-reference-answers-*
---


# Upload Dataset to Hugging Face

1. Copy the generated metadata from this notebook into the README.md file.
2. Use the Metadata UI on the Hugging Face website to populate the remaining metadata (e.g., dataset name, license, task categories, etc.), then copy the generated text into the README.md file.
3. Populate the README.md file with information about the dataset, including instructions, label distribution, citation, references, and more.
4. Commit and push the changes using Git.

### Dataset URL: [https://huggingface.co/datasets/nkazi/Beetle](https://huggingface.co/datasets/nkazi/Beetle)