# Summary

The SciEntsBank dataset was released in XML format. This notebook outlines the steps taken to convert it into a Hugging Face dataset.

# Install Packages

In [None]:
# For parsing XML files
%pip install beautifulsoup4 lxml

# For creating Hugging Face dataset
%pip install datasets

# Load Data

The dataset is provided in XML format, distributed across 331 files, with each file corresponding to a single question in the set. Each file contains a question, a reference answer, and student answers along with their associated labels. Our goal is to parse the information from these XML files and organize them into a unified structure.

To accomplish this, we will use the Beautiful Soup library to parse the XML files. In addition to the primary information, we will also extract and store answer identifiers, allowing us to trace each answer back to its source if needed in the future.

In [1]:
import os
from bs4 import BeautifulSoup

In [2]:
# Dictionary to store the data, with a key for each set
data = {
    'train': [],
    'test_ua': [],
    'test_uq': [],
    'test_ud': []
}

# File location of each set
data_map = {
    'train': 'Raw/train/5-way/',
    'test_ua': 'Raw/test/5-way/test-unseen-answers/',
    'test_uq': 'Raw/test/5-way/test-unseen-questions/',
    'test_ud': 'Raw/test/5-way/test-unseen-domains/'
}

In [3]:
# Parse the files and load the data
for set_name in data_map:
    # Traverse through the files in each set
    for file in os.scandir(data_map[set_name]):
        # Parse XML files
        if file.is_file() and file.name.endswith('.xml'):
            with open(file.path, 'r') as file:
                xml = BeautifulSoup(file, 'xml')
            root = xml.find('question')
            
            # Extract question
            question = xml.find('questionText').text
            
            # Extract reference answer
            reference_answer = xml.find_all('referenceAnswer')
            # Check whether multiple reference answers exist
            if len(reference_answer) > 1:
                print('[ WARNING ]  Found more than one reference answer in', file.path)
            reference_answer = reference_answer[0].text
            
            # Extract student answers and store each as a sample
            for el in xml.find_all('studentAnswer'):
                data[set_name].append({
                    'id': el.get('id'),
                    'question': question,
                    'reference_answer': reference_answer,
                    'student_answer': el.text,
                    'label': el.get('accuracy')
                })

### Export Data

In [4]:
import json
import pickle

# For preview
with open('SciEntsBank.json', 'w') as file:
    json.dump(data, file, indent=4)

# For scripts
with open('SciEntsBank.pkl', 'wb') as file:
    pickle.dump(data, file)

# Create Dataset for Hugging Face

Before we started building the dataset, we created a new dataset in our Hugging Face account and cloned the repository locally to a directory named `HuggingFace`.

### Import Data

In [5]:
import pickle

with open('SciEntsBank.pkl', 'rb') as file:
    data = pickle.load(file)

### Prepare Features

In [6]:
from datasets import ClassLabel
from datasets import Features
from datasets import Value

In [7]:
# Define the internal structure of the dataset.
# NOTE: The class labels are not in alphabetical order since it is
# important to preserve their conceptual relationship and direction.
features = Features({
    'id': Value('string'),
    'question': Value('string'),
    'reference_answer': Value('string'),
    'student_answer': Value('string'),
    'label': ClassLabel(names=['correct', 'contradictory', 'partially_correct_incomplete', 'irrelevant', 'non_domain'])
})

### Transform Data Into Datasets

Please note that the term "dataset" (not the variable) in this section refers to a single set/split and not the entire data.

In [8]:
import os
from datasets import Dataset

In [9]:
# Dictionary to store the datasets
dataset = {}

# Ensure the directory exist to export the datasets.
# All data files should be stored in the "data" subdirectory, following standard practice.
os.makedirs('HuggingFace/data/', exist_ok=True)

# Iterate through each set
for set_name in data:
    # Transform the set into a dataset
    dataset[set_name] = Dataset.from_list(data[set_name], features=features)
    # Export the dataset in Parquest format
    dataset[set_name].to_parquet(f'HuggingFace/data/{set_name.replace("_", "-")}-00001.parquet')

Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 1250.09ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 749.26ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 999.83ba/s]
Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 2499.88ba/s]


In [10]:
# Overview of the datasets
print(dataset)

{'train': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 4969
}), 'test_ua': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 540
}), 'test_uq': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 733
}), 'test_ud': Dataset({
    features: ['id', 'question', 'reference_answer', 'student_answer', 'label'],
    num_rows: 4562
})}


# Generate Dataset Card For Readme

In [11]:
import re
import yaml

In [12]:
# Begin metadata section
print('---')

# Print dataset information section
print('dataset_info:')

# Print dataset features
print('  features:')
print(re.sub(r'^', '  ', yaml.safe_dump(features._to_yaml_list()), flags=re.MULTILINE))

# Print dataset splits metadata and calculate dataset size
print('  splits:')
dataset_size = 0
for set_name in dataset:
    print('  - name:', set_name)
    print(' '*3, 'num_examples:', len(dataset[set_name]))
    num_bytes = os.stat(f'HuggingFace/data/{set_name.replace("_", "-")}-00001.parquet').st_size
    print(' '*3, 'num_bytes:', num_bytes)
    dataset_size += num_bytes

# Print dataset size
print('  dataset_size:', dataset_size)

# Print data file configurations
print('configs:')
# Define config for the default subset
print('- config_name: default')
print('  data_files:')
for set_name in dataset:
    print('  - split:', set_name)
    print(' '*3, 'path:', f'data/{set_name.replace("_", "-")}-*')

# End metadata section
print('---')

---
dataset_info:
  features:
  - dtype: string
    name: id
  - dtype: string
    name: question
  - dtype: string
    name: reference_answer
  - dtype: string
    name: student_answer
  - dtype:
      class_label:
        names:
          '0': correct
          '1': contradictory
          '2': partially_correct_incomplete
          '3': irrelevant
          '4': non_domain
    name: label
  splits:
  - name: train
    num_examples: 4969
    num_bytes: 232655
  - name: test_ua
    num_examples: 540
    num_bytes: 52730
  - name: test_uq
    num_examples: 733
    num_bytes: 35716
  - name: test_ud
    num_examples: 4562
    num_bytes: 177307
  dataset_size: 498408
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  - split: test_ua
    path: data/test-ua-*
  - split: test_uq
    path: data/test-uq-*
  - split: test_ud
    path: data/test-ud-*
---


# Upload Dataset to Hugging Face

1. Copy the generated metadata from this notebook into the README.md file.
2. Use the Metadata UI on the Hugging Face website to populate the remaining metadata (e.g., dataset name, license, task categories, etc.), then copy the generated text into the README.md file.
3. Populate the README.md file with information about the dataset, including instructions, label distribution, citation, references, and more.
4. Commit and push the changes using Git.

### Dataset URL: [https://huggingface.co/datasets/nkazi/SciEntsBank](https://huggingface.co/datasets/nkazi/SciEntsBank)