In [1]:
import datasets

ds = datasets.load_dataset('joyheyueya/250428_abstract_pair', split='train')
ds

Dataset({
    features: ['joint_prompt', 'paper1_prompt', 'paper2_prompt', 'no_context_prompt', 'abstracts', 'forum_id_1', 'forum_id_2', 'pair_id'],
    num_rows: 99997
})

In [2]:
ds[0]

{'joint_prompt': 'Your task is to identify and elaborate on an insight that only becomes apparent by combining information from both documents together—i.e., an insight that has high relevance when treating the documents jointly but low relevance if you were to consider each document alone. Write the insight as a standalone statement that does not reference the original texts or use terms like "paper" or "document". The insight should involve indirect connections or intermediate reasoning steps, such as linking cause and effect through a shared variable or mechanism. Here are some examples: Let A = magnesium — either dietary or internal to the body. Implicitly refers, inmost cases, to quantity of magnesium. Let M = migraine headache. Let L = one or more intermediate physiological links.  -> means \'can or might influence\'. == means \'equivalent in action\' or \'equivalent in mechanism\' depending on whether it connects two drugs or two diseases. Paper 1:\nStress and Type A behavior ar

In [3]:
from typing import Dict, Any, Optional
import os
def make_map_fn(split: str):
    """Create a mapping function to process dataset examples.

    Args:
        split: Dataset split name ('train' or 'test')

    Returns:
        Function that processes individual dataset examples
    """
    def process_fn(example: Dict[str, Any], idx: int) -> Optional[Dict[str, Any]]:
        question = example.pop('joint_prompt')
        answer = '' # empty string (dummy)

        data = {
            "data_source": "",
            "prompt": [{
                "role": "user",
                "content": question
            }],
            "ability": "insight",
            "reward_model": {
                "style": "rule",
                "ground_truth": answer
            },
            "extra_info": {
                'paper1_prompt': example['paper1_prompt'],
                'paper2_prompt': example['paper2_prompt'],
                'no_context_prompt': example['no_context_prompt'],
                'abstracts': example['abstracts'],
                'forum_id_1': example['forum_id_1'],
                'forum_id_2': example['forum_id_2'],
                'pair_id': example['pair_id'],
                'split': split,
                'index': idx
            }
        }
        return data
    return process_fn

ds_train = ds.map(function=make_map_fn('train'), with_indices=True, num_proc=os.cpu_count())
ds_test = ds.map(function=make_map_fn('test'), with_indices=True, num_proc=os.cpu_count())


ds_train

Dataset({
    features: ['paper1_prompt', 'paper2_prompt', 'no_context_prompt', 'abstracts', 'forum_id_1', 'forum_id_2', 'pair_id', 'data_source', 'prompt', 'ability', 'reward_model', 'extra_info'],
    num_rows: 99997
})

In [4]:
ds_train['extra_info'][0]

{'abstracts': ['Machine learning models can capture and amplify biases present in data, leading to disparate test performance across social groups. To better understand, evaluate, and mitigate these biases, a deeper theoretical understanding of how model design choices and data distribution properties contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models feedforward neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we observe that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be differences in test error between groups that are not alleviated with increased parameterization. Importantly, o

In [5]:
ds_train.to_parquet('/home/anikait.singh/rl_behaviors_verl_stable/data_insights_rl/train.parquet')
ds_test.to_parquet('/home/anikait.singh/rl_behaviors_verl_stable/data_insights_rl/test.parquet')

Creating parquet from Arrow format:   0%|          | 0/100 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/100 [00:00<?, ?ba/s]

1764711852