# 🙌 Data


## Setup

---

Let's install some necessary dependencies and set global variables.

In [None]:
import autorootcwd

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
# Modules
from transformers import AutoTokenizer

In [None]:
# Globals
TOKENIZER = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

## HF Dataset Loading

---

In [None]:
# Load from HF Hub
from datasets import load_dataset
data = load_dataset("lhoestq/demo1", split="train")

data

## BaseDataset

---

In [None]:
# Load `DummyData`
from src.data import DummyData

dummy = DummyData(split="train")
dummy

## Preference Dta

---

In [None]:
# Load `PreferenceData`
from src.data import PreferenceData

preference = PreferenceData(split="train", filtering_strategy="none")
preference

For the `PreferenceData` class, the `filtering_strategy` parameter can be set to one of the following values:

* `none`: No filtering is applied.
* `keep_first`: Only the first preference pair for each question is kept.
* `global_threshold`: Only the preference pairs with a score greater (or less) equal than a global threshold are kept. *Requires: `mode` and `threshold` parameters.*
* `local_tolerance`: Only the preference pairs with a score greater (or less) equal than a maximum minus/ minimum plus tolerance are kept. *Requires: `mode` and `tolerance` parameters.*

In [None]:
PreferenceData(split="train", filtering_strategy="none")

In [None]:
PreferenceData(split="train", filtering_strategy="keep_first")

In [None]:
PreferenceData(split="train", filtering_strategy="global_threshold", mode="least", threshold=4)

In [None]:
PreferenceData(split="train", filtering_strategy="local_tolerance", mode="least", tolerance=0)

## Open-Answer Data

In [None]:
# Load `OrcaMathData`
from src.data import OrcaMathData

orcamath = OrcaMathData(filter_on_length=True)
orcamath

In [None]:
from src.data import TuluData
from src.data import TuluDatasetIDs

tulu = TuluData(sub_datasets=[TuluDatasetIDs.SCIENCE_EVIDENCE_INFERENCE], filter_out_english=False)
tulu

## MCQ Data

### SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

**Size**: Train: 11.7K, Val: 1K, Test: 1K

In [None]:
# Load `SciQData`
from src.data import SciQData

sciq = SciQData(tokenizer=TOKENIZER, include_explanation=True)
sciq

### AI2 Arc

The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2. These are text-only, English language exam questions that span several grade levels as indicated in the files. Each question has a multiple choice structure (typically 4 answer options). The questions are sorted into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.

**Size** 
* Arc-Easy: Train: 2.25K, Validation: 570, Test: 2.38K
* Arc-Challenge: Train: 1.12K, Validation: 299, Test: 1.17K

In [None]:
from src.data import ArcEasyData

arc_easy = ArcEasyData(tokenizer=TOKENIZER)
arc_easy

In [None]:
from src.data import ArcChallengeData

arc_challenge = ArcChallengeData(tokenizer=TOKENIZER)
arc_challenge

### OpenBookQA

OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject.

**Size**: 
* Main: Train 4.96K, Validation: 500, Test: 500
* Additional: Train 4.96K, Validation: 500, Test: 500

In [None]:
from src.data import OpenBookQAMainData

openbookqa_main = OpenBookQAMainData(tokenizer=TOKENIZER)
openbookqa_main

In [None]:
from src.data import OpenBookQAAdditionalData

openbookqa_additional = OpenBookQAAdditionalData(tokenizer=TOKENIZER)
openbookqa_additional

### GPQA

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4–based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions—for example, when developing new scientific knowledge—we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

**Size**:

* Diamond: Train 198
* Extended: Train 546
* Main: Train 448

> 🚨 Because the dataset only contains a single split, we cannot train on it.

In [None]:
# SKIP

#### MMLU

This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.

We choose the STEM splits.

> 🚨 Because the dataset only contains a validation and test (+dev) splits, we cannot train on it.

In [None]:
# SKIP

### Custom Data

In [None]:
# Load `HardCodedData`
from src.data import HardCodedData

hardcoded = HardCodedData(duplicate=1)

## Data Recipe

In [None]:
from src.data import DataRecipe

datasets = [sciq, arc_easy, arc_challenge, openbookqa_main, openbookqa_additional, hardcoded]
ratio = [1.0] * len(datasets)
recipe = DataRecipe(
    datasets=datasets,
    ratio=ratio
)
recipe