# Preference Data

This notebook prepares the preference data for MCQ.

## Setup

---

Let's install some necessary dependencies and set global variables.

In [1]:
import autorootcwd

In [2]:
%reload_ext autoreload
%autoreload 2

In [34]:
# Modules
import re
import pandas as pd
from pprint import pprint

from src.data.datasets.preference import PreferenceData

## Preference

In [113]:
train = PreferenceData(split="train", filtering_strategy="none")
val = PreferenceData(split="val", filtering_strategy="none")

print(f"Train: {len(train)}, Val: {len(val)}")

Train: 21390, Val: 5348


### Find MCQ Question

First, we need to find the MCQ question from the preference data. Although the `question` and `question_options` key have been merged into the `question_complete` (called `prompt` in the `PreferenceData`), we can pretty easily extract the MCQ question because the available options are always formatted in the same way:

* The question is followed by the string `Options:\n`
* It then lists the options `A. [Option A]\n`, `B. [Option B]`, ...
* Most MCQ questions have a four answer options, but some are also True/ False questions

We can simply write a Regex pattern to only retain the MCQ question.

In [114]:
def is_mcq(question: str):
    """
    Helper utility to check if a question is a multiple choice question.
    In the preference dataset, multiple choice questions are formatted as follows:

    ```
    Options:
    A. Option 1
    B. Option 2
    ...
    ```

    Args:
        question (str): The question to check.

    Returns:
        bool: True if the question is a multiple choice question, False otherwise.
    """
    pattern = r'Options:\n(?:[A-D]\..+\n)+'
    return bool(re.search(pattern, question))

In [115]:
train_mcq = train.filter(is_mcq, input_columns="prompt")
val_mcq = val.filter(is_mcq, input_columns="prompt")

print(f"Train: {len(train_mcq)} ({100*(len(train_mcq)/ len(train)):.1f}%), Val: {len(val_mcq)} ({100*(len(val_mcq)/ len(val)):.1f}%)")

Train: 11005 (51.4%), Val: 2778 (51.9%)


### Find Answer Key

Next, we need to find the correct response from the preferred answer (`chosen`) and the non-preferred answer (`rejected`) in the multiple choice questions. If we do SFT finetuning we will only use the chosen response, but we might consider doing DPO training on the MCQ formatted dataset.

I noticed the following in the answers:
* Sometimes, the model answers with digits (1-4), or strings like `Option 1` instead of letters.
* Sometimes, the model only gives the answer in natural language without referring to the question options
* ...

Given this, we will likely have to fall back to an LLM to extract the correct response from the preference pair and map back to the correct letter (A-D) from the question answers.

In [None]:
# TODO: Use LLM to extract the correct answer for MCQs.