# Finetuning Dataset

The funetuning dataset has a relatively small fraction (~10%) of positive examples. This leads to some training batches with only negative examples, as well as a loss function that prioritizes getting the negative examples right. As a simply strategy, we duplicate positive entries in the training set until the classes are balanced.

In [None]:
from datasets import load_dataset, concatenate_datasets
import sys; sys.path.append("..")
from classifier.paths import data_folder

dataset = load_dataset("json", data_files={
  "train": str(data_folder / "finetuning" / "train.jsonl"),
})

In [None]:
true_examples = dataset["train"].filter(lambda x: x["label"] == "True")
false_examples = dataset["train"].filter(lambda x: x["label"] == "False")

print(f"The training dataset has {len(true_examples)} TRUE and {len(false_examples)} FALSE examples")

In [None]:
# True duplicating the true examples to have equal frequency.
augmented = true_examples.shuffle(seed=42)

for _ in range(len(false_examples) // len(true_examples)):
  augmented = concatenate_datasets([augmented, true_examples.shuffle(seed=42)])

In [None]:
augmented = concatenate_datasets([augmented, false_examples]).shuffle(seed=42)
augmented.to_json(data_folder / "finetuning" / "augmented_train.jsonl")