### 📊 Dataset Preparation Summary

To make the large Amazon Polarity dataset more manageable and suitable for experimentation in a semi-supervised learning setup, we created a balanced and downsized subset of **4,000 samples**. This subset was split as follows:

- **Validation Set**: 400 samples (200 positive, 200 negative)  
- **Labeled Training Set**: 300 samples (150 positive, 150 negative)  
- **Unlabeled Training Set**: 3,300 samples (1,650 positive, 1,650 negative)

Each subset maintains a **50/50 class balance** to ensure fairness during training and evaluation.

The splits were saved in `.parquet` format to the `/data` directory for efficient storage and faster I/O operations during development.

This setup enables easy experimentation with semi-supervised learning approaches, where only a small fraction of the data is labeled.

In [1]:
from datasets import load_dataset, Dataset, DatasetDict
from collections import Counter
import random

# Load and shuffle dataset
dataset = load_dataset("fancyzhx/amazon_polarity", split="train").shuffle(seed=42)

# Separate by label
positives = [x for x in dataset if x['label'] == 1]
negatives = [x for x in dataset if x['label'] == 0]

# Take only 2000 positive and 2000 negative (total ~4000)
positives = positives[:2000]
negatives = negatives[:2000]

# 400 pos + 400 neg for validation
val_pos = positives[:400]
val_neg = negatives[:400]

# 200 pos + 200 neg for labeled (total 400)
labeled_pos = positives[400:600]
labeled_neg = negatives[400:600]

# 1400 pos + 1400 neg for unlabeled (rest)
unlabeled_pos = positives[600:2000]
unlabeled_neg = negatives[600:2000]

# Convert to HF Datasets
validation = Dataset.from_list(val_pos + val_neg).shuffle(seed=42)
labeled = Dataset.from_list(labeled_pos + labeled_neg).shuffle(seed=42)
unlabeled = Dataset.from_list(unlabeled_pos + unlabeled_neg).shuffle(seed=42)

# Wrap into DatasetDict
final_dataset = DatasetDict({
    "validation": validation,
    "labeled": labeled,
    "unlabeled": unlabeled
})

# Check counts
print("Validation:", Counter(final_dataset["validation"]["label"]))
print("Labeled:", Counter(final_dataset["labeled"]["label"]))
print("Unlabeled:", Counter(final_dataset["unlabeled"]["label"]))
total = len(validation) + len(labeled) + len(unlabeled)
print("Total samples:", total)

Validation: Counter({1: 400, 0: 400})
Labeled: Counter({0: 200, 1: 200})
Unlabeled: Counter({1: 1400, 0: 1400})
Total samples: 4000


Save the splits

In [2]:
import os
import rootutils

# Set up root utils
root = rootutils.setup_root(search_from=".", indicator=".git")

DATA_DIR = root / "data"

os.makedirs("data", exist_ok=True)

# save each split to data
final_dataset["validation"].to_parquet(DATA_DIR / "validation.parquet")
final_dataset["labeled"].to_parquet(DATA_DIR / "labeled.parquet")
final_dataset["unlabeled"].to_parquet(DATA_DIR / "unlabeled.parquet")

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

1239076

In [3]:
import pandas as pd

# reading the 300-example labeled pool
full = pd.read_parquet(DATA_DIR / "labeled.parquet")

# shuffling once so all splits are nested
full = full.sample(frac=1, random_state=42).reset_index(drop=True)

# sizes
sizes = [25, 50, 100, 150, 200, 250, 300, 350, 400]

# for each size, take top-n examples and write out
for n in sizes:
    subset = full.iloc[:n]
    subset.to_parquet(DATA_DIR / f"train_{n}.parquet", index=False)
    print(f"Wrote train_{n}.parquet with {len(subset)} examples")


Wrote train_25.parquet with 25 examples
Wrote train_50.parquet with 50 examples
Wrote train_100.parquet with 100 examples
Wrote train_150.parquet with 150 examples
Wrote train_200.parquet with 200 examples
Wrote train_250.parquet with 250 examples
Wrote train_300.parquet with 300 examples
Wrote train_350.parquet with 350 examples
Wrote train_400.parquet with 400 examples
