# Create Tiny Smoke Dataset

This notebook creates a **tiny subset** of the main dataset and a matching data config YAML.

- Input dataset: `../dataset/train.json`
- Output dataset: `../dataset_tiny/{train.json, validation.json}`
- Output config: `../config/data/resume_tiny.yaml`

After running this once, you can point `01_orchestrate_training.ipynb` at `resume_tiny.yaml`
when you want to exercise the full orchestration on a tiny dataset for speed.


In [7]:
from pathlib import Path
import json
import yaml

RAW_DATA_DIR = Path("../dataset")
RAW_TRAIN = RAW_DATA_DIR / "train.json"

TINY_DATA_DIR = Path("../dataset_tiny")
TINY_TRAIN = TINY_DATA_DIR / "train.json"
TINY_VAL = TINY_DATA_DIR / "validation.json"

# How many samples to keep for the tiny smoke dataset
TINY_TRAIN_SAMPLES = 8
TINY_VAL_SAMPLES = 2

print("Raw train path:", RAW_TRAIN.resolve())
print("Tiny dataset directory:", TINY_DATA_DIR.resolve())


Raw train path: /workspaces/resume-ner-azureml/dataset/train.json
Tiny dataset directory: /workspaces/resume-ner-azureml/dataset_tiny


In [8]:
# Build tiny train/validation JSON files
if not RAW_TRAIN.exists():
    raise FileNotFoundError(f"Raw train.json not found at {RAW_TRAIN}")

with RAW_TRAIN.open("r", encoding="utf-8") as f:
    full_train = json.load(f)

if not isinstance(full_train, list) or not full_train:
    raise ValueError("Expected train.json to be a non-empty list of samples")

TINY_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Filter to samples with reasonably short text to keep tokenization/training fast
MAX_CHARS = 1500
short_samples = []
for sample in full_train:
    text = sample.get("text", "")
    if isinstance(text, str) and len(text) <= MAX_CHARS:
        short_samples.append(sample)

if len(short_samples) < TINY_TRAIN_SAMPLES + TINY_VAL_SAMPLES:
    raise ValueError(
        f"Not enough short samples (<= {MAX_CHARS} chars). "
        f"Found {len(short_samples)}, need at least {TINY_TRAIN_SAMPLES + TINY_VAL_SAMPLES}."
    )

train_slice = short_samples[:TINY_TRAIN_SAMPLES]
val_slice = short_samples[TINY_TRAIN_SAMPLES:TINY_TRAIN_SAMPLES + TINY_VAL_SAMPLES]

with TINY_TRAIN.open("w", encoding="utf-8") as f:
    json.dump(train_slice, f, ensure_ascii=False, indent=2)
with TINY_VAL.open("w", encoding="utf-8") as f:
    json.dump(val_slice, f, ensure_ascii=False, indent=2)

print(f"Wrote tiny train ({len(train_slice)} samples, max {MAX_CHARS} chars) to {TINY_TRAIN}")
print(f"Wrote tiny validation ({len(val_slice)} samples, max {MAX_CHARS} chars) to {TINY_VAL}")


Wrote tiny train (8 samples, max 1500 chars) to ../dataset_tiny/train.json
Wrote tiny validation (2 samples, max 1500 chars) to ../dataset_tiny/validation.json


In [9]:
# Create a tiny data config YAML by copying resume_v1.yaml
BASE_CONFIG_PATH = Path("../config/data/resume_v1.yaml")
TINY_CONFIG_PATH = Path("../config/data/resume_tiny.yaml")

if not BASE_CONFIG_PATH.exists():
    raise FileNotFoundError(f"Base data config not found: {BASE_CONFIG_PATH}")

with BASE_CONFIG_PATH.open("r", encoding="utf-8") as f:
    base_cfg = yaml.safe_load(f)

# Override name/version/description for the tiny dataset
# Bump version whenever the tiny dataset generation logic changes materially
base_cfg["name"] = "resume-ner-data-tiny-short"
base_cfg["version"] = "v2"
base_cfg["description"] = "Tiny smoke-test subset of Resume NER dataset (short-text version for fast orchestration tests)"

with TINY_CONFIG_PATH.open("w", encoding="utf-8") as f:
    yaml.safe_dump(base_cfg, f, sort_keys=False)

print("Wrote tiny data config to:", TINY_CONFIG_PATH.resolve())
print("Config contents:\n", yaml.safe_dump(base_cfg, sort_keys=False))


Wrote tiny data config to: /workspaces/resume-ner-azureml/config/data/resume_tiny.yaml
Config contents:
 name: resume-ner-data-tiny-short
version: v2
description: Tiny smoke-test subset of Resume NER dataset (short-text version for
  fast orchestration tests)
schema:
  format: json
  annotation_format: character_spans
  entity_types:
  - SKILL
  - EDUCATION
  - DESIGNATION
  - EXPERIENCE
  - NAME
  - EMAIL
  - PHONE
  - LOCATION
  stats:
    median_sentence_length: 19
    mean_sentence_length: 20
    p95_sentence_length: 40
    suggested_sequence_length: 40
    entity_density: 0.35

