##### 00 â€” Project Overview & Reproducibility (Entry Notebook)

This is the entry notebook for the repo.

What it does:
1) Forces the working directory to the repo root (so imports work).
2) Validates repo structure (`src/`, `configs/`).
3) Ensures `configs/project.yaml` exists (single source of truth).
4) Creates local folders (`data/*`, `runs/`) safely.
5) Sets reproducibility controls (seed + deterministic).
6) Creates a RUN_ID and writes `runs/<RUN_ID>/meta.json`.

Non-negotiable rules:
- Only real datasets and real results produced by this repo.
- No toy/synthetic/example data.
- Every run must save metadata to `runs/<RUN_ID>/meta.json`.


Imports (single place for shared imports)

In [None]:
# [CELL 00-01] Imports (keep shared imports here; avoid repeating)
import os
import sys
import json
from pathlib import Path

import yaml

Bootstrap: force repo root + sys.path

In [30]:
# [CELL 00-02] Bootstrap: locate repo root reliably (Windows-safe)

import os
import sys
from pathlib import Path

CWD = Path.cwd().resolve()
print("Initial CWD:", CWD)

def find_repo_root(start: Path) -> Path:
    # Search upward for repo markers
    for p in [start, *start.parents]:
        if (p / ".git").exists() or (p / "PROJECT_STATE.md").exists() or (p / "README.md").exists():
            return p
    # fallback: your known absolute path (edit only if needed)
    return Path(r"D:\00_DS-ML-Workspace\mooc-coldstart-session-meta").resolve()

REPO_ROOT = find_repo_root(CWD)

os.chdir(REPO_ROOT)

repo_str = str(REPO_ROOT)
if repo_str not in sys.path:
    sys.path.insert(0, repo_str)

print("REPO_ROOT:", REPO_ROOT)
print("CWD now:", Path.cwd())
print("tree checks:",
      "src=", (REPO_ROOT / "src").exists(),
      "notebooks=", (REPO_ROOT / "notebooks").exists(),
      "PROJECT_STATE.md=", (REPO_ROOT / "PROJECT_STATE.md").exists())


Initial CWD: D:\00_DS-ML-Workspace
REPO_ROOT: D:\00_DS-ML-Workspace
CWD now: D:\00_DS-ML-Workspace
tree checks: src= True notebooks= False PROJECT_STATE.md= False


Ensure src/configs/project.yaml

In [31]:
# [CELL 00-03] Ensure config exists at: src/configs/project.yaml

cfg_dir = REPO_ROOT / "src" / "configs"
cfg_dir.mkdir(parents=True, exist_ok=True)

cfg_path = cfg_dir / "project.yaml"

default_cfg = """project:
  name: mooc-coldstart-session-meta

paths:
  data_raw: data/raw
  data_processed: data/processed
  runs: runs

repro:
  seed: 42
  deterministic: true

training:
  num_workers: 2
  pin_memory: false
"""

if not cfg_path.exists():
    cfg_path.write_text(default_cfg, encoding="utf-8")
    print("Created:", cfg_path)
else:
    print("Exists:", cfg_path)

print("Config path:", cfg_path.resolve())


Exists: D:\00_DS-ML-Workspace\src\configs\project.yaml
Config path: D:\00_DS-ML-Workspace\src\configs\project.yaml


Validate/Create src/ structure (fix for src exists: False)

In [32]:
# [CELL 00-03] Repo structure validation (and minimal fix)
# If src/ is missing, create it (safe, no data assumptions).
# This prevents ModuleNotFoundError: No module named 'src'

src_dir = REPO_ROOT / "src"
utils_dir = src_dir / "utils"

if not src_dir.exists():
    src_dir.mkdir(parents=True, exist_ok=True)
    print("Created:", src_dir)

if not utils_dir.exists():
    utils_dir.mkdir(parents=True, exist_ok=True)
    print("Created:", utils_dir)

# Ensure packages are importable
init_src = src_dir / "__init__.py"
init_utils = utils_dir / "__init__.py"

if not init_src.exists():
    init_src.write_text("", encoding="utf-8")
    print("Created:", init_src)

if not init_utils.exists():
    init_utils.write_text("", encoding="utf-8")
    print("Created:", init_utils)

print("src exists:", src_dir.exists())
print("src/utils exists:", utils_dir.exists())


src exists: True
src/utils exists: True
