# GNS Kaggle Pipeline Notebook

Edit this notebook locally, keep it under version control, and push it to Kaggle when you want to execute on their remote hardware. Attach this repository (zipped or as a Kaggle Dataset) to the Kaggle Notebook so all relative paths stay valid.

**Local -> Kaggle workflow**
1. Update the repo locally, including this notebook.
2. Publish the updated sources (e.g. `kaggle notebooks push -p gns_kaggle` from the project root, or ship a new Dataset version).
3. Open the Notebook on Kaggle and run the cells from top to bottom.

Outputs are written to `kaggle_models/` (checkpoints) and `kaggle_rollouts/` (inference results) so they can be collected as Notebook outputs.

## Configure the pipeline
Toggle the flags below to control which stages run during the Kaggle job. The default configuration executes the full flow (dependency install -> dataset generation -> training -> rollout).

In [3]:
from pathlib import Path

IN_KAGGLE = Path("/kaggle").exists()
print(f"Running inside Kaggle: {IN_KAGGLE}")
print(f"Working directory: {Path.cwd()}")

# Pipeline toggles ---------------------------------------------------------
SKIP_INSTALL = True
FORCE_REINSTALL = False
SKIP_GENERATE = False
SKIP_TRAIN = False
SKIP_ROLLOUT = False
RUN_ANALYSIS = False
VISUALIZE_HTML = False

# Configuration files ------------------------------------------------------
DATASET_CONFIG = Path("datasets/config/fluid_kaggle.yaml")
TRAIN_CONFIG = Path("config_kaggle.yaml")
ROLLOUT_CONFIG = Path("config_kaggle_rollout.yaml")

for label, cfg in (
    ("Dataset", DATASET_CONFIG),
    ("Train", TRAIN_CONFIG),
    ("Rollout", ROLLOUT_CONFIG),
):
    status = "OK" if cfg.exists() else "MISSING"
    print(f"{label} config: {cfg} [{status}]")

Running inside Kaggle: True
Working directory: /kaggle/working
Dataset config: datasets/config/fluid_kaggle.yaml [MISSING]
Train config: config_kaggle.yaml [MISSING]
Rollout config: config_kaggle_rollout.yaml [MISSING]


## Run the orchestration script
This cell wraps `gns_kaggle/pipeline.py`, which already knows how to install dependencies and invoke the training scripts with the lightweight Kaggle-friendly settings.

In [4]:
from pathlib import Path

from gns_kaggle import pipeline as gns_pipeline


def _ensure(path: Path, label: str) -> Path:
    resolved = path.expanduser().resolve()
    if not resolved.exists():
        raise FileNotFoundError(f"{label} not found: {resolved}")
    return resolved


if not SKIP_INSTALL:
    gns_pipeline.install_dependencies(force=FORCE_REINSTALL)
else:
    print("[notebook] Skipping dependency installation.")

if not SKIP_GENERATE:
    gns_pipeline.generate_dataset(_ensure(DATASET_CONFIG, "Dataset config"))
else:
    print("[notebook] Skipping dataset generation.")

if not SKIP_TRAIN:
    gns_pipeline.train_model(_ensure(TRAIN_CONFIG, "Training config"))
else:
    print("[notebook] Skipping training.")

if not SKIP_ROLLOUT:
    gns_pipeline.run_rollout(_ensure(ROLLOUT_CONFIG, "Rollout config"))
    if RUN_ANALYSIS:
        gns_pipeline.analyze_rollouts()
    if VISUALIZE_HTML:
        gns_pipeline.visualize_rollouts(html=True)
else:
    print("[notebook] Skipping rollout inference.")

ModuleNotFoundError: No module named 'gns_kaggle'

## Inspect generated artifacts
Quickly list the checkpoint and rollout directories so you can decide what to keep as Kaggle Notebook outputs.

In [None]:
from pathlib import Path

def list_directory(path: Path) -> None:
    if not path.exists():
        print(f"{path}: (missing)")
        return
    print(f"{path}:")
    for child in sorted(path.iterdir()):
        if child.is_dir():
            marker = "<DIR>"
        else:
            marker = f"{child.stat().st_size / 1024:.1f} KiB"
        print(f"  {child.name:30s} {marker}")


list_directory(Path("kaggle_models"))
list_directory(Path("kaggle_rollouts"))

## Package outputs (optional)
Copy the important directories into a single export folder and create a zip archive that can be downloaded from Kaggle.

In [None]:
import shutil
from pathlib import Path

ARTIFACT_ROOT = Path("kaggle_export")
ARTIFACT_ROOT.mkdir(exist_ok=True)

sources = (Path("kaggle_models"), Path("kaggle_rollouts"))
copied = []
for source in sources:
    target = ARTIFACT_ROOT / source.name
    if target.exists():
        if target.is_dir():
            shutil.rmtree(target)
        else:
            target.unlink()
    if source.exists():
        if source.is_dir():
            shutil.copytree(source, target)
        else:
            shutil.copy2(source, target)
        copied.append(target)
        print(f"Copied {source} -> {target}")
    else:
        print(f"Skipped missing source: {source}")

zip_path = Path("gns_artifacts.zip")
if zip_path.exists():
    zip_path.unlink()

archive = shutil.make_archive("gns_artifacts", "zip", root_dir=ARTIFACT_ROOT)
archive_path = Path(archive)
size_kib = archive_path.stat().st_size / 1024
print(f"Created archive: {archive_path} ({size_kib:.1f} KiB)")
if IN_KAGGLE:
    print("Add gns_artifacts.zip to the Notebook output files before finishing the run.")