<a href="https://colab.research.google.com/github/mahb97/Wake2vec/blob/main/Wake2vec_heartbeat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wake2Vec Training Monitor

Remote monitoring and checkpoint verification system for long-running embedding training experiments.

---

## Overview

This notebook provides read-only monitoring utilities for tracking the progress of Wake2Vec training runs executing on separate Colab instances. It implements a non-invasive inspection layer that queries checkpoint directories and metrics logs without interfering with active training processes. The monitoring system supports both local ephemeral storage and persistent Google Drive locations.

## Functionality

The notebook performs the following diagnostic operations:

Training progress monitoring via loss history and evaluation metrics

Checkpoint inventory and validation across multiple storage locations

Automatic mirroring of validated checkpoints to secondary backup directories

Temporal analysis of file modifications to detect stale or incomplete saves

Embedding snapshot tracking at configurable step intervals

Resume point identification for interrupted training sessions

## Implementation Details

Storage hierarchy: The system monitors two primary locations:

- Local ephemeral: `/content/runs/t4_*` (active training directory)
- Drive persistent: `/content/drive/MyDrive/wake2vec/runs/t4_*` (synchronized copy)
- Sentry backup: `/content/drive/MyDrive/wake2vec/sentry_backups/t4_*` (safety mirror)

Checkpoint validation: Weights are verified by checking for the presence of `model.safetensors`, `pytorch_model.bin`, or sharded weight files. Checkpoints without valid weights are excluded from resume candidates.

Loss monitoring: The system preferentially reads from structured JSON logs at `metrics/phase1_loss_log.json`, falling back to `trainer_state.json` within checkpoint directories when streaming logs are unavailable.

Evaluation tracking: Evaluation metrics are extracted from the most recent checkpoint's trainer state, displaying the tail of recorded validation losses with corresponding step numbers and runtime statistics.


## Monitoring Schedule

The notebook is designed for manual execution at user-defined intervals during training. Typical usage patterns include:

- Hourly checks during active training phases
- Post-checkpoint verification after save events
- Pre-resume validation before launching continuation runs
- Post-mortem analysis for failed or interrupted sessions



In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)
print("Drive mounted.")

Mounted at /content/drive
Drive mounted.


Resolve active RUN

In [None]:
import pathlib, time

RUN_ID = None

LOCAL_ROOT = pathlib.Path("/content/runs")
DRIVE_ROOT = pathlib.Path("/content/drive/MyDrive/wake2vec")
def latest_run(root):
    if not root.exists(): return None
    runs = []
    for p in root.glob("t4_*"):
        try: runs.append((p.stat().st_mtime, p))
        except FileNotFoundError: pass
    return max(runs, key=lambda x: x[0])[1] if runs else None

LOCAL_RUN  = (LOCAL_ROOT/RUN_ID) if RUN_ID else latest_run(LOCAL_ROOT)
DRIVE_RUN  = (DRIVE_ROOT/"runs"/RUN_ID) if RUN_ID else latest_run(DRIVE_ROOT/"runs")
RUN = LOCAL_RUN or DRIVE_RUN
assert RUN is not None, "No local or Drive t4_* run found."

print("Watching:", RUN, "| mtime:", time.ctime(RUN.stat().st_mtime))
SENTRY = DRIVE_ROOT/"sentry_backups"/RUN.name
SENTRY.mkdir(parents=True, exist_ok=True)

Watching: /content/drive/MyDrive/wake2vec/runs/t4_1762376560 | mtime: Wed Nov  5 21:02:40 2025


loss tail

In [None]:
import json

mlog = RUN/"metrics"/"phase1_loss_log.json"
if mlog.exists():
    logs = json.loads(mlog.read_text())
    tail = logs[-5:]
    print("[LOSS stream] last:", [(d["step"], round(float(d["loss"]),4)) for d in tail])
else:
    # Fallback
    cks = sorted(RUN.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]))
    if cks and (cks[-1]/"trainer_state.json").exists():
        state = json.loads((cks[-1]/"trainer_state.json").read_text())
        tail = [(d["step"], d["loss"]) for d in state.get("log_history", []) if "loss" in d][-5:]
        print("[LOSS state ] last:", tail if tail else "—")
    else:
        print("[LOSS] no logs yet")

[LOSS stream] last: [(350, 5.4883), (400, 5.3082), (450, 4.7776), (500, 4.0891), (550, 3.2604)]


Eval tail

In [None]:
import json
cks = sorted(RUN.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]))
if cks:
    p = cks[-1]/"trainer_state.json"
    if p.exists():
        state = json.loads(p.read_text())
        evals = [d for d in state.get("log_history", []) if "eval_loss" in d][-3:]
        print("[EVAL] tail:", evals if evals else "—")
    else:
        print("[EVAL] none yet (next at 400/600/800...)")
else:
    print("[EVAL] no checkpoints yet")

[EVAL] tail: [{'epoch': 3.50989010989011, 'eval_loss': 6.237724304199219, 'eval_runtime': 13.5689, 'eval_samples_per_second': 3.538, 'eval_steps_per_second': 0.442, 'step': 200}, {'epoch': 7.0175824175824175, 'eval_loss': 6.416950225830078, 'eval_runtime': 13.5887, 'eval_samples_per_second': 3.532, 'eval_steps_per_second': 0.442, 'step': 400}, {'epoch': 10.527472527472527, 'eval_loss': 7.096441268920898, 'eval_runtime': 13.6439, 'eval_samples_per_second': 3.518, 'eval_steps_per_second': 0.44, 'step': 600}]


checkpoint audit

In [None]:
import shutil, time

def latest_full_ckpt(root):
    cks = sorted(root.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]), reverse=True)
    for ck in cks:
        if (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists():
            return ck
    return None

src = latest_full_ckpt(RUN)
if src is None:
    print("[SENTRY] No full checkpoint yet; wait for next save.")
else:
    dst = SENTRY/src.name
    if not dst.exists():
        shutil.copytree(src, dst)
        print(f"[SENTRY] mirrored {src.name} (mtime {time.ctime(src.stat().st_mtime)})")
    else:
        print("[SENTRY] already has", src.name)
    # Mirror metrics
    msrc = RUN/"metrics"; mdst = SENTRY/"metrics"; mdst.mkdir(parents=True, exist_ok=True)
    copied = 0
    if msrc.exists():
        for f in msrc.glob("*.json"):
            shutil.copy2(f, mdst/f.name); copied += 1
    print(f"[SENTRY] metrics mirrored ({copied} files)")

[SENTRY] already has checkpoint-300
[SENTRY] metrics mirrored (1 files)


embedding snapshot quick view

In [None]:
SNAPS_DIR = DRIVE_ROOT/"emb_snaps"/RUN.name
if SNAPS_DIR.exists():
    snaps = sorted(SNAPS_DIR.glob("emb_step*.pt"))
    print(f"[SNAPS] count={len(snaps)}  latest=", snaps[-1].name if snaps else "—")
    hb = SNAPS_DIR/"heartbeat.json"
    if hb.exists(): print("[SNAPS] heartbeat:", hb.read_text())
else:
    print("[SNAPS] none yet (first at step 350 if every 50)")

[SNAPS] count=9  latest= emb_step0750.pt
[SNAPS] heartbeat: {
  "step": 750,
  "rows": 32000,
  "dim": 2048,
  "ts": 1762729811.8068688
}


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


opt light sync

In [None]:
import os, time, pathlib
touch = RUN/"_heartbeat.touch"
touch.write_text(str(time.time()))
os.sync()
print("[SYNC] touched + sync hinted →", touch)

[SYNC] touched + sync hinted → /content/drive/MyDrive/wake2vec/runs/t4_1762376560/_heartbeat.touch


In [None]:
import shutil, pathlib
RUN = pathlib.Path("/content/drive/MyDrive/wake2vec/runs/t4_1762376560")
SENTRY = pathlib.Path("/content/drive/MyDrive/wake2vec/sentry_backups/t4_1762376560")
SENTRY.mkdir(parents=True, exist_ok=True)
src = RUN/"checkpoint-750"; dst = SENTRY/"checkpoint-750"
if src.exists() and not dst.exists():
    shutil.copytree(src, dst); print("[SENTRY] mirrored checkpoint-750")

In [None]:
import pathlib
RUN = pathlib.Path("/content/drive/MyDrive/wake2vec/runs/t4_1762376560")
ck = RUN/"checkpoint-750"
print("750 exists:", ck.exists(),
      "| weights:", (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists())

750 exists: False | weights: False


In [None]:
# unlink drive
try:
    from google.colab import drive
    drive.flush_and_unmount()
    print("[OK] flushed & unmounted")
except Exception as e:
    print("[INFO] flush/unmount skipped:", e)

import shutil, os
if os.path.exists("/content/drive"):
    shutil.rmtree("/content/drive", ignore_errors=True)
    print("[OK] removed /content/drive")

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# anity check
import pathlib
BASE = pathlib.Path("/content/drive/MyDrive/wake2vec")
print("[BASE exists]", BASE.exists())
print("[BASE contents]", [p.name for p in BASE.iterdir()] if BASE.exists() else "—")

Drive not mounted, so nothing to flush and unmount.
[OK] flushed & unmounted
[OK] removed /content/drive
Mounted at /content/drive
[BASE exists] True
[BASE contents] ['runs', 'adapters', 'reports', 'archives', 'notebooks', 'datasets', 'sentry_backups', 'emb_snaps']


In [None]:
import pathlib

BASE   = pathlib.Path("/content/drive/MyDrive/wake2vec")
RUNS   = BASE/"runs"
SENTRY = BASE/"sentry_backups"
SNAPS  = BASE/"emb_snaps"

def audit(run_root):
    print(f"\n[{run_root}] exists:", run_root.exists())
    if not run_root.exists(): return
    for run in sorted(run_root.glob("t4_*")):
        print(" ", run.name)
        for ck in sorted(run.glob("checkpoint-*"), key=lambda p:int(p.name.split("-")[-1])):
            step = int(ck.name.split("-")[-1]) if ck.name.count("-") else -1
            has_w = (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists()
            print(f"   {ck.name:>20}  weights={int(has_w)}")

audit(RUNS)
audit(SENTRY)

print("\n[SNAPS] exists:", SNAPS.exists())
if SNAPS.exists():
    for r in sorted(SNAPS.glob("t4_*")):
        snaps = sorted(r.glob("emb_step*.pt"))
        print(" ", r.name, "| snaps:", len(snaps), "| latest:", snaps[-1].name if snaps else "—")


[/content/drive/MyDrive/wake2vec/runs] exists: True
  t4_1762100254
  t4_1762104879
  t4_1762105026
  t4_1762113417
  t4_1762375997
  t4_1762376307
  t4_1762376560
         checkpoint-100  weights=1
         checkpoint-200  weights=1
         checkpoint-300  weights=1
         checkpoint-400  weights=0
         checkpoint-500  weights=0
         checkpoint-600  weights=0
         checkpoint-700  weights=0

[/content/drive/MyDrive/wake2vec/sentry_backups] exists: True
  t4_1762376560
         checkpoint-300  weights=0
         checkpoint-400  weights=0
         checkpoint-500  weights=0
         checkpoint-600  weights=0
         checkpoint-700  weights=0

[SNAPS] exists: True
  t4_1762376560 | snaps: 9 | latest: emb_step0750.pt


In [None]:
RUN_ID = "t4_1762376560"
STEP = 700
ck = RUNS/RUN_ID/f"checkpoint-{STEP}"
print(ck)
print("\nFiles:")
for p in sorted(ck.glob("*")):
    print(" -", p.name)

/content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-700

Files:
 - config.json
 - generation_config.json
 - optimizer.pt
 - rng_state.pth
 - scheduler.pt
 - special_tokens_map.json
 - tokenizer.json
 - tokenizer.model
 - tokenizer_config.json
 - trainer_state.json
 - training_args.bin


In [None]:
# Rebuild a loadable checkpoint-750 from the embedding snapshot
import pathlib, torch, shutil, re
from transformers import AutoTokenizer, AutoModelForCausalLM

BASE   = pathlib.Path("/content/drive/MyDrive/wake2vec")
RUNS   = BASE/"runs"
SENTRY = BASE/"sentry_backups"
SNAPS  = BASE/"emb_snaps"

# Pick RUN_ID from snapshots
RUN_IDS = sorted([p.name for p in SNAPS.glob("t4_*")])
assert RUN_IDS, "No t4_* in emb_snaps — check Drive mount/account."
RUN_ID = RUN_IDS[-1]
print("[RUN_ID]", RUN_ID)

# Find base checkpoint ≤ 750
def full_ckpts(root):
    out = []
    d = root/RUN_ID
    if not d.exists(): return out
    for ck in d.glob("checkpoint-*"):
        step = int(ck.name.split("-")[-1])
        has_w = (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists() \
                or list(ck.glob("model-*-of-*.safetensors")) or list(ck.glob("pytorch_model-*-of-*.bin"))
        if has_w: out.append((step, ck))
    return sorted(out, key=lambda x: x[0], reverse=True)

bases = full_ckpts(SENTRY) + full_ckpts(RUNS)
assert bases, "No base checkpoints with weights found in sentry_backups/ or runs/."
base_step, BASE_CK = next(((s, p) for s, p in bases if s <= 750), bases[-1])
print(f"[BASE] Using {BASE_CK} (step {base_step})")

# Load embedding snapshot @ 750
EMB = SNAPS/RUN_ID/"emb_step0750.pt"
assert EMB.exists(), "emb_step0750.pt not found."
emb = torch.load(EMB, map_location="cpu")

# Load base and inject embeddings; re-tie head
tok = AutoTokenizer.from_pretrained(str(BASE_CK), use_fast=True)
model = AutoModelForCausalLM.from_pretrained(str(BASE_CK), torch_dtype=torch.float32, device_map="cpu")
with torch.no_grad():
    model.get_input_embeddings().weight[:emb.size(0), :].copy_(emb)
    model.get_output_embeddings().weight = model.get_input_embeddings().weight
print("[REBUILD] Injected emb_step0750 into base")

# Save as checkpoint-750-rebuilt
OUT_RUN    = RUNS/RUN_ID/"checkpoint-750-rebuilt"
OUT_SENTRY = SENTRY/RUN_ID/"checkpoint-750-rebuilt"
for d in (OUT_RUN, OUT_SENTRY):
    if d.exists(): shutil.rmtree(d)
model.save_pretrained(str(OUT_RUN), safe_serialization=True)
tok.save_pretrained(str(OUT_RUN))
shutil.copytree(OUT_RUN, OUT_SENTRY)
print("[SAVED] →", OUT_RUN)
print("[MIRRORED] →", OUT_SENTRY)

[RUN_ID] t4_1762376560
[BASE] Using /content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-300 (step 300)


`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-300 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[REBUILD] Injected emb_step0750 into base
[SAVED] → /content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-750-rebuilt
[MIRRORED] → /content/drive/MyDrive/wake2vec/sentry_backups/t4_1762376560/checkpoint-750-rebuilt


In [None]:
from pathlib import Path
ck = Path("/content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-750-rebuilt")
has_w = (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists() \
        or list(ck.glob("model-*-of-*.safetensors")) or list(ck.glob("pytorch_model-*-of-*.bin"))
print("750-rebuilt loadable:", bool(has_w))

750-rebuilt loadable: True


In [None]:
# HB-1
from pathlib import Path
from google.colab import drive
import re

drive.mount("/content/drive")

RUN_ID = "t4_1762376560"
DRIVE = Path("/content/drive/MyDrive/wake2vec")
RUN_LOCAL = Path(f"/content/runs/{RUN_ID}")
RUN_DRIVE = DRIVE / "runs" / RUN_ID
RESUME_750 = RUN_DRIVE / "checkpoint-750-rebuilt"

def has_weights(ck: Path) -> bool:
    if not ck.exists(): return False
    if (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists():
        return True
    if list(ck.glob("model-*-of-*.safetensors")) or list(ck.glob("pytorch_model-*-of-*.bin")):
        return True
    return False

def ckpt_step(p: Path) -> int:
    """
    Extract numeric step safely.
    - 'checkpoint-1234' -> 1234
    - 'checkpoint-750-rebuilt' -> 750
    - 'checkpoint-final' -> very large sentinel so it's treated as latest
    - anything else -> -1 (ignored)
    """
    m = re.search(r"checkpoint-(\d+)", p.name)
    if m:
        return int(m.group(1))
    if p.name == "checkpoint-final":
        return 10**9
    return -1

candidates = []
for root in (RUN_LOCAL, RUN_DRIVE):
    if root.exists():
        for p in root.glob("checkpoint-*"):
            if p.is_dir() and has_weights(p):
                candidates.append(p)

# Sort for best
catalog = sorted([(ckpt_step(p), p) for p in candidates], key=lambda x: x[0])
print("[CANDIDATES]", [f"{s}:{p.name}" for s, p in catalog])

latest = max(candidates, key=ckpt_step) if candidates else None
if latest is not None and ckpt_step(latest) < 0:
    latest = None

# fallback to 750-rebuilt
if latest is None and has_weights(RESUME_750):
    latest = RESUME_750

print("[RUN]", RUN_ID)
print("[BASE]", RESUME_750, "| has_weights:", has_weights(RESUME_750))
print("[LATEST]", latest if latest else "none yet")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[CANDIDATES] ['100:checkpoint-100', '200:checkpoint-200', '300:checkpoint-300', '750:checkpoint-750-rebuilt']
[RUN] t4_1762376560
[BASE] /content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-750-rebuilt | has_weights: True
[LATEST] /content/drive/MyDrive/wake2vec/runs/t4_1762376560/checkpoint-750-rebuilt


hb watch

In [None]:
# HB quick status (with ages)
from pathlib import Path
import re, time, os

RUN_ID = "t4_1762376560"
ROOT   = Path("/content/drive/MyDrive/wake2vec")
EMB    = ROOT / "emb_snaps" / RUN_ID
SENTRY = ROOT / "sentry_backups" / RUN_ID

def has_weights(ck: Path) -> bool:
    if not ck.exists(): return False
    if (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists(): return True
    if list(ck.glob("model-*-of-*.safetensors")) or list(ck.glob("pytorch_model-*-of-*.bin")): return True
    return False

def step_of(name: str) -> int:
    if name == "checkpoint-final": return 10**9
    m = re.search(r"checkpoint-(\d+)", name); return int(m.group(1)) if m else -1

def age(p: Path):
    if not p.exists(): return "n/a"
    secs = time.time() - p.stat().st_mtime
    if secs < 90:   return f"{int(secs)}s"
    if secs < 5400: return f"{int(secs//60)}m"
    return f"{secs/3600:.1f}h"

# snapshots
snaps = sorted(EMB.glob("emb_step*.pt"))
snap_name = snaps[-1].name if snaps else "none"
snap_age  = age(snaps[-1]) if snaps else "n/a"
print(f"[SNAP] {snap_name}  (age {snap_age})")

hb = EMB / "heartbeat.json"
print(f"[HB]   {('present, age '+age(hb)) if hb.exists() else 'none'}")

# checkpoints
cands = [p for p in SENTRY.glob("checkpoint-*") if p.is_dir() and has_weights(p)]
cands = sorted(cands, key=lambda p: step_of(p.name))
if cands:
    latest = cands[-1]
    print(f"[CKPT] {latest.name}  (age {age(latest)})")
else:
    print("[CKPT] none yet (first expected: checkpoint-0825)")

[SNAP] emb_step0800.pt  (age 21m)
[HB]   present, age 20m
[CKPT] checkpoint-750-rebuilt  (age 142.2h)


In [None]:
# Heartbeat
from pathlib import Path
import re, time

ROOT   = Path("/content/drive/MyDrive/wake2vec")
RUN_ID = "t4_1762376560"
EMB    = ROOT / "emb_snaps" / RUN_ID
SENTRY = ROOT / "sentry_backups" / RUN_ID

def has_weights(ck: Path) -> bool:
    if not ck.exists(): return False
    if (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists(): return True
    if list(ck.glob("model-*-of-*.safetensors")) or list(ck.glob("pytorch_model-*-of-*.bin")): return True
    return False

def age_str(p: Path) -> str:
    if not p.exists(): return "n/a"
    secs = time.time() - p.stat().st_mtime
    return (f"{int(secs)}s" if secs < 90 else f"{int(secs//60)}m" if secs < 5400 else f"{secs/3600:.1f}h")

# Expected embedding snapshots every 50 from 800..1300
snap_targets = [EMB / f"emb_step{n:04d}.pt" for n in range(800, 1301, 50)]
snap_missing = [p.name for p in snap_targets if not p.exists()]
snap_existing = [p.name for p in snap_targets if p.exists()]

# Expected checkpoints:
# - pre-1000: every 75 (825, 900, 975)
# - ≥1000: extra every 100 (1000, 1100, 1200, 1300); you may ALSO see 1075/1175/1275 from base 75 saves
ck_steps = sorted(set([825, 900, 975, 1000, 1075, 1100, 1175, 1200, 1275, 1300]))
ck_targets = [SENTRY / f"checkpoint-{s:04d}" for s in ck_steps]
ck_missing = [f.name for f in ck_targets if not has_weights(f)]
ck_existing = [f.name for f in ck_targets if has_weights(f)]

print("[WATCHING]")
print("  SNAPS :", EMB)
print("  SENTRY:", SENTRY)

# latest snap & checkpoint
snaps = sorted(EMB.glob("emb_step*.pt"))
print("\n[LATEST SNAP ]", snaps[-1].name if snaps else "none", "| age:", age_str(snaps[-1]) if snaps else "n/a")

cands = [p for p in SENTRY.glob("checkpoint-*") if p.is_dir() and has_weights(p)]
def step_of(name: str) -> int:
    if name == "checkpoint-final": return 10**9
    m = re.search(r"checkpoint-(\d+)", name); return int(m.group(1)) if m else -1
latest_ck = max(cands, key=lambda p: step_of(p.name)) if cands else None
print("[LATEST CKPT]", latest_ck.name if latest_ck else "none", "| age:", age_str(latest_ck) if latest_ck else "n/a")

print("\n[NEXT SNAP CANDIDATES]", (snap_missing[:5] or ["(all later)"]))
print("[HAVE SNAPS        ]", (snap_existing[-5:] or ["none"]))
print("\n[NEXT CKPT CANDS  ]", (ck_missing[:5] or ["(all later)"]))
print("[HAVE CKPTS       ]", (ck_existing[-5:] or ["none"]))

[WATCHING]
  SNAPS : /content/drive/MyDrive/wake2vec/emb_snaps/t4_1762376560
  SENTRY: /content/drive/MyDrive/wake2vec/sentry_backups/t4_1762376560

[LATEST SNAP ] emb_step0800.pt | age: 21m
[LATEST CKPT] checkpoint-750-rebuilt | age: 142.2h

[NEXT SNAP CANDIDATES] ['emb_step0850.pt', 'emb_step0900.pt', 'emb_step0950.pt', 'emb_step1000.pt', 'emb_step1050.pt']
[HAVE SNAPS        ] ['emb_step0800.pt']

[NEXT CKPT CANDS  ] ['checkpoint-0825', 'checkpoint-0900', 'checkpoint-0975', 'checkpoint-1000', 'checkpoint-1075']
[HAVE CKPTS       ] ['none']


In [None]:
# HB StepSpeed
from pathlib import Path
import re, time, math, json

RUN_ID = "t4_1762376560"
ROOT   = Path("/content/drive/MyDrive/wake2vec")
SNAPS  = ROOT / "emb_snaps" / RUN_ID
SENTRY = ROOT / "sentry_backups" / RUN_ID

MILESTONES = [800, 825, 850, 900, 975, 1000, 1050, 1100, 1200, 1300]

def last_snapshot():
    snaps = sorted(SNAPS.glob("emb_step*.pt"))
    if not snaps: return None
    p = snaps[-1]
    m = re.search(r"emb_step(\d+)\.pt", p.name)
    step = int(m.group(1)) if m else None
    ts = p.stat().st_mtime
    return step, ts, p

def last_ckpt():
    cands = [p for p in SENTRY.glob("checkpoint-*") if p.is_dir()]
    if not cands: return None
    def step_of(p):
        m = re.search(r"checkpoint-(\d+)", p.name)
        return int(m.group(1)) if m else (-1 if p.name!="checkpoint-final" else 10**9)
    p = max(cands, key=step_of)
    return step_of(p), p.stat().st_mtime, p

def fmt_eta(sec):
    if sec is None or sec == float("inf"): return "—"
    if sec < 90: return f"{int(sec)}s"
    if sec < 3600: return f"{int(sec//60)}m"
    return f"{sec/3600:.1f}h"

print("[HB] watching:")
print("  snaps :", SNAPS)
print("  sentry:", SENTRY)

ema = None
alpha = 0.3  # EMA smoothing

prev = last_snapshot()
if prev:
    print(f"[INIT] last snapshot: step {prev[0]} at {time.ctime(prev[1])}")
else:
    print("[INIT] no snapshots yet; first expected at step 0800")

while True:
    cur = last_snapshot()
    ck  = last_ckpt()
    now = time.time()

    if cur and prev and cur[0] > prev[0]:
        d_steps = cur[0] - prev[0]
        d_time  = cur[1] - prev[1]
        sps = d_time / max(d_steps, 1)
        ema = sps if ema is None else (alpha*sps + (1-alpha)*ema)

        # ETA to upcoming milestones
        nxt = [m for m in MILESTONES if m > cur[0]]
        etas = {m: fmt_eta(ema*(m - cur[0])) for m in nxt[:5]}

        print(f"[{cur[0]:4d}] Δ{d_steps} in {d_time:.1f}s → {sps:.2f}s/step | EMA {ema:.2f}s/step")
        if ck and ck[0] >= 800:
            print(f"      last ckpt: {ck[2].name} ({int(now-ck[1])}s ago)")
        print("      next:", ", ".join([f"{m}:{etas[m]}" for m in nxt[:5]]) if nxt else "done")

        prev = cur

    elif cur and not prev:
        prev = cur

    else:
        # still waiting for first new artifact
        if cur:
            print(f"[{cur[0]:4d}] waiting for next snapshot… ({int(now-cur[1])}s since last)")
        else:
            print("[WAIT] no snapshots yet (pre-0800)")

    time.sleep(15)

[HB] watching:
  snaps : /content/drive/MyDrive/wake2vec/emb_snaps/t4_1762376560
  sentry: /content/drive/MyDrive/wake2vec/sentry_backups/t4_1762376560
[INIT] last snapshot: step 800 at Sat Nov 15 21:55:50 2025
[ 800] waiting for next snapshot… (1276s since last)


KeyboardInterrupt: 

In [None]:
# HB one-shot
from pathlib import Path
import re, time, json

RUN_ID = "t4_1762376560"
ROOT   = Path("/content/drive/MyDrive/wake2vec")
SNAPS  = ROOT / "emb_snaps" / RUN_ID
SENTRY = ROOT / "sentry_backups" / RUN_ID
MILESTONES = [800, 825, 850, 900, 975, 1000, 1050, 1100, 1200, 1300]

def last_snapshot():
    snaps = sorted(SNAPS.glob("emb_step*.pt"))
    if not snaps: return None
    p = snaps[-1]
    m = re.search(r"emb_step(\d+)\.pt", p.name)
    step = int(m.group(1)) if m else None
    ts = p.stat().st_mtime
    return step, ts, p

def last_ckpt():
    cands = [p for p in SENTRY.glob("checkpoint-*") if p.is_dir()]
    if not cands: return None
    def step_of(p):
        m = re.search(r"checkpoint-(\d+)", p.name)
        return int(m.group(1)) if m else (-1 if p.name!="checkpoint-final" else 10**9)
    p = max(cands, key=step_of)
    return step_of(p), p.stat().st_mtime, p

cur = last_snapshot()
ck  = last_ckpt()

print("[SNAP]", cur[2].name if cur else "none", "| age:", f"{int(time.time()-cur[1])}s" if cur else "n/a")
print("[CKPT]", ck[2].name if ck else "none",  "| age:", f"{int(time.time()-ck[1])}s" if ck else "n/a")

EMA_FILE = SNAPS.parent / "speed_ema.json"
def load_ema():
    try:
        return json.loads(EMA_FILE.read_text())["ema_s_per_step"]
    except Exception:
        return None

ema = load_ema()
if cur and ema:
    nxt = [m for m in MILESTONES if m > cur[0]][:2]
    for m in nxt:
        eta = ema*(m - cur[0])
        print(f"[ETA] → {m}: ~{int(eta//60)}m {int(eta%60)}s")
else:
    print("[ETA] need at least two snapshots (e.g., 750→800) to estimate speed.")

[SNAP] emb_step0800.pt | age: 1172s
[CKPT] checkpoint-750-rebuilt | age: 511827s
[ETA] need at least two snapshots (e.g., 750→800) to estimate speed.


for <750

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

from pathlib import Path
import time

RUN_ID = "t4_1762376560"
SENTRY = Path("/content/drive/MyDrive/wake2vec/sentry_backups/t4_1762376560")

print("[HEARTBEAT MONITOR]")
print(f"Watching: {SENTRY}\n")

def has_weights(ck):
    return (ck/"model.safetensors").exists() or (ck/"pytorch_model.bin").exists()

last_seen = -1

try:
    while True:
        checkpoints = sorted([p for p in SENTRY.glob("checkpoint-*") if p.is_dir() and has_weights(p)],
                           key=lambda p: int(p.name.split("-")[-1]))

        if checkpoints:
            latest = checkpoints[-1]
            step = int(latest.name.split("-")[-1])

            if step > last_seen:
                mtime = time.ctime(latest.stat().st_mtime)
                print(f"[{time.strftime('%H:%M:%S')}] {latest.name} (saved: {mtime})")
                last_seen = step

                remaining = 1300 - step
                print(f"  Progress: {step}/1300 ({remaining} steps remaining)\n")
        else:
            print(f"[{time.strftime('%H:%M:%S')}] No checkpoints in sentry yet...")

        time.sleep(600)

except KeyboardInterrupt:
    print(f"\n[STOPPED] Last checkpoint: {last_seen}")