Skip to content

feat(ops): daily DB backup schedule, idempotent install via session-start hook#39

Merged
metazen11 merged 1 commit into
feat/v2-finetune-data-pipelinefrom
feat/v2-daily-backup
May 14, 2026
Merged

feat(ops): daily DB backup schedule, idempotent install via session-start hook#39
metazen11 merged 1 commit into
feat/v2-finetune-data-pipelinefrom
feat/v2-daily-backup

Conversation

@metazen11
Copy link
Copy Markdown
Owner

Daily pg_dump schedule. Pre-existing scripts/backup.sh (with rotation logic) was already in the repo; this PR wires it into a macOS launchd job that fires at 03:14 daily, AND adds idempotent auto-install via the session-start hook.

Not tied to a particular issue — surfaced while taking a manual safety backup before #28 work and noticing that the backup script existed but had never been scheduled.

Summary

  • scripts/install_backup_schedule.sh — installer with --check + --uninstall. Renders the launchd plist with absolute paths; bootstraps the job.
  • scripts/com.metazen.agent-memory-backup.plist — launchd template (StartCalendarInterval = 03:14 daily; Hour: 3 / Minute: 14 — off the :00 mark per scheduling guidance).
  • hooks/ensure-services.js — new ensureBackupSchedule() runs at end of Main. Idempotent: re-installs only when the template is newer than the installed plist OR the target is missing. macOS-only. Failures debug-logged and swallowed — never blocks session start.
  • scripts/backup.sh — bug fixes (see below).
  • docs/backups.md — operator reference.

scripts/backup.sh bug fixes

The pre-existing script hard-coded user=agentmem and didn't carry a password, which silently failed against the real dev setup (which uses DATABASE_URL from .env).

Auth resolution order is now:

  1. DATABASE_URL (preferred — matches the running FastAPI server's DSN).
  2. POSTGRES_USER + PGPASSWORD env var fallback.
  3. Defaults: agentmem / agent_memory / 5432.

Plus a min-size guard: if the produced file is < 1 KB the script deletes it and exits 1. Catches silent auth-failure cases that would otherwise leave a useless 20-byte .gz in data/backups/.

Test plan

  • bash scripts/install_backup_schedule.sh installs the plist and bootstraps the job.
  • bash scripts/install_backup_schedule.sh --check reports plist installed + job loaded.
  • launchctl list | grep com.metazen.agent-memory-backup shows the loaded job.
  • bash scripts/backup.sh produces a valid 319 MB gzipped dump.
  • Rotation keeps only the 3 most recent daily_*.sql.gz files; manual snapshots with other prefixes are preserved.
  • Min-size guard removes a 20-byte failed-auth file from earlier testing.
  • hooks/ensure-services.js runs through node -c parse (already in production every session).
  • Cannot test in CI: the launchd 03:14 daily fire happens out-of-process. Will be verified by checking data/backups/ after tomorrow's run.

Verification commands

bash scripts/install_backup_schedule.sh --check
ls -lht data/backups/
launchctl list | grep com.metazen.agent-memory-backup
tail -f ~/Library/Logs/agent-memory-backup.log

Rollback

bash scripts/install_backup_schedule.sh --uninstall

Reverting the commit also reverts the ensure-services.js hook, so subsequent sessions won't re-install the job. Existing dumps in data/backups/ stay; data is never destroyed.

Out of scope

  • Linux cron path is sketched in the installer (crontab -e fallback) but not exercised on a real Linux host — needs verification when the second-Mac / Linux deploy lands (issue ops: move agentMemory off Dropbox to local SSD #14).
  • Off-machine backup destination (S3/B2) — out of scope. Local-only.
  • Encryption of dumps at rest — out of scope. Dumps may contain redacted-but-not-encrypted data.

Why not in #28 instead

This is operational infra, independent of the v2 fine-tune pipeline. Wanted a safety backup before #28's bulk import (which it now has — see data/backups/pre_v2_backfill_20260513_211653.sql.gz), and discovered the schedule was missing. Decoupling so the backup fix can ship independently.

Wires the existing scripts/backup.sh into a macOS launchd job that fires
daily at 03:14 local time. Retention: 3 most recent daily_*.sql.gz files
(rotation already in backup.sh). Manual snapshots with other prefixes
(e.g. pre_v2_backfill_*) are preserved.

What this adds
--------------
* scripts/com.metazen.agent-memory-backup.plist — launchd template with
  __PROJECT_DIR__ and __HOME__ placeholders.
* scripts/install_backup_schedule.sh — renders the template, copies it
  to ~/Library/LaunchAgents/, bootstraps the job. Supports --check and
  --uninstall. Falls back to crontab on non-macOS.
* hooks/ensure-services.js — new ensureBackupSchedule() called at the
  end of Main after services come up. Idempotent: re-installs only when
  the template is newer than the installed plist or the target is
  missing. macOS only. Never fails session start (debug-log + swallow).

scripts/backup.sh — bug fixes
-----------------------------
Pre-existing script hard-coded user=agentmem with no password support,
which fails against the actual dev setup that uses DATABASE_URL. Now:
- Prefers DATABASE_URL (matches the FastAPI server's DSN).
- Falls back to POSTGRES_USER + PGPASSWORD env when DATABASE_URL absent.
- Refuses to keep a backup file < 1 KB (catches silent auth failures
  that would otherwise produce a near-empty .gz).

Verified
--------
* install_backup_schedule.sh --check reports plist installed + job loaded.
* backup.sh produced a 319 MB gzipped dump and rotated correctly.
* launchctl list confirms com.metazen.agent-memory-backup is scheduled.

Docs
----
* docs/backups.md (new) — operator reference: setup, verification,
  restore, manual run, disabling, non-macOS fallback.
* README.md, handoff.md — short pointers to docs/backups.md.
@metazen11 metazen11 merged commit 925391e into feat/v2-finetune-data-pipeline May 14, 2026
@metazen11 metazen11 deleted the feat/v2-daily-backup branch May 14, 2026 04:31
metazen11 added a commit that referenced this pull request May 14, 2026
…omplete (#43)

* HANDOFF.md: replaces stale 'next session = backfill' section with
  current state — all 7 v2 data-pipeline sub-issues closed (PRs
  #34/#35/#36/#37/#38/#39/#40/#41/#42 merged). Sole remaining v2 work
  is #33 (retrain). Includes the actual training procedure to run.
  Live DB stats (28,599 backfilled tool_calls, 100% linked) and v2
  dataset stats (23,983 train rows) captured for the next session.

* README.md: adds 'V2 Tool-Call Dataset' subsection under Fine-Tuning
  Dataset Exports — documents data/processed/qwen25_tools/v2/ shape,
  build command, source-of-truth tables, and link to the plan doc.

* docs/fine_tune/V2_DATA_PIPELINE_PLAN.md: per-step checklist now
  reflects merged PR status. #31 explicitly marked deferred (not on
  training critical path). #36 (project consolidation) and the daily
  backup work flagged as bonus items from the data audit.

Co-authored-by: MZ <mz@wfca.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants