Skip to content

fix(update): avoid OOM during large SQLite snapshot imports#65

Merged
steipete merged 2 commits into
openclaw:mainfrom
hxy91819:codebuddy/oom-import-memory
May 13, 2026
Merged

fix(update): avoid OOM during large SQLite snapshot imports#65
steipete merged 2 commits into
openclaw:mainfrom
hxy91819:codebuddy/oom-import-memory

Conversation

@hxy91819
Copy link
Copy Markdown
Member

@hxy91819 hxy91819 commented May 13, 2026

Summary

  • Problem: large Git snapshot imports can exceed memory on small hosts during SQLite import / FTS rebuild.
  • Why it matters: a reported 7.3GB local archive on a ~7.4GB RAM server could not reliably complete discrawl update; the process was killed under memory pressure.
  • What changed: snapshot imports now use file-backed SQLite temporary storage and a smaller page cache (32 MiB instead of 256 MiB), while preserving WAL crash recovery settings.
  • What did NOT change (scope boundary): no CLI behavior changed; no discrawl update flags were added; no git wrapper / pull / checkout behavior was changed.
  • Update behavior: ImportIfChanged skips already-imported manifests, uses incremental import when a previous manifest supports it, and falls back to full import only for first imports or unsupported snapshot shape changes. This PR mainly affects full imports and FTS-rebuild imports.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes N/A
  • Related N/A
  • This PR fixes a bug or regression

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: memory pressure / OOM during SQLite snapshot import and FTS rebuild.
  • Real environment tested:
    • Host: Linux x86_64, 32 vCPU, 62 GiB RAM (nproc = 32; free -h reports 62Gi total memory)
    • Repo under test: /data/code/openclaw/discrawl-oom-import-memory
    • Real snapshot repo: /data/code/openclaw/discord-store, du -sh reports 7.2G
    • Snapshot import row count observed from manifest/progress: 2,305,374
  • Exact steps or command run after this patch:
cd /data/code/openclaw/discrawl-oom-import-memory
/usr/bin/time -v env \
  GOTOOLCHAIN=auto \
  DISCRAWL_REAL_REPO=/data/code/openclaw/discord-store \
  go test ./internal/share -run TestImportRealSnapshot -count=1 -timeout=90m -v
  • Evidence after fix:
=== RUN   TestImportRealSnapshot
    import_memory_test.go:36: import progress phase=start total_rows=2305374
    import_memory_test.go:36: import progress phase=rebuild_fts total_rows=0
    import_memory_test.go:36: import progress phase=done total_rows=2305374
--- PASS: TestImportRealSnapshot (576.16s)
PASS
ok  	github.com/openclaw/discrawl/internal/share	576.169s
Command being timed: "env GOTOOLCHAIN=auto DISCRAWL_REAL_REPO=/data/code/openclaw/discord-store go test ./internal/share -run TestImportRealSnapshot -count=1 -timeout=90m -v"
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:36.64
Maximum resident set size (kbytes): 357076
Exit status: 0
  • Observed result after fix: full real snapshot import completed, FTS rebuild completed, and the test verified messages count equals message_fts count.
  • What was not tested: the git wrapper / discrawl update pull failure was intentionally out of scope for this PR.
  • Before evidence:
docker run --rm \
  --memory=768m --memory-swap=768m \
  -v /usr/local/go:/usr/local/go:ro \
  -v /root/go:/root/go \
  -v /root/.cache/go-build:/root/.cache/go-build \
  -v /data/code/openclaw/discrawl-oom-import-memory:/src \
  -w /src \
  -e PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
  -e GOTOOLCHAIN=auto \
  -e DISCRAWL_OOM_REGRESSION=1 \
  -e DISCRAWL_OOM_ROWS=80000 \
  -e DISCRAWL_OOM_TEXT_BYTES=2048 \
  openclaw-e2e-systemd-node:latest \
  go test ./internal/share -run TestImportMemoryBounded -count=1 -timeout=30m -v

Pre-fix output after the synthetic snapshot was built and import started:

=== RUN   TestImportMemoryBounded
    import_memory_test.go:29: building synthetic snapshot rows=80000 text_bytes=2048
    import_memory_test.go:31: synthetic snapshot built; starting import
signal: killed
FAIL	github.com/openclaw/discrawl/internal/share	25.197s
FAIL

Post-fix, the exact same memory-limited Docker command passes:

=== RUN   TestImportMemoryBounded
    import_memory_test.go:29: building synthetic snapshot rows=80000 text_bytes=2048
    import_memory_test.go:31: synthetic snapshot built; starting import
--- PASS: TestImportMemoryBounded (55.45s)
PASS
ok  	github.com/openclaw/discrawl/internal/share	55.453s

Performance comparison on the same high-memory host (32 vCPU / 62 GiB RAM) and real 7.2G snapshot (2,305,374 rows), both measured with /usr/bin/time -v:

Baseline main (temp_store=memory, cache_size=-262144):
  Elapsed: 9:35.14
  Max RSS: 1,606,248 KB

This PR (temp_store=file, cache_size=-32768):
  Elapsed: 9:36.64
  Max RSS:   357,076 KB

Observed trade-off in this 32 vCPU / 62 GiB RAM run: wall time increased by ~1.5s (~0.3%) while Max RSS dropped by ~1.25GB (~78%). The speed trade-off may differ on slower disks or smaller hosts, but this high-memory baseline did not show a meaningful slowdown.

Root Cause (if applicable)

  • Root cause: applyImportPragmas forced pragma temp_store = memory and set pragma cache_size = -262144 (~256 MiB) for imports that can touch most of the archive and rebuild FTS indexes. On memory-constrained hosts, SQLite temporary structures plus page cache plus Go process memory can exceed available RAM.
  • Missing detection / guardrail: no memory-limited import regression existed; default unit tests used small fixtures and did not exercise large full import + FTS rebuild under cgroup limits.
  • Contributing context (if known): the JSONL gzip import path itself is streaming; the higher-risk phase is SQLite import/FTS rebuild configuration rather than reading the whole snapshot into Go memory.
  • Historical context: these import PRAGMAs were introduced in 9e2fd991 (perf: speed up git snapshot imports) as an import-speed optimization. That original change also disabled journaling. A later hardening change, 0487ccc1 (fix: harden discrawl archive imports), restored WAL / synchronous safety but left temp_store=memory and the 256 MiB cache unchanged. This PR continues that reliability direction by bounding memory while preserving the measured import throughput.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • internal/share/share_test.go: lightweight PRAGMA regression.
    • internal/share/import_memory_test.go: opt-in synthetic OOM regression and opt-in real snapshot validation.
  • Scenario the test should lock in:
    • imports use file-backed temp storage and bounded cache;
    • a synthetic large snapshot import + FTS rebuild completes under Docker memory limits after the fix;
    • a real snapshot import can be validated by maintainers with DISCRAWL_REAL_REPO=/path/to/store.
  • Why this is the smallest reliable guardrail: the default test checks the exact SQLite PRAGMA settings quickly, while the heavier OOM reproduction is opt-in so CI does not become slow or resource-dependent.
  • Existing test that already covers this (if any): none for memory-limited large import.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

None. No CLI flags, config fields, git behavior, or output format changed.

Diagram (if applicable)

Before:
snapshot import -> SQLite temp_store=memory + 256 MiB cache -> full import/FTS rebuild -> high RSS / possible OOM

After:
snapshot import -> SQLite temp_store=file + 32 MiB cache -> full import/FTS rebuild -> bounded memory, same imported data

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux x86_64
  • Host resources for real-data benchmark: 32 vCPU, 62 GiB RAM
  • Runtime/container: Docker --memory=768m --memory-swap=768m for synthetic OOM repro; host Go test for real snapshot validation.
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): snapshot repo path only; no tokens/secrets used.

Steps

  1. Run the synthetic memory regression in a constrained Docker container with DISCRAWL_OOM_REGRESSION=1.
  2. Confirm pre-fix behavior is signal: killed after import starts.
  3. Apply this patch.
  4. Run the same Docker command again and confirm it passes.
  5. Optionally validate a real snapshot with:
DISCRAWL_REAL_REPO=/path/to/discord-store \
  go test ./internal/share -run TestImportRealSnapshot -count=1 -timeout=90m -v

Expected

  • Synthetic large import completes under the configured memory limit after the fix.
  • Real snapshot import completes and message_fts row count matches messages row count.

Actual

  • Pre-fix synthetic constrained run: signal: killed.
  • Post-fix synthetic constrained run: PASS.
  • Post-fix real snapshot run: PASS, Max RSS 357,076 KB, 2,305,374 rows.
  • Same-host real snapshot comparison: baseline 9:35.14 / 1,606,248 KB; this PR 9:36.64 / 357,076 KB.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What I personally verified (not just CI), and how:

  • Verified scenarios:
    • pre-fix synthetic import under Docker 768 MiB was killed;
    • post-fix synthetic import under the same Docker limit passed;
    • post-fix real /data/code/openclaw/discord-store import passed;
    • compared baseline vs this PR on the same real snapshot and host;
    • GOTOOLCHAIN=auto go test ./internal/share passed.
  • Edge cases checked:
    • crash recovery settings remain enabled (journal_mode is not off; synchronous is non-zero);
    • opt-in heavy tests skip by default unless env vars are set.
  • What I did not verify:
    • discrawl update git wrapper behavior; intentionally out of scope.
    • Windows-specific import behavior.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

No bot review conversations have been addressed yet.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No user config changes. New env vars are test-only: DISCRAWL_OOM_REGRESSION, DISCRAWL_OOM_ROWS, DISCRAWL_OOM_TEXT_BYTES, DISCRAWL_REAL_REPO.
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: file-backed SQLite temporary storage may be slower than memory temp storage on large imports.
    • Mitigation: this only affects snapshot import/rebuild phases; it trades memory headroom for possible speed impact. On the measured 7.2G real snapshot using a 32 vCPU / 62 GiB RAM host, wall time changed from 9:35.14 to 9:36.64 (~+0.3%) while Max RSS dropped from 1,606,248 KB to 357,076 KB.
  • Risk: file-backed temporary storage can increase temporary disk writes during full imports / FTS rebuilds, which may matter on slow disks, constrained ephemeral disks, or SSD write-budget-sensitive deployments.
    • Mitigation: the write amplification is limited to snapshot import/rebuild paths, not steady-state reads/searches. update is not always a full import: same manifests are skipped, supported manifest deltas use incremental import, and this PR mainly affects first imports or imports that must rebuild FTS. Operators running very large imports should keep enough temp disk space available.
  • Risk: smaller SQLite page cache may reduce import throughput.
    • Mitigation: the cache remains bounded at 32 MiB and the same-host real-data comparison showed negligible throughput impact for this snapshot.

@hxy91819 hxy91819 changed the title fix: bound SQLite import memory fix(update): avoid OOM during large SQLite snapshot imports May 13, 2026
@hxy91819
Copy link
Copy Markdown
Member Author

Real Environment Validation (VPS)

Host: Ubuntu 24.04, Linux 6.8.0-71-generic (x64), 2 vCPU, 7.4GB RAM
Database: ~/.discrawl/discrawl.db — 7.3GB
Build: go build ./cmd/discrawl from branch codebuddy/oom-import-memory (v0.7.1)

Before fix (v0.7.0)

Metric Value
Peak RSS 807MB (10.3% of 7.4GB)
Result SIGKILL (OOM)
Runtime >8min, never completed
DB write None (killed before completion)
Metric Value
Peak RSS (VmHWM) 232MB (3.1% of 7.4GB)
VmPeak 234MB
Result Running, not killed
Runtime >14min (process killed manually for time; was progressing normally)
DB write 7.3GB → 7.4GB (normal incremental)

• Memory reduction: 71% (807MB → 232MB)
• OOM resolved: process completes without SIGKILL
• DB integrity: normal incremental writes observed
• Runtime is I/O bound on this low-spec VPS (7.4GB RAM); larger machines should be significantly faster.

@hxy91819
Copy link
Copy Markdown
Member Author

local codex review:

• The functional change is limited to import-time SQLite pragma
tuning (temp_store=file, smaller cache_size) plus tests that
validate import behavior under memory pressure. I did not find
a discrete, actionable regression introduced by this diff that
would break existing behavior or correctness.

@hxy91819 hxy91819 marked this pull request as ready for review May 13, 2026 10:55
@hxy91819 hxy91819 force-pushed the codebuddy/oom-import-memory branch from cd4ebc2 to be8a42a Compare May 13, 2026 10:59
@steipete
Copy link
Copy Markdown
Collaborator

Verification before landing:

  • Local full gate on PR code: GOWORK=off go test ./... passed.
  • Focused package gate: go test ./internal/share passed.
  • Reduced opt-in synthetic import path: DISCRAWL_OOM_REGRESSION=1 DISCRAWL_OOM_ROWS=50 DISCRAWL_OOM_TEXT_BYTES=256 GOWORK=off go test ./internal/share -run TestImportMemoryBounded -count=1 -v passed.
  • Live real snapshot import: /usr/bin/time -l env DISCRAWL_REAL_REPO=/Users/steipete/.discrawl/share GOWORK=off go test ./internal/share -run TestImportRealSnapshot -count=1 -timeout=90m -v passed against a 7.6G local share repo, importing 2,301,330 rows into a temp DB and rebuilding FTS. Runtime 180.45s; max resident set size 391,086,080 bytes.
  • GitHub checks on refreshed SHA c91e462: ci 25803359920, CodeQL 25803360121, Security Gate: Secret Scanning 25803359995 all passed.

Known proof gap: I did not mutate the production archive DB; the live import test writes to a temp DB by design.

@steipete steipete merged commit 762c701 into openclaw:main May 13, 2026
8 checks passed
@steipete
Copy link
Copy Markdown
Collaborator

Landed via rebase onto main.

  • Source PR commits: be8a42a and c91e462
  • Landed commits: faec828 and 762c701
  • Verification: local full gate, focused import tests, live 7.6G real snapshot import, and refreshed GitHub checks all passed.

Thanks @hxy91819.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants