Skip to content

[codex] Add attachment media backup support#66

Merged
steipete merged 3 commits into
mainfrom
codex/attachment-media-support
May 15, 2026
Merged

[codex] Add attachment media backup support#66
steipete merged 3 commits into
mainfrom
codex/attachment-media-support

Conversation

@steipete
Copy link
Copy Markdown
Collaborator

Summary

This adds first-class Discord attachment media support to Discrawl. Attachments were already tracked as searchable metadata and small text snippets, but the binary media itself was only reachable through Discord CDN URLs. This change lets Discrawl cache attachment bytes locally, keep that cache verified by content hash, and include non-DM media in Git snapshot backups so subscribers can restore the same files offline.

What changed

  • Adds a schema v3 migration for attachment media metadata: relative media path, SHA-256, byte size, fetch time, fetch status, and fetch error.
  • Adds discrawl attachments to list attachment rows with filters for channel, author, message, filename, content type, time range, guild scope, DM scope, and missing cached media.
  • Adds discrawl attachments fetch to download attachment media into cache_dir/media, with missing-only fetches, forced refetch, byte limits, cache verification, long filename handling, and corrupt-cache repair.
  • Adds discrawl sync --with-media, backed by [sync] attachment_media and [sync] max_attachment_bytes, so sync can fetch media for the same Discord or wiretap scope it just updated.
  • Extends Git sharing so publish exports cached non-DM media into media/, and update or subscribe restore that media into the subscriber cache by default.
  • Adds --no-media for publish, update, and subscribe; subscribe --no-media persists [share] media = false.
  • Keeps @me DM media local-only. DM rows and media are excluded from shared snapshots, and attachments --dm / attachments fetch --dm skip Git share auto-update.
  • Preserves subscriber-local media metadata when importing shared snapshots that do not yet carry media metadata.
  • Documents the new command, config, storage layout, share behavior, and privacy boundary.

Design notes

The media logic lives in Discrawl rather than Crawlkit. Crawlkit remains the generic snapshot and mirror layer; the new behavior depends on Discord-specific attachment URLs, proxy URLs, message and guild scoping, the @me privacy boundary, and Discrawl's attachment table. If another crawler needs sidecar binary artifact support later, Crawlkit can grow a generic artifact API from this shape, but moving this first implementation there would force Discord semantics into the shared package too early.

Media files are content-addressed under cache_dir/media/attachments/<sha-prefix>/<sha>-<filename>. The database stores the relative path and hash, not the bytes. Git snapshots copy only verified regular files and reject symlink escapes during import. Export also rejects overlapping repo/cache media roots so a misconfigured share path cannot delete or overwrite the local cache.

Validation

  • GOWORK=off go test ./...
  • git diff --check
  • node scripts/build-docs-site.mjs
  • codex-review --parallel-tests "GOWORK=off go test ./..."

The nested review sandbox could not bind httptest ports during its own direct media package run, but the helper's parallel test invocation and the local full suite both passed outside that sandbox limitation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a6054f039

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/media/cache.go
Comment on lines +233 to +234
if err := os.Rename(tmpPath, target); err != nil {
_ = os.Remove(tmpPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove existing media file before renaming temp download

fetchURL can enter the write path when a cached attachment already exists but is corrupt (or when --force is used), yet it calls os.Rename(tmpPath, target) without deleting an existing regular target first. On Windows, rename-to-existing fails, so forced refetch and corruption repair will consistently fail with a fetch error instead of updating the cache.

Useful? React with 👍 / 👎.

Comment thread internal/share/share.go
Comment on lines +1140 to +1141
if err := os.Rename(tmpPath, target); err != nil {
_ = os.Remove(tmpPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replace existing cache target before renaming copied media

copyFile uses the same temp-file rename pattern and also renames onto target without removing an existing file. importMedia calls this helper when a local cached media file exists but hash-mismatches, so on Windows update/subscribe media restoration will fail whenever a cached file needs replacement.

Useful? React with 👍 / 👎.

@steipete steipete merged commit 724adb0 into main May 15, 2026
12 checks passed
@steipete steipete deleted the codex/attachment-media-support branch May 15, 2026 09:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93ec2463e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +231 to +235
List: store.AttachmentListOptions{
GuildIDs: opts.GuildIDs,
ExcludeGuildIDs: []string{store.DirectMessageGuildID},
ChannelIDs: channelIDs,
Since: opts.Since,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit media fetch to the latest-only sync window

When sync runs in its default latest-only mode (LatestOnly true with no --since), fetchSyncMedia passes only opts.Since into AttachmentListOptions, which is zero in that mode, so media.Fetch scans all attachments in the targeted guilds instead of the latest delta. Because media.Fetch verifies reusable files by hashing them, enabling --with-media (or [sync].attachment_media=true) can turn routine syncs into full-archive media verification/download passes on every run, causing severe performance regressions and unexpected backlog fetching.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant