Reader view images break when source pages go offline or block hotlinking #2712

anpryl · 2026-04-16T07:36:14Z

anpryl
Apr 16, 2026

❓ Question for the maintainer (please answer first)

Before diving into a detailed design discussion, I want to check: is this feature still of interest to you?

#1563 is labelled status/approved, and the separate-worker / 50-image-cap direction originally came from your own comment on that issue (see decision #1 below). But several months have passed, priorities shift, and I'd rather not burn your review time on a deep back-and-forth if the answer is "not right now" or "the approach has changed."

A "yes, let's discuss" from you and I'll engage seriously on each of the design decisions and trade-offs below, address review feedback, and drive this toward mergeable. The rest of this document is supporting context for that conversation — not an ask for a full review yet.

Problem

Karakeep's reader view extracts readable HTML from a page and stores it, but the images inside that HTML remain as external <img src="https://origin/…"> references. Two user-visible symptoms follow from this:

Source goes offline — the article text is still there, but images are broken placeholders. For tutorials, step-by-step guides, and longform visual content, this often destroys the value of the archive.
Hotlink protection on the origin — even while the source is up, servers that check Referer reject requests coming from the Karakeep UI. Reader view renders with broken images even for "live" bookmarks.

The full-page archive path (SingleFile) already inlines images, but doing that for the default reader content is not viable — base64-embedding images in HTML that lives in the database would balloon storage and memory usage for every reader-view fetch.

The maintainer-approved approach (#1563): parse the extracted HTML, download referenced images as attached assets, and rewrite <img src> to point at the asset URLs.

Related work

Both issues below are currently open. "Closes" in the PR body is the GitHub keyword that auto-closes them when the PR merges — it is not the current state.

Addresses [FR] Cache and store images in the extracted reader content #1563 — [FR] Cache and store images in the extracted reader content (status/approved, pri/medium). Will auto-close on merge.
Addresses Feature request: Download and embed article images in Readview to bypass hotlink protection #2363 — Download and embed article images in Readview to bypass hotlink protection. Will auto-close on merge.

Our PR

#2595 — feat: cache reader content images for offline access (draft)

This is a proof-of-concept PR. It has been tested end-to-end on my self-hosted instance (Docker image built from this branch, reader view verified with real bookmarks). It is not production-hardened — see "Trade-offs" and "Open questions" below for the gaps that would need to close before a production rollout.

21 commits, ~6,400 lines added. Introduces a new contentImageWorker, a CONTENT_IMAGE asset type, a DB column for per-bookmark status, admin UI integration, and per-bookmark debugger instrumentation.

If you want to pursue this direction, I will split this PR into smaller, review-sized pieces before asking for review — I don't expect you to review ~6,400 lines in a single branch. The current shape is a POC artifact to prove the end-to-end path on my instance; it isn't the shape I'd ask you to merge.

Solution overview

After a successful crawl, the crawler enqueues a ContentImageQueue job. A dedicated contentImageWorker:

Loads the stored reader HTML.
Extracts every external image reference — including <img src>, lazy-load attributes, srcset (picks the largest candidate), and SVG <image href> / xlink:href.
Downloads each image with browser-like headers (including Referer derived from the source page), validates format by magic bytes, and saves it as a CONTENT_IMAGE asset.
Rewrites each original URL in the HTML to /api/assets/{assetId}.
Persists the rewritten HTML back and records per-bookmark status (pending / success / failure) in a new bookmarkLinks.contentImageStatus column.
Cleans up stale content images no longer referenced in the re-crawled HTML.

Reader view then renders cached assets — images survive source outages and hotlink restrictions.

Key design decisions and why

1. Separate worker, not inline in the crawler

The crawler already has many failure modes (browser crash, timeout, quota). Folding image downloads into the same code path would let one slow or broken image stall or fail an otherwise successful crawl. A dedicated worker with its own queue isolates reliability domains and lets us tune retries independently.

Origin of this decision: proposed by @MohamedBassem on #1563 — "maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker." (comment)

2. Default OFF (`CRAWLER_STORE_CONTENT_IMAGES=false`)

Turning this on silently for existing installs would surprise self-hosters with extra bandwidth, storage, and queue pressure. Opt-in is the safe default; admins who want the feature can flip a single env var.

Open for discussion: CRAWLER_STORE_CONTENT_IMAGES is currently a server-wide env var — one switch for the whole instance. We may want to move this (and the budget knobs below) to per-user application-level settings so individual users can opt in without admin involvement. Not done in this POC.

3. Deterministic asset IDs — `SHA-256(bookmarkId:sourceUrl).slice(0,32)`

Non-deterministic IDs would make retries and re-crawls messy: a failed download retried later would either duplicate the asset or need out-of-band dedup logic. Deriving the ID from stable inputs gives us free idempotency — a retry either finds the asset already saved and skips, or writes to the same slot.

4. Extended format support (SVG, AVIF, APNG) on top of JPEG/PNG/GIF/WebP

Real-world article content uses all of these. SVG for diagrams and logos is particularly common on technical blogs. We kept the global IMAGE_ASSET_TYPES unchanged and added a worker-scoped CONTENT_IMAGE_ASSET_TYPES set so the broader asset system isn't affected.

5. Magic bytes detection for misconfigured servers

Many origins return images with Content-Type: text/html or application/octet-stream. Trusting the header blindly would drop legitimate images. We sniff the first few bytes against known image signatures and fall back to that when the header is wrong.

6. Browser-like request headers with `Referer`

Chrome UA + Accept: image/* + a Referer set to the original page URL is the minimum needed to satisfy the most common hotlink checks (#2363). This alone recovers a meaningful fraction of images that would otherwise fail.

7. Per-image exponential backoff (up to 10 retries, 1s → 30s cap)

Image hosts and CDNs return transient 403/429/5xx far more than they return permanent failures. Aggressive but bounded retries (~3 min worst case per image) strike a reasonable balance between resilience and bounded work per job.

8. Sequential downloads per bookmark

Parallel downloads against a single origin trip rate limits fast, especially for image-heavy articles hosted on one CDN. Sequential keeps us under most per-host budgets, and the work is queued — total throughput is regulated by worker concurrency, not per-job fanout.

9. Stale asset cleanup on re-crawl — only after at least one new image succeeded

When a bookmark is re-crawled, the extracted HTML can change (site layout updates, article edits). Images no longer referenced would otherwise accumulate forever. We gate cleanup on "at least one new image cached successfully" so a transient total failure doesn't wipe the existing working set.

10. Content images are system-managed, hidden from `AttachmentBox`

These are implementation detail, not user uploads. Showing 30+ tiny assets per bookmark in the attachment UI would be noise. isAllowedToAttachAsset / isAllowedToDetachAsset both return false for CONTENT_IMAGE so users can't accidentally break reader view by deleting one.

11. Data-URI inlining for API-key clients

The web client authenticates asset requests via session cookies, so relative /api/assets/{id} URLs in <img src> just work. API-key clients — mobile app, CLI, browser extension, MCP server — cannot attach bearer tokens to plain <img> requests, so the same URLs render as broken images for them. When a request comes in via Bearer-token auth, the tRPC layer detects ctx.apiAuth and inlines small cached images as base64 data URIs directly into the returned HTML. Large images still flow through the normal asset endpoint.

POC choice. This was picked specifically to ship a working end-to-end experience without requiring coordinated changes across four client codebases (iOS/Android via React Native, CLI, browser extension, MCP server). It is not necessarily the right long-term design — data-URI payloads bloat HTML responses, duplicate bytes across fetches, and can't leverage HTTP caching. Ready to discuss alternatives before merging. No alternatives have been formally brainstormed yet; this note is the invitation to do so.

12. First-class operability — admin panel + bookmark debugger

Background Jobs card: pending/failed counters, "Recache failed content images" bulk action. Hidden when the feature is disabled.
Bookmark Debugger: Image Crawl Status badge + per-bookmark "Re-cache images" button for investigating one-off failures without shelling into the worker.

This wasn't originally scoped, but a feature that's off by default and silently fails per-bookmark is hard to trust. The instrumentation pays for itself the first time a report says "reader view is broken for this one page".

Trade-offs we accepted

Re-cache without re-crawl is a no-op when HTML already contains /api/assets/ URLs (original URLs are already gone). Users must re-crawl first. Alternative (keep original URLs alongside asset IDs) doubles the storage footprint of the HTML.
Images beyond the per-bookmark cap are silently dropped. The default cap is 50, matching @MohamedBassem's suggestion on [FR] Cache and store images in the extracted reader content #1563 — "an artificial limit of maybe 50 images per link or something" (comment). If a page references more than 50 images, extraction stops at the 50th — images 51+ are never queued, never retried, never cached, and there is no partial-retry mechanism to pick them up later. We considered spilling the overflow into a follow-up job, but that adds moving parts for a case we have not seen in practice (articles with >50 images are rare). If we start seeing it, revisit.
Bookmark deletion cascades DB rows but not asset files — cleanup is handled by the existing tidyAssets maintenance job. Same trade-off as every other asset type; not worth special-casing.
Video poster and CSS background-image not handled. Intentionally out of scope — the long tail is noisy, and the feature already covers the 95% case.

Configuration

All knobs below are currently server-wide env vars. Same caveat as decision #2: these may want to move to per-user application-level settings so individual users can pick their own budget without admin involvement.

Env var	Default	Purpose
`CRAWLER_STORE_CONTENT_IMAGES`	`false`	Master on/off
`CRAWLER_CONTENT_IMAGE_MAX_COUNT`	`50`	Per-bookmark image cap
`CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB`	`5`	Per-image size cap
`CONTENT_IMAGE_NUM_WORKERS`	`1`	Worker concurrency
`CONTENT_IMAGE_JOB_TIMEOUT_SEC`	`120`	Per-bookmark job timeout

Database migration

0081_add_content_image_status — nullable contentImageStatus column on bookmarkLinks. No data migration; existing rows stay NULL (not processed) until re-crawled.

Note: this migration number will collide once we rebase on main — main has added migrations up through 0082. Will renumber to 0083 during rebase.

Testing

91 unit tests (extraction, rewriting, download, magic bytes, pipeline, stale cleanup)
5 integration tests with a real HTTP server serving minimal valid images
Manual acceptance criteria pending: enable flag, bookmark an image-heavy page, take source offline, verify reader view still renders; verify admin panel counters; verify per-bookmark re-cache action.

Open questions for review

Is default-off the right shipping posture, or should we default-on in the next major release?
Should re-cache automatically trigger a re-crawl rather than no-op? (Tension: re-crawl is expensive; users may want to try re-caching cheaply first.)
Is 50 images / 5 MB the right default budget? We have no data on the real distribution yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reader view images break when source pages go offline or block hotlinking #2712

Uh oh!

{{title}}

Uh oh!

❓ Question for the maintainer (please answer first)

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Reader view images break when source pages go offline or block hotlinking #2712

Uh oh!

anpryl Apr 16, 2026

❓ Question for the maintainer (please answer first)

Problem

Related work

Our PR

Solution overview

Key design decisions and why

1. Separate worker, not inline in the crawler

2. Default OFF (CRAWLER_STORE_CONTENT_IMAGES=false)

3. Deterministic asset IDs — SHA-256(bookmarkId:sourceUrl).slice(0,32)

4. Extended format support (SVG, AVIF, APNG) on top of JPEG/PNG/GIF/WebP

5. Magic bytes detection for misconfigured servers

6. Browser-like request headers with Referer

7. Per-image exponential backoff (up to 10 retries, 1s → 30s cap)

8. Sequential downloads per bookmark

9. Stale asset cleanup on re-crawl — only after at least one new image succeeded

10. Content images are system-managed, hidden from AttachmentBox

11. Data-URI inlining for API-key clients

12. First-class operability — admin panel + bookmark debugger

Trade-offs we accepted

Configuration

Database migration

Testing

Open questions for review

Replies: 0 comments

anpryl
Apr 16, 2026

2. Default OFF (`CRAWLER_STORE_CONTENT_IMAGES=false`)

3. Deterministic asset IDs — `SHA-256(bookmarkId:sourceUrl).slice(0,32)`

6. Browser-like request headers with `Referer`

10. Content images are system-managed, hidden from `AttachmentBox`