You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
❓ Question for the maintainer (please answer first)
Before diving into a detailed design discussion, I want to check: is this feature still of interest to you?
#1563 is labelled status/approved, and the separate-worker / 50-image-cap direction originally came from your own comment on that issue (see decision #1 below). But several months have passed, priorities shift, and I'd rather not burn your review time on a deep back-and-forth if the answer is "not right now" or "the approach has changed."
A "yes, let's discuss" from you and I'll engage seriously on each of the design decisions and trade-offs below, address review feedback, and drive this toward mergeable. The rest of this document is supporting context for that conversation — not an ask for a full review yet.
Problem
Karakeep's reader view extracts readable HTML from a page and stores it, but the images inside that HTML remain as external <img src="https://origin/…"> references. Two user-visible symptoms follow from this:
Source goes offline — the article text is still there, but images are broken placeholders. For tutorials, step-by-step guides, and longform visual content, this often destroys the value of the archive.
Hotlink protection on the origin — even while the source is up, servers that check Referer reject requests coming from the Karakeep UI. Reader view renders with broken images even for "live" bookmarks.
The full-page archive path (SingleFile) already inlines images, but doing that for the default reader content is not viable — base64-embedding images in HTML that lives in the database would balloon storage and memory usage for every reader-view fetch.
The maintainer-approved approach (#1563): parse the extracted HTML, download referenced images as attached assets, and rewrite <img src> to point at the asset URLs.
Related work
Both issues below are currently open. "Closes" in the PR body is the GitHub keyword that auto-closes them when the PR merges — it is not the current state.
This is a proof-of-concept PR. It has been tested end-to-end on my self-hosted instance (Docker image built from this branch, reader view verified with real bookmarks). It is not production-hardened — see "Trade-offs" and "Open questions" below for the gaps that would need to close before a production rollout.
21 commits, ~6,400 lines added. Introduces a new contentImageWorker, a CONTENT_IMAGE asset type, a DB column for per-bookmark status, admin UI integration, and per-bookmark debugger instrumentation.
If you want to pursue this direction, I will split this PR into smaller, review-sized pieces before asking for review — I don't expect you to review ~6,400 lines in a single branch. The current shape is a POC artifact to prove the end-to-end path on my instance; it isn't the shape I'd ask you to merge.
Solution overview
After a successful crawl, the crawler enqueues a ContentImageQueue job. A dedicated contentImageWorker:
Loads the stored reader HTML.
Extracts every external image reference — including <img src>, lazy-load attributes, srcset (picks the largest candidate), and SVG <image href> / xlink:href.
Downloads each image with browser-like headers (including Referer derived from the source page), validates format by magic bytes, and saves it as a CONTENT_IMAGE asset.
Rewrites each original URL in the HTML to /api/assets/{assetId}.
Persists the rewritten HTML back and records per-bookmark status (pending / success / failure) in a new bookmarkLinks.contentImageStatus column.
Cleans up stale content images no longer referenced in the re-crawled HTML.
Reader view then renders cached assets — images survive source outages and hotlink restrictions.
Key design decisions and why
1. Separate worker, not inline in the crawler
The crawler already has many failure modes (browser crash, timeout, quota). Folding image downloads into the same code path would let one slow or broken image stall or fail an otherwise successful crawl. A dedicated worker with its own queue isolates reliability domains and lets us tune retries independently.
Origin of this decision: proposed by @MohamedBassem on #1563 — "maybe the crawler can spin up a new job to do the image downloads to not degrade the reliability of the crawler worker." (comment)
2. Default OFF (CRAWLER_STORE_CONTENT_IMAGES=false)
Turning this on silently for existing installs would surprise self-hosters with extra bandwidth, storage, and queue pressure. Opt-in is the safe default; admins who want the feature can flip a single env var.
Open for discussion:CRAWLER_STORE_CONTENT_IMAGES is currently a server-wide env var — one switch for the whole instance. We may want to move this (and the budget knobs below) to per-user application-level settings so individual users can opt in without admin involvement. Not done in this POC.
Non-deterministic IDs would make retries and re-crawls messy: a failed download retried later would either duplicate the asset or need out-of-band dedup logic. Deriving the ID from stable inputs gives us free idempotency — a retry either finds the asset already saved and skips, or writes to the same slot.
4. Extended format support (SVG, AVIF, APNG) on top of JPEG/PNG/GIF/WebP
Real-world article content uses all of these. SVG for diagrams and logos is particularly common on technical blogs. We kept the global IMAGE_ASSET_TYPES unchanged and added a worker-scoped CONTENT_IMAGE_ASSET_TYPES set so the broader asset system isn't affected.
5. Magic bytes detection for misconfigured servers
Many origins return images with Content-Type: text/html or application/octet-stream. Trusting the header blindly would drop legitimate images. We sniff the first few bytes against known image signatures and fall back to that when the header is wrong.
6. Browser-like request headers with Referer
Chrome UA + Accept: image/* + a Referer set to the original page URL is the minimum needed to satisfy the most common hotlink checks (#2363). This alone recovers a meaningful fraction of images that would otherwise fail.
Image hosts and CDNs return transient 403/429/5xx far more than they return permanent failures. Aggressive but bounded retries (~3 min worst case per image) strike a reasonable balance between resilience and bounded work per job.
8. Sequential downloads per bookmark
Parallel downloads against a single origin trip rate limits fast, especially for image-heavy articles hosted on one CDN. Sequential keeps us under most per-host budgets, and the work is queued — total throughput is regulated by worker concurrency, not per-job fanout.
9. Stale asset cleanup on re-crawl — only after at least one new image succeeded
When a bookmark is re-crawled, the extracted HTML can change (site layout updates, article edits). Images no longer referenced would otherwise accumulate forever. We gate cleanup on "at least one new image cached successfully" so a transient total failure doesn't wipe the existing working set.
10. Content images are system-managed, hidden from AttachmentBox
These are implementation detail, not user uploads. Showing 30+ tiny assets per bookmark in the attachment UI would be noise. isAllowedToAttachAsset / isAllowedToDetachAsset both return false for CONTENT_IMAGE so users can't accidentally break reader view by deleting one.
11. Data-URI inlining for API-key clients
The web client authenticates asset requests via session cookies, so relative /api/assets/{id} URLs in <img src> just work. API-key clients — mobile app, CLI, browser extension, MCP server — cannot attach bearer tokens to plain <img> requests, so the same URLs render as broken images for them. When a request comes in via Bearer-token auth, the tRPC layer detects ctx.apiAuth and inlines small cached images as base64 data URIs directly into the returned HTML. Large images still flow through the normal asset endpoint.
POC choice. This was picked specifically to ship a working end-to-end experience without requiring coordinated changes across four client codebases (iOS/Android via React Native, CLI, browser extension, MCP server). It is not necessarily the right long-term design — data-URI payloads bloat HTML responses, duplicate bytes across fetches, and can't leverage HTTP caching. Ready to discuss alternatives before merging. No alternatives have been formally brainstormed yet; this note is the invitation to do so.
Background Jobs card: pending/failed counters, "Recache failed content images" bulk action. Hidden when the feature is disabled.
Bookmark Debugger: Image Crawl Status badge + per-bookmark "Re-cache images" button for investigating one-off failures without shelling into the worker.
This wasn't originally scoped, but a feature that's off by default and silently fails per-bookmark is hard to trust. The instrumentation pays for itself the first time a report says "reader view is broken for this one page".
Trade-offs we accepted
Re-cache without re-crawl is a no-op when HTML already contains /api/assets/ URLs (original URLs are already gone). Users must re-crawl first. Alternative (keep original URLs alongside asset IDs) doubles the storage footprint of the HTML.
Images beyond the per-bookmark cap are silently dropped. The default cap is 50, matching @MohamedBassem's suggestion on [FR] Cache and store images in the extracted reader content #1563 — "an artificial limit of maybe 50 images per link or something" (comment). If a page references more than 50 images, extraction stops at the 50th — images 51+ are never queued, never retried, never cached, and there is no partial-retry mechanism to pick them up later. We considered spilling the overflow into a follow-up job, but that adds moving parts for a case we have not seen in practice (articles with >50 images are rare). If we start seeing it, revisit.
Bookmark deletion cascades DB rows but not asset files — cleanup is handled by the existing tidyAssets maintenance job. Same trade-off as every other asset type; not worth special-casing.
Video poster and CSS background-image not handled. Intentionally out of scope — the long tail is noisy, and the feature already covers the 95% case.
Configuration
All knobs below are currently server-wide env vars. Same caveat as decision #2: these may want to move to per-user application-level settings so individual users can pick their own budget without admin involvement.
Env var
Default
Purpose
CRAWLER_STORE_CONTENT_IMAGES
false
Master on/off
CRAWLER_CONTENT_IMAGE_MAX_COUNT
50
Per-bookmark image cap
CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB
5
Per-image size cap
CONTENT_IMAGE_NUM_WORKERS
1
Worker concurrency
CONTENT_IMAGE_JOB_TIMEOUT_SEC
120
Per-bookmark job timeout
Database migration
0081_add_content_image_status — nullable contentImageStatus column on bookmarkLinks. No data migration; existing rows stay NULL (not processed) until re-crawled.
Note: this migration number will collide once we rebase on main — main has added migrations up through 0082. Will renumber to 0083 during rebase.
Testing
91 unit tests (extraction, rewriting, download, magic bytes, pipeline, stale cleanup)
5 integration tests with a real HTTP server serving minimal valid images
Manual acceptance criteria pending: enable flag, bookmark an image-heavy page, take source offline, verify reader view still renders; verify admin panel counters; verify per-bookmark re-cache action.
Open questions for review
Is default-off the right shipping posture, or should we default-on in the next major release?
Should re-cache automatically trigger a re-crawl rather than no-op? (Tension: re-crawl is expensive; users may want to try re-caching cheaply first.)
Is 50 images / 5 MB the right default budget? We have no data on the real distribution yet.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
Karakeep's reader view extracts readable HTML from a page and stores it, but the images inside that HTML remain as external
<img src="https://origin/…">references. Two user-visible symptoms follow from this:Refererreject requests coming from the Karakeep UI. Reader view renders with broken images even for "live" bookmarks.The full-page archive path (SingleFile) already inlines images, but doing that for the default reader content is not viable — base64-embedding images in HTML that lives in the database would balloon storage and memory usage for every reader-view fetch.
The maintainer-approved approach (#1563): parse the extracted HTML, download referenced images as attached assets, and rewrite
<img src>to point at the asset URLs.Related work
Both issues below are currently open. "Closes" in the PR body is the GitHub keyword that auto-closes them when the PR merges — it is not the current state.
Our PR
#2595 — feat: cache reader content images for offline access (draft)
This is a proof-of-concept PR. It has been tested end-to-end on my self-hosted instance (Docker image built from this branch, reader view verified with real bookmarks). It is not production-hardened — see "Trade-offs" and "Open questions" below for the gaps that would need to close before a production rollout.
21 commits, ~6,400 lines added. Introduces a new
contentImageWorker, aCONTENT_IMAGEasset type, a DB column for per-bookmark status, admin UI integration, and per-bookmark debugger instrumentation.Solution overview
After a successful crawl, the crawler enqueues a
ContentImageQueuejob. A dedicatedcontentImageWorker:<img src>, lazy-load attributes,srcset(picks the largest candidate), and SVG<image href>/xlink:href.Refererderived from the source page), validates format by magic bytes, and saves it as aCONTENT_IMAGEasset./api/assets/{assetId}.pending/success/failure) in a newbookmarkLinks.contentImageStatuscolumn.Reader view then renders cached assets — images survive source outages and hotlink restrictions.
Key design decisions and why
1. Separate worker, not inline in the crawler
The crawler already has many failure modes (browser crash, timeout, quota). Folding image downloads into the same code path would let one slow or broken image stall or fail an otherwise successful crawl. A dedicated worker with its own queue isolates reliability domains and lets us tune retries independently.
2. Default OFF (
CRAWLER_STORE_CONTENT_IMAGES=false)Turning this on silently for existing installs would surprise self-hosters with extra bandwidth, storage, and queue pressure. Opt-in is the safe default; admins who want the feature can flip a single env var.
3. Deterministic asset IDs —
SHA-256(bookmarkId:sourceUrl).slice(0,32)Non-deterministic IDs would make retries and re-crawls messy: a failed download retried later would either duplicate the asset or need out-of-band dedup logic. Deriving the ID from stable inputs gives us free idempotency — a retry either finds the asset already saved and skips, or writes to the same slot.
4. Extended format support (SVG, AVIF, APNG) on top of JPEG/PNG/GIF/WebP
Real-world article content uses all of these. SVG for diagrams and logos is particularly common on technical blogs. We kept the global
IMAGE_ASSET_TYPESunchanged and added a worker-scopedCONTENT_IMAGE_ASSET_TYPESset so the broader asset system isn't affected.5. Magic bytes detection for misconfigured servers
Many origins return images with
Content-Type: text/htmlorapplication/octet-stream. Trusting the header blindly would drop legitimate images. We sniff the first few bytes against known image signatures and fall back to that when the header is wrong.6. Browser-like request headers with
RefererChrome UA +
Accept: image/*+ aRefererset to the original page URL is the minimum needed to satisfy the most common hotlink checks (#2363). This alone recovers a meaningful fraction of images that would otherwise fail.7. Per-image exponential backoff (up to 10 retries, 1s → 30s cap)
Image hosts and CDNs return transient 403/429/5xx far more than they return permanent failures. Aggressive but bounded retries (~3 min worst case per image) strike a reasonable balance between resilience and bounded work per job.
8. Sequential downloads per bookmark
Parallel downloads against a single origin trip rate limits fast, especially for image-heavy articles hosted on one CDN. Sequential keeps us under most per-host budgets, and the work is queued — total throughput is regulated by worker concurrency, not per-job fanout.
9. Stale asset cleanup on re-crawl — only after at least one new image succeeded
When a bookmark is re-crawled, the extracted HTML can change (site layout updates, article edits). Images no longer referenced would otherwise accumulate forever. We gate cleanup on "at least one new image cached successfully" so a transient total failure doesn't wipe the existing working set.
10. Content images are system-managed, hidden from
AttachmentBoxThese are implementation detail, not user uploads. Showing 30+ tiny assets per bookmark in the attachment UI would be noise.
isAllowedToAttachAsset/isAllowedToDetachAssetboth returnfalseforCONTENT_IMAGEso users can't accidentally break reader view by deleting one.11. Data-URI inlining for API-key clients
The web client authenticates asset requests via session cookies, so relative
/api/assets/{id}URLs in<img src>just work. API-key clients — mobile app, CLI, browser extension, MCP server — cannot attach bearer tokens to plain<img>requests, so the same URLs render as broken images for them. When a request comes in via Bearer-token auth, the tRPC layer detectsctx.apiAuthand inlines small cached images as base64 data URIs directly into the returned HTML. Large images still flow through the normal asset endpoint.12. First-class operability — admin panel + bookmark debugger
This wasn't originally scoped, but a feature that's off by default and silently fails per-bookmark is hard to trust. The instrumentation pays for itself the first time a report says "reader view is broken for this one page".
Trade-offs we accepted
/api/assets/URLs (original URLs are already gone). Users must re-crawl first. Alternative (keep original URLs alongside asset IDs) doubles the storage footprint of the HTML.tidyAssetsmaintenance job. Same trade-off as every other asset type; not worth special-casing.posterand CSSbackground-imagenot handled. Intentionally out of scope — the long tail is noisy, and the feature already covers the 95% case.Configuration
All knobs below are currently server-wide env vars. Same caveat as decision #2: these may want to move to per-user application-level settings so individual users can pick their own budget without admin involvement.
CRAWLER_STORE_CONTENT_IMAGESfalseCRAWLER_CONTENT_IMAGE_MAX_COUNT50CRAWLER_CONTENT_IMAGE_MAX_SIZE_MB5CONTENT_IMAGE_NUM_WORKERS1CONTENT_IMAGE_JOB_TIMEOUT_SEC120Database migration
0081_add_content_image_status— nullablecontentImageStatuscolumn onbookmarkLinks. No data migration; existing rows stayNULL(not processed) until re-crawled.Testing
Open questions for review
Beta Was this translation helpful? Give feedback.
All reactions