Extract individual files out of a huge remote ZIP archive on Cloudflare Workers — using HTTP byte-range reads, without downloading the whole archive.
Point it at a multi-gigabyte .zip sitting on any server that supports HTTP
range requests. rangezip reads only the archive's index, then pulls out exactly
the files you ask for and streams them into R2. A 50 GB archive where you want
three files costs you three files' worth of bandwidth and a few kilobytes of
index — not 50 GB.
The interesting part of this project is doing what looks impossible inside the platform's hard constraints. A Cloudflare Worker isolate has a small memory ceiling (tens of MB), yet rangezip extracts files out of archives orders of magnitude larger than that ceiling. The trick is that it never holds more than one byte-range slice in memory at a time: the ZIP index is read from the tail, each file's compressed bytes are streamed through the decompressor and into R2, and nothing — not the archive, not even a single extracted file — is ever fully buffered in the isolate.
remote ZIP (could be 50 GB, on any range-capable host)
┌───────────────────────────────────────────────────┐
│ [local hdr][data] … [local hdr][data] │ central dir │EOCD│
└──────▲─────────────────▲──────────────┴──────▲──────┴─▲──┘
│ │ │ │
① range-GET tail (~64 KB) ───┼─────────────────┼─────────────────────┼────────┘
→ find EOCD / ZIP64, │ │ │
learn where the │ │ ② range-GET the central
central directory is │ │ directory bytes ────────┘
│ │ → parse every entry
│ │ (name, method, sizes,
│ │ CRC32, local offset)
③ per requested file: │ │
range-GET its 30-byte ────┘ │
local header → compute │ │
the exact data offset │ │
│ │
④ range-GET just the ────────┘ │
compressed bytes (streamed, never buffered) │
│ │
▼ │
⑤ DecompressionStream('deflate-raw') ──► FixedLengthStream(uncompressedSize) ──► R2.put
(method 8) or pass-through (method 0) (gives R2 an exact Content-Length;
asserts the produced byte count)
Coordinated by a Durable Object (one per job): tracks status in SQLite, fans
files out with bounded concurrency, isolates per-file failures.
-
Read the index without downloading the file. Range-GET the last ~64 KB and scan backwards for the End-Of-Central-Directory record (signature
0x06054b50). If the archive is larger than 4 GB or has more than 65 535 entries, the classic EOCD fields hold sentinel values and we follow the ZIP64 EOCD locator (0x07064b50) → ZIP64 EOCD (0x06064b50) to read the real 64-bit offsets. Then range-GET exactly the central-directory bytes and parse each central-directory file header (0x02014b50) into a list of entries: name, compression method, compressed/uncompressed size, local-header offset, CRC32. -
Find each file's true data offset. The central directory tells us where a file's local header is, but not where its data starts — the local header has its own filename-length and extra-field-length fields, which routinely differ from the central directory's. So per file we range-GET the 30-byte local header (
0x04034b50) and compute:dataOffset = localHeaderOffset + 30 + localNameLength + localExtraLength -
Stream the bytes out. Range-GET just the compressed bytes as a stream. For DEFLATE (method 8) we pipe through the platform-native
new DecompressionStream('deflate-raw'); for STORED (method 0) we pass the bytes straight through. The output is wrapped in aFixedLengthStreamsized to the known uncompressed length, thenputinto R2 at${prefix}/${name}. AFixedLengthStreamgives R2 a preciseContent-Length(so it never buffers the object to discover its size) and turns a truncated or corrupt inflate into a clean write-time error instead of a silently short object.Large STORED entries take a parallel path. A STORED entry above a threshold (64 MiB) needs no decompression, so its byte range splits cleanly: instead of one streamed
put, rangezip plans the range into parts (each ≥ 5 MiB to satisfy R2), range-GETs the parts in parallel with the same bounded concurrency as the per-file pool, streams each into an R2uploadPart, then completes the multipart upload. This lifts a single huge STORED file past the limits of one Worker invocation's single-stream pump. DEFLATE entries and small STORED entries keep the single-stream path. -
Coordinate with a Durable Object. One DO per job owns the job's state in SQLite, fans the per-file extractions out with bounded concurrency (6 at a time), records progress, and isolates per-file failures — one bad entry is recorded as
failedand the rest of the job continues.
The Worker also serves a self-contained demo UI at / (single HTML page,
vanilla JS, no build step). It walks you through: enter an access code → pick a
source (paste a ZIP URL or choose a sample) → pick a destination (the ephemeral
demo bucket, or your own S3/R2 bucket) → Extract → watch live progress over a
WebSocket with a real metrics panel, a per-file status log, a file browser
with downloads, and a countdown to cleanup.
Auth is an access-code gate that issues a signed (HMAC-SHA256) session cookie.
Access codes are not a static secret — they live in a SQLite-backed
Registry Durable Object and are created/labelled/revoked from the admin panel
(see below). The only secrets are the HMAC key and the admin key:
# HMAC key for signing BOTH session cookies (access-code + admin). Long + random.
wrangler secret put SESSION_SECRET # e.g. `openssl rand -hex 32`
# Admin-panel key — gates /admin and the code-management API. Long + random.
wrangler secret put ADMIN_KEY # e.g. `openssl rand -hex 32`(There is no longer an ACCESS_CODES secret — the Registry is the source of
truth. If you're upgrading an older deploy, you can drop it.)
For local wrangler dev, put them in a gitignored .dev.vars file:
SESSION_SECRET=local-dev-only-not-a-real-secret
ADMIN_KEY=local-dev-only-admin-keyThe access-code model. POST /auth { code } calls
Registry.validateCode(code) — a code is accepted iff it exists and hasn't been
revoked. On success it records a session_start usage event and sets
rangezip_session, an HttpOnly; Secure; SameSite=Lax; Max-Age=86400 cookie
carrying <base64url(payload)>.<base64url(hmac)> whose sub is the access
code (so all downstream activity attributes to it). The middleware on the
gated routes recomputes the HMAC (constant-time) and checks expiry; there's no
server-side session store. POST /logout clears it; GET /me reports whether a
valid session is present.
A separate self-contained page (same dark theme + logo) for managing access
codes and reviewing usage. It's gated by ADMIN_KEY, not an access-code
session: POST /admin/auth { key } constant-time-compares the key and issues a
distinct admin cookie (rangezip_admin, sub = "admin"). The two cookies are
mutually exclusive — an admin session can't satisfy a code-gated route and vice
versa (different cookie name + a sub sentinel check). All /admin/* data
routes require the admin session.
From the panel you can:
- Generate a code with a label (e.g. "Investor demo — Acme"); the new code is shown to copy.
- See a code list: label · code · created · redeemed? · last active · #sessions · #jobs · #files · bytes — all aggregated from real usage events.
- Revoke a code (it can no longer sign in; existing sessions expire on TTL).
- Open a per-code usage timeline of every recorded event.
Usage tracking. Codes accrue events in the Registry: session_start on
sign-in, plus job_started ({ source, requestedFiles }) and job_completed
({ filesExtracted, bytes, percentOfArchive, durationMs }) emitted by the
extraction job, attributed to the session's code. The job records these
best-effort — a Registry hiccup never disrupts an extraction. Every number in
the admin view is a real measurement the job already computes (same honesty bar
as the demo metrics).
Every number the demo shows is a real measurement, precisely labelled:
| Metric | What it is |
|---|---|
| range bytes vs archive | the headline — bytes range-GET'd / total archive size, as a % |
| bytes fetched / archive size | raw byte totals behind the headline |
| bytes not fetched | the savings (archive size − bytes fetched) |
| range requests | count of every range-GET (size probe + index + per-file header + data) |
| peak concurrency | max simultaneous in-flight extractions observed (bounded at 6) |
| files extracted | files successfully written |
| bucket writes | successful R2 / BYO puts |
| index-read time | measured wallclock from job start to central directory parsed |
| extraction time | measured wallclock from index parsed to last file settled |
| total time | index-read + extraction |
| compute time (measured) | summed performance.now() deltas around the inflate/stream pump |
| central dir reads (reused) | the central directory is read once and reused across every file |
Labelling honesty notes (a critical reviewer should know these):
- "compute time (measured)" is wallclock measured around the CPU-bound
decompress+stream section with
performance.now(). It is not billed CPU-ms (the Workers platform doesn't expose per-section CPU time to the isolate), and it includes some stream/IO overlap. It's labelled "measured" for exactly this reason — an honest measured wallclock delta, not a synthetic billing figure. - The headline percentage can exceed 100% when you extract all files of a small archive — you fetch every file's compressed bytes plus per-file local headers plus the index, which sums to slightly more than the archive. That's the metric being honest, not broken: the dramatic savings appear when you extract a subset of a large archive (e.g. 2 files of 30 GB → well under 1%). The demo never fakes this.
- "central dir reads (reused)" is a genuine reuse metric: the central directory is read exactly once per job and reused across every per-file extraction. It's not an invented "N reuses" number.
When a demo-bucket job completes, the Durable Object schedules an alarm at
now + EXTRACT_TTL_HOURS (default 2h, set via the EXTRACT_TTL_HOURS var in
wrangler.jsonc). The alarm() handler deletes every R2 object under the
job's own <jobId>/… prefix (and only that prefix — never a shared or another
job's). The job's expiresAt drives a live countdown in the UI ("clears in
1:58:14"). As a backstop, add a bucket-wide R2 lifecycle rule (demo output
keys live at the bucket root under their per-job <jobId>/ namespace):
# Belt-and-suspenders: even if a DO never fires its alarm, R2 expires the object.
wrangler r2 bucket lifecycle add rangezip-output --expire-days 1(BYO-bucket jobs are never cleaned up — that's the user's data.)
{ code } → 200 + Set-Cookie on success, 401 on a bad code. All routes
below require the resulting session cookie (401 without it).
Returns 202 with { "jobId": "...", "status": "pending" }. The request does a
cheap pre-flight (reads the index, validates that any explicitly-requested files
exist → 404 if not) before accepting the job, then returns immediately while
extraction runs in the background.
Output keys are scoped per job. There is no caller-supplied output prefix — the server derives the R2 key prefix from the per-job id so two concurrent jobs can never collide (overwrite each other's objects, serve the wrong download, or wipe each other's data on cleanup):
- demo bucket:
<jobId>/<filename> - BYO bucket:
<byoPrefix>/<jobId>/<filename>(the optional in-bucketprefixis the user's bucket layout; the<jobId>subfolder keeps it collision-safe).
{ destination: { endpoint, region, bucket, accessKeyId, secretAccessKey, prefix? } }
→ signs a cheap real round-trip (SigV4 PUT then DELETE of
${prefix}/.rangezip-check via aws4fetch) → { valid: true } or
{ valid: false, reason }. The UI requires this to pass before enabling
Extract in BYO mode.
Returns the job status, per-file results, destination/expiry, and live metrics:
{
"id": "3f9c…", // the jobId; also the demo output key prefix
"status": "completed", // pending | running | completed | failed
"sourceUrl": "https://example.com/huge.zip",
"prefix": "3f9c…", // server-derived: <jobId> (demo) or <byoPrefix>/<jobId> (byo)
"total": 2,
"done": 2,
"failed": 0,
"error": null,
"destination": "demo",
"expiresAt": 1779473689526, // unix-ms cleanup time (null for byo)
"percent": 100,
"metrics": {
/* see "Honest metrics" above */
},
"files": [
{
"name": "docs/report.pdf",
"key": "3f9c…/docs/report.pdf",
"status": "done",
"bytes": 1048576,
"error": null,
},
],
}A job is reported completed even if some individual files failed — inspect the
per-file status/error. The job is only failed if extraction couldn't even
start (e.g. the source doesn't support range requests, or the index is
unparseable).
WebSocket upgrade. The Worker validates the session + upgrade header, then
forwards the request to the job's Durable Object, which accepts the socket via
the Hibernation API (ctx.acceptWebSocket) and broadcasts progress messages
as extraction runs: a snapshot on connect, a file delta per file transition
(queued/extracting/done/failed), and a terminal done. Each message
carries the live metrics. (The DO can hibernate between bursts without dropping
the socket.)
Lists extracted objects under the job prefix: { files: [{ name, key, size }] }.
For BYO jobs the list is empty (the files are in the user's bucket).
Streams one extracted object back (Content-Disposition: attachment), never
buffering it in the isolate. 409 for BYO jobs (the file lives in the user's
bucket).
When you choose "My S3/R2 bucket", the credentials are treated as transient:
- They live only in the Durable Object's in-memory job state for the duration of the run. They are never written to DO storage / SQLite, never logged, and never returned in any API response. They are discarded the moment the run completes (the DO drops them).
- They travel over HTTPS only and are used solely to SigV4-sign the per-file
PUTs (and the validate probe) to your endpoint. - rangezip never deletes from a BYO bucket — there's no cleanup alarm and no countdown for BYO jobs. The files are yours to keep.
Use a scoped, write-only key for just this bucket/prefix — not a root key.
The validate step needs s3:PutObject + s3:DeleteObject on the target prefix;
the extraction itself only needs s3:PutObject.
scripts/make-sample.ts generates a synthetic many-file ZIP of a target size
(2 / 10 / 30 GB; a mix of STORED incompressible + DEFLATE text entries, so the
central directory and per-file extraction are meaningful). It uses fflate's
streaming ZIP API piped to a file with backpressure, so even a 30 GB sample
is generated in bounded memory (same discipline as the Worker).
# Generate locally (writes ./samples/rangezip-sample-2gb.zip):
bun run scripts/make-sample.ts --size 2gb
# Generate AND upload to the demo R2 bucket via wrangler:
bun run scripts/make-sample.ts --size 10gb --uploadAfter uploading, set the public, range-capable URL and flip available: true
for that preset in src/samples.ts (the demo UI reads the presets from there).
Until then the preset URLs are documented placeholders (the UI labels them as
such). Generating/uploading 10–30 GB is a one-time manual step — the script,
presets, and instructions ship ready; the actual large upload is yours to run.
Being honest about where this technique shines and where it doesn't:
-
Parallelism is across files, not within a compressed file. The Durable Object fans the per-file extractions out with bounded concurrency (6 at a time), so the sweet spot is many-file archives — the more entries, the more the work overlaps. A 10 000-file ZIP parallelises beautifully; a ZIP that is one enormous file does not get the same benefit from the per-file pool.
-
A single large STORED entry IS parallelised. STORED (uncompressed) data needs no inflate, so its byte range is freely splittable. Above the 64 MiB threshold rangezip range-GETs the parts in parallel and uploads them via an R2 multipart upload (see pipeline step 3). This is the common case for the big files inside real archives — large media is usually already-compressed (JPEG/PNG/MP4/…) and therefore stored uncompressed in the ZIP.
-
A single very large DEFLATE entry is the genuine limit. A raw deflate stream is not randomly seekable — byte N of the compressed stream can't be inflated without having inflated everything before it. So a huge DEFLATE entry must inflate sequentially, in one invocation, and that one invocation is CPU-bound on the inflate. There is no splitting it; this is a property of DEFLATE, not a missing feature. In practice it's rare: the files big enough to hit this are almost always already-compressed media, which ZIPs store as STORED (the parallel path above), not DEFLATE.
If you ever genuinely needed to extract a single multi-GB DEFLATE entry that exceeds one invocation's CPU budget, the path is to offload the sequential inflate to a runtime without the per-invocation CPU limit — Cloudflare Workflows (durable, long-running steps) or Containers (a full process, no isolate CPU ceiling) — streaming the inflate output into R2 the same way. That's a deliberate non-goal here, called out plainly rather than hidden: this reference targets the memory-bounded, range-driven core, and the STORED-parallel + many-file cases that cover the overwhelming majority of real archives.
-
Streaming over buffering, everywhere. The whole design exists to respect the isolate memory ceiling. The index reads are tiny (tail + central directory). Each file's data is streamed range-GET → decompress →
FixedLengthStream→ R2, so peak memory is one in-flight slice, not one whole file. This is what makes multi-GB archives tractable on Workers. -
Native
DecompressionStreamover WASM. A production system that inspired this technique used a hand-tuned WASM inflate for raw throughput. This reference deliberately uses the platform-nativeDecompressionStream('deflate-raw')instead. The honest trade-off: WASM can be faster and gives finer control (custom dictionaries, partial flushes), but the native stream is zero-dependency, already audited by the runtime, and reads as plain web-standard code. For a reference implementation, legibility wins; if you were chasing maximum throughput on hot paths, WASM is the lever to reach for. -
Pure logic separated from the IO shell. All ZIP-format parsing (EOCD scan, central-directory parse, local-header offset math) lives in
src/zip/as pure functions overUint8Array— no network, no Effect, no R2. The IO and orchestration live in the Effect shell (src/effect/,src/extract.ts) and the Durable Object (src/job.ts). This keeps the fiddly binary logic trivially unit-testable and the side-effecting code thin. SeeCLAUDE.md. -
Effect for typed errors. Failures travel as tagged errors (
RangeFetchError502,ZipParseError422,EntryNotFoundError404,R2WriteError500,DecompressError422) in the Effect error channel — notry/catchswallowing. The error's HTTPstatusdoubles as the API contract: a failed Effect maps straight to a JSONResponsewith that status. -
A Durable Object per job. Extraction is stateful, coordinated, and possibly long-running. A DO gives a single serialization point for progress updates and a durable place (SQLite) to read status from while work proceeds.
waitUntilis a no-op inside a DO — the instance simply stays alive while the background extraction promise has pending I/O. The same DO also hosts the live-progress WebSocket (Hibernation API, so it can be evicted between progress bursts) and the cleanup alarm (one alarm per DO, deleting the demo-bucket output at the TTL). -
A singleton
RegistryDurable Object (src/registry/registry.ts, addressed bygetByName("registry")) is the source of truth for access codes and usage events — same DO+SQLite pattern (no D1). Its pure bits — random code generation and the usage-aggregation reducer — live insrc/registry/codes.tswith their own unit tests, per the pure-logic-vs-shell split. The wranglerv2migration introduces this SQLite class (thev1ExtractJobmigration is untouched).
rangezip requires the source to honour HTTP range requests (it checks, and
fails with a clear 502 if not). For a source that can't — say a CDN that only
serves the whole object — the technique still works with a copy-in + restitch
front phase, documented here but not built:
- Fetch the source in parallel byte-range chunks (or a sequence of
Rangerequests if the origin allows ranges on read but you want to control chunking) and write them as parts of an R2 multipart upload, assembling the archive into R2 first. - Then operate on the R2 object exactly as above — R2 natively supports range
reads, so the index-read / per-file-extract pipeline is unchanged; only the
Sourceimplementation swaps from "remote URL" to "R2 object".
This trades extra ingest bandwidth and storage for the ability to handle any source, while keeping the memory-bounded extraction core identical.
bun install
bun run typecheck # tsc --noEmit (strict + noUncheckedIndexedAccess + verbatimModuleSyntax)
bun run test # vitest, inside the real Workers runtime via @cloudflare/vitest-pool-workers
bun run dev # wrangler dev (put SESSION_SECRET + ADMIN_KEY in .dev.vars first)
bun run deploy # wrangler deploy (create the R2 bucket + set secrets first)Before dev/deploy, set the auth secrets (see "Auth setup" above): a
gitignored .dev.vars for local dev, wrangler secret put for deploy. After
deploy, open /admin, sign in with ADMIN_KEY, and generate the first access
code — there are no codes until you make one.
Tests build real ZIP fixtures in-memory with fflate and assert the
central-directory parse, the local-header data-offset computation, and a full
parse → extract round-trip that reproduces the original bytes against the genuine
DecompressionStream, FixedLengthStream, R2, and Durable Object SQLite inside
workerd (not Node shims). On top of the core, the suite covers the new pure
logic and shells: session signing/verification (test/auth.test.ts), metrics
math (test/metrics.test.ts, test/metrics-collection.test.ts), BYO destination
helpers (test/destination.test.ts), progress math (test/progress.test.ts),
the auth gate + routing (test/app.test.ts), and the DO's cleanup alarm + report
shape (test/job.test.ts). 105 tests, all run in workerd.
WebSocket completion over a Durable Object can't run under the Vitest pool's per-file storage isolation (a documented pool limitation), so the WS tests assert the gating + the DO's non-upgrade contract; the live upgrade + broadcast is verified via
wrangler dev.
MIT © Nik Divjak. See LICENSE.
{ "sourceUrl": "https://example.com/huge.zip", // must support HTTP range requests "files": ["docs/report.pdf", "data/rows.csv"], // OPTIONAL — omit to extract all "destination": "demo", // "demo" (ephemeral, auto-cleaned) | "byo" "byo": { // REQUIRED when destination == "byo" "endpoint": "https://acct.r2.cloudflarestorage.com", "region": "auto", "bucket": "my-bucket", "accessKeyId": "…", "secretAccessKey": "…", "prefix": "exports/", // optional in-bucket prefix (the user's bucket layout) }, }