English Β· δΈζ
Browse fully-interactive, exported flipbooks right in your browser β click hotspots to drill in, no install needed.
β¨ Click anywhere on a generated image. The backend infers what you clicked, searches the web when useful, generates a child diagram, and links it back. A flipbook of explorable knowledge β one click at a time.
π‘ Inspired by and a re-implementation of the product idea behind flipbook.page β credit to the original team for the click-to-explore canvas concept.
A long-running web product: Express + SSE backend, Vite + React + TS frontend, a pluggable multi-model image pipeline, web-search augmented planning, per-node concurrency, read-only share links, fullscreen casting and a fully responsive mobile layout.
Most "AIη»εΎ" demos stop at one image. This one turns each image into a playable knowledge surface:
- π±οΈ Long-press anywhere on a picture β the model reads what's under your finger, decides whether the topic needs fresh sources, optionally hits the web, then paints a brand new annotated diagram zoomed into that concept.
- π Encyclopedia-style output β every node ships with a 150β220-char caption and 20β40 in-image labels (place names, dates, numbersβ¦), all OCR'd back into a transparent text layer so you can drag-select and copy any fragment straight off the picture.
- π³ Infinite tree of canvases β every click spawns a child node; the whole exploration tree is persisted, shareable, and replayable.
- β³ Watch it think β a node is saved and linkable the instant you click, then its title / caption / scene prompt type out live; share the link and a friend on another device watches the same stream fill in.
Click-to-explore β long-press any region to drill in |
End-to-end pipeline β search β planner β ImageGen β drill-down |
Gallery + canvas β every canvas is persisted, shareable, replayable |
|
- π±οΈ Click-to-explore: long-press (1 s) anywhere on a node's image. The backend infers the label, decides whether to web-search, then generates a child node. Spatial + semantic dedup means clicking the same region again jumps straight in.
- β³ Live-streaming, linkable generating nodes: the moment you click, the child node is persisted under its final id and its parent hotspot links to it immediately β so it's shareable / openable on any device while still generating. Its title, caption and image prompt type out live (token-streamed via SSE), the catalog shows a spinner row, and a refresh or cross-device open resumes the stream from the on-disk snapshot. On failure the half-node is auto-deleted.
- π«οΈ Progressive image loading: every PNG gets blur β thumbnail β medium β full variants (sharp). Gallery cards blur-up, the canvas swaps to full-res when ready β no broken-image flashes, fast first paint.
- πΌοΈ Portrait & landscape canvases: pick orientation per canvas (mobile portrait viewports default to portrait); filter the gallery by All / Landscape / Portrait with the choice synced to the URL.
- β‘ Per-node parallelism: up to 4 different spots in parallel per parent
(configurable). Each in-flight click streams a phase chip
(
Inferring labelβ¦βSearching the webβ¦βGenerating imageβ¦) on the hotspot. Hit the cap and the cursor turns into β. - π Encyclopedia register: planner produces 150β220 char captions with 20β40 in-image text fragments β like reading a richly annotated diagram in a children's encyclopedia. Long captions clamp to 2 lines with a ζ₯ηζ΄ε€ / Show more toggle.
- π Web-search augmented: a "decide-then-search" gate asks the LLM whether a topic benefits from up-to-date sources. When yes, results are fetched and fed into the planner; sources are persisted to disk + DB and rendered as a π hover badge over the canvas.
- π Resilient SSE: Last-Event-ID replay + per-job snapshot resume β a dropped connection or page refresh mid-generation reconnects and catches up on everything it missed, including the in-flight typewriter.
- π¬ Scene transitions: drill-in / drill-out / fade animations make navigation feel like a zooming flipbook rather than a page swap.
- π Share as preview: any canvas β read-only
?s=<token>URL. Viewers can navigate and watch live SSE updates from in-flight generations, but cannot trigger new ones. - πΊ Fullscreen casting: βΆ requests browser fullscreen; toggle the chrome (breadcrumb + caption + hint) on/off for a clean projection view.
- π€ Selectable in-image text: every label baked into the diagram is OCR'd
with Apple Vision (
zh-Hans+en-US) and overlaid as invisible HTML, so users can drag-select and Cmd-C copy any text directly off the picture while the painted pixels remain the visual ground truth. - π Voice narration: each node's title + caption is synthesised to speech with Microsoft Edge neural voices (msedge-tts β free, no API key). Pick a character voice per flipbook from the live Edge catalogue (filtered to the UI language); the picker reads "ζζ Β· ε₯³ε£°" instead of raw locale IDs. Switching voices re-narrates the whole book and restarts in-flight playback. Auto-narration is on by default (toggleable) and is bundled into exports so the static site speaks offline too.
- π± Mobile responsive: sticky top bar that pins on scroll, single-column gallery, pinch-zoom image lightbox, smaller hotspots and pending bubbles.
Flipbook Canvas is built around a pluggable multimodal pipeline. Three modalities are wired end-to-end:
| Modality | What it does | Pluggable into |
|---|---|---|
| π Text / JSON LLM | planner, click-label inference, decide-then-search verdict | any chat-completion-style model |
| πΌοΈ Image generation | turns a structured prompt into a 2752Γ1536 annotated diagram with bake-in text labels | OpenAI, Nano Banana (Gemini), Seedream/Seeddance, or your own provider |
| π Web search | rephrased query β top-N normalized results β planner context + π sources panel | any search backend |
| ποΈ OCR (Apple Vision) | zh-Hans + en-US recognition over every generated PNG, projected as a selectable HTML overlay |
local, no API keys needed |
| π TTS (Edge neural voices) | synthesises each node's title + caption to an mp3, per-flipbook character voice | Microsoft Edge online voices via msedge-tts, no API key |
The image layer is a provider chain (IMAGE_PROVIDER=...,svg) β first
enabled provider wins, svg is always appended last as a placeholder so the
UI never breaks. Adding a new model is a single file:
// server/src/generation/providers/<name>.js
export default {
name: 'my-model',
enabled(config) { return Boolean(config.MY_API_KEY); },
async generate({ imagePrompt, outputDir, size, title, hash, onEvent }) {
// call your model, write <hash>.png into outputDir, push phase events
},
};Out of the box:
| Provider | Trigger to enable | Status |
|---|---|---|
openai |
OPENAI_API_KEY set |
π stub β implement in providers/openai.js |
nanobanana |
NANOBANANA_API_KEY or GEMINI_API_KEY |
π stub |
seeddance |
SEEDDANCE_API_KEY or ARK_API_KEY |
π stub |
codebuddy |
ENABLE_CODEBUDDY=1 |
β reference impl (used in the demo gif) |
svg |
always | β fallback placeholder |
π― The reference implementation wires the
codebuddyCLI as a subprocess driver for planner / ImageGen / WebSearch. Subprocess lifecycle (concurrency cap, per-call timeouts, single retry, file-size sanity check on generated PNGs, graceful degradation) lives inserver/src/codebuddyClient.jsand is a useful template if you ever shell out to any CLI-based model.
Type εζ¨ιΈ (woodpecker) into the top bar and watch the entire pipeline run:
decide-then-search β planner β ImageGen β click to drill into the tongue
anatomy / nest cavity / ant-foraging zones, each spawning its own annotated
diagram with its own sources.
.
βββ prompts/ # system / planner / click-label / image-prompt / decide-search
βββ scripts/
β βββ sync-prompts.mjs
β βββ serve-preview.mjs # build + serve one canvas's static preview
β βββ example-doc-publish.mjs # publish canvases to GitHub Pages
βββ server/
β βββ src/
β βββ routes/ # canvas, click, events (SSE), assets, share
β βββ export/ # static-site exporter + viewer template
β β βββ buildExport.js # buildCanvasSite / buildCanvasExport (zip)
β β βββ template/ # self-contained index.html + viewer.js/css
β βββ lib/zip.js # dependency-free ZIP writer
β βββ generation/
β β βββ pipeline.js # generateRoot + expandFromClick + per-node concurrency
β β βββ decideSearch.js # decide-then-search gate
β β βββ webSearch.js # WebSearch subprocess + result normaliser
β β βββ queue.js # PerCanvasQueue / Semaphore / PerKeySemaphore
β β βββ planner.js / clickLabel.js
β β βββ image.js # provider-chain orchestrator
β β βββ providers/ # codebuddy, openai, nanobanana, seeddance, svg
β βββ db/ # Sequelize models + hydrateFromDisk
β βββ store/ # filesystem layer
β βββ sse/ # event hub
β βββ codebuddyClient.js # reference CLI-subprocess wrapper
βββ web/ # Vite + React + TS
- π Filesystem (source of truth for big artifacts):
server/data/canvases/<id>/{data/tree.json, data/nodes/<hash>.json, images/<hash>.{png,svg}, manifest.json}. - ποΈ SQLite (
server/data/flipbook.sqlite, via Sequelize): metadata index β Canvases / Nodes / Hotspots / ShareLinks / Sources tables. Drives the gallery, spatial dedup, share lookup, and sources hover panel. On boot the server runshydrateFromDisk()to rebuild this index if it's missing.
npm install
npm run dev # server on :8787 + Vite on :5173 in parallelOpen http://127.0.0.1:5173.
By default ENABLE_CODEBUDDY=0 (stub mode β fast, SVG placeholders, no LLM).
Set ENABLE_CODEBUDDY=1 to use the reference CLI provider for planner +
ImageGen + WebSearch:
ENABLE_CODEBUDDY=1 npm run dev:serverβ±οΈ With the reference provider, each node takes ~70β95 s end-to-end (planner ~25 s + ImageGen ~50β60 s including cold start; +5β15 s if web search runs). ImageGen produces 2752Γ1536 PNG (~6 MB).
Up to 4 click expansions per parent node run in parallel; excess clicks
queue. Different parents and different canvases run independently. A
per-parent write lock serializes only the short read-modify-write of the
parent node JSON. Tunable via MAX_PARALLEL_CLICKS_PER_NODE (default 4).
A pre-planner gate (decideSearch.js + prompts/decide-search.md) calls the
LLM with the proposed subject and asks: do recent / authoritative sources
materially improve this node? The default leans yes β only clearly
abstract / timeless subjects skip search. When yes:
- The web-search backend runs with the rephrased query.
- Results are normalised into
{title, url, snippet, source}. - Top results are passed into the planner prompt.
- Sources are persisted both into
nodes/<hash>.jsonand into the SQLiteSourcestable. - The frontend renders a π badge near the breadcrumb. Hover to see a popover with the source list (220 ms grace period so the popover is reachable with the mouse).
Any canvas can be exported as a fully self-contained static site β a
read-only replica of the preview with all data and images inlined, openable
directly from file:// with zero network requests.
-
In-app: the
Β·Β·Β·More menu β Export preview downloads a.zip(index.html/viewer.js/viewer.css/data.js+images/). -
Serve one locally for quick viewing in a browser:
npm run serve-preview -- <canvasId> [--lang en] [--port 8088]
Builds the static site to a temp dir, starts a tiny static HTTP server, prints the URL. Ctrl-C cleans up.
-
Publish to GitHub Pages (one or more canvases β a routed gallery landing page at
/, each example at/<canvasId>/):npm run example:publish -- <canvasId> [<canvasId> ...] [--lang en] [--no-push]
Builds each canvas, regenerates the landing index, and pushes to the
gh-pagesbranch (accumulating β re-publishing a new id keeps the others). β see the result at https://imcuttle.github.io/flipbook-app/.
The exported viewer mirrors the live read-only preview: image stage with collision-avoiding hotspot labels, leader lines, selectable OCR text overlay, caption, breadcrumb, catalog and sources β plus progressive image loading, scene transitions, and next-layer image prefetch. Per-node narration mp3s are bundled too, so the static site auto-narrates offline (toggleable in the top bar). It never calls the server.
POST /api/canvas/:id/shareβ{token, url}. Reuses an existing token for the same canvas.GET /api/share/:tokenβ{canvasId, topic, readOnly:true}.- Frontend: opening
β¦?s=<token>puts the UI in read-only preview mode β no topic input, no clicks on the image, "π Preview" badge in the corner. SSE stays connected, so a viewer watching mid-generation sees images stream in real-time.
βΆbutton in TopBar requests browser fullscreen; uses CSS-only fullscreen on iOS Safari where the API isn't supported.π/π«button (visible while in fullscreen) toggles the breadcrumb + caption + hint. Useful for clean projection.- Long-press hint is suppressed in fullscreen by default; the press still works.
npm run clean:data # reset server/data (all canvases)
npm run clean:dist # reset web/dist
npm run clean # bothnpm run build # builds web/dist
npm start # serves web/dist + API from :8787Give the app a stable hostname (e.g. http://flipbook.lan) reachable from any
device on your LAN β no port number needed. Uses dnsmasq (resolves the
domain β this machine's LAN IP) + Caddy (reverse-proxies :80 to the app).
npm run lan:up # flipbook.lan β dev :5173 (preferred), falls back to prod :8787
npm run lan:down # tear it down
# custom: scripts/lan-domain-setup.sh <domain> <devPort> <prodPort>
bash scripts/lan-domain-setup.sh studio.lan 5173 8787The proxy tries the dev port (5173) first and automatically falls back to
the prod port (8787) when dev isn't running (passive health check, 3s
blacklist). So npm run dev and npm start both work behind the same domain.
lan:up installs dnsmasq/caddy via Homebrew if missing and needs sudo
(dnsmasq binds 53, Caddy binds 80). It only configures this machine; to
reach the domain from other devices, point their DNS at this machine's LAN IP
(router DHCP DNS, per-device DNS, or a hosts entry β the script prints the
exact options and your IP).
| Var | Default | Purpose |
|---|---|---|
PORT |
8787 | server port |
HOST |
127.0.0.1 | server bind |
DATA_DIR |
server/data |
canvas state on disk |
PROMPTS_DIR |
prompts |
prompt files |
DB_PATH |
<DATA_DIR>/flipbook.sqlite |
SQLite file |
MAX_PARALLEL_CLICKS_PER_NODE |
4 | concurrent click expansions per parent |
MAX_PARALLEL_CODEBUDDY |
20 | concurrent planner/LLM subprocesses |
MAX_PARALLEL_IMAGE |
20 | concurrent image-generation jobs (separate pool from the LLM limit) |
PLANNER_TIMEOUT_MS |
90000 | per-call planner timeout |
IMAGE_TIMEOUT_MS |
180000 | per-call ImageGen timeout |
WEB_SEARCH_TIMEOUT_MS |
60000 | per-call WebSearch timeout |
IMAGE_PROVIDER |
codebuddy |
provider chain (e.g. openai,nanobanana,svg) |
IMAGE_SIZE |
1920x1080 |
requested size (provider may pick its own) |
ENABLE_CODEBUDDY |
0 | flip to 1 to enable the reference CLI provider |
ENABLE_WEB_SEARCH |
follows ENABLE_CODEBUDDY |
force-disable with 0 |
ENABLE_OCR |
1 | run Apple Vision OCR on each generated PNG to produce a selectable text overlay; set to 0 to skip |
OCR_TIMEOUT_MS |
25000 | per-call OCR timeout |
OCR_MIN_CONFIDENCE |
0.4 | drop OCR spans below this confidence |
ENABLE_AUDIO |
1 | synthesise Edge neural-voice narration (mp3) for each node; set to 0 to skip. Non-blocking β failures never stop image generation |
AUDIO_TIMEOUT_MS |
30000 | per-call TTS synthesis timeout |
English Β· δΈζ


