A Node.js + TypeScript service that ingests OHLCV candle data from market data providers into PostgreSQL. It backfills as far back as the provider will serve, detects and fills gaps in already-stored data, and continues polling for newly-closed candles while it runs.
Built as a portfolio project to demonstrate clean backend architecture, data-pipeline thinking, and production-minded engineering. Not a trading system — no orders, positions, or P&L.
- Two providers, one interface. Yahoo Finance (no API key) and Alpaca Markets (free tier) behind a uniform
Providercontract. Adding a third is a single-file PR. - WebSocket streaming for Alpaca 1m bars. When the provider supports push-based delivery, the live job consumes
streamLiveCandlesdirectly; otherwise it falls back to REST polling. The multiplexer client keeps Alpaca's one-connection-per-account limit honest across multiple subscribed symbols. - Full historical backfill from the provider's earliest-available timestamp. Per-timeframe chunking (and pagination for Alpaca) is the provider's concern; the orchestrator just iterates.
- Idempotent writes. Composite PK on
(provider_id, symbol, timeframe, open_time)withINSERT ... ON CONFLICT DO UPDATE. Re-running is always safe. - Gap detection and automatic refill. Walks between consecutive stored candles, re-requests the bracketing range, upserts only the candles that actually fill a gap. Buckets the provider confirms are missing (weekends, holidays, exchange downtime) get recorded to
known_gapsso they aren't re-requested forever. - Resume-on-restart.
ingestion_statetracks progress per series; backfill picks up from the last storedopen_time. If stored data is older than the provider's retention window, the cursor advances with a warning instead of crashing. - Graceful shutdown. SIGINT/SIGTERM aborts in-flight work cleanly. A second signal exits immediately. For Alpaca streaming, the service also closes its single-account WebSocket on shutdown so a fast restart does not usually trip the provider's connection cap.
- Config validated at boot. zod schema, bulleted error messages, no runtime surprises.
- Strict TypeScript (ESM,
noUncheckedIndexedAccess,verbatimModuleSyntax), ESLint flat config, Prettier, Vitest.
flowchart TD
CLI["src/index.ts<br/>composition root"] --> Orch[Orchestrator]
Orch -->|per series| GapScan[GapScanner]
Orch -->|per series| Backfill[BackfillJob]
Orch -->|per series, concurrent| Live[LiveIngestJob]
GapScan --> P[Provider interface]
Backfill --> P
Live --> P
P -.-> Yahoo[YahooProvider<br/>REST, chunked]
P -.-> Alpaca[AlpacaProvider<br/>REST, paginated]
GapScan --> Repos[Repositories]
Backfill --> Repos
Live --> Repos
Repos --> DB[(PostgreSQL)]
For each configured (provider, symbol, timeframe):
- Gap scan. Walks consecutive stored candles; any range wider than one bucket gets re-fetched. Only the candles that actually fill a gap get upserted. Unfillable buckets go to
known_gaps. - Backfill. Resumes from
findLatestOpenTime + 1bucket (or the provider's earliest-available on a fresh series). Streams candles in batches, upserts, updates progress. - Live polling. One loop per series, concurrent across series. Poll interval is
max(15s, min(tf/3, 5min)). Only persists candles whoseclose_time < nowso unclosed current-bucket candles don't corrupt the series. Transient errors are logged and retried; the loop survives them.
See docs/architecture.md for the detailed lifecycle.
Prerequisites: Node 22.9+, Docker Desktop (or Podman Desktop).
git clone <this-repo>
cd market-ingest
npm install
# Start Postgres
docker compose up -d postgres
# Configure. Edit SYMBOLS / TIMEFRAMES / PROVIDER as desired.
cp .env.example .env
# Apply schema
npm run migrate
# Run
npm run devWith the default config (PROVIDER=yahoo, SYMBOLS=AAPL, TIMEFRAMES=1d) you'll see roughly 11,400 daily candles backfilled from AAPL's December 1980 IPO through today in a few seconds, then the service enters live polling mode.
Shut down with Ctrl+C — in-flight writes drain and the pool closes cleanly.
[09:29:12.296] INFO: market-ingest starting
service: "market-ingest"
nodeEnv: "development"
provider: "yahoo"
symbols: ["AAPL"]
timeframes: ["1d"]
[09:29:12.820] INFO: backfill: starting
service: "market-ingest"
component: "backfill"
symbol: "AAPL"
timeframe: "1d"
startTime: "1980-12-12T14:30:00.000Z"
batchSize: 1000
[09:29:15.559] INFO: backfill: complete
service: "market-ingest"
component: "backfill"
symbol: "AAPL"
timeframe: "1d"
totalUpserted: 11439
[09:29:15.559] INFO: orchestrator: backfill + gap scan phase complete
service: "market-ingest"
[09:29:15.559] INFO: orchestrator: entering live phase
service: "market-ingest"
[09:29:15.560] INFO: live ingest: starting
service: "market-ingest"
component: "live-ingest"
symbol: "AAPL"
timeframe: "1d"
pollIntervalMs: 300000
Dev mode uses pino-pretty; production (NODE_ENV=production) emits the same fields as JSON lines for log aggregators.
All configuration is via environment variables, validated on boot. See .env.example for the reference.
| Variable | Required | Default | Notes |
|---|---|---|---|
DATABASE_URL |
yes | — | Postgres connection string |
SYMBOLS |
yes | — | Comma-separated, uppercased on load |
TIMEFRAMES |
yes | — | Comma-separated (1m, 5m, 1h, 1d, 1w, …) |
PROVIDER |
no | yahoo |
yahoo or alpaca |
LOG_LEVEL |
no | info |
fatal/error/warn/info/debug/trace |
NODE_ENV |
no | development |
Toggles pino-pretty vs JSON logs |
BACKFILL_BATCH_SIZE |
no | 1000 |
Rows per upsert |
ALPACA_KEY_ID |
if PROVIDER=alpaca |
— | Free at alpaca.markets |
ALPACA_SECRET_KEY |
if PROVIDER=alpaca |
— | |
ALPACA_DATA_URL |
no | https://data.alpaca.markets |
Override for testing |
ALPACA_FEED |
no | iex |
iex (free) or sip (paid, full US market) |
- Implement the
Providerinterface insrc/providers/<name>/. - Write a mapping function that produces a canonical
Candle— seesrc/providers/alpaca/mapping.tsfor the template. The comment block at the top lists the five normalization invariants every mapper must satisfy (UTCDatetimestamps, 8-decimal string prices, computedcloseTime, canonical timeframe code,providerIdfrom context). - Register the provider in
src/providers/registry.tsand add the name to theProviderNameunion insrc/config/index.ts. - Seed the provider in a new migration (
migrations/000N_provider_seed.sql). - Add unit tests for timeframe translation and mapping; an integration test against the real upstream is nice-to-have.
Edit .env. The service parses, validates, and fails fast on bad values.
npm run dev # Run via tsx (dev-friendly; no build required)
npm run dev:watch # tsx watch — restarts on source changes
npm run build # tsc -p tsconfig.build.json
npm start # Run compiled dist/ (needs `npm run build` first)
npm run start:build # Build then runAll destructive scripts (delete, reset-db) run as a preview by default and require --yes to actually mutate data.
npm run migrate # Apply pending SQL migrations
npm run health # One-screen summary of every series
npm run count -- <SYM> <TF> # Per-provider detail for one series
npm run recent -- <SYM> <TF> [N] # Last N candles (default 10, max 1000)
npm run gaps -- <SYM> <TF> [PROV] # Missing buckets the scanner would refill
npm run delete -- <SYM> <TF> [--yes] # Wipe one series
npm run reset-db [-- --yes] # Truncate all data tablesThe -- separates npm's own args from args passed to the script.
npm run typecheck # tsc --noEmit
npm run lint # eslint .
npm run lint:fix # eslint . --fix
npm run format # prettier --write .
npm run format:check # prettier --check .npm test # vitest run (requires Postgres + applied migrations)
npm run test:watch # vitest watchIntegration tests hit the real Postgres started by docker compose. vitest.config.ts sets fileParallelism: false so the integration tests don't step on each other's TRUNCATEs. If tests fail with a cryptic Cannot read properties of undefined (reading 'config'), clear the Vite cache: rm -rf node_modules/.vite node_modules/.vitest.
Cannot read properties of undefined (reading 'config')from vitest — stale Vite transform cache.rm -rf node_modules/.vite node_modules/.vitest && npm test.- Migration fails with FK violation on
providers— you're upgrading from a pre-0003 snapshot that had abinanceseed. Migration0003_provider_seed.sqlrenames it toyahoosafely; just runnpm run migrate. - Alpaca returns 401 — check
ALPACA_KEY_ID/ALPACA_SECRET_KEY. Paper-trading keys work for IEX market data. - Alpaca returns WebSocket
406 connection limit exceeded— Alpaca allows one bars-stream connection per account. The service now closes that socket during shutdown, but a very fast restart can still briefly race a server-side drain. Wait a few seconds and retry; the live loop also logs this case explicitly and backs off automatically. - Yahoo returns "1m data not available" — Yahoo caps 1m history at roughly 7 days. The service detects when stored data predates the provider's window and advances the cursor with a warning.
- Docker Desktop pipe error on Windows — Docker Desktop isn't running. Start it.
Honest about what this is not:
- Yahoo is unofficial. Endpoints can change without notice; breakage is usually patched upstream in
yahoo-finance2. - No market-calendar awareness. Live polling fires every interval regardless of market hours. Wasted calls, not incorrect behavior.
- WebSocket streaming is Alpaca 1m only. 5m/15m/1h/1d on Alpaca and all Yahoo timeframes stay on REST polling (Yahoo has no WebSocket feed; Alpaca's bars stream is minute-only). Aggregating trade ticks into higher-timeframe bars is a roadmap item.
- Single-process. Two instances pointing at the same DB would race on
ingestion_state.last_live_tick_at. For horizontal scaling, add advisory locks or a simple worker-partitioning scheme. - Cross-DST gap detection edge case. The step-based enumeration in
findGapsmay overshoot the real bucket across a DST transition for stock data. Rare; a market-calendar-aware provider method would fix it.
See docs/roadmap.md.
- Architecture — layered design, ingestion lifecycle, error strategy.
- Data model — schema, type choices, the gap-detection query.
- Providers — interface contract, Yahoo + Alpaca specifics, how to add one.
- Roadmap — prioritized next work.
MIT. See LICENSE.