Skip to content

pauldevnull/market-ingest

Repository files navigation

market-ingest

A Node.js + TypeScript service that ingests OHLCV candle data from market data providers into PostgreSQL. It backfills as far back as the provider will serve, detects and fills gaps in already-stored data, and continues polling for newly-closed candles while it runs.

Built as a portfolio project to demonstrate clean backend architecture, data-pipeline thinking, and production-minded engineering. Not a trading system — no orders, positions, or P&L.

Features

  • Two providers, one interface. Yahoo Finance (no API key) and Alpaca Markets (free tier) behind a uniform Provider contract. Adding a third is a single-file PR.
  • WebSocket streaming for Alpaca 1m bars. When the provider supports push-based delivery, the live job consumes streamLiveCandles directly; otherwise it falls back to REST polling. The multiplexer client keeps Alpaca's one-connection-per-account limit honest across multiple subscribed symbols.
  • Full historical backfill from the provider's earliest-available timestamp. Per-timeframe chunking (and pagination for Alpaca) is the provider's concern; the orchestrator just iterates.
  • Idempotent writes. Composite PK on (provider_id, symbol, timeframe, open_time) with INSERT ... ON CONFLICT DO UPDATE. Re-running is always safe.
  • Gap detection and automatic refill. Walks between consecutive stored candles, re-requests the bracketing range, upserts only the candles that actually fill a gap. Buckets the provider confirms are missing (weekends, holidays, exchange downtime) get recorded to known_gaps so they aren't re-requested forever.
  • Resume-on-restart. ingestion_state tracks progress per series; backfill picks up from the last stored open_time. If stored data is older than the provider's retention window, the cursor advances with a warning instead of crashing.
  • Graceful shutdown. SIGINT/SIGTERM aborts in-flight work cleanly. A second signal exits immediately. For Alpaca streaming, the service also closes its single-account WebSocket on shutdown so a fast restart does not usually trip the provider's connection cap.
  • Config validated at boot. zod schema, bulleted error messages, no runtime surprises.
  • Strict TypeScript (ESM, noUncheckedIndexedAccess, verbatimModuleSyntax), ESLint flat config, Prettier, Vitest.

Architecture

flowchart TD
    CLI["src/index.ts<br/>composition root"] --> Orch[Orchestrator]

    Orch -->|per series| GapScan[GapScanner]
    Orch -->|per series| Backfill[BackfillJob]
    Orch -->|per series, concurrent| Live[LiveIngestJob]

    GapScan --> P[Provider interface]
    Backfill --> P
    Live --> P

    P -.-> Yahoo[YahooProvider<br/>REST, chunked]
    P -.-> Alpaca[AlpacaProvider<br/>REST, paginated]

    GapScan --> Repos[Repositories]
    Backfill --> Repos
    Live --> Repos

    Repos --> DB[(PostgreSQL)]
Loading

For each configured (provider, symbol, timeframe):

  1. Gap scan. Walks consecutive stored candles; any range wider than one bucket gets re-fetched. Only the candles that actually fill a gap get upserted. Unfillable buckets go to known_gaps.
  2. Backfill. Resumes from findLatestOpenTime + 1 bucket (or the provider's earliest-available on a fresh series). Streams candles in batches, upserts, updates progress.
  3. Live polling. One loop per series, concurrent across series. Poll interval is max(15s, min(tf/3, 5min)). Only persists candles whose close_time < now so unclosed current-bucket candles don't corrupt the series. Transient errors are logged and retried; the loop survives them.

See docs/architecture.md for the detailed lifecycle.

Quickstart

Prerequisites: Node 22.9+, Docker Desktop (or Podman Desktop).

git clone <this-repo>
cd market-ingest

npm install

# Start Postgres
docker compose up -d postgres

# Configure. Edit SYMBOLS / TIMEFRAMES / PROVIDER as desired.
cp .env.example .env

# Apply schema
npm run migrate

# Run
npm run dev

With the default config (PROVIDER=yahoo, SYMBOLS=AAPL, TIMEFRAMES=1d) you'll see roughly 11,400 daily candles backfilled from AAPL's December 1980 IPO through today in a few seconds, then the service enters live polling mode.

Shut down with Ctrl+C — in-flight writes drain and the pool closes cleanly.

Sample output

[09:29:12.296] INFO: market-ingest starting
    service: "market-ingest"
    nodeEnv: "development"
    provider: "yahoo"
    symbols: ["AAPL"]
    timeframes: ["1d"]
[09:29:12.820] INFO: backfill: starting
    service: "market-ingest"
    component: "backfill"
    symbol: "AAPL"
    timeframe: "1d"
    startTime: "1980-12-12T14:30:00.000Z"
    batchSize: 1000
[09:29:15.559] INFO: backfill: complete
    service: "market-ingest"
    component: "backfill"
    symbol: "AAPL"
    timeframe: "1d"
    totalUpserted: 11439
[09:29:15.559] INFO: orchestrator: backfill + gap scan phase complete
    service: "market-ingest"
[09:29:15.559] INFO: orchestrator: entering live phase
    service: "market-ingest"
[09:29:15.560] INFO: live ingest: starting
    service: "market-ingest"
    component: "live-ingest"
    symbol: "AAPL"
    timeframe: "1d"
    pollIntervalMs: 300000

Dev mode uses pino-pretty; production (NODE_ENV=production) emits the same fields as JSON lines for log aggregators.

Configuration

All configuration is via environment variables, validated on boot. See .env.example for the reference.

Variable Required Default Notes
DATABASE_URL yes Postgres connection string
SYMBOLS yes Comma-separated, uppercased on load
TIMEFRAMES yes Comma-separated (1m, 5m, 1h, 1d, 1w, …)
PROVIDER no yahoo yahoo or alpaca
LOG_LEVEL no info fatal/error/warn/info/debug/trace
NODE_ENV no development Toggles pino-pretty vs JSON logs
BACKFILL_BATCH_SIZE no 1000 Rows per upsert
ALPACA_KEY_ID if PROVIDER=alpaca Free at alpaca.markets
ALPACA_SECRET_KEY if PROVIDER=alpaca
ALPACA_DATA_URL no https://data.alpaca.markets Override for testing
ALPACA_FEED no iex iex (free) or sip (paid, full US market)

Extending

Adding a new provider

  1. Implement the Provider interface in src/providers/<name>/.
  2. Write a mapping function that produces a canonical Candle — see src/providers/alpaca/mapping.ts for the template. The comment block at the top lists the five normalization invariants every mapper must satisfy (UTC Date timestamps, 8-decimal string prices, computed closeTime, canonical timeframe code, providerId from context).
  3. Register the provider in src/providers/registry.ts and add the name to the ProviderName union in src/config/index.ts.
  4. Seed the provider in a new migration (migrations/000N_provider_seed.sql).
  5. Add unit tests for timeframe translation and mapping; an integration test against the real upstream is nice-to-have.

Adding symbols or timeframes

Edit .env. The service parses, validates, and fails fast on bad values.

Scripts

Running the service

npm run dev               # Run via tsx (dev-friendly; no build required)
npm run dev:watch         # tsx watch — restarts on source changes
npm run build             # tsc -p tsconfig.build.json
npm start                 # Run compiled dist/ (needs `npm run build` first)
npm run start:build       # Build then run

Database utilities

All destructive scripts (delete, reset-db) run as a preview by default and require --yes to actually mutate data.

npm run migrate                         # Apply pending SQL migrations
npm run health                          # One-screen summary of every series
npm run count   -- <SYM> <TF>           # Per-provider detail for one series
npm run recent  -- <SYM> <TF> [N]       # Last N candles (default 10, max 1000)
npm run gaps    -- <SYM> <TF> [PROV]    # Missing buckets the scanner would refill
npm run delete  -- <SYM> <TF> [--yes]   # Wipe one series
npm run reset-db [-- --yes]             # Truncate all data tables

The -- separates npm's own args from args passed to the script.

Code quality

npm run typecheck         # tsc --noEmit
npm run lint              # eslint .
npm run lint:fix          # eslint . --fix
npm run format            # prettier --write .
npm run format:check      # prettier --check .

Tests

npm test                  # vitest run (requires Postgres + applied migrations)
npm run test:watch        # vitest watch

Integration tests hit the real Postgres started by docker compose. vitest.config.ts sets fileParallelism: false so the integration tests don't step on each other's TRUNCATEs. If tests fail with a cryptic Cannot read properties of undefined (reading 'config'), clear the Vite cache: rm -rf node_modules/.vite node_modules/.vitest.

Troubleshooting

  • Cannot read properties of undefined (reading 'config') from vitest — stale Vite transform cache. rm -rf node_modules/.vite node_modules/.vitest && npm test.
  • Migration fails with FK violation on providers — you're upgrading from a pre-0003 snapshot that had a binance seed. Migration 0003_provider_seed.sql renames it to yahoo safely; just run npm run migrate.
  • Alpaca returns 401 — check ALPACA_KEY_ID / ALPACA_SECRET_KEY. Paper-trading keys work for IEX market data.
  • Alpaca returns WebSocket 406 connection limit exceeded — Alpaca allows one bars-stream connection per account. The service now closes that socket during shutdown, but a very fast restart can still briefly race a server-side drain. Wait a few seconds and retry; the live loop also logs this case explicitly and backs off automatically.
  • Yahoo returns "1m data not available" — Yahoo caps 1m history at roughly 7 days. The service detects when stored data predates the provider's window and advances the cursor with a warning.
  • Docker Desktop pipe error on Windows — Docker Desktop isn't running. Start it.

Limitations

Honest about what this is not:

  • Yahoo is unofficial. Endpoints can change without notice; breakage is usually patched upstream in yahoo-finance2.
  • No market-calendar awareness. Live polling fires every interval regardless of market hours. Wasted calls, not incorrect behavior.
  • WebSocket streaming is Alpaca 1m only. 5m/15m/1h/1d on Alpaca and all Yahoo timeframes stay on REST polling (Yahoo has no WebSocket feed; Alpaca's bars stream is minute-only). Aggregating trade ticks into higher-timeframe bars is a roadmap item.
  • Single-process. Two instances pointing at the same DB would race on ingestion_state.last_live_tick_at. For horizontal scaling, add advisory locks or a simple worker-partitioning scheme.
  • Cross-DST gap detection edge case. The step-based enumeration in findGaps may overshoot the real bucket across a DST transition for stock data. Rare; a market-calendar-aware provider method would fix it.

See docs/roadmap.md.

Docs

  • Architecture — layered design, ingestion lifecycle, error strategy.
  • Data model — schema, type choices, the gap-detection query.
  • Providers — interface contract, Yahoo + Alpaca specifics, how to add one.
  • Roadmap — prioritized next work.

License

MIT. See LICENSE.

About

Market data pipeline in Node.js + TypeScript. Ingests OHLCV candles from Yahoo Finance and Alpaca Markets into PostgreSQL with full historical backfill, gap detection, and live polling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors