Article crawling and extraction pipeline for Firefox New Tab content recommendations. Crawls publisher pages, discovers articles, extracts content via Zyte, and streams results to BigQuery for ML ranking.
nvm use # Node 24
pnpm install
pnpm build
pnpm testRun a service locally (no build step, uses tsx):
pnpm --filter crawl-agent dev
pnpm --filter crawl-worker dev| Command | Description |
|---|---|
pnpm build |
Build all packages and services |
pnpm test |
Run all tests |
pnpm lint |
Lint all packages |
pnpm format |
Format source files with Prettier |
pnpm format:check |
Check formatting (CI) |
pnpm clean |
Remove all build artifacts and node_modules |
See the Article Crawler Technical Spec for the full design. In brief:
- Crawl Agent runs a tick loop every 60s, checking which publisher pages and live articles need crawling based on Redis state, then enqueues jobs to Pub/Sub.
- Crawl Worker consumes from two Pub/Sub queues:
crawl-article-discovery(page crawling) andcrawl-article(article extraction). Results stream to BigQuery via Pub/Sub subscriptions. - Redis (Memorystore) tracks crawl timestamps, prevents duplicate fetches, and provides distributed locking.
hnt-content/
├── services/
│ ├── crawl-agent/ # Scheduler: enqueues crawl jobs on configured intervals
│ └── crawl-worker/ # Worker: discovers articles and extracts content
├── packages/
│ └── crawl-common/ # Shared types, utilities, Zyte client
├── Dockerfile # Multi-stage build with turbo prune + pnpm deploy
├── turbo.json
└── pnpm-workspace.yaml
The Dockerfile builds a single image containing all services. Each Helm workload overrides the command to select which service to run:
docker build -t hnt-content .
docker run -e PORT=8080 hnt-content node crawl-agent/dist/main.js
docker run -e PORT=8080 hnt-content node crawl-worker/dist/main.jsThe Dockerfile uses Turborepo Docker pruning and pnpm deploy --prod to produce a minimal image with only production dependencies. Services deploy to GKE via ArgoCD (mozcloud Helm chart).