Skip to content

mozilla/hnt-content

Repository files navigation

hnt-content

Article crawling and extraction pipeline for Firefox New Tab content recommendations. Crawls publisher pages, discovers articles, extracts content via Zyte, and streams results to BigQuery for ML ranking.

Development

nvm use  # Node 24
pnpm install
pnpm build
pnpm test

Run a service locally (no build step, uses tsx):

pnpm --filter crawl-agent dev
pnpm --filter crawl-worker dev
Command Description
pnpm build Build all packages and services
pnpm test Run all tests
pnpm lint Lint all packages
pnpm format Format source files with Prettier
pnpm format:check Check formatting (CI)
pnpm clean Remove all build artifacts and node_modules

Architecture

See the Article Crawler Technical Spec for the full design. In brief:

  • Crawl Agent runs a tick loop every 60s, checking which publisher pages and live articles need crawling based on Redis state, then enqueues jobs to Pub/Sub.
  • Crawl Worker consumes from two Pub/Sub queues: crawl-article-discovery (page crawling) and crawl-article (article extraction). Results stream to BigQuery via Pub/Sub subscriptions.
  • Redis (Memorystore) tracks crawl timestamps, prevents duplicate fetches, and provides distributed locking.

Repository structure

hnt-content/
├── services/
│   ├── crawl-agent/      # Scheduler: enqueues crawl jobs on configured intervals
│   └── crawl-worker/     # Worker: discovers articles and extracts content
├── packages/
│   └── crawl-common/     # Shared types, utilities, Zyte client
├── Dockerfile            # Multi-stage build with turbo prune + pnpm deploy
├── turbo.json
└── pnpm-workspace.yaml

Deployment

The Dockerfile builds a single image containing all services. Each Helm workload overrides the command to select which service to run:

docker build -t hnt-content .
docker run -e PORT=8080 hnt-content node crawl-agent/dist/main.js
docker run -e PORT=8080 hnt-content node crawl-worker/dist/main.js

The Dockerfile uses Turborepo Docker pruning and pnpm deploy --prod to produce a minimal image with only production dependencies. Services deploy to GKE via ArgoCD (mozcloud Helm chart).

About

Article crawling and extraction pipeline for Firefox New Tab content recommendations

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors