hnt-content

Article crawling and extraction pipeline for Firefox New Tab content recommendations. Crawls publisher pages, discovers articles, extracts content via Zyte, and streams results to BigQuery for ML ranking.

Development

nvm use  # Node 24
pnpm install
pnpm build
pnpm test

Run a service locally (no build step, uses tsx):

pnpm --filter crawl-agent dev
pnpm --filter crawl-worker dev

Command	Description
`pnpm build`	Build all packages and services
`pnpm test`	Run all tests
`pnpm lint`	Lint all packages
`pnpm format`	Format source files with Prettier
`pnpm format:check`	Check formatting (CI)
`pnpm clean`	Remove all build artifacts and node_modules

Architecture

See the Article Crawler Technical Spec for the full design. In brief:

Crawl Agent runs a tick loop every 60s, checking which publisher pages and live articles need crawling based on Redis state, then enqueues jobs to Pub/Sub.
Crawl Worker consumes from two Pub/Sub queues: crawl-article-discovery (page crawling) and crawl-article (article extraction). Results stream to BigQuery via Pub/Sub subscriptions.
Redis (Memorystore) tracks crawl timestamps, prevents duplicate fetches, and provides distributed locking.

Repository structure

hnt-content/
├── services/
│   ├── crawl-agent/      # Scheduler: enqueues crawl jobs on configured intervals
│   └── crawl-worker/     # Worker: discovers articles and extracts content
├── packages/
│   └── crawl-common/     # Shared types, utilities, Zyte client
├── Dockerfile            # Multi-stage build with turbo prune + pnpm deploy
├── turbo.json
└── pnpm-workspace.yaml

Deployment

The Dockerfile builds a single image containing all services. Each Helm workload overrides the command to select which service to run:

docker build -t hnt-content .
docker run -e PORT=8080 hnt-content node crawl-agent/dist/main.js
docker run -e PORT=8080 hnt-content node crawl-worker/dist/main.js

The Dockerfile uses Turborepo Docker pruning and pnpm deploy --prod to produce a minimal image with only production dependencies. Services deploy to GKE via ArgoCD (mozcloud Helm chart).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
packages/crawl-common		packages/crawl-common
services		services
.dockerignore		.dockerignore
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hnt-content

Development

Architecture

Repository structure

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hnt-content

Development

Architecture

Repository structure

Deployment

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages