Trawl is a high-performance, platform-agnostic ETL pipeline designed to bridge the gap between unstructured web data and modern search engines. Built with Go's high-concurrency primitives, Trawl allows you to scrape, normalize, and index thousands of records into engines like Typesense, Algolia, or Elasticsearch with minimal latency.
Most scraping solutions are either simple scripts that don't scale or massive distributed systems that are overkill. Trawl occupies the "Goldilocks" zone:
- High Concurrency: Uses Go Worker Pools and buffered channels to process thousands of pages in parallel.
- Platform Agnostic: Swap your search provider (e.g., Algolia to Typesense) via a single config line.
- Polite by Design: Built-in rate limiting and retries to respect target server constraints.
- Dual-Purpose: Use it as a Go library (
go get) in your existing projects or run it as a standalone API server.
Trawl utilizes a Concurrent Pipeline architecture. Instead of sequential execution, data flows through "pipes" (Channels), allowing the scraper, the parser, and the indexer to work at maximum capacity simultaneously.
graph LR
subgraph Trawl Engine
A[Seed URLs] --> B(Generator)
B -->|Jobs Channel| C{Worker Pool}
C -->|Scrape & Parse| D(Results Channel)
D -->|Batching| E(Search Adapter)
end
E -->|Index| F[Typesense / Algolia]
G[Voice UI / Frontend] -->|Query| H[Trawl API]
H -->|Search| F
Integrate Trawl's concurrency engine directly into your Go application.
go get github.com/ntekim/trawlimport "github.com/ntekim/trawl/pkg/engine"
func main() {
t := engine.New(engine.Config{
Concurrency: 50,
Provider: "typesense",
})
t.Run(urls)
}Download and install the binary.
go install github.com/ntekim/trawl/cmd/trawl@latestDefine your scraping rules and search targets without writing a single line of Go code.
version: "1"
concurrency: 20
rate_limit: 5 # requests per second
search_provider:
type: "typesense"
host: "http://localhost:8108"
api_key: "${TYPESENSE_API_KEY}"
sources:
- name: "movie_archive"
url: "https://example-movies.com/latest"
driver: "colly" # Use 'chromedp' for Javascript-heavy sites
selectors:
item_container: ".movie-card"
fields:
title: "h2.title"
description: ".synopsis"
image: "img.poster @src"
year: ".meta .year"To demonstrate Trawl's power, we've included a Voice-Activated Media Library in the /examples folder.
- Trawl scrapes your favorite movie/music metadata sites.
- Data is indexed into Typesense.
- The React Frontend uses the Web Speech API to turn your voice into search queries.
- User: "Show me sci-fi movies from the 90s."
- Trawl: Instantly filters the indexed data and displays results.
- Core Pipeline & Worker Pool logic
- Colly Driver (Static HTML)
- Typesense Adapter
- Chromedp Driver (Headless JS)
- Algolia Adapter
- Built-in API Proxy with Authentication
- Prometheus metrics for scraping performance
Trawl is open-source. We love PRs!
- Fork the repo.
- Create your feature branch (
git checkout -b feature/amazing-feature). - Commit your changes (
git commit -m 'Add amazing feature'). - Push to the branch (
git push origin feature/amazing-feature). - Open a Pull Request.
Distributed under the Apache-2.0 License. See LICENSE for more information.
Built with 💙 using Go.
Trawl: Stop searching for data. Start finding it.