Trawl 🛰️

Trawl is a high-performance, platform-agnostic ETL pipeline designed to bridge the gap between unstructured web data and modern search engines. Built with Go's high-concurrency primitives, Trawl allows you to scrape, normalize, and index thousands of records into engines like Typesense, Algolia, or Elasticsearch with minimal latency.

🔥 Why Trawl?

Most scraping solutions are either simple scripts that don't scale or massive distributed systems that are overkill. Trawl occupies the "Goldilocks" zone:

High Concurrency: Uses Go Worker Pools and buffered channels to process thousands of pages in parallel.
Platform Agnostic: Swap your search provider (e.g., Algolia to Typesense) via a single config line.
Polite by Design: Built-in rate limiting and retries to respect target server constraints.
Dual-Purpose: Use it as a Go library (go get) in your existing projects or run it as a standalone API server.

🏗️ Architecture: The Pipeline Pattern

Trawl utilizes a Concurrent Pipeline architecture. Instead of sequential execution, data flows through "pipes" (Channels), allowing the scraper, the parser, and the indexer to work at maximum capacity simultaneously.

graph LR
    subgraph Trawl Engine
        A[Seed URLs] --> B(Generator)
        B -->|Jobs Channel| C{Worker Pool}
        C -->|Scrape & Parse| D(Results Channel)
        D -->|Batching| E(Search Adapter)
    end
    E -->|Index| F[Typesense / Algolia]
    G[Voice UI / Frontend] -->|Query| H[Trawl API]
    H -->|Search| F

🚀 Getting Started

1. As a Go Library

Integrate Trawl's concurrency engine directly into your Go application.

go get github.com/ntekim/trawl

import "github.com/ntekim/trawl/pkg/engine"

func main() {
    t := engine.New(engine.Config{
        Concurrency: 50,
        Provider:    "typesense",
    })
    t.Run(urls)
}

2. As a Standalone Server/CLI

Download and install the binary.

go install github.com/ntekim/trawl/cmd/trawl@latest

📄 Configuration (`config.yaml`)

Define your scraping rules and search targets without writing a single line of Go code.

version: "1"
concurrency: 20
rate_limit: 5 # requests per second

search_provider:
  type: "typesense"
  host: "http://localhost:8108"
  api_key: "${TYPESENSE_API_KEY}"

sources:
  - name: "movie_archive"
    url: "https://example-movies.com/latest"
    driver: "colly" # Use 'chromedp' for Javascript-heavy sites
    selectors:
      item_container: ".movie-card"
      fields:
        title: "h2.title"
        description: ".synopsis"
        image: "img.poster @src"
        year: ".meta .year"

🎙️ Flagship Demo: Voice-Activated Media Library

To demonstrate Trawl's power, we've included a Voice-Activated Media Library in the /examples folder.

Trawl scrapes your favorite movie/music metadata sites.
Data is indexed into Typesense.
The React Frontend uses the Web Speech API to turn your voice into search queries.
- User: "Show me sci-fi movies from the 90s."
- Trawl: Instantly filters the indexed data and displays results.

🛠️ Roadmap

Core Pipeline & Worker Pool logic
Colly Driver (Static HTML)
Typesense Adapter
Chromedp Driver (Headless JS)
Algolia Adapter
Built-in API Proxy with Authentication
Prometheus metrics for scraping performance

🤝 Contributing

Trawl is open-source. We love PRs!

Fork the repo.
Create your feature branch (git checkout -b feature/amazing-feature).
Commit your changes (git commit -m 'Add amazing feature').
Push to the branch (git push origin feature/amazing-feature).
Open a Pull Request.

📄 License

Distributed under the Apache-2.0 License. See LICENSE for more information.

Built with 💙 using Go.
Trawl: Stop searching for data. Start finding it.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
cmd		cmd
examples		examples
internal/platform		internal/platform
pkg		pkg
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trawl 🛰️

🔥 Why Trawl?

🏗️ Architecture: The Pipeline Pattern

🚀 Getting Started

1. As a Go Library

2. As a Standalone Server/CLI

📄 Configuration (`config.yaml`)

🎙️ Flagship Demo: Voice-Activated Media Library

🛠️ Roadmap

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trawl 🛰️

🔥 Why Trawl?

🏗️ Architecture: The Pipeline Pattern

🚀 Getting Started

1. As a Go Library

2. As a Standalone Server/CLI

📄 Configuration (config.yaml)

🎙️ Flagship Demo: Voice-Activated Media Library

🛠️ Roadmap

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📄 Configuration (`config.yaml`)

Packages