web-crawler

A concurrent web crawler written in Go. It crawls all internal pages of a single website, extracts basic page data (heading, first paragraph, outgoing links, image URLs), and writes the result to a JSON report.

Requirements

Go 1.26 or newer

Build

go build -o crawler

Usage

./crawler BASE_URL MAX_CONCURRENCY MAX_PAGES

BASE_URL — the site to crawl (e.g. https://example.com)
MAX_CONCURRENCY — max number of pages fetched in parallel
MAX_PAGES — stop after this many pages have been crawled

The crawler only follows links on the same domain as BASE_URL.

Example

./crawler https://example.com 5 50

This writes a report.json file in the current directory containing one entry per crawled page:

[
  {
    "url": "https://example.com/",
    "heading": "Example Heading",
    "first_paragraph": "First paragraph text...",
    "outgoing_links": ["https://example.com/about"],
    "image_urls": ["https://example.com/logo.png"]
  }
]

Tests

go test ./...

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
README.md		README.md
crawl_page.go		crawl_page.go
extract_page_data.go		extract_page_data.go
extract_page_data_test.go		extract_page_data_test.go
get_html_content.go		get_html_content.go
go.mod		go.mod
go.sum		go.sum
json_report.go		json_report.go
main.go		main.go
normalize_url.go		normalize_url.go
normalize_url_test.go		normalize_url_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-crawler

Requirements

Build

Usage

Example

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web-crawler

Requirements

Build

Usage

Example

Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages