Web Crawler

A simple, concurrent web crawler written in Go. This program starts at a given URL, downloads the content of the web page, saves it to a specified directory, and extracts valid URLs from the page. It then recursively crawls the extracted URLs and repeats the process. The program ensures that URLs are not visited multiple times by using a thread-safe data structure.

Installation

To install the web crawler, make sure you have Go installed on your system. Then, to clone the repository and install webcrawler run the following commands:

go get github.com/oliverasistemas/webcrawler
cd webcrawler
go install

Usage

To run the web crawler, execute the following command:

webcrawler <starting-url> <destination-directory>

Replace with the URL you want to start crawling from and with the directory where you want to save the downloaded content.

For example, to start crawling from https://example.com and save the content in a directory named output:

webcrawler https://example.com output

Dependencies

The web crawler uses two custom packages:

crawler/pagedownloader: Responsible for downloading the content of a URL and saving it to a specified directory.
crawler/url_extractor: Responsible for extracting valid URLs from the downloaded content.

Limitations

This web crawler is a basic implementation and is not optimized for large-scale web crawling.
It does not handle JavaScript rendering or dynamic content.
There is no support for robots.txt or other crawling rules.
Support resume functionality by checking the destination directory for downloaded pages and skip downloading and processing where not necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pagedownloader		pagedownloader
url_extractor		url_extractor
.gitignore		.gitignore
README.md		README.md
SafeVisited.go		SafeVisited.go
crawl.go		crawl.go
go.mod		go.mod
go.sum		go.sum
handleInterrupt.go		handleInterrupt.go
main.go		main.go
main_test.go		main_test.go
setupCrawler.go		setupCrawler.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Installation

Usage

Dependencies

Limitations

About

Releases

Packages

Languages

oliverasistemas/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Installation

Usage

Dependencies

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages