An engineering exercise implemented in Go.
A simple web crawler that visits all pages within a given domain, but does not follow external links. It outputs a simple structured site map, showing for each page:
- domain-internal page links
- external page links
- links to static content such as images
This entire project can be cloned directly from github via: https://github.com/nerophon/crawler
- The Go Programming Langugage must be installed to build, test, and install this software.
- Clone this project.
cd
to the project directory- run
go install
The software will be installed to the $GOPATH/bin
directory by default.
This software includes unit tests. They can be run as per standard for Go tests:
cd
to the source folder with test files you wish to run- run
go test
Benchmarks exist for key steps in the process. These can be run from the root project directory, via the crawler_test.go
file. I suggest running each benchmark separately, using the following commands:
go test -bench=BenchmarkFetch -benchtime=7s
go test -bench=BenchmarkCrawl -benchtime=15s
Please be aware that this kind of benchmark could, if run without care, be interpreted as a DOS attack. The benchtime
flag may need to be adjusted depending upon which website is being used in the test. I strongly advise NOT using commonly DOS'd websites such as those belonging to major corporations.
cd
to the install directory, usually$GOPATH/bin
- run
./crawler
At the application command prompt, the following commands are available:
crawl [URL] begin crawling the specified domain
help show available commands
quit exit the application
Press ctrl-c
during a crawl to halt and force quit back to the OS command line.