GitHub - OmarEl-Mohandes/SimpleCrawler: Simple crawler that keep fetching relative urls using one seed

Overview

This is a simple crawler program written in Go. Using one url seed, it should fetch all relative urls recursively for the domain, and will print it on the console.

Requirements for building the program

You'll need go1.13 to build.
To build: go build.
To test: go test .

Usage

$ ./simpleCrawler -s "https://example.com" -w 5000 -w 60

Usage of ./simpleCrawler:
  -d int
    	Number of seconds to crawl, default will be forever until no more crawling is needed (default -1)
  -s string
    	Seed url to start crawling from. (required)
  -w int
    	Max number of workers to crawl. (default 1000)

Todo

This project isn't "production ready" due to couple of points:

Write unit tests for the Crawler & Fetcher packages using gomock.
Write proper documentation for the major parts of the system (Crawler, Queue and Fetcher)
Implement politeness delay to not get throttled or cause pain for people.
Make the number of cores GOMAXPROCS configurable, as currently it's 1, depending on your number of logical available cores.
Sometimes this program might hit socket: too many open files if you use a lot of workers (depends on your default limits).
- Your limit for max open files per process (ulimit -n to see the current limit).
Handling errors more gracefully:
- e.g Currently, Fetcher will swallow the errors if fetching url failed (will assume this url has no children). We should implement retries with exponential backoff.
Writing the results to a file, with status update every (e.g 10 seconds) on the console.
- Currently, it logs every url found on the console. (you can ./simpleCrawler .. > urls.txt)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Crawler		Crawler
DataStructures		DataStructures
Fetcher		Fetcher
.gitignore		.gitignore
README.md		README.md
simpleCrawler		simpleCrawler
simpleCrawler.go		simpleCrawler.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler

Crawler

DataStructures

DataStructures

Fetcher

Fetcher

.gitignore

.gitignore

README.md

README.md

simpleCrawler

simpleCrawler

simpleCrawler.go

simpleCrawler.go

Repository files navigation

Overview

Requirements for building the program

Usage

Todo

About

Releases

Packages

Languages

OmarEl-Mohandes/SimpleCrawler

Folders and files

Latest commit

History

Repository files navigation

Overview

Requirements for building the program

Usage

Todo

About

Resources

Stars

Watchers

Forks

Languages