A fast and flexible web crawler written in go.
For optimal performance, we recommend running the crawler on a system with at least:
- 4 vCPUs (virtual CPUs)
- 8GB RAM
- SSD storage (for faster disk operations)
- Stable internet connection
The web crawler crawls up to 100,000 domains in 6 minutes 1 2 .
To install requirements, run:
make dep
Run make or make build
to compile the project and the plugins or run make run for compiling and starting the application.
To compile plugins, run make plugin or just run make build.
For tests, run make test
clear clear the screen
continue continue the webcrawler
exit exit the program
help display help
load load website csv file
pause pause the webcrawler
quit stop the webcrawler and app
start start the webcrawler
status get the status of the crawler
The default configuration file:
Depth: 0
DirectStart: false
Workers: 4
Blacklist:
RateLimit: 100
Browser: 0
Crawler: Get Html
SkipOutputer: true
Outputer:
Writer: Csv
BatchSize: 100
CsvField: 1
WebsiteCsvFile: websites.csv
DBUsername: "root"
DBPassword: ""
DBHost: "127.0.0.1"
DBPort: 3306
DBName: "crawlerDB"
KafkaBroker: "localhost:9092"
KafkaTopic: "test-topic"
where Browser is 0 for Chrome or 1 for Firefox and
BatchSize the number of domains to work through each iteration.
RateLimit is the maximum amount of domains per minute per worker and
Workers defines the number of crawler and outputer instances.
Depth is the depth where 0 is no depth and
Blacklist is the file path to the blacklist.
CsvField is the csv element which will be crawled.
DB Fields have to be used when choosing the database plugin as writer
Kafka Fields have to be used when the kafka plugin is used as writer
There are three types of plugins. First the crawler plugin, second the outputer plugin and last the writer plugin.
The crawler plugin is responsible for getting the information from a page and writing it into an output struct. The crawler plugin receives a CRAWLER_INPUT interface as input:
type CRAWLER_INPUT interface {
Page() playwright.Page
Browser() playwright.BrowserContext
Locators() []playwright.Locator
}where Page() gets the current playwright page, Browser() gets the current browser context and Locators() gets an array of playwright locators.
As an output, the function must give an OUTPUT struct:
type OUTPUT struct {
Str string
Str_array []string
Byte_array []byte
Ctx interface{}
Choice int
}The outputer plugin is responsible for transforming and changing the information from the crawler plugin. It has the parameter OUTPUT struct and returns OUTPUT struct. As an input, it has an OUTPUT struct:
type OUTPUT struct {
Str string
Str_array []string
Byte_array []byte
Ctx interface{}
Choice int
}As an output, the function must give an OUTPUT struct:
type OUTPUT struct {
Str string
Str_array []string
Byte_array []byte
Ctx interface{}
Choice int
}The writer plugin is responsible for writing the information in different formats. It has the parameter OUTPUT struct and returns src.ERR typedef. As an input, it has an OUTPUT struct array:
type OUTPUT struct {
Str string
Str_array []string
Byte_array []byte
Ctx interface{}
Choice int
}As an output, the function return an integer or a src.ERR typedef.
Each plugin must have the following structure consisting of a Name() function and a Start() function. In addition, each plugin must export a variable Plugin !
type DemoPlugin struct{}
func (p *DemoPlugin) Name() string {
return "Demo Plugin"
}
func (p *DemoPlugin) Start(<CRAWLER_INPUT> || <OUTPUT>) <OUTPUT> || src.ERR {
}
var Plugin DemoPluginAfter one iteration, a state file is created which saves the current state of the batch.
The program automatically resumes when a state file is located; otherwise it will start from the beginning.
A Dockerfile is included in this repository. To create a docker image,
run : make docker .
The Workdir of the docker container is /opt/abc. You have to include this mount for persistence.
For example: sudo docker run -v <your-path>:/opt/abc.
The default plugins will be copied to the volume mount if no plugins are present.
If no configuration file is given, a default configuration file will be written and can only be changed when you set up a volume.
To send data live to Kafka, the Kafka-Writer plugin can be used. Broker address and Topic can be written into the config.
If another Writer-Plugin is being used but data streaming to Kafka is still desired, a Kafka instance can be initialized within the respective plugin as follows:
writer := api.Kafka_writer_init(<BrokerAddress>, <Topic>)Where the following parameters must be specified:
BrokerAddress: The address of the Kafka broker
Topic: The Kafka topic to which the data should be written
The parameters can also be set via the configuration. To do this, load the configuration using config.Load_config(path string), after which they can be accessed.
After initialization, data can be written to Kafka using the following command:
api.Kafka_write(writer, <String>, <String>)Footnotes
-
The crawling speed can be lower for different websites and can lead to a disruption of the internet connection ↩
-
With 20 workers and the same website https://michaelengel.net (i.e. with browser caching) and the plugins get html and csv ↩