Skip to content

michaelengelgit/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

A fast and flexible web crawler written in go.

Dependencies

Building

System Requirements

For optimal performance, we recommend running the crawler on a system with at least:

  • 4 vCPUs (virtual CPUs)
  • 8GB RAM
  • SSD storage (for faster disk operations)
  • Stable internet connection

The web crawler crawls up to 100,000 domains in 6 minutes 1 2 .

Requirements

To install requirements, run: make dep

Build:

Run make or make build
to compile the project and the plugins or run make run for compiling and starting the application.

Plugins:

To compile plugins, run make plugin or just run make build.

Tests:

For tests, run make test

Usage

clear         clear the screen
continue      continue the webcrawler
exit          exit the program
help          display help
load          load website csv file
pause         pause the webcrawler
quit          stop the webcrawler and app
start         start the webcrawler
status        get the status of the crawler

Config

The default configuration file:

Depth: 0
DirectStart: false
Workers: 4
Blacklist:
RateLimit: 100
Browser: 0
Crawler: Get Html
SkipOutputer: true
Outputer:
Writer: Csv
BatchSize: 100
CsvField: 1
WebsiteCsvFile: websites.csv
DBUsername: "root"
DBPassword: ""
DBHost: "127.0.0.1"
DBPort: 3306
DBName: "crawlerDB"
KafkaBroker: "localhost:9092"
KafkaTopic: "test-topic"

where Browser is 0 for Chrome or 1 for Firefox and
BatchSize the number of domains to work through each iteration.

RateLimit is the maximum amount of domains per minute per worker and
Workers defines the number of crawler and outputer instances.

Depth is the depth where 0 is no depth and
Blacklist is the file path to the blacklist.

CsvField is the csv element which will be crawled.

DB Fields have to be used when choosing the database plugin as writer

Kafka Fields have to be used when the kafka plugin is used as writer

Plugins

There are three types of plugins. First the crawler plugin, second the outputer plugin and last the writer plugin.

Crawler plugin

The crawler plugin is responsible for getting the information from a page and writing it into an output struct. The crawler plugin receives a CRAWLER_INPUT interface as input:

type CRAWLER_INPUT interface {
	Page() playwright.Page
	Browser() playwright.BrowserContext

	Locators() []playwright.Locator
}

where Page() gets the current playwright page, Browser() gets the current browser context and Locators() gets an array of playwright locators.

As an output, the function must give an OUTPUT struct:

type OUTPUT struct {
	Str       string
	Str_array []string

	Byte_array []byte
	Ctx        interface{}

	Choice int
}

Outputer plugin

The outputer plugin is responsible for transforming and changing the information from the crawler plugin. It has the parameter OUTPUT struct and returns OUTPUT struct. As an input, it has an OUTPUT struct:

type OUTPUT struct {
	Str       string
	Str_array []string

	Byte_array []byte
	Ctx        interface{}

	Choice int
}

As an output, the function must give an OUTPUT struct:

type OUTPUT struct {
	Str       string
	Str_array []string

	Byte_array []byte
	Ctx        interface{}

	Choice int
}

Writer plugin

The writer plugin is responsible for writing the information in different formats. It has the parameter OUTPUT struct and returns src.ERR typedef. As an input, it has an OUTPUT struct array:

type OUTPUT struct {
	Str       string
	Str_array []string

	Byte_array []byte
	Ctx        interface{}

	Choice int
}

As an output, the function return an integer or a src.ERR typedef.

Plugin definition

Each plugin must have the following structure consisting of a Name() function and a Start() function. In addition, each plugin must export a variable Plugin !

type DemoPlugin struct{}

func (p *DemoPlugin) Name() string {

	return "Demo Plugin"
}

func (p *DemoPlugin) Start(<CRAWLER_INPUT> || <OUTPUT>) <OUTPUT> || src.ERR {

}

var Plugin DemoPlugin

Persistence

After one iteration, a state file is created which saves the current state of the batch.
The program automatically resumes when a state file is located; otherwise it will start from the beginning.

Containerization

A Dockerfile is included in this repository. To create a docker image,
run : make docker .

The Workdir of the docker container is /opt/abc. You have to include this mount for persistence.
For example: sudo docker run -v <your-path>:/opt/abc.

The default plugins will be copied to the volume mount if no plugins are present.
If no configuration file is given, a default configuration file will be written and can only be changed when you set up a volume.

Data-Streaming

To send data live to Kafka, the Kafka-Writer plugin can be used. Broker address and Topic can be written into the config.
If another Writer-Plugin is being used but data streaming to Kafka is still desired, a Kafka instance can be initialized within the respective plugin as follows:

writer := api.Kafka_writer_init(<BrokerAddress>, <Topic>)

Where the following parameters must be specified:
BrokerAddress: The address of the Kafka broker
Topic: The Kafka topic to which the data should be written

The parameters can also be set via the configuration. To do this, load the configuration using config.Load_config(path string), after which they can be accessed.
After initialization, data can be written to Kafka using the following command:

api.Kafka_write(writer, <String>, <String>)

Footnotes

  1. The crawling speed can be lower for different websites and can lead to a disruption of the internet connection

  2. With 20 workers and the same website https://michaelengel.net (i.e. with browser caching) and the plugins get html and csv

About

A fast and flexible webcrawler with support for plugins.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors