WIER: Crawler implementation

IMPORTANT: THE report.pdf FILE IN THE ROOT OF THE REPOSITORY IS SOFT-LINKED TO report/report.pdf

Brief description

This is an implementation of a web-crawler written in Scala.

It is using Akka Actors to spawn multiple workers in order to parallelize the execution of the code. To follow the crawling rules provided in robots.txt files we are using crawler-commons, which also provides us support of SiteMaps.

Using the crawler, we gather pages from as much as possible .gov.si websites. To render the websites and get the status code, we are using HtmlUnit. We start off with a limited static seed list provided in CrawlerApp.scala. We gather the HTML content of all encountered pages, including all the images and binary files of type pdf, doc, docx, ppt, pptx within the seed list and its subdomains.

All the data gathered by the crawler is written to a local instance of a PostgreSQL database using Slick as a relational mapper.

Requirements

In order to run the code you need to:

A local instance of PostgreSQL database needs to be set up:
1. Have the modified crawldb.sql imported into a database called crawldb
2. Have a user postgres with no password (default user)
Make sure that sbt installed is installed on the machine where the code will be running
Make sure that a queue directory exists within the working directory of the executable. (If you are running everything from the root directory you should be fine.)

NOTE: The script provided in the repository is required to run the code because the database models which the code relies upon are generated on compile time.

NOTE: To change the specifics about the database name, username and password, please modify the local configuration in application.conf.

Running

To run the code please make sure you have the database set up (previous section) and then run the following command in the root directory:

sbt run

To stop the execution just send a SIGTERM signal in the console (CTRL/CMD + C)

Packaging

To generate a .jar executable, run the following command in the root directory:

# generate a fat jar:
sbt assembly
# upon completion run the jar:
java -jar ./target/scala-2.12/ieps-crawler-assembly-0.1.jar

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
db_dumps		db_dumps
project		project
queue		queue
r		r
report		report
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
crawldb.sql		crawldb.sql
db		db
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WIER: Crawler implementation

Brief description

Requirements

Running

Packaging

About

Releases 3

Packages

Contributors 3

Languages

pr3mar/ieps-crawler

Folders and files

Latest commit

History

Repository files navigation

WIER: Crawler implementation

Brief description

Requirements

Running

Packaging

About

Topics

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Languages

Packages