IMPORTANT: THE report.pdf
FILE IN THE ROOT OF THE REPOSITORY IS SOFT-LINKED TO report/report.pdf
This is an implementation of a web-crawler written in Scala.
It is using Akka Actors to spawn multiple workers in order to parallelize the execution of the code. To follow the crawling rules provided in robots.txt files we are using crawler-commons, which also provides us support of SiteMaps.
Using the crawler, we gather pages from as much as possible .gov.si
websites. To render the websites and get the status code, we are using HtmlUnit. We start off with a limited static seed list provided in CrawlerApp.scala. We gather the HTML content of all encountered pages, including all the images and binary files of type pdf
, doc
, docx
, ppt
, pptx
within the seed list and its subdomains.
All the data gathered by the crawler is written to a local instance of a PostgreSQL database using Slick as a relational mapper.
In order to run the code you need to:
- A local instance of
PostgreSQL
database needs to be set up:- Have the modified
crawldb.sql
imported into a database called crawldb - Have a user
postgres
with no password (default user)
- Have the modified
- Make sure that
sbt
installed is installed on the machine where the code will be running - Make sure that a
queue
directory exists within the working directory of the executable. (If you are running everything from the root directory you should be fine.)
NOTE: The script provided in the repository is required to run the code because the database models which the code relies upon are generated on compile time.
NOTE: To change the specifics about the database name, username and password, please modify the local
configuration in application.conf.
To run the code please make sure you have the database set up (previous section) and then run the following command in the root directory:
sbt run
To stop the execution just send a SIGTERM
signal in the console (CTRL/CMD + C)
To generate a .jar
executable, run the following command in the root directory:
# generate a fat jar:
sbt assembly
# upon completion run the jar:
java -jar ./target/scala-2.12/ieps-crawler-assembly-0.1.jar