Skip to content

pr3mar/ieps-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WIER: Crawler implementation

IMPORTANT: THE report.pdf FILE IN THE ROOT OF THE REPOSITORY IS SOFT-LINKED TO report/report.pdf

Brief description

This is an implementation of a web-crawler written in Scala.

It is using Akka Actors to spawn multiple workers in order to parallelize the execution of the code. To follow the crawling rules provided in robots.txt files we are using crawler-commons, which also provides us support of SiteMaps.

Using the crawler, we gather pages from as much as possible .gov.si websites. To render the websites and get the status code, we are using HtmlUnit. We start off with a limited static seed list provided in CrawlerApp.scala. We gather the HTML content of all encountered pages, including all the images and binary files of type pdf, doc, docx, ppt, pptx within the seed list and its subdomains.

All the data gathered by the crawler is written to a local instance of a PostgreSQL database using Slick as a relational mapper.

Requirements

In order to run the code you need to:

  1. A local instance of PostgreSQL database needs to be set up:
    1. Have the modified crawldb.sql imported into a database called crawldb
    2. Have a user postgres with no password (default user)
  2. Make sure that sbt installed is installed on the machine where the code will be running
  3. Make sure that a queue directory exists within the working directory of the executable. (If you are running everything from the root directory you should be fine.)

NOTE: The script provided in the repository is required to run the code because the database models which the code relies upon are generated on compile time.

NOTE: To change the specifics about the database name, username and password, please modify the local configuration in application.conf.

Running

To run the code please make sure you have the database set up (previous section) and then run the following command in the root directory:

sbt run

To stop the execution just send a SIGTERM signal in the console (CTRL/CMD + C)

Packaging

To generate a .jar executable, run the following command in the root directory:

# generate a fat jar:
sbt assembly
# upon completion run the jar:
java -jar ./target/scala-2.12/ieps-crawler-assembly-0.1.jar