No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
README
pom.xml

README

SimCrawler
==========

This project was develop as an programming exercise for the course Computational
Advertising at the University of Leipzig. The main task was to implement a
simulated web crawler, with three parts. The first one was to implement a
breadth-first search crawling strategy. The second task consisted of writing two
page level and two site level crawling strategies, which should be freely
combinable. The page level strategies were round robin and max page priority and
for the site level backlink and OPIC. The third task was to combine the site
level strategies to one and implement it as optimal strategy.



Files
-----

There are three different input files:
1. the link graph file, this file holds the entry link graph which will be used. The
file needs to be sorted (for faster read) and each line has the following format:
<from URL><TAB><to URL>

2. the quality mapping file, this file holds the mapping whether a URL is a good
one or not, we had values which were either 1 or 0. Each line has the following
format:
<URL> <quality>

3. the seed file, this file contains the seed URL with which the crawler will
start, each line has one URL.



Usage
-----

simcrawler -k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps>
	-k			: urls per crawling step
	-qm
	--quality_mapping	: quality mapping input file
	-wg
	--web_graph		: web graph input file
	-sf
	--seed_file		: seed url input file
	-sq
	--step_quality		: step quality output file
	-ms
	--max_steps		: maximum number of steps (optinal)
	-rr
	-raound_robin		: round robin page crawling strategy
	-mpp
	-max_page_priority	: max page priority page crawling strategy
	-bl
	-backlink		: backlink site crawling strategy
	-opic			: opic site crawling strategy
	-o
	-optimal		: optimal site crawling strategy
	-b
	--batch_size		: batch size

If no web graph file and/or quality mapping file is given a mapdb file is expected in ./data.



Compile
-------

[Maven]
mvn clean compile test exec:java -Dexec.args="-k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps>"