Skip to content

jnphilipp/SimCrawler

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 
SimCrawler
==========

This project was develop as an programming exercise for the course Computational
Advertising at the University of Leipzig. The main task was to implement a
simulated web crawler, with three parts. The first one was to implement a
breadth-first search crawling strategy. The second task consisted of writing two
page level and two site level crawling strategies, which should be freely
combinable. The page level strategies were round robin and max page priority and
for the site level backlink and OPIC. The third task was to combine the site
level strategies to one and implement it as optimal strategy.



Files
-----

There are three different input files:
1. the link graph file, this file holds the entry link graph which will be used. The
file needs to be sorted (for faster read) and each line has the following format:
<from URL><TAB><to URL>

2. the quality mapping file, this file holds the mapping whether a URL is a good
one or not, we had values which were either 1 or 0. Each line has the following
format:
<URL> <quality>

3. the seed file, this file contains the seed URL with which the crawler will
start, each line has one URL.



Usage
-----

simcrawler -k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps>
	-k			: urls per crawling step
	-qm
	--quality_mapping	: quality mapping input file
	-wg
	--web_graph		: web graph input file
	-sf
	--seed_file		: seed url input file
	-sq
	--step_quality		: step quality output file
	-ms
	--max_steps		: maximum number of steps (optinal)
	-rr
	-raound_robin		: round robin page crawling strategy
	-mpp
	-max_page_priority	: max page priority page crawling strategy
	-bl
	-backlink		: backlink site crawling strategy
	-opic			: opic site crawling strategy
	-o
	-optimal		: optimal site crawling strategy
	-b
	--batch_size		: batch size

If no web graph file and/or quality mapping file is given a mapdb file is expected in ./data.



Compile
-------

[Maven]
mvn clean compile test exec:java -Dexec.args="-k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps>"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages