jnphilipp/SimCrawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Could not load branches
Nothing to show
Could not load tags
Nothing to show
{{ refName }}
default
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI.
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
SimCrawler ========== This project was develop as an programming exercise for the course Computational Advertising at the University of Leipzig. The main task was to implement a simulated web crawler, with three parts. The first one was to implement a breadth-first search crawling strategy. The second task consisted of writing two page level and two site level crawling strategies, which should be freely combinable. The page level strategies were round robin and max page priority and for the site level backlink and OPIC. The third task was to combine the site level strategies to one and implement it as optimal strategy. Files ----- There are three different input files: 1. the link graph file, this file holds the entry link graph which will be used. The file needs to be sorted (for faster read) and each line has the following format: <from URL><TAB><to URL> 2. the quality mapping file, this file holds the mapping whether a URL is a good one or not, we had values which were either 1 or 0. Each line has the following format: <URL> <quality> 3. the seed file, this file contains the seed URL with which the crawler will start, each line has one URL. Usage ----- simcrawler -k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps> -k : urls per crawling step -qm --quality_mapping : quality mapping input file -wg --web_graph : web graph input file -sf --seed_file : seed url input file -sq --step_quality : step quality output file -ms --max_steps : maximum number of steps (optinal) -rr -raound_robin : round robin page crawling strategy -mpp -max_page_priority : max page priority page crawling strategy -bl -backlink : backlink site crawling strategy -opic : opic site crawling strategy -o -optimal : optimal site crawling strategy -b --batch_size : batch size If no web graph file and/or quality mapping file is given a mapdb file is expected in ./data. Compile ------- [Maven] mvn clean compile test exec:java -Dexec.args="-k <k> -qm <quality mapping> -wg <web graph> -sf <seed urls> -sq <step quality> -ms <max steps>"
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published