web-crawler

Didactic Web crawler for Web Search Engines (CS 6913) course at NYU

Requirements

Python 3.7
pip
virtualenv (optional but recommended)

Install

$ pip install -r requirements.txt

Instructions

Crawling query new york university with priority score:

$ python crawler.py "new york university"

Crawling query new york university without priority score (BFS-style):

$ python crawler.py --bfs "new york university"

Saving output to file:

$ python crawler.py "new york university" > output.txt

Printing output in real-time:

$ tail -f output.txt

Getting help:

$ python crawler.py -h

Priority Score

The total priority score is simply the sum of the novelty and importance scores.

Novelty starts at 10 and is decreased by 0.1 each time the domain has been crawled. The minimum value novelty can reach is 0.

Importance starts at 0 and increases by 1 each time the specific URL has been parsed from crawled URLs and 0.01 each time the domain has been parsed out.

score = novelty + importance
novelty = max(0, 10 - 0.1*domain)
importance = 1*url + 0.01*domain

Missing features

MIME type checking
Overall crawling statistics

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
README.md		README.md
brooklyn_parks_bfs.txt		brooklyn_parks_bfs.txt
brooklyn_parks_priority.txt		brooklyn_parks_priority.txt
crawler.py		crawler.py
explain.txt		explain.txt
paris_texas_bfs.txt		paris_texas_bfs.txt
paris_texas_priority.txt		paris_texas_priority.txt
readme.txt		readme.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

brooklyn_parks_bfs.txt

brooklyn_parks_bfs.txt

brooklyn_parks_priority.txt

brooklyn_parks_priority.txt

crawler.py

crawler.py

explain.txt

explain.txt

paris_texas_bfs.txt

paris_texas_bfs.txt

paris_texas_priority.txt

paris_texas_priority.txt

readme.txt

readme.txt

requirements.txt

requirements.txt

Repository files navigation

web-crawler

Requirements

Install

Instructions

Priority Score

Missing features

About

Releases

Packages

Languages

matheusportela/web-crawler

Folders and files

Latest commit

History

Repository files navigation

web-crawler

Requirements

Install

Instructions

Priority Score

Missing features

About

Resources

Stars

Watchers

Forks

Languages