Skip to content

matheusportela/web-crawler

Repository files navigation

web-crawler

Didactic Web crawler for Web Search Engines (CS 6913) course at NYU

Requirements

  • Python 3.7
  • pip
  • virtualenv (optional but recommended)

Install

$ pip install -r requirements.txt

Instructions

Crawling query new york university with priority score:

$ python crawler.py "new york university"

Crawling query new york university without priority score (BFS-style):

$ python crawler.py --bfs "new york university"

Saving output to file:

$ python crawler.py "new york university" > output.txt

Printing output in real-time:

$ tail -f output.txt

Getting help:

$ python crawler.py -h

Priority Score

The total priority score is simply the sum of the novelty and importance scores.

Novelty starts at 10 and is decreased by 0.1 each time the domain has been crawled. The minimum value novelty can reach is 0.

Importance starts at 0 and increases by 1 each time the specific URL has been parsed from crawled URLs and 0.01 each time the domain has been parsed out.

score = novelty + importance
novelty = max(0, 10 - 0.1*domain)
importance = 1*url + 0.01*domain

Missing features

  • MIME type checking
  • Overall crawling statistics

About

Didactic Web crawler for Web Search Engines (CS 6913) course at NYU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages