Skip to content

Web crawler implementation. Web-Crawler has been built using bare minimum components 1. HTTP Fetcher 2. Page Extractor 3. Policy - PageFetch, Revisit 4. Visited URL eliminator 5. DataStore

Notifications You must be signed in to change notification settings

mxku/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-crawler

Web-Crawler has been built using bare minimum components

  1. HTTP Fetcher
  2. Page Extractor
  3. Policy - PageFetch, Revisit
  4. Visited URL eliminator
  5. DataStore

For parsing the page, Jsoup library (http://jsoup.org/) has been used to parse the HTML.

The implementation given here is different from architecture in terms of

  1. In-process LinkedBlockingQueue has been used as message transfer mechanism between components
  2. Persistence is provided by in-memory instance of HSQLDB
  3. Mock implementation of indexer and Page caching components

About

Web crawler implementation. Web-Crawler has been built using bare minimum components 1. HTTP Fetcher 2. Page Extractor 3. Policy - PageFetch, Revisit 4. Visited URL eliminator 5. DataStore

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages