Web-Crawler has been built using bare minimum components
- HTTP Fetcher
- Page Extractor
- Policy - PageFetch, Revisit
- Visited URL eliminator
- DataStore
For parsing the page, Jsoup library (http://jsoup.org/) has been used to parse the HTML.
The implementation given here is different from architecture in terms of
- In-process LinkedBlockingQueue has been used as message transfer mechanism between components
- Persistence is provided by in-memory instance of HSQLDB
- Mock implementation of indexer and Page caching components