Crawl, index and search web content
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
README.textile
pom.xml

README.textile

Overview

Treeing a simple Java project that demonstrates Website crawling, indexing and searching using Apache Lucene. This project provides fundamentals for building website search engines.

Usage

Treeing provides a simple set of classes, including a multi-threaded web crawler and a simple HTML parser, and is very easy to embed in Java application.

Crawling and indexing websites

Below is an example of how to run the website crawler that crawls and indexes the web content.

                                                                           
                                                                         
  Thread t = new Thread( new WebCrawler( "http://www.python.org", 2, "c:/test/luc" ) );
  t.start();
  t.join();                                                                
                                                                         

The above code snippet will crawl and index the entire web site.

Searching the index

Once the index is created it can be searched by using Lucene API. See the test case provided with the source for more details.

Resources

Apache Lucene
Crawl, index and search