Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.
Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
doc
lib
test
.gitignore
.svnignore
CHANGES
History.txt
LICENSE
Manifest.txt
README.rdoc
TODO
install.rb
rakefile
rdig.gemspec

README.rdoc

RDig

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way to specify such an OR dependency in a gem specification, the gem depends on Hpricot. If this is a problem for you, install the gem with –force and manually do a +gem install rubyful_soup+.

basic usage

Index creation

  • create a config file based on the template in doc/examples

  • to create an index:

    rdig -c CONFIGFILE
  • to run a query against the index (just to try it out)

    rdig -c CONFIGFILE -q 'your query'

    this will dump the first 10 search results to STDOUT

Handle search in your application:

require 'rdig'
require 'rdig_config'   # load your config file here
search_results = RDig.searcher.search(query)

see RDig::Search::Searcher for more information.

usage in rails

  • add to config/environment.rb :

    require 'rdig'
    require 'rdig_config'
  • place rdig_config.rb into config/ directory.

  • build index:

    rdig -c config/rdig_config.rb
  • in your controller that handles the search form:

    search_results = RDig.searcher.search(params[:query])
    @results = search_results[:list]
    @hitcount = search_results[:hitcount]

search result paging

Use the :first_doc and :num_docs options to implement paging through search results. (:num_docs is 10 by default, so without using these options only the first 10 results will be retrieved)

sample configuration

from doc/examples/config.rb. The tag_selector properties are called with a BeautifulSoup instance as parameter. See the RubyfulSoup Site for more info about this cool lib. You can also have a look at the html_content_extractor unit test.

:include:doc/examples/config.rb

Something went wrong with that request. Please try again.