Ivory: A Hadoop toolkit for web-scale information retrieval research
Ivory
A Hadoop toolkit for web-scale information retrieval research
Home
API
Publications
Experiments
Team
<p>Ivory is a Hadoop toolkit for web-scale information retrieval
<p>In order to temper expectations: Ivory is a research system, not a
full-featured search engine! It's aimed at information retrieval
researchers who generally know their way around retrieval algorithms,
postings lists, etc. If you want to, for example, play with the latest
research coming out of SIGIR and related venues, then Ivory is for
you. On the other hand, if you just want search capabilities as a
box", <a href="">Lucene</a>
is a likely a better
choice. <a href="">Katta</a> is a
framework for serving distributed Lucene indexes that plays well with
Hadoop clusters.</p>
<p>Ivory was specifically designed to work with Hadoop "out of the
box" on
the <a href="">ClueWeb09
collection</a>, a 1 billion page (25 TB) Web crawl distributed by
Carnegie Mellon University. The initial release of Ivory is meant to
serve as a reference implementation of indexing and retrieval
algorithms that can operate at the multi-terabyte scale. Another
interesting experimental aspect of Ivory is the retrieval
architecture: we've been playing with retrieval engines that directly
read postings from HDFS.</p>
<p>Ivory is available on <a href="">github</a>.</p>
<li><a href="docs/api/index.html">Ivory API javadoc</a></li>
<li>Getting started with <a href="docs/trec.html">TREC disks 4-5</a></li>
<li>Getting started with <a href="docs/wt10g.html">the Wt10g collection</a></li>
<li>Getting started with <a href="docs/gov2.html">the Gov2 collection</a></li>
<li>Getting started with <a href="docs/clue.html">the ClueWeb09 collection</a></li>
<li>Ivory <a href="docs/pipeline.html">preprocessing and indexing pipeline</a></li>
<li>Ivory <a href="docs/pwsim.html">pairwise document similarity computation</a></li>
<li><a href="docs/regression.html">Experimental results</a> with Ivory</li>
<li><a href="docs/team.html">Project team</a></li>
<p style="line-height:90%"><small>This work is or has been supported
by the following sources: NSF under awards IIS-0836560 and
IIS-0705832; Google and IBM under the Academic Cloud Computing
Initiative (ACCI); the Intramural Research Program of the NIH,
National Library of Medicine; DARPA/IPTO Contract No. HR0011-06-2-0001
under the GALE program; and Amazon Web Services. Any opinions,
findings, conclusions, or recommendations expressed here do not
necessarily reflect those of the sponsors. </small></p>
