Skip to content

kowsik/rucene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 

Repository files navigation

Rucene

Jotting down an idea before I forget. I've used Lucene in the past for searches (http://www.pcapr.net) and I absolutely <3 it. But some of the blogs on repurposing Redis for free-text search is too off - storing all prefixes is such a kludge. So question is, can we do something better?

Idea

This is in a very raw form. Use a simple tokenizer and a bunch of stop words to convert the input document into a set of terms. For each term, compute tf which is the term frequency within the document and store that in a sorted set. The docid is a simple opaque document identifier.

<term1>: zset<docid, tf(term in doc)>
<term2>: zset<docid, tf(term in doc)>

During query time, compute idf which is the number of documents that contain the term (zcard for the term really). Lucene does a lot more than that with boosts and what not.

weight = idf(t)^2
idf = no of documents that contain term (zcard for the term)

So when we do the searching, simply intersect the terms of the query, do the per-term weighting as above and then aggregate the sum. This effectively gives a set of docids, sorted by the relevance. Simplistic, but works.

zinterstore dst num_terms termi ... termk weights wi ... wk aggregate sum

Code to come soon.

About

A poor-man's Lucene just using redis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published