Skip to content

A simple implementation of a fulltext search engine

Notifications You must be signed in to change notification settings

mirkosertic/InvertLight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

InvertLight

Overview

InvertLight is a simple implementation of a fulltext search engine based on inverted indices.

InvertLight itself does no content extraction, we have to use another framework for this purpose. It just provides the required data structures and allows you to query the fulltext index.

Everything is available under the Apache License, Version 2.0 .

API Usage

Initialize and query an inverted index

public static void main(String[] args) {
    // Initialize the index
    InvertedIndex theIndex = new InvertedIndex();
    UpdateIndexHandler theIndexHandler = new UpdateIndexHandler(theIndex);
    Tokenizer theTokenizer = new Tokenizer(new ToLowercaseTokenHandler(theIndexHandler));

    // Add some content
    theTokenizer.process(new Document("doc1", "this is a test"));

    // Token query
    Result theResult1 = theIndex.query(new TokenQuery("this", "is"));

    // Token Query with wildcards
    Result theResult2 = theIndex.query(new TokenQuery("th*", "i?"));

    // And perform a query for a sequence of token
    Result theResult3 = theIndex.query(new TokenSequenceQuery("this", "is"));

    // Sequence query with wildcards
    Result theResult4 = theIndex.query(new TokenSequenceQuery("th*s", "i?"));
}

Query types

InvertLight supports the following query types:

Query type Description

Token Query

Searches for documents containing all tokens in any order.

Token Sequence Query

Searches for documents containing all tokens in exact order

Wildcards

InvertLight supports wildcards as query tokens. The following wildcards are supported:

Wildcard Meaning

?

Any character

*

Many characters

Examples:

  • "*at" matches "that", "hat", "damnrat" and so son

  • "th?s" matches "this"

Suggestion queries

InvertLight supports suggestion based on Token Sequences. Here is an example:

InvertedIndex theIndex = new InvertedIndex();
UpdateIndexHandler theIndexHandler = new UpdateIndexHandler(theIndex);
Tokenizer theTokenizer = new Tokenizer(theIndexHandler);

theTokenizer.process(new Document("doc1", "this is a test"));
theTokenizer.process(new Document("doc2", "this is another test"));
theTokenizer.process(new Document("doc3", "welcome home"));
theTokenizer.process(new Document("doc4", "nothing here"));
theTokenizer.process(new Document("doc4", "nothing already"));
theTokenizer.process(new Document("doc4", "nothing already there"));

Suggester theSuggester = new TokenSequenceSuggester("nothing");
SuggestResult theResult = theIndex.suggest(theSuggester);

assertEquals(2, theResult.getSuggestions().length);
assertEquals("already", theResult.getSuggestions()[0]);
assertEquals("here", theResult.getSuggestions()[1]);

Performance Metrics (Beta)

Number of Documents Size on Disk(plain text) Number of Tokens Java Heap Size Phrase Query time

846

73 MB

383477

86 MB

960ms / 100000 Queries

About

A simple implementation of a fulltext search engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published