TriLite, An Inverted Trigram Index for Accelerated String Matching in Sqlite
TriLite is an experimental Sqlite extension for creating an inverted trigram index to accelerate string matching queries, such as regular expressions and substring matching.
A trigram is essential any substring of length 3, the inverted index maps from trigrams, to a list of document ids for documents containing the given trigram. When querying for a substring, trigrams is extracted from the substring, and documents id lists for the trigrams is fetched and merged, after which only documents containing all the trigrams are matched against the substring and returned if considered a result.
TriLite sources is unless otherwise noted released into public domain.
TriLite is under active development and currently supports the bare minimum to
make it usable for static databases, ie.
INSERT and accelerated substring and
regular expression matching.
TriLite still has a lot of short comings, and this code is a work in progress.
To sum this up do NOT use TriLite in a production environment, unless you
are me. Currently TriLite can do a few tricks as outlined in
which you can pipe to
sqlite3 after building TriLite.
Things To Do
- Add reference counted regular expression -> compiled expression cache to vtable
- Add rowid, column -> first match cache to regular expression cache entries
- Support multiple columns of different types
- Alias rowid, with PRIMARY INTEGER KEY
- Auxiliary non-indexed columns (useful for storing extents and file id)
- Indexed text columns (useful for tname and tqualname)
- Handle extents for expressions with the empty string in the language
- Limit number of extents returned per file by queries
- Facilitate update/delete
- Use a hash set to optimize extraction of unique trigrams
- Optimize temporary hash table
- Query documents inserted in the same transaction, ie. before commit as currently required
- Table specific options, configurable at runtime not compile time
- Forbid full table scan with MATCH using regular expressions
- Max pending insert cache (bytes)
- Max pending insert list
- Max pending delete cache (bytes)
- Max pending delete list
- Handle LIKE/GLOB patterns (ie. case (in)sensitive ordered substrings)
- Write Documentation
TriLite would not been as fast without these awesome projects.
Other projects with similar aim,