Skip to content

markusfurtlehner/ReperioNet

Repository files navigation

ReperioNet

Embedded, fuzzy, multilingual full-text search for .NET, built on SQLite FTS5. Pure-managed, cross-platform (Windows / macOS / Linux / Android / iOS), no server, no native build steps.

ReperioNet builds and queries a full-text search index over SQLite FTS5 with a tiny, friendly async API. It is generic and domain-agnostic: you supply text plus a strongly-typed metadata payload, searches return metadata plus a relevance score. Fuzzy search covers typo tolerance (fuzzy re-ranking), substring/partial matching (trigram index) and spelling variants & word forms (stemming + phonetic codes) — multilingual out of the box for the European languages.

Quick start

using ReperioNet;
using ReperioNet.Languages.All;
using ReperioNet.LanguageDetection;

var index = await SearchIndex<DocMeta>.OpenAsync("index.db", o =>
{
    o.MetadataTypeInfo = AppJsonContext.Default.DocMeta;   // source-generated, required (AOT-safe)
    o.AddAllEuropeanLanguages();                           // ReperioNet.Languages.All
    o.LanguageDetector = new NTextCatDetector();           // ReperioNet.LanguageDetection (optional)
});

await index.AddAsync(new SearchEntry<DocMeta>(
    Id: doc.Path,
    Content: text,
    Metadata: new DocMeta(doc.Path, name)));

var hits = await index.SearchAsync("müler rechnng");       // typo-tolerant, multilingual
foreach (var h in hits)
    Console.WriteLine($"{h.Score:F2}  {h.Metadata.FileName}");

TMeta serialization uses System.Text.Json source generation — supply a JsonTypeInfo<TMeta> from your own JsonSerializerContext. There is no reflection fallback; this keeps the library trimming- and AOT-clean (iOS/MAUI).

Packages

Package Contents
ReperioNet Core engine: SearchIndex<TMeta>, SQLite FTS5 schema, trigram recall, fuzzy re-ranking, snippets
ReperioNet.Languages.De.Tr One pack per language: vendored Snowball stemmer + stop words (German adds Kölner Phonetik, English adds Double Metaphone)
ReperioNet.Languages.All Meta-package: AddAllEuropeanLanguages() registers all fifteen packs
ReperioNet.LanguageDetection NTextCatDetector (ILanguageDetector backed by NTextCat, Core14 profile bundled)

Supported language packs: German, English, French, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Russian, Hungarian, Romanian, Turkish. Unknown or undetected languages fall back to an identity analyzer — search still works on the base token stream.

How search works

For each document ReperioNet indexes three token streams (base, stems, phonetic codes) in one FTS5 table plus an optional trigram table for substring/typo recall. A query gathers candidates from all of them, merges by best bm25 rank, re-ranks the bounded candidate pool with fuzzy similarity (0.6 * fuzzy + 0.4 * normalized bm25, plus an exact-substring boost), then applies MinScore, paging and optional <mark>-style snippets. Scores are normalized to 0..1, higher is better.

Multi-token queries default to all-terms (AND) semantics (SearchQueryOptions.TermMatch): documents must contain every base term, which matches the common user intent and is far cheaper to rank than OR (the intersection is small, and FTS5 must bm25-score every matching row). When the strict pass yields fewer candidates than Limit, an any-term pass widens recall automatically — fallback hits always rank after all-terms hits. Stem/phonetic variant matching and trigram substring recall keep their OR semantics in the fallback, so inflections and typos are still caught. Set TermMatch = TermMatch.AnyTerms for the widest recall up front.

Index profiles

Two named presets capture the benchmark-derived recommendations (see benchmarks/RESULTS.md); both are chainable extension methods on ReperioOptions<TMeta>:

o.UseDesktopProfile();   // = the defaults: trigram + stored content + phonetic, unbounded text
o.UseMobileProfile();    // trigram off, stop words removed, MaxContentChars = 4000
  • UseDesktopProfile() — full fidelity. Mid-word substring search (trigram), snippets, phonetic variants. Cost: the trigram index is roughly half the database (~4.4× raw content on the benchmark corpus) and the slowest indexing.
  • UseMobileProfile() — the size/battery-conscious choice. Dropping the trigram index is the one change that improves database size (~4.4× → ~2× raw content), query latency and indexing throughput together. StoreContent stays on because it is free with respect to size — with content off, the same text lives in rank_text for fuzzy re-ranking anyway, so turning it off saves nothing and only costs snippets. Phonetic codes stay on (cheap, valuable for name variants); stop words are removed from the stem/phonetic streams to trim common-term cost; and MaxContentChars = 4000 is the real lever below the rank_text floor for long bodies — a starting default you should tune. Lost: mid-word substring search. Kept: typo tolerance (fuzzy re-ranking over content/rank_text), word forms (stemming), phonetic variants, short-query prefix matching, snippets.

The flags these presets set are persisted layout flags: reopening an existing database with a different profile throws ReperioException (no silent rebuild) — open with the original options and call RebuildAsync() after changing flags, or start a new database file.

Options worth knowing

  • StoreContent (default on): stores one copy of the content — enables snippets and full-text fuzzy re-ranking.
  • EnableTrigram / EnableStemming / EnablePhonetic (default on): layout-affecting flags, persisted in the index; reopening with different values throws — call RebuildAsync() to migrate.
  • RemoveStopWords (default off): strips stop words from the stem/phonetic streams only, never from base.
  • MaxContentChars (default 0 = unbounded): caps the indexed text length.
  • SearchQueryOptions: Limit/Offset, MinScore, EnableFuzzy, EnablePhonetic, Language, IncludeSnippet, CandidatePoolSize.

Bulk indexing

AddRangeAsync indexes the whole batch in one transaction and is heavily optimized for large loads: SQLite writes stay on the single dedicated write connection (SQLite is single-writer by nature), but the text analysis — tokenization, stemming, phonetic encoding, metadata serialization — runs in parallel across CPU cores ahead of the writer, with stem/phonetic results memoized for the duration of the batch. During a bulk batch the write connection also runs with a temporarily enlarged page cache (32 MiB) and a raised WAL checkpoint threshold, both restored when the batch completes. Entries are written strictly in input order (for duplicate ids the last one wins), and an invalid entry rolls back the entire batch.

One contract follows from this: IStemmer, IPhoneticEncoder, IStopWordFilter and ILanguageDetector implementations must be thread-safe (all bundled implementations are; they keep no mutable state).

Practical tips for big loads: prefer one large AddRangeAsync over many AddAsync calls, pass batches of ~10k–50k entries per call for progress reporting, call OptimizeAsync() once at the end, and consider the index-layout options below — the trigram index is the dominant cost in both indexing time and database size.

Benchmarks

benchmarks/ReperioNet.Benchmark is a combined scale smoke test and benchmark. It generates deterministic email-like documents (~860 bytes, en/de/fr mix), bulk-indexes them, asserts correctness at scale (needle recall and ranking, stemming, phonetic, substring, snippets, CRUD consistency), then measures a search-latency battery (cold + p50/p95/max), concurrent throughput, single-document mutation latency, peak process memory and raw-content-vs-database size.

# single run (defaults: 1,000,000 docs, profile "full")
dotnet run -c Release --project benchmarks/ReperioNet.Benchmark -- --docs 100000

# full matrix: 4 index-layout profiles x 3 simulated device classes -> benchmarks/RESULTS.md
benchmarks/run-matrix.sh 100000

Index-layout profiles (--profile): full (trigram + stemming + phonetic + stored content — best recall, biggest database), no-trigram (drops substring recall), compact (additionally stores no content copy — no snippets, fuzzy re-ranks on rank_text), smallest (additionally no phonetic codes, stop words removed from the stem stream). Device classes are simulated by restricting CPU affinity (taskset): desktop (all cores), fast phone (4 cores), slow phone (2 cores) — note this models reduced parallelism, not slower silicon or flash storage. Results, including the exact CPU, memory consumption and db/content ratios for every combination, are checked in at benchmarks/RESULTS.md.

Operational notes

  • Local storage only: the index uses SQLite WAL journaling, which is unsafe on network file systems (SMB/NFS). Keep the database file on a local disk.
  • One SearchIndex<TMeta> instance per database file per process. All writes are serialized internally; reads run concurrently — you will never see SQLITE_BUSY.
  • Minimum SQLite 3.43.0 with FTS5 — satisfied by the bundled SQLitePCLRaw.bundle_e_sqlite3 engine (the only native artifact, prebuilt for all target platforms including iOS/Android).
  • ReperioNet and the language packs are trimming/AOT-clean (net8.0). ReperioNet.LanguageDetection depends on NTextCat (an unannotated netstandard2.0 library without a formal trim-compatibility guarantee); in practice the detection path publishes with zero IL warnings under both PublishTrimmed and Native AOT and works in the resulting binary.

Samples

  • samples/ReperioNet.Sample.Console — end-to-end demo; also serves as the trimmed-publish AOT smoke test (dotnet publish -c Release -r <rid> --self-contained -p:PublishTrimmed=true).
  • samples/ReperioNet.Sample.Maui — minimal .NET MAUI app (iOS/Android) exercising the index on device; build with the MAUI workloads installed (not part of the main solution).

License

MIT — see LICENSE. The language packs contain C# ports of the Snowball stemming algorithms (BSD 3-clause, © Dr Martin Porter / Richard Boulton) and other derived material; see THIRD-PARTY-NOTICES.md for the full notices.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors