Skip to content

mynl/skimmatch

Repository files navigation

skimmatch

skimmatch is an in-process fzf/skim-style fuzzy finder for Python, implemented in Rust.

It is designed for ranked abbreviation matching over a fixed list of candidate strings. You give it strings such as filenames, references, titles, symbols, or command labels; users type short abbreviation-style queries; skimmatch returns the best candidates, scores, and optional highlight positions.

from skimmatch import Matcher

candidates = [
    "Follmer and Schied, Stochastic Finance, 2011",
    "Mildenhall and Major, Pricing Insurance Risk",
    "Wang distortion risk measures",
    "Archive reference catalogue",
]

matcher = Matcher(candidates)

for result in matcher.search("wang distortion", limit=3):
    print(result)

Example result:

{
    "index": 2,
    "score": 260,
    "text": "Wang distortion risk measures",
    "matches": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10],
}

Scores are backend scores where higher is better. The exact numeric value should be treated as ranking information, not as a stable cross-version metric.

What This Is

skimmatch solves the same broad problem as interactive fuzzy finders such as fzf and skim: finding good abbreviation matches quickly.

For example, a query like:

fs sf 2011

can match:

Follmer and Schied, Stochastic Finance, 2011

because the query characters and tokens appear in useful positions and in the right order.

This is different from edit-distance fuzzy matching. Libraries such as RapidFuzz, Levenshtein, or token-ratio matchers are excellent for typo correction, deduplication, OCR cleanup, and record linkage. skimmatch is aimed at fast candidate selection, interactive search, and highlightable abbreviation matching.

Features

  • In-process Python extension: no external fzf executable required.
  • Rust matching backends using SkimMatcherV2, nucleo-matcher, and frizbee.
  • Preloaded candidate lists for fast repeated queries.
  • Single-token and multi-token search modes.
  • Optional highlight indices for UI rendering.
  • Legacy tuple-returning APIs for compatibility with the earlier rustfuzz shape.
  • Structured Matcher.search(...) API for new code.
  • Backend argument already present, so future backends can be added without changing the public matcher classes.

Installation

When published on PyPI:

pip install skimmatch

From a local checkout:

uv pip install -e .

or build with maturin:

uv run maturin develop

The current package metadata targets Python 3.13 or newer.

Quick Start

Use Matcher for new code.

from skimmatch import Matcher

candidates = [
    "Buhlmann, Mathematical Methods in Risk Theory",
    "Cramer, Collective Risk Theory",
    "Mildenhall and Major, Pricing Insurance Risk",
    "Kaas, Goovaerts, Dhaene, and Denuit, Modern Actuarial Risk Theory",
]

matcher = Matcher(candidates)
results = matcher.search("risk theory", limit=5)

for result in results:
    print(result["index"], result["score"], result["text"])

By default, search:

  • splits the query on whitespace;
  • requires every query token to match;
  • returns up to 20 results;
  • includes candidate text;
  • includes highlight positions.

Structured API

matcher = Matcher(candidates, backend="nucleo", threads=None)  # or "skim" or "frizbee"
results = matcher.search(
    query,
    limit=20,
    highlights=True,
    include_text=True,
    multi=True,
)

Each result is a dictionary containing:

{
    "index": 0,          # original candidate index
    "score": 123,       # backend score, higher is better
    "text": "...",      # included when include_text=True
    "matches": [0, 3],  # included when highlights=True
}

Parameters

query

The search string. In multi-token mode, whitespace-separated tokens are matched independently and every token must match the candidate.

limit

The maximum number of results to return. limit=0 returns an empty list.

highlights

When true, results include matches, a sorted and deduplicated list of matched positions. Turn this off when you only need ranking; score-only matching does less work.

include_text

When true, each result includes the original candidate string. Turn this off if you already have the candidate list and want smaller result objects.

multi

When true, the query is split on whitespace and all tokens are required. When false, the whole query is sent to the matcher as one pattern.

threads

Constructor option. For backend="nucleo" and backend="frizbee", threads=None uses all available cores. Pass threads=1 for single-threaded matching, or any positive integer to cap the number of worker threads. The skim backend currently ignores this option.

Legacy APIs

The package also exports compatibility classes with tuple return shapes:

from skimmatch import FuzzyMatcher, FuzzyMatcherMulti, FuzzyMatcherMultiHi

FuzzyMatcher

Treats the whole query as one pattern.

matcher = FuzzyMatcher(candidates, threads=None)
indices, scores = matcher.query("sf", top_k=10)

FuzzyMatcherMulti

Splits the query on whitespace. Every token must match.

matcher = FuzzyMatcherMulti(candidates)
indices, scores = matcher.query("pricing insurance", top_k=10)

FuzzyMatcherMultiHi

Like FuzzyMatcherMulti, but also returns highlight positions.

matcher = FuzzyMatcherMultiHi(candidates)
indices, scores, highlights = matcher.query("pricing insurance", top_k=10)

Matching Behavior

The available backends are:

backend="skim"
backend="nucleo"
backend="frizbee"

backend="skim" uses SkimMatcherV2 from the Rust fuzzy-matcher crate and is kept for compatibility.

backend="nucleo" uses nucleo-matcher, the lower-level matcher from the nucleo ecosystem. It is the default backend. It is a modern fzf-like backend and may rank candidates differently from skim. Scores are backend-specific and should not be compared between backends.

backend="frizbee" uses frizbee, a SIMD matcher with typo-resistant matching support. skimmatch currently runs it with typo tolerance disabled for a closer comparison with the other fzf-style backends. It matches against bytes, so highlight lists are intentionally empty for this backend until Unicode offset semantics are defined.

Good matches tend to reward:

  • characters appearing in order;
  • compact alignments;
  • word-boundary matches;
  • punctuation-separated and camel-case transitions;
  • early matches;
  • consecutive query-character matches;
  • candidates that match every query token in multi-token mode.

skimmatch returns candidates sorted by descending score. Ties are ordered by the original candidate index for deterministic output.

When To Use It

skimmatch is a good fit for:

  • command palettes;
  • file pickers;
  • bibliography and reference search;
  • symbol search;
  • autocomplete over known labels;
  • terminal or web UI candidate selection;
  • fast repeated queries over a preloaded list.

It is probably not the right tool for:

  • typo correction;
  • deduplication;
  • record linkage;
  • token-sort similarity;
  • OCR cleanup;
  • semantic search;
  • embedding-based retrieval.

Those are useful problems, but they are different from fzf/skim-style abbreviation matching.

Performance Notes

Candidate strings are copied into Rust once when the matcher is constructed. Repeated calls to query or search scan that Rust-owned list and return only the final top results to Python.

For best performance:

  • construct one matcher and reuse it across queries;
  • set highlights=False when you only need indices and scores;
  • set include_text=False when you already have the candidate strings;
  • use limit to keep returned result objects small.

Development

This project is a Python package with a Rust extension built by maturin.

Run the tests:

uv run pytest tests/test_skimmatch.py -q

Check Rust formatting:

cargo fmt --check

Important files:

  • src/lib.rs: Rust/PyO3 extension implementation.
  • python/skimmatch/__init__.py: Python re-exports.
  • tests/test_skimmatch.py: API and behavior tests.
  • pyproject.toml: Python packaging and maturin configuration.
  • Cargo.toml: Rust crate configuration.

Backend Roadmap

The public API accepts a backend argument. Today "skim", "nucleo", and "frizbee" are implemented. frizbee is experimental and currently exposes score/ranking behavior without highlight positions.

Unknown backend names currently raise ValueError.

License

MIT.

About

fzf-like matching for Python.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors