regexr

A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.

⚠️ Experimental - API May Change

This library was created as the regex backend for splintr, an LLM tokenizer. It is highly experimental and the API may change drastically between versions.

While it passes compliance tests for industry-standard tokenizer patterns (OpenAI's cl100k_base, Meta's Llama 3), it has not been proven in production environments.

Recommended for: Research, experimentation, tokenizer development, data preprocessing.

Not recommended for: Production systems requiring stability guarantees.

Please report issues on the Issue Tracker.

🎯 When to use `regexr`

This is a specialized tool, not a general-purpose replacement.

The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.

Only use regexr if you specifically need:

Lookarounds: You need features like (?=...), (?<=...), or (?!\S) without C dependencies.
- Why not regex? It intentionally omits lookarounds to guarantee linear time.
- Why not pcre2? Requires C library and FFI.
JIT Compilation in Pure Rust: You want native code generation for hot patterns without C dependencies.
- Why not regex/fancy-regex? Neither offers JIT compilation.
- Why not pcre2? Requires C library and FFI.
Pure Rust Dependency: You need advanced features (Lookarounds, Backreferences) but cannot use pcre2 due to unsafe C bindings or build complexity.
Bounded Execution: You want ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like pcre2).

The Problem Solved

Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:

regex crate: Fast, safe, but lacks lookarounds and JIT compilation.
fancy-regex: Supports lookarounds, but lacks JIT compilation.
pcre2: Supports everything including JIT, but introduces unsafe C bindings and external dependencies.

regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences while remaining 100% Pure Rust.

Installation

Add this to your Cargo.toml:

[dependencies]
regexr = "0.x"

For JIT compilation support:

[dependencies]
regexr = { version = "0.x", features = ["full"] }

Usage

Basic matching

use regexr::Regex;

let re = Regex::new(r"\w+").unwrap();
assert!(re.is_match("hello"));

// Find first match
if let Some(m) = re.find("hello world") {
    println!("Found: {}", m.as_str()); // "hello"
}

// Find all matches
for m in re.find_iter("hello world") {
    println!("{}", m.as_str());
}

Capture groups

use regexr::Regex;

let re = Regex::new(r"(\w+)@(\w+)\.(\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();

println!("{}", &caps[0]); // "user@example.com"
println!("{}", &caps[1]); // "user"
println!("{}", &caps[2]); // "example"
println!("{}", &caps[3]); // "com"

Named captures

use regexr::Regex;

let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+\.\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();

println!("{}", &caps["user"]);   // "user"
println!("{}", &caps["domain"]); // "example.com"

JIT compilation

Enable JIT for patterns that will be matched many times:

use regexr::RegexBuilder;

let re = RegexBuilder::new(r"\w+")
    .jit(true)
    .build()
    .unwrap();

assert!(re.is_match("hello"));

Prefix optimization for tokenizers

For patterns with many literal alternatives (e.g., keyword matching in tokenizers):

use regexr::RegexBuilder;

let re = RegexBuilder::new(r"(function|for|while|if|else|return)")
    .optimize_prefixes(true)
    .build()
    .unwrap();

assert!(re.is_match("function"));

Text replacement

use regexr::Regex;

let re = Regex::new(r"\d+").unwrap();

// Replace first match
let result = re.replace("abc 123 def", "NUM");
assert_eq!(result, "abc NUM def");

// Replace all matches
let result = re.replace_all("abc 123 def 456", "NUM");
assert_eq!(result, "abc NUM def NUM");

Feature Flags

simd (default): Enables SIMD-accelerated literal search
jit: Enables JIT compilation (x86-64 and ARM64)
full: Enables both JIT and SIMD

Platform Support

Platform	JIT Support	SIMD Support
Linux x86-64	✓	✓ (AVX2)
Linux ARM64	✓	✗
macOS x86-64	✓	✓ (AVX2)
macOS ARM64 (Apple Silicon)	✓	✗
Windows x86-64	✓	✓ (AVX2)
Other	✗	✗

Build without default features for a minimal installation:

cargo build --no-default-features

Build with all optimizations:

cargo build --features "full"

Engine Selection

The library automatically selects the best execution engine based on pattern characteristics:

Non-JIT mode (default):

ShiftOr: Small patterns (≤64 states) without anchors/word boundaries
EagerDfa: Patterns with word boundaries or anchors
LazyDfa: General patterns with on-demand state construction
BacktrackingVm: Patterns with backreferences
PikeVm: Patterns with lookaround or non-greedy quantifiers

JIT mode (with jit feature):

BacktrackingJit: Patterns with backreferences
TaggedNfa: Patterns with lookaround or non-greedy quantifiers
JitShiftOr: Small patterns with alternations
DFA JIT: General patterns, benefits from SIMD prefiltering

See docs/architecture.md for details on the engine selection logic.

Performance

Speedup relative to regex crate (higher is better):

Highlights (speedup vs regex crate):

Benchmark	`regexr`	`regexr-jit`	`pcre2-jit`
log_parsing	0.80-0.84x	3.91-4.09x	3.57-3.71x
url_extraction	0.81-0.83x	1.95-1.99x	2.10-2.13x
unicode_letters	1.24x	1.43-1.44x	1.65-1.72x
html_tags	0.82-0.87x	1.33-1.43x	0.80-0.85x
word_boundary	1.19-1.24x	1.15-1.19x	0.72-0.74x
email_validation	0.99-1.00x	1.00-1.11x	0.94-1.00x
alternation	0.88-1.01x	0.88-1.01x	0.12-0.15x

regexr-jit excels at log parsing (4x faster than regex)
regexr (non-JIT) matches regex performance on most patterns
Both outperform fancy-regex and pcre2 (non-JIT) consistently

Documentation

Architecture Overview - Engine architecture and selection logic
Features - Detailed feature documentation

Citation

If you use regexr in your research, please cite:

@software{regexr2025,
  author       = {Syah, Farhan},
  title        = {regexr: A Pure-Rust Regex Engine with JIT Compilation for LLM Tokenization},
  year         = {2025},
  url          = {https://github.com/farhan-syah/regexr},
  note         = {Experimental regex engine with lookaround support and JIT compilation}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
benches		benches
data/unicode_tables		data/unicode_tables
docs		docs
examples		examples
images		images
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

regexr

🎯 When to use `regexr`

The Problem Solved

Installation

Usage

Basic matching

Capture groups

Named captures

JIT compilation

Prefix optimization for tokenizers

Text replacement

Feature Flags

Platform Support

Engine Selection

Performance

Documentation

Citation

About

Uh oh!

Releases

Packages

Languages

License

ml-rust/regexr

Folders and files

Latest commit

History

Repository files navigation

regexr

🎯 When to use regexr

The Problem Solved

Installation

Usage

Basic matching

Capture groups

Named captures

JIT compilation

Prefix optimization for tokenizers

Text replacement

Feature Flags

Platform Support

Engine Selection

Performance

Documentation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🎯 When to use `regexr`

Packages