Constant-Delay Enumeration for Nondeterministic Document Spanners

This tool allows you to find efficiently all matches of a regular expression in a string, i.e., find all contiguous substrings of the string that satisfy the regular expression (including overlapping substrings).

The tool is being actively developed and has not been thoroughly tested yet. Use at your own risk.

It is the reimplementation of the previous Python prototype and was forked from the earlier version by Rémi Dupré.

Requirements

It has been tested and developed using rustc 1.34 and cargo 1.34. It will not work with older Rust versions shipped by some Linux distributions, e.g., with version 1.32. You can check your rust version with rustc --version, and install manually a more recent Rust version from the Rust website.

Specific library requirements can be found in Cargo.toml and Cargo.lock.

Batch Usage

The program can be used to run some benchmarks in an automated fashion. Please refer to the benchmarks folder for details.

Usage

The program can also be run manually via Cargo by specifying a regular expression with captures on the command line.

# Display all occurences of a pattern (regexp) in a file
cargo run --release -- [regexp] [file]
cat [file] | cargo run --release -- [regexp]

# For instance, this example will match 'aa@aa', 'aa@a', 'a@aa' and 'a@a'
echo "aa@aa" | cargo run --release -- ".+@.+"

# List optional parameters
cargo run -- --help

# Run unit tests
cargo test

The matches displayed correspond to all distincts substrings of the text that match the given pattern. If the pattern contains named groups, the tool will output one match for each possible assignment of the groups.

Named groups

You can define named groups as follows: (?P<group_a>a+)(?P<group_b>b+). This example will extract any group of a's followed by a group of b's.

If no named group is specified, the algorithm will implicitly assume a group named match that captures the whole expression.

If a double underscore appears in a group name, the double underscore and evrything behing is stripped. This allows to workaround a limitation in rust regexp handling, where a group name has to be unique. To use the same group name several times just use a__1, a__2, etc.

Supported Syntax for Regular Expressions

The tool supports the same syntax as the Rust's regex crate, which is specified here, except for anchors, which are not implemented yet.

Underlying Algorithm

The algorithm used by this tool is described in the research paper Constant-Delay Enumeration for Nondeterministic Document Spanners, by Amarilli, Bourhis, Mengel and Niewerth.

It has been presented at the ICDT'19 conference.

The tool will first compile the regular expression into a non-deterministic finite automaton, and then apply an enumeration algorithm. Specifically, it will first pre-process the string (without producing any matches), in time linear in the string and polynomial in the regular expression. After this pre-computation, the algorithm produces the matches sequentially, with constant delay between each match.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
benchmarks		benchmarks
src		src
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Constant-Delay Enumeration for Nondeterministic Document Spanners

Requirements

Batch Usage

Usage

Named groups

Supported Syntax for Regular Expressions

Underlying Algorithm

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PoDMR/enum-spanner-rs

Folders and files

Latest commit

History

Repository files navigation

Constant-Delay Enumeration for Nondeterministic Document Spanners

Requirements

Batch Usage

Usage

Named groups

Supported Syntax for Regular Expressions

Underlying Algorithm

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages