Skip to content
No description, website, or topics provided.
HTML Shell C Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmark
graphs
src
LICENESE.txt
Makefile
README.md
main.pdf

README.md

Regular Expression Search on Compressed Text

What is it? zearch is a regular expression engine that takes in input a regular expression and a grammar-based compressed text and returns every line of the uncompressed text containing a match for the regular expression.

Limitations

  • no match across lines
  • no support invert match option
  • only regular languages (like RE2) → e.g. no backreferences
  • no zero-width character
  • no named character classes (e.g. alnum, digit, lower,…)
  • no UTF8 support, only ASCII characters
  • no highlighted output, only matching line are reported in full

Compiling

Installing libfa

Instal the tools and libraries required to build augeas (besides the usual tools: gcc, autoconf, automake etc.)

sudo apt-get install bison, flex, readline-devel, libxml2-devel

Clone the augeas project repository.

git clone git@github.com:hercules-team/augeas.git

From the root folder run

./autogen.sh
make
sudo make install

Finally, update the library path so that zearch can find libfa

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LD_LIBRARY_PATH

Compiling Zearch

From the root folder of this repository run make zearch stats debug which generates three executables:

  • zearch is the main program.
  • stats works as zearch but produces some statistical information about memory usage.
  • debug prints a lot of information for debugging through stderr.

These three tools are invoked in the same way:

./{zearch,debug,stats} [-m] <option> <input_regex> <input_file>

where

  • -m is an optional argument. When present, the program will determinize and minimize the automaton, following the algorithm used by libfa. Note that this option does not always improves performance.
  • option can be -c, -l, -a or -b to print the number of matching lines, the matching lines, both of them or simply inform about whether there is (at least) one match or not respectively.
  • input_regex is a regular expression following the format accepted by libfa.
  • input_file is a repair-compressed file.

Comparison with other tools

We have compared the performance of zearch with the state of the art approach for regular expression matching on compressed text. Due to the limited functionality of our tool, the comparison only considers the running time required by these tools to report the number of lines in the original file containing a match. For the experiments the state of the art is represented by the following command

{lz4,zstd} -dc compressed_file | {grep,rg,hyperscan} -c regex

The experiments show that our tool outperforms the state of the art, even though decompression and search are done in parallel. The detailed results of this comparison are available here

You can’t perform that action at this time.