Hpipe is the acronym of "High Performance Incremental Parser Engine".
It generates optimized code to evaluate regular expressions with embedded actions (arbitrary code, that can be defined by the user).
Incremental means that it works with streams: the crunching of the incoming data can be stopped/restarted at any point, with proper and automated handling of buffers (with reference counter or other techniques,depending on what the user wants). It is a prerequisite for different types of data coming from the network or for instance from huge files that do not fit into memory. It is the first difference compared to tools like for instance re2c which assumes that data come once, in a contiguous buffer.
Besides, all other things being equal,
- Hpipe is designed for performance. Hpipe supports training, anticipation and zero copy for most of the buffer styles, tries to read the data only once, etc...
- Hpipe's syntax and semantic focuses on security and clarity (over compactness). For instance, user codes are executed only if the paths fully succeed (not the case for ragel), escape sequences are avoided when possible, etc...
- What is Hpipe ?
- Table of contents
- Show me an example
- Installation
- A word on performance
- Syntax
- Important flags
- Output file
The following code is an example of a "machine" that sums the last digit of each 'bar[0-9]', and that counts the number of 'foo' and 'bar' not followed by a digit (which can be EOF or the beginning of something else...).
main = (
( 'bar' '0' .. '9' { a += *data - '0'; } ) |
( 'bar' { b += 1; } ) |
( 'foo' { c += 1; } ) |
any_utf8
)**
If the input is "bar5", bar += 1
is not going to be executed: a user code is executed only if the belonging path succeeds and it won't be the case for ( 'bar' { b += 1; } )
because there is a priority for ( 'bar' '0' .. '9' { a += *data - '0'; } )
.
It produces code like
l_2:
// as in the Boyer-Moore algorithm, hpipe may decide to test several
// chars ahead if it leads to reduction of the overall execution time
if ( data + 3 > end_m1 ) goto l_1;
data += 3;
// partition and ordering of tests depend on the result of "training"
// (see the paragraph on performance)
if ( data[ 0 ] == 'a' ) goto l_19;
if ( data[ 0 ] != 'o' ) goto l_20;
if ( data[ -1 ] != 'o' ) goto l_21;
if ( data[ -2 ] != 'f' ) goto l_2;
l_3:
{ c += 1; }
goto l_2;
l_21:
if ( data[ -1 ] != 'f' ) goto l_2;
l_6:
// end of the buffer (different styles are possible)
if ( data >= end_m1 ) goto c_2;
++data;
l_22:
if ( data[ 0 ] == 'o' ) goto l_3;
...
c_7:
// interruption of the data is fully handled (including in the cases
// where one needs to go backward, ...)
if ( last_buf ) goto l_11;
hpipe_data->inp_cont = &&e_7;
return RET_CONT;
e_7:
if ( data > end_m1 ) goto c_7;
goto l_16;
# support function. res is a function argument
read_dec[ res ] =
( digit { res = *data - '0'; } )
( digit { res = 10 * res + *data - '0'; } )**
number_flt =
read_dec[ "nfl" ]
# in this example, '.' is mandatory for floating point numbers
'.' { mul = 1; } ( digit { nfl += ( mul *= 0.1 ) * ( *data - '0' ); } )**
# 'e' and 'E' are optional
( 'e' | 'E'
( '+'? read_dec[ "num" ] { nfl *= std::pow( 10.0, num ); } ) |
( '-' read_dec[ "num" ] { nfl *= std::pow( 10.0, -num ); } )
)??
{ os << nfl << " "; }
In order to allow zero copy, hpipe proposes to take care of markers, for all the types of buffers (contiguous or not, interruptible or not, etc...).
For instance, a http parser can be written with
# read until `end` char, reference the result in `$name` if success
read_string[ name, end = 10 | 13 ] = beg_str_next[ name ] ( any - end )** end_str_next[ name ]
# read `size` bytes, reference the result in `$name` if success
read_ssized[ name, size ] = beg_str_next[ name ] skip[ size ] end_str_next[ name ]
# read a decimal number. Store the result in `$name`
read_number[ name ] = digit { name = *data - '0'; } ( digit { name = 10 * name + *data - '0'; } )**
# this http parser simply get url, content length and content
main = 'GET ' read_string[ 'url', ' ' ] (
( 10 'Content-Length: ' read_number[ 'content_length' ] ) |
( 10 eol read_ssized[ 'content', 'content_length' ] { os << "url:" << url << " cl:" << content_length << " content:" << content; } any** ) |
any
)**
For instance, for linked lists of buffers, the generated will then look like:
...
if ( __builtin_expect( data[ 0 ] == 'C', 0 ) ) goto l_10;
if ( data[ 0 ] == 13 ) goto l_9;
l_28:
if ( data[ 0 ] != 10 ) goto l_6;
// hpipe uses markers (in this example)
HPIPE_DATA.__beg_content_buf = buf;
HPIPE_DATA.__beg_content_data = data;
if ( data + content_length <= end_m1 ) {
data += content_length;
} else {
// if stream is interrupted
HPIPE_DATA.__bytes_to_skip = content_length - ( buf->data + buf->used - data );
t_9:
if ( ! buf->next ) {
if ( last_buf )
goto l_38;
// handling of buffer retention
HPIPE_DATA.pending_buf = buf;
HPIPE_BUFF_T__INC_REF( buf );
HPIPE_DATA.inp_cont = &&e_9;
return RET_CONT;
e_9:
HPIPE_DATA.pending_buf->next = buf;
HPIPE_DATA.pending_buf = buf;
goto c_9;
}
buf = buf->next;
c_9:
if ( HPIPE_DATA.__bytes_to_skip >= buf->used ) {
HPIPE_DATA.__bytes_to_skip -= buf->used;
goto t_9;
}
data = buf->data + HPIPE_DATA.__bytes_to_skip;
end_m1 = buf->data + buf->used - 1;
}
HPIPE_BUFF_T__INC_REF( buf );
HPIPE_CB_STRING__ASSIGN_BEG_END( HPIPE_DATA.content, HPIPE_DATA.__beg_content_buf, HPIPE_DATA.__beg_content_data, buf, data + 1 );
...
git clone https://github.com/hleclerc/Hpipe.git && cd Hpipe && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=release .. && make -j8
sudo make install
Hpipe is made from the ground for performance. Here are some illustrations:
Training data can be used by hpipe to transform the instruction graph before the code generation. It enables in particular to automatically decompose the tests, in order to minimize the average number of tests and branch mispredictions.
Let's consider the following declaration:
a = 'a' { cpt += 'a'; }
b = 'b' { cpt += 'b'; }
c = 'c' { cpt += 'c'; }
d = 'd' { cpt += 'd'; }
l = lf { cpt += 10; } # lf = line feed
# we simply test presence of some given chars
main = ( a | b | c | d | l )**
# some training data
beg_training
input
aaaacdacaabda... (a majority of 'a')
freq
1
end_training
Launched on (very long) data similar to the training one, execution times are
Without training | With training | speedup | |
---|---|---|---|
w/o profile opt | 2.27894s | 1.11687s | 2.04x |
with profile opt | 1.80447s | 0.778634s | 2.31x |
("profile opt" means use of g++ -fprofile-instr-generate
-- clang gives the same kind of results with or without profile optimizations)
Here is a comparison of the performance of a basic js scanner, written with re2c, written with ragel or written using hpipe (see tests/re2c/js.re, tests/re2c/js.ragel and tests/hpipe/js.hpipe):
re2c | re2c -g | ragel -G2 | hpipe | speedup | |
---|---|---|---|---|---|
w/o profile opt | 6.06785s | 3.26868s | 4.09402s | 2.15281s | 2.49x 1.52x 1.90x |
with profile opt | 6.36135s | 3.26959s | 2.42014s | 1.82015s | 3.49x 1.80x 1.32x |
re2c is not meant to work on interrupted streams, but the base techniques are roughly the same, hence the relevance of the test. By the way, languages like javascript are meant to be easily parsed, meaning that the most simple parsing codes should already been roughly optimized (i.e. this test is only about basic training, not about the other kinds of optimizations applicable for other kinds of input).
Like in the Boyer-Moore algorithm it may be beneficial to test several chars ahead, in particular if the test leads most of the time to paths where it is of no use to test the chars before.
For instance, with
main = (
( 'bcdefghi' { cpt += 1; } ) |
any
)**
we obtain (on a very long random data set) the following result (using --boyer-more
flag or not):
Without BM | With BM | speedup | |
---|---|---|---|
execution time | 0.937635s | 0.126285s | 7.42x |
"foo"
or'foo'
: look for the text "foo" in the incoming data. "f" must be the first character of the incoming data (looking for a substring can be expressed asany* "foo"
).104
(unsigned char value): look for a byte with specified (unsigned) numeric value. For encoding where characters can use more than one byte (UTFx...), concatenation can be used.'a' .. 'v'
(range): look for a byte value between 'a' and 'v' inclusive. Range works also with numerical values (e.g.'a' .. 105
).A - B
(substraction): look for a byte value inA
but not inB
.A B
(concatenation): test forA
, and if there's a match for machineA
, advance in the buffer and look forB
.{ something in the target language }
(execution): execute the code between the{}
if the previous machines in the lane have a match and if there is at least one ending path through the following machines (which can be guessed or tested). For example, if the machine is'A' {foo();} 'B'
and the text is 'AC',foo
won't be executed ('B'
is not matched). Sometimes, it is not necessary to test the coming machines because there is at least a sure path (e.g. with non mandatory things likeA*
). If it is necessary to test the coming machines, Hpipe may add a marker to keep the data available until there is an "up to the end" path.A | B
(priority or): testA
; ifA
does not work, testB
with the same input data. Of course, internally, it works as a state machine, and whileA
is tested, we keep an eye onB
to avoid buffering of the incoming data, as much as possible. Nevertheless, sometimes, it can be mandatory, specifically when there are execution machines ({...}
) which depend on the data and tests to be made (leading to "data dependent" postponed execution).A*
(loop, not having the priority): test for 0 or any number of successive matches forA
. If there is a following machine, it will have the priority onA
.
A**
(priority loop): test for as much as possible (0 or any number of successive) matches forA
.A
has the priority over the eventual following machines.
A+
(one or more, not having the priority): test for 1 or more successive matches forA
. If there is a following machine, it will have the priority onA
.A++
(one or more with priority): test for as much as possible matches forA
, provided that there is at least one match.A?
(option, not having the priority): test for zero or one match ofA
. For example, theeol
machine is equivalent tocr? lf
to take into account optional dos end lines. If there is a following machine, the following machine have priority.A??
(option with priority): test for zero or one match ofA
. IfA
matches, it has the priority.-> label ... <- label
(goto):-> label
means "jump to label".<- label
is to define the label. It allows notably to exit loops with complex priority patterns. For instance, we can scan C strings using
'"' ( '\\' | '\"' | ( '"' -> out_string ) | any )** <- out_string
in place of
'"' ( '\\' | '\"' | any - '"' )** '"'
where the end machine is written twice (please note that there is no escaping in hpipe strings).
eof
: ok if end of file.beg_str[ "A" ]
: beginning of a string namedA
. If the correspondingend_str
is in a succeeding path,A
will contain (referenced) data betweenbeg_str
andend_str
. This machine will register an attribute inHpipeData
. Type of this attribute will depend on the buffer type.beg_str_next[ "A" ]
: same thing thanbeg_str[ "A" ]
but the actual first char will be after the current one.end_str[ "A" ]
: end of a string namedA
. To be used in conjunction with beg_str.end_str_next[ "A" ]
: end of a string namedA
, current char being be included inA
.skip[ "variable or value" ]
: skip some bytes (defined by an integer). At a given point, if the only active machine are skips, bytes are directly skipped.
There are some internal predefined machine to help generate code:
add_variable[ type, name, optionnal_default_value ]
will add a variable declaration in theHpipeData
structure (+use ofoptionnal_default_value
in ctor).add_include[ 'my_include.h' ]
will add#include <my_include.h>
(only once) in the preliminary of the generated code.<>
and""
can be specified explicitely (as inadd_include[ '"my_include.h"' ]
)add_prel[ 'some_code' ]
oradd_preliminary[ 'some_code' ]
will addsome_code
(only once) in the preliminary of the generated codeimport my_file.hpipe
allows to import content from another hpipe file.
# Arguments values are machines. It may have default values.
my_machine[ val, num = "123", end = digit 'e' | 'E' ] =
{ res += num; } # substring with argument names are replaced by the values
val any* end # variable can be used as normal machines
#
other[ foo = eol ] = ...
#
main =
# arguments can be specified using their names
my_machine[ any, num = "5" ]
# if no argument of default values are available at every position, a machine can be called without []
other
They are defined in src/Hpipe/Predef.sipe.
It is possible to define test(s) inside machine definition, using:
beg_test preproc
input
...
output
... status=...
end_test
input
and output
are shifted to the left (space at the beginning of the line are here to separate sections). hpipe --test foo.hpipe
allows to launch the tests in foo.hpipe
(one can use the flag style
to test with different kinds of buffers).
Training data uses a similar syntax:
beg_training
input
...
freq
42
end_training
freq
represent a frequency (of course relevant only if several training data are present).
beg_methods
void something_in_the_class();
int an_attribute;
// ...
end_methods
will add some code inside the declaration section (which may be in a class or not, depending on how the generated code is used).
beg_flags
--never-ending
--...
end_flags
allows to inline some command line flags (for instance for machine intended to work only under some context, ...).
group | operators |
---|---|
0 | [ |
1 | -> <- |
2 | .. |
3 | - |
4 | ** ++ ?? * + ? |
5 | ` |
6 | , |
7 | = |
--never-ending
indicates that a machine is not going to fail due to the end of the stream. For instance in{ some_code; } any any
, by default, one will have to read the two chars before executing{ some_code; }
. If--never-ending
is specified,{ some_code; }
will be executed directly: one will consider that lack of data is not a failure reason.--style buffer_style
: defines how the data are given (contiguous buffers or not, interruptible or not, ...). See Buffer styles.
Hpipe generates only one output file, with sections that can be activated using defines (#define HPIPE_DECLARATIONS
, #define HPIPE_DEFINITIONS
, ...). A lot of flexibility is allowed using macros, users are encouraged to read them in the generated code.
By default (but there are other choices), hpipe generate a parse
function with a signature like:
unsigned parse( HpipeData* hpipe_data, Hpipe::Buffer* buf, bool last_buf = false );
Hpipe::Buffer
is basically a data chunk with a reference counter and a next
field for strings made from several data chunks (Hpipe::CbString
and Hpipe::CbStringPtr
enable to handle those lists regular strings). A standard reading pattern using Hpipe::Buffer
would be
Buffer *inp_buff = Buffer::New( Buffer::default_size );
HpipeData hpipe_data;
...
void input_data_event() {
while ( true ) {
// if not possible to reuse the old buffer, create a new one
if ( inp_buff->cpt_use > 0 ) {
--inp_buff->cpt_use;
inp_buff = Buffer::New( Buffer::default_size );
}
// try to read some data
inp_buff->used = recv( fd, inp_buff->data, inp_buff->size, ... );
if ( int( inp_buff->used ) <= 0 ) ...;
// parse
parse( &hpipe_data, inp_buff );
}
}
Of course, using this kind of buffer is not the taste of everyone. One can use the option -s
or --style
to choose the buffer style:
-s HPIPE_BUFFER
generates code as shown above.-s HPIPE_CB_STRING_PTR
generates code that assumes that data comes once, in a chained list of buffer with a begin and an end offset (relative to the first buffer, the one given as argument).-s BEG_END
will generate code for a unique (non interruptible) buffer defined using aconst unsigned char *
for the beginning and also for the end.-s C_STR
will generate code for a unique (non interruptible) buffer defined using aconst unsigned char *
for the beginning, and ended by the value after the--stop-char
cmd arg (0 by default).
For instance, -s BEG_END
will generate a function with a signature like:
unsigned parse( HpipeData* hpipe_data, const unsigned char* data, const unsigned char* end_m1 ); // `end_m1` must point to the last char (e.g. for data.size == 0, end_m1 must be equal to data - 1).