Partial matches #127

cpitclaudel · 2016-12-15T19:28:14Z

As an alternative to matching on arbitrary input streams, how hard would it be to support partial matches (in PCRE's sense)?

That is, how hard would it be to extend the API so that one could write something along the lines of:

state = RE2::PartialMatchOnIncompleteInput(hel", "h.*o", NULL)
final_state RE2::PartialMatchOnIncompleteInput("lo", "h.*o", state)

assert(RE2::PartialMatchOnIncompleteInputSuceeded(final_state))

Such a feature would make it easy to use RE2 to search Emacs' gap buffer.

The text was updated successfully, but these errors were encountered:

junyer · 2016-12-16T10:16:21Z

That's essentially what I was thinking when I wrote "hacking the DFA execution engine to handle input strings in chunks" in #126 (comment). In theory, you'd just have to pass around a State*. In practice, it's much more complicated than that.

How would this fit into the current interface and implementation? I'm struggling to imagine how the public API should look, for example. And if the caller wants to know where the regular expression matched, not just whether it matched, then you'd have to run the DFA backwards in order to find the start of the match. What if you need to read previous chunks? What if the caller doesn't have previous chunks? Then there's the whole problem of elegantly plumbing and wiring the needful.

Also, because the DFA execution engine permits concurrent use of a DFA object by multiple threads, the State* would have to be pinned before being returned in order to avoid being invalidated when another thread resets (i.e. clears) the DFA state cache. Pinning (and unpinning) would most likely need flag_ to become atomic. Resetting would become more complex due to memory budget arithmetic. You could try dodging all of that by relying on StateSaver, I guess, but that seems worse because it would involve (a) copying data, (b) handling failures and (c) also wrangling the initial State*, which the main loop compares with the State* to trigger the memchr(3) optimisation.

Sorry, this isn't going to happen in the foreseeable future. :(

(Ping, @rsc and @BurntSushi.)

rsc · 2016-12-16T15:41:40Z

Agreed.

cpitclaudel · 2016-12-16T19:31:03Z

Thanks for the detailed and thoughtful reply. Ultimately, my goal is to be able to use RE2 in Emacs; it already works nicely for strings, but not for buffer contents, as Emacs stores buffer text in a gap buffer. One can think of this as essentially two strings, and indeed in the past glibc had a __re_search_2 function.

Ideally, a solution for this use case would also cover similar ones, like ropes or piece tables. There are multiple possible interfaces for this, but here are the three that I can think of:

pcrepartial's style, which let's you continue a match by passing a pointer to a workspace into DFA-based matching functions;
TRE's style, which fetches character using a function call
Glibc's style, where you can pass two strings. This could be extended to passing an array of strings.

It looks like 1 would be pretty complex to implement, and 2 would be prohibitively costly, so 3 may be the only option left?

junyer · 2016-12-19T10:52:10Z

One can think of this as essentially two strings, and indeed in the past glibc had a __re_search_2 function.

FWIW, glibc still has the re_search_2() and re_match_2() functions in regex.h.

It looks like 1 would be pretty complex to implement, and 2 would be prohibitively costly, so 3 may be the only option left?

You'd have to do this for the general case, not just for the specific case of two strings, and you'd have to use Google's Cord, which supports efficient forward iteration by yielding StringPiece fragments, but doesn't support efficient reverse iteration, so running the DFA backwards would still be problematic. Unfortunately, since you'd no longer be constrained to the DFA execution engine, you'd have to juggle Cord and StringPiece everywhere in the API for the sake of consistency.

Again, this isn't going to happen in the foreseeable future.

junyer · 2016-12-19T10:55:25Z

P.S. Having now seen k-takata/Onigmo#83 and kkos/oniguruma#45, I'm rather curious about your ultimate goal.

cpitclaudel · 2016-12-19T15:10:55Z

FWIW, glibc still has the re_search_2() and re_match_2() functions in regex.h.

No, not really:

static int
internal_function
re_search_2_stub (struct re_pattern_buffer *bufp, const char *string1,
		  int length1, const char *string2, int length2, int start,
		  int range, struct re_registers *regs,
		  int stop, int ret_len)
{
  const char *str;
  int rval;
  int len = length1 + length2;
  char *s = NULL;

  …
  /* Concatenate the strings.  */
  if (length2 > 0)
    if (length1 > 0)
      {
	s = re_malloc (char, len);

	if (BE (s == NULL, 0))
	  return -2;
	memcpy (s, string1, length1);
	memcpy (s + length1, string2, length2);
	str = s;
      }
    else
      str = string2;
  else
    str = string1;

  rval = re_search_stub (bufp, str, len, start, range, stop, regs, ret_len);
  re_free (s);
  return rval;
}

Again, this isn't going to happen in the foreseeable future.

Thanks for taking the time to clarify!

junyer closed this as completed Dec 16, 2016

BurntSushi mentioned this issue Dec 3, 2017

execute a regex on text streams rust-lang/regex#425

Open

junyer mentioned this issue Aug 2, 2019

Stream API (or matching across successive segments of data)? #213

Closed

junyer mentioned this issue Mar 20, 2020

Do you think it is possible to support streams without a maximum length? #251

Closed

spacewander mentioned this issue Mar 11, 2022

request help: Does response-rewrite plugin support partial or regular substitution apache/apisix#5451

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial matches #127

Partial matches #127

cpitclaudel commented Dec 15, 2016

junyer commented Dec 16, 2016

rsc commented Dec 16, 2016

cpitclaudel commented Dec 16, 2016

junyer commented Dec 19, 2016

junyer commented Dec 19, 2016

cpitclaudel commented Dec 19, 2016

Partial matches #127

Partial matches #127

Comments

cpitclaudel commented Dec 15, 2016

junyer commented Dec 16, 2016

rsc commented Dec 16, 2016

cpitclaudel commented Dec 16, 2016

junyer commented Dec 19, 2016

junyer commented Dec 19, 2016

cpitclaudel commented Dec 19, 2016