Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial matches #127

Closed
cpitclaudel opened this issue Dec 15, 2016 · 6 comments
Closed

Partial matches #127

cpitclaudel opened this issue Dec 15, 2016 · 6 comments

Comments

@cpitclaudel
Copy link

As an alternative to matching on arbitrary input streams, how hard would it be to support partial matches (in PCRE's sense)?

That is, how hard would it be to extend the API so that one could write something along the lines of:

state = RE2::PartialMatchOnIncompleteInput(hel", "h.*o", NULL)
final_state RE2::PartialMatchOnIncompleteInput("lo", "h.*o", state)

assert(RE2::PartialMatchOnIncompleteInputSuceeded(final_state))

Such a feature would make it easy to use RE2 to search Emacs' gap buffer.

@junyer
Copy link
Contributor

junyer commented Dec 16, 2016

That's essentially what I was thinking when I wrote "hacking the DFA execution engine to handle input strings in chunks" in #126 (comment). In theory, you'd just have to pass around a State*. In practice, it's much more complicated than that.

How would this fit into the current interface and implementation? I'm struggling to imagine how the public API should look, for example. And if the caller wants to know where the regular expression matched, not just whether it matched, then you'd have to run the DFA backwards in order to find the start of the match. What if you need to read previous chunks? What if the caller doesn't have previous chunks? Then there's the whole problem of elegantly plumbing and wiring the needful.

Also, because the DFA execution engine permits concurrent use of a DFA object by multiple threads, the State* would have to be pinned before being returned in order to avoid being invalidated when another thread resets (i.e. clears) the DFA state cache. Pinning (and unpinning) would most likely need flag_ to become atomic. Resetting would become more complex due to memory budget arithmetic. You could try dodging all of that by relying on StateSaver, I guess, but that seems worse because it would involve (a) copying data, (b) handling failures and (c) also wrangling the initial State*, which the main loop compares with the State* to trigger the memchr(3) optimisation.

Sorry, this isn't going to happen in the foreseeable future. :(

(Ping, @rsc and @BurntSushi.)

@junyer junyer closed this as completed Dec 16, 2016
@rsc
Copy link
Contributor

rsc commented Dec 16, 2016

Agreed.

@cpitclaudel
Copy link
Author

Thanks for the detailed and thoughtful reply. Ultimately, my goal is to be able to use RE2 in Emacs; it already works nicely for strings, but not for buffer contents, as Emacs stores buffer text in a gap buffer. One can think of this as essentially two strings, and indeed in the past glibc had a __re_search_2 function.

Ideally, a solution for this use case would also cover similar ones, like ropes or piece tables. There are multiple possible interfaces for this, but here are the three that I can think of:

  • pcrepartial's style, which let's you continue a match by passing a pointer to a workspace into DFA-based matching functions;
  • TRE's style, which fetches character using a function call
  • Glibc's style, where you can pass two strings. This could be extended to passing an array of strings.

It looks like 1 would be pretty complex to implement, and 2 would be prohibitively costly, so 3 may be the only option left?

@junyer
Copy link
Contributor

junyer commented Dec 19, 2016

One can think of this as essentially two strings, and indeed in the past glibc had a __re_search_2 function.

FWIW, glibc still has the re_search_2() and re_match_2() functions in regex.h.

It looks like 1 would be pretty complex to implement, and 2 would be prohibitively costly, so 3 may be the only option left?

You'd have to do this for the general case, not just for the specific case of two strings, and you'd have to use Google's Cord, which supports efficient forward iteration by yielding StringPiece fragments, but doesn't support efficient reverse iteration, so running the DFA backwards would still be problematic. Unfortunately, since you'd no longer be constrained to the DFA execution engine, you'd have to juggle Cord and StringPiece everywhere in the API for the sake of consistency.

Again, this isn't going to happen in the foreseeable future.

@junyer
Copy link
Contributor

junyer commented Dec 19, 2016

P.S. Having now seen k-takata/Onigmo#83 and kkos/oniguruma#45, I'm rather curious about your ultimate goal.

@cpitclaudel
Copy link
Author

FWIW, glibc still has the re_search_2() and re_match_2() functions in regex.h.

No, not really:

static int
internal_function
re_search_2_stub (struct re_pattern_buffer *bufp, const char *string1,
		  int length1, const char *string2, int length2, int start,
		  int range, struct re_registers *regs,
		  int stop, int ret_len)
{
  const char *str;
  int rval;
  int len = length1 + length2;
  char *s = NULL;

  …
  /* Concatenate the strings.  */
  if (length2 > 0)
    if (length1 > 0)
      {
	s = re_malloc (char, len);

	if (BE (s == NULL, 0))
	  return -2;
	memcpy (s, string1, length1);
	memcpy (s + length1, string2, length2);
	str = s;
      }
    else
      str = string2;
  else
    str = string1;

  rval = re_search_stub (bufp, str, len, start, range, stop, regs, ret_len);
  re_free (s);
  return rval;
}

Again, this isn't going to happen in the foreseeable future.

Thanks for taking the time to clarify!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants