Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex->trigram has wrong behaviour when trigrams are question-marked #17

Closed
GoogleCodeExporter opened this issue Mar 30, 2015 · 5 comments

Comments

@GoogleCodeExporter
Copy link

bash$ csearch -verbose "foo_(bar_)?" >/dev/null
2012/03/05 10:44:37 query: "_ba" "foo" "o_b" "oo_"
2012/03/05 10:44:37 post query identified 11 possible files
bash$ csearch -verbose "foo_b(ar_)?" >/dev/null
2012/03/05 10:44:44 query: "foo" "o_b" "oo_"
2012/03/05 10:44:44 post query identified 12 possible files

In the first example, "_ba" and "o_b"' should not be required trigrams for the 
file.  In the second example, "ar_" is correctly _not_ included in the list of 
required trigrams.

I'll take a look at this later tonight.  It looks like an issue with precedence 
in index/regexp.go.  (I.e., the question mark is only applying to the final 
trigram, and not all trigrams included in the grouping. ).

Original issue reported on code.google.com by dgryski on 5 Mar 2012 at 10:07

@GoogleCodeExporter
Copy link
Author

A bit of investigation has narrowed my focus to concat(x,y)'s handling when x 
or y has canEmpty=true and a prefix/suffix list rather than a list of exact 
matches.  The hunt continues...

Original comment by dgryski on 5 Mar 2012 at 6:29

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

I'm stumped.  I think this bug is beyond my current familiarity with the 
codebase.  I'm certain the issue is with concat(), but after many hours of 
staring at this I really have no idea what the fix should look like.

Original comment by dgryski on 6 Mar 2012 at 8:40

  • Added labels: ****
  • Removed labels: ****

@GoogleCodeExporter
Copy link
Author

So, it looks like this particular issue is actually in alternate().  The code 
assumes that either x and y are both exact or both have prefix/suffix lists.  
This obviously isn't the case.

The above bug is caused by the fact that emptyString() has an exact match but 
no prefix/suffix lists, so when being combined we were unioning with a nil 
list.  This didn't crash, it just didn't have the expected behaviour.

This also fixes the case where, when searching for '(123|4567)q' , we would get 
"67q" as the only required trigram, instead of the correct ("23q"|"67q").

Original comment by dgryski on 7 Mar 2012 at 9:37

  • Added labels: ****
  • Removed labels: ****

Attachments:

@GoogleCodeExporter
Copy link
Author

After writing that last sentence, I realized that my "corrected" trigrams were 
still missing some.  My implementation of alternate ignored the exact matches.

We now call addExact() before combining the matches.  This allows the (this 
time for sure ;) correct trigram set: 
    ("123")|("456" "567")  ("23q"|"67q")
to be generated.


Original comment by dgryski on 7 Mar 2012 at 11:07

  • Added labels: ****
  • Removed labels: ****

Attachments:

@GoogleCodeExporter
Copy link
Author

This issue was closed by revision 56b76ffbf8bb.

Original comment by dgryski on 2 May 2012 at 8:26

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant