Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST #11

Closed
sadegh01 opened this issue Jan 4, 2016 · 4 comments
Closed

HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST #11

sadegh01 opened this issue Jan 4, 2016 · 4 comments

Comments

@sadegh01
Copy link

sadegh01 commented Jan 4, 2016

These 2 flags are conflicted .
When I want to Use these two flags together the flag HS_FLAG_SOM_LEFTMOST does not work correctly and at the eventHandler() function the value of from is always 0 zero !!!

@jviiret
Copy link
Contributor

jviiret commented Jan 4, 2016

These two flags should definitely work together. Please post a test case with the following information:

  • The full expression, flags and mode passed to hs_compile(),
  • The data being scanned,
  • The locations of the matches you are receiving, and where you expect them to be.

@sadegh01
Copy link
Author

sadegh01 commented Jan 5, 2016

These are my flag's
flags.push_back(HS_FLAG_DOTALL | HS_FLAG_SOM_LEFTMOST );

This is My pattern to extract javascript :
1:/<script[^/][^&gt;]>(.?)</script[^&gt;]>|<javascript[^/][^&gt;]>(.?)</javascript[^&gt;]>/

This is My File :
<script type="text/javascript">//<![CDATA[si_ST=new Date;//]]></script>
lablablab
<script type="text/javascript">//<![CDATA[
_G.HT=new Date;
//]]></script></html>
lablablab

I want to use this (HS_FLAG_DOTALL) to detect New Line and use this (HS_FLAG_SOM_LEFTMOST) to get start point .

Comment : I used Escape HTML to copy My file and My pattern here . for your test would you please use this link
http://www.freeformatter.com/html-escape.html#ad-output to change it to unscaped mode

@jviiret
Copy link
Contributor

jviiret commented Jan 5, 2016

I think your escaped markup might have stripped some characters from your pattern - I'm assuming from your description that this is what it should look like: (on github, indenting with four spaces will make their Markdown support render text as code without formatting)

/<script[^/][^>]*>(.*?)</script[^>]*>|<javascript[^/][^>]*>(.*?)</javascript[^>]*>/

I think the issue here is that you are assuming backtracking semantics, whereas Hyperscan provides automata semantics. This means that instead of providing one "best match" that takes into account greedy/ungreedy repeats, alternation ordering, etc like PCRE, Hyperscan delivers all possible matches for a given regex. In these semantics, there is no difference between .* and .*?.

This is a fundamental difference from the way that a backtracking matcher like PCRE operates. We have a more detailed description of it in the Semantics section of the Hyperscan developer reference.

In this particular case, this is why SOM_LEFTMOST is always reporting a from offset of zero: your .*? repeats in DOTALL mode will match any sequence of characters, so the leftmost start of any match is the first occurrence a match for of <script[^/][^>]*>, which is the <script type="text/javascript"> at offset 0 of your file.

I would suggest that the easiest way to use Hyperscan to extract the data between two script tags would be to split your pattern up into four patterns:

1:/<script[^/][^>]*>/
2:/</script[^>]*>/
3:/<javascript[^/][^>]*>/
4:/</javascript[^>]*>/

You can then track the offsets at which patterns 1 and 2 match, extracting the data between them, and similarly for 3 and 4.

@jviiret jviiret closed this as completed Jan 5, 2016
@sadegh01
Copy link
Author

sadegh01 commented Jan 6, 2016

Dear jviiret
Thank you a lot for your attention and explanation ;)

@intel intel locked and limited conversation to collaborators Aug 8, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants