-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering segments from pygmetized code fails for some lexers #143
base: master
Are you sure you want to change the base?
Conversation
…hat there will be a single space between SEGMENT and DIVIDER. This fails for some lexers. Relaxing this assumption fixes the issue.
@@ -509,7 +509,7 @@ module.exports = Utils = | |||
result = result.replace('<div class="highlight"><pre>', '').replace('</pre></div>', '') | |||
|
|||
# Extract our segments from the pygmentized source. | |||
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}\s#{div}.*<\/span>.*/// | |||
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}.*?#{div}.*<\/span>.*/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .? seems a bit dangerous. It could match Anything. \s would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its not ideal, but for some languages, JSP for instance, the words SEGMENT
and DIVIDER
are in separate spans. It might be possible to come up with a more exact regex. I'll try and post an example of the parsed JSP comment HTML today and we can discuss options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they wind up in separate spans??? wow... I wasn't expecting that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is especially disconcerting because ORIGINALLY "#{seg}\s#{div}" was just a literal "SEGMENT DIVIDER"
I split it up into 2 words like this to prevent an error when running groc on itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need this for the xml/html language support?
I suspect that the lexer might be doing something wierd... hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one option might be: (?:\s|INSERT_VERY_SPECIFIC_REGEX_FOR_THIS_ISSUE) so we match either a single space Or the extra close/open span
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose .*?
was a bit lazy of me, sorry. Here is the output for JSP comments:
<span class="k"><%-</span><span class="o">-</span> <span class="n">SEGMENT</span> <span class="n">DIVIDER</span> <span class="o">--</span><span class="k">%></span>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so </span> <span class="n">
is the bad stuff we need to match
(?:\s|<\/span>\s*<span\sclass="n">)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would work. I'm not sure if the same pattern applies to comments in other languages though. I'll have to check XML and HTML.
Please read this comment: #134 (comment) If we implement the code-segmentation in the proper tool, which is to my mind the syntax-highlighter, we would not just simplify groc's implementation, because groc would get a well known JSON-formatted stream of raw-comments and formatted-code, but also raise groc's precision dramatically. Another benefit of this approach is, to add new syntax-highlighters, we just need to implement this kind of glue-API that spits out the preprocessed JSON-stream. Possible candidates: ruby's rouge or node's highlight.js. As rouge uses the same styles as pygments, only highlight.js would have a different styling. |
The regex that splits the segments from the pygmetized code assumes that there will be a single space between SEGMENT and DIVIDER. This fails for some lexers. Relaxing this assumption fixes the issue.