Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering segments from pygmetized code fails for some lexers #143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shiftyp
Copy link
Contributor

@shiftyp shiftyp commented Dec 1, 2013

The regex that splits the segments from the pygmetized code assumes that there will be a single space between SEGMENT and DIVIDER. This fails for some lexers. Relaxing this assumption fixes the issue.

…hat there will be a single space between SEGMENT and DIVIDER. This fails for some lexers. Relaxing this assumption fixes the issue.
@@ -509,7 +509,7 @@ module.exports = Utils =
result = result.replace('<div class="highlight"><pre>', '').replace('</pre></div>', '')

# Extract our segments from the pygmentized source.
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}\s#{div}.*<\/span>.*///
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}.*?#{div}.*<\/span>.*///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .? seems a bit dangerous. It could match Anything. \s would be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its not ideal, but for some languages, JSP for instance, the words SEGMENT and DIVIDER are in separate spans. It might be possible to come up with a more exact regex. I'll try and post an example of the parsed JSP comment HTML today and we can discuss options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they wind up in separate spans??? wow... I wasn't expecting that...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is especially disconcerting because ORIGINALLY "#{seg}\s#{div}" was just a literal "SEGMENT DIVIDER"

I split it up into 2 words like this to prevent an error when running groc on itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need this for the xml/html language support?

I suspect that the lexer might be doing something wierd... hmm...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one option might be: (?:\s|INSERT_VERY_SPECIFIC_REGEX_FOR_THIS_ISSUE) so we match either a single space Or the extra close/open span

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose .*? was a bit lazy of me, sorry. Here is the output for JSP comments:

<span class="k">&lt;%-</span><span class="o">-</span> <span class="n">SEGMENT</span> <span class="n">DIVIDER</span> <span class="o">--</span><span class="k">%&gt;</span>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so </span> <span class="n"> is the bad stuff we need to match

(?:\s|<\/span>\s*<span\sclass="n">)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would work. I'm not sure if the same pattern applies to comments in other languages though. I'll have to check XML and HTML.

@sjorek
Copy link
Contributor

sjorek commented Dec 21, 2013

Please read this comment: #134 (comment)

If we implement the code-segmentation in the proper tool, which is to my mind the syntax-highlighter, we would not just simplify groc's implementation, because groc would get a well known JSON-formatted stream of raw-comments and formatted-code, but also raise groc's precision dramatically. Another benefit of this approach is, to add new syntax-highlighters, we just need to implement this kind of glue-API that spits out the preprocessed JSON-stream. Possible candidates: ruby's rouge or node's highlight.js. As rouge uses the same styles as pygments, only highlight.js would have a different styling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants