Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering segments from pygmetized code fails for some lexers #143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion lib/utils.coffee
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,7 @@ module.exports = Utils =
result = result.replace('<div class="highlight"><pre>', '').replace('</pre></div>', '')

# Extract our segments from the pygmentized source.
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}\s#{div}.*<\/span>.*///
highlighted = "\n#{result}\n".split ///.*<span.*#{seg}.*?#{div}.*<\/span>.*///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .? seems a bit dangerous. It could match Anything. \s would be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its not ideal, but for some languages, JSP for instance, the words SEGMENT and DIVIDER are in separate spans. It might be possible to come up with a more exact regex. I'll try and post an example of the parsed JSP comment HTML today and we can discuss options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they wind up in separate spans??? wow... I wasn't expecting that...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is especially disconcerting because ORIGINALLY "#{seg}\s#{div}" was just a literal "SEGMENT DIVIDER"

I split it up into 2 words like this to prevent an error when running groc on itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need this for the xml/html language support?

I suspect that the lexer might be doing something wierd... hmm...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one option might be: (?:\s|INSERT_VERY_SPECIFIC_REGEX_FOR_THIS_ISSUE) so we match either a single space Or the extra close/open span

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose .*? was a bit lazy of me, sorry. Here is the output for JSP comments:

<span class="k">&lt;%-</span><span class="o">-</span> <span class="n">SEGMENT</span> <span class="n">DIVIDER</span> <span class="o">--</span><span class="k">%&gt;</span>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so </span> <span class="n"> is the bad stuff we need to match

(?:\s|<\/span>\s*<span\sclass="n">)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would work. I'm not sure if the same pattern applies to comments in other languages though. I'll have to check XML and HTML.


if highlighted.length != segments.length
console.log(result)
Expand Down