Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of indented markdown features by requiring line-break delimiter #2821

Closed
nickfargo opened this issue Mar 12, 2013 · 13 comments
Closed
Labels

Comments

@nickfargo
Copy link

Here is an example of a markdown-coffee passage that cannot be lexed as literate (#1786) in 1.6.1, because the list item’s continuation line passes the simple per-line /[ ]{4}|\t/ test and as such is incorrectly identified as a code block.

Allow me to enumerate:

  * Long list items may
    continue
  * Oops.

    @however is code

This seems to imply that literate code blocks should be not only indented, but also delimited from overhead markdown by at least one immediately preceding empty line. As much is suggested by example, and by the syntax highlighting rules in the tmbundle, however it is not presently stipulated by the implementation.

One approach for a fix might be to expand Lexer::clean a bit, allowing the markdown passages to be read in blocks, rather than individual lines:

  # Preprocess the code to remove leading and trailing whitespace, carriage
  # returns, etc.
  clean: (code) ->
    code = code.slice(1) if code.charCodeAt(0) is BOM
    code = code.replace(/\r/g, '').replace TRAILING_SPACES, ''
    if WHITESPACE.test code
      code = "\n#{code}"
      @chunkLine--
    code = @cleanMarkdown code if @literate
    code

  # If we're lexing literate CoffeeScript, identify the blocks of extrinsic
  # Markdown and transform them into `#`-comments.
  cleanMarkdown: (code) ->
    COFFEE = /^(?:[ ]{4}|\t)(.*)/
    EMPTY = /^\s*$/

    lines = code.split '\n'
    line = lines[0]
    state =
      if COFFEE.test line then 'coffee'
      else if EMPTY.test line then 'markdown-newline'
      else 'markdown'

    # Allow transition into `coffee` from `markdown` only after an empty line.
    for line, i in lines
      switch state
        when 'markdown'
          if EMPTY.test line
            state = 'markdown-newline'
            line = ''
          else
            line = '# ' + line
        when 'markdown-newline'
          if match = COFFEE.exec line
            state = 'coffee'
            line = match[1]
          else if EMPTY.test line
            line = '# ' + line
          else
            state = 'markdown'
            line = '# ' + line
        when 'coffee'
          if match = COFFEE.exec line
            line = match[1]
          else if EMPTY.test line
            line = ''
          else
            state = 'markdown'
            line = '# ' + line
      lines[i] = line

    lines.join '\n'

With this in place, existing tests appear to pass, as do ad-hoc cases similar to the example above that previously would fail.

Please offer thoughts on whether this might be an approach worth evolving into a pull.

@jashkenas
Copy link
Owner

Although it would be possible for us to start doing more sophisticated parses of Markdown, ideally, we'd try and keep this feature very simple. Having just a one-line-at-a-time, is-it-literate-or-is-it-not heuristic makes it way easier to for other tools to support the same syntax for other purposes.

Are you sure it wouldn't be easier just to write it out like this?

Allow me to enumerate:

* Long list items may
  continue
* Oops.

    @however is code

@nickfargo
Copy link
Author

A poor example; mea culpa. Something less easily “fixable” would have illustrated the point better, like making the list nested:

* Even when nested
  * long list items may
    continue
  * Oops.

or a canonical markdown example:

To make lists look nice, you can wrap items with hanging indents:

*   Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
    Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
    viverra nec, fringilla in, laoreet vitae, risus.

or the more edgy cases, like neatly indented multi-paragraph list items.

To be sure, some features are well enough off the beaten path to disregard, but others are common enough, and in a few cases the danger could lie well hidden, and lead to some pernicious bugs.

I appreciate the power of simplicity above all else, but here I think it will prove worthwhile to find an acceptable way to chase the line break.

Thanks for having a look.

@jashkenas
Copy link
Owner

You're almost certainly right. Any interested in cooking up a patch?

@shesek
Copy link

shesek commented Mar 14, 2013

This is especially problematic because in many cases plain English text is parsed as valid CoffeeScript code, which means the indented markdown text would end up as javascript.

@nickfargo
Copy link
Author

So I can think of two ways to go here. There’s the line-by-line state-machinery, as in the cleanMarkdown function above, and then there’s cramming the states into a set of global regexes — for no particular reason let’s call this one deliterate:

  deliterate: (code) ->
    PROSE = /([\s\S]*?(?:\n|$))(?:[ \t]*?\n|$)/g
    BLANK = /((?:[ \t]*\n)+)(?:( {4}|\t)|.)/g
    CODE = /((?:(?: {4}|\t).*(?:\n|$))+)(?:([ \t]*\n)|.)/g
    out = []

    rx = if /^(?: {4}\t).*(?:\n|$)/.test code then CODE
    else if /^\s*(?:\n|$)/.test code          then BLANK
    else                                           PROSE

    while match = rx.exec code
      chunk = match[1]
      break unless lines = chunk.replace(/([\s\S]*?)\n?$/, '$1').split '\n'
      switch rx
        when PROSE
          out.push '# ' + line for line in lines
          rx = BLANK
        when BLANK
          out.push '' for line in lines
          rx = if match[2]? then CODE else PROSE
        when CODE
          out.push /^(?: {4}|\t)?(.*)/.exec(line)[1] for line in lines
          rx = if match[2]? then BLANK else PROSE
      rx.lastIndex = match.index + chunk.length
    out.join '\n'

Uglier looking, but here executions operate on multi-line blocks rather than per-line, so — although it’s far too soon to care — deliterate is getting through a few-hundred line source file about twice as fast as cleanMarkdown.

Anyway those are just two first quick stabs, but before moving forward I thought I’d ask for an opinion on which might be a better strategy to try, or just a more appropriate fit. (I suspect Docco will need to be kept in mind as well.)

And certainly, parsing markdown is not my wheelhouse; if there’s a smarter approach that blows both of these away, I hope someone can chime in.

marchaefner pushed a commit to marchaefner/coffee-script that referenced this issue Mar 14, 2013
* Expect a blank line as delimiter between text and code (jashkenas#2821).
* Don't change indentation of code. It is not necessary and leads to
  erroneous locationData. (jashkenas#2835)
* Don't modify blank lines and reverse the change in the lexer.
* Don't ignore indentation with mixed whitespace.
@jashkenas
Copy link
Owner

Pinging @marchaefner to take a look at the approaches in this ticket and compare 'em to his, in his.

@jashkenas
Copy link
Owner

Or vice-versa, @nickfargo, if you have any thoughts about the simplification in #2838

@epidemian
Copy link
Contributor

Wouldn't it be better to use an existing Markdown parser and then extract the <code> tags from the output, or something along those lines, instead of trying to correctly parse Markdown with its million corner-cases?

Here's an example of a possible simple interpreter in the command line, using marked as the Markdown parser and hxselect to extract the <code> tags from the Markdown output:

$ echo '# Awesome Literate CoffeeScript Interpreter

* Lorem
  * ipsum 
    * dolor
  * sit
    amet,
    consectetur adipisicing elit,
    sed do eiusmod tempor

incididunt ut labore

    some coffee code

et dolore

    even
    more code

magna aliqua.' | marked | hxselect 'code' -c -s '\n' | coffee -cs

Output:

// Generated by CoffeeScript 1.6.1
(function() {

  some(coffee(code));

  even;

  more(code);

}).call(this);

@bjmiller
Copy link

That's an awesome idea! As long as no one mixes in code blocks that aren't coffeescript.

@marchaefner
Copy link
Collaborator

Wouldn't it be better to use an existing Markdown parser and then extract the <code> tags from the output...

Unless you wanted those nice error messages and source maps to coincide with the code in your .litcoffee file.

@shesek
Copy link

shesek commented Mar 15, 2013

@epidemian Its a nice approach, but it seems somewhat wasteful to run a full markdown parser just to find where the code parts are. But the biggest problem is probably generating source maps, as @marchaefner mentioned.

@epidemian
Copy link
Contributor

Unless you wanted those nice error messages and source maps to coincide with the code in your .litcoffee file.

Ho, that's right; totally forgot about that. Then it seems the pain of reinventing a part of a Markdown parser is necessary 😓. The corner cases in Markdown are pretty hairy though, things like code inside lists (or nested lists), or code after lists. I wouldn't like to see those pesky corner cases creep into LCS.

[...] but it seems somewhat wasteful to run a full markdown parser just to find where the code is.

Well, it depends on what you're worried to "waste". One could also argue that trying to make and maintain a (subset of a) Mardown parser when some many already exist is a waste of development time.

@nickfargo
Copy link
Author

I think #2838 takes it.

The global regexes of deliterate that break code into multi-line chunks do still appear to offer a negligible performance advantage, but that cannot justify the function’s relative heft, complexity, and un-readability.

Ready to close. Thanks @marchaefner!

hden pushed a commit to hden/coffee-script that referenced this issue Mar 18, 2013
* Expect a blank line as delimiter between text and code (jashkenas#2821).
* Don't change indentation of code. It is not necessary and leads to
  erroneous locationData. (jashkenas#2835)
* Don't modify blank lines and reverse the change in the lexer.
* Don't ignore indentation with mixed whitespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants