Fix parsing of indented markdown features by requiring line-break delimiter #2821

nickfargo · 2013-03-12T11:21:45Z

Here is an example of a markdown-coffee passage that cannot be lexed as literate (#1786) in 1.6.1, because the list item’s continuation line passes the simple per-line /[ ]{4}|\t/ test and as such is incorrectly identified as a code block.

Allow me to enumerate:

  * Long list items may
    continue
  * Oops.

    @however is code

This seems to imply that literate code blocks should be not only indented, but also delimited from overhead markdown by at least one immediately preceding empty line. As much is suggested by example, and by the syntax highlighting rules in the tmbundle, however it is not presently stipulated by the implementation.

One approach for a fix might be to expand Lexer::clean a bit, allowing the markdown passages to be read in blocks, rather than individual lines:

  # Preprocess the code to remove leading and trailing whitespace, carriage
  # returns, etc.
  clean: (code) ->
    code = code.slice(1) if code.charCodeAt(0) is BOM
    code = code.replace(/\r/g, '').replace TRAILING_SPACES, ''
    if WHITESPACE.test code
      code = "\n#{code}"
      @chunkLine--
    code = @cleanMarkdown code if @literate
    code

  # If we're lexing literate CoffeeScript, identify the blocks of extrinsic
  # Markdown and transform them into `#`-comments.
  cleanMarkdown: (code) ->
    COFFEE = /^(?:[ ]{4}|\t)(.*)/
    EMPTY = /^\s*$/

    lines = code.split '\n'
    line = lines[0]
    state =
      if COFFEE.test line then 'coffee'
      else if EMPTY.test line then 'markdown-newline'
      else 'markdown'

    # Allow transition into `coffee` from `markdown` only after an empty line.
    for line, i in lines
      switch state
        when 'markdown'
          if EMPTY.test line
            state = 'markdown-newline'
            line = ''
          else
            line = '# ' + line
        when 'markdown-newline'
          if match = COFFEE.exec line
            state = 'coffee'
            line = match[1]
          else if EMPTY.test line
            line = '# ' + line
          else
            state = 'markdown'
            line = '# ' + line
        when 'coffee'
          if match = COFFEE.exec line
            line = match[1]
          else if EMPTY.test line
            line = ''
          else
            state = 'markdown'
            line = '# ' + line
      lines[i] = line

    lines.join '\n'

With this in place, existing tests appear to pass, as do ad-hoc cases similar to the example above that previously would fail.

Please offer thoughts on whether this might be an approach worth evolving into a pull.

The text was updated successfully, but these errors were encountered:

jashkenas · 2013-03-13T07:50:58Z

Although it would be possible for us to start doing more sophisticated parses of Markdown, ideally, we'd try and keep this feature very simple. Having just a one-line-at-a-time, is-it-literate-or-is-it-not heuristic makes it way easier to for other tools to support the same syntax for other purposes.

Are you sure it wouldn't be easier just to write it out like this?

Allow me to enumerate:

* Long list items may
  continue
* Oops.

    @however is code

nickfargo · 2013-03-14T15:27:24Z

A poor example; mea culpa. Something less easily “fixable” would have illustrated the point better, like making the list nested:

* Even when nested
  * long list items may
    continue
  * Oops.

or a canonical markdown example:

To make lists look nice, you can wrap items with hanging indents:

*   Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
    Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
    viverra nec, fringilla in, laoreet vitae, risus.

or the more edgy cases, like neatly indented multi-paragraph list items.

To be sure, some features are well enough off the beaten path to disregard, but others are common enough, and in a few cases the danger could lie well hidden, and lead to some pernicious bugs.

I appreciate the power of simplicity above all else, but here I think it will prove worthwhile to find an acceptable way to chase the line break.

Thanks for having a look.

jashkenas · 2013-03-14T15:30:09Z

You're almost certainly right. Any interested in cooking up a patch?

shesek · 2013-03-14T18:29:01Z

This is especially problematic because in many cases plain English text is parsed as valid CoffeeScript code, which means the indented markdown text would end up as javascript.

nickfargo · 2013-03-14T22:50:55Z

So I can think of two ways to go here. There’s the line-by-line state-machinery, as in the cleanMarkdown function above, and then there’s cramming the states into a set of global regexes — for no particular reason let’s call this one deliterate:

  deliterate: (code) ->
    PROSE = /([\s\S]*?(?:\n|$))(?:[ \t]*?\n|$)/g
    BLANK = /((?:[ \t]*\n)+)(?:( {4}|\t)|.)/g
    CODE = /((?:(?: {4}|\t).*(?:\n|$))+)(?:([ \t]*\n)|.)/g
    out = []

    rx = if /^(?: {4}\t).*(?:\n|$)/.test code then CODE
    else if /^\s*(?:\n|$)/.test code          then BLANK
    else                                           PROSE

    while match = rx.exec code
      chunk = match[1]
      break unless lines = chunk.replace(/([\s\S]*?)\n?$/, '$1').split '\n'
      switch rx
        when PROSE
          out.push '# ' + line for line in lines
          rx = BLANK
        when BLANK
          out.push '' for line in lines
          rx = if match[2]? then CODE else PROSE
        when CODE
          out.push /^(?: {4}|\t)?(.*)/.exec(line)[1] for line in lines
          rx = if match[2]? then BLANK else PROSE
      rx.lastIndex = match.index + chunk.length
    out.join '\n'

Uglier looking, but here executions operate on multi-line blocks rather than per-line, so — although it’s far too soon to care — deliterate is getting through a few-hundred line source file about twice as fast as cleanMarkdown.

Anyway those are just two first quick stabs, but before moving forward I thought I’d ask for an opinion on which might be a better strategy to try, or just a more appropriate fit. (I suspect Docco will need to be kept in mind as well.)

And certainly, parsing markdown is not my wheelhouse; if there’s a smarter approach that blows both of these away, I hope someone can chime in.

* Expect a blank line as delimiter between text and code (jashkenas#2821). * Don't change indentation of code. It is not necessary and leads to erroneous locationData. (jashkenas#2835) * Don't modify blank lines and reverse the change in the lexer. * Don't ignore indentation with mixed whitespace.

jashkenas · 2013-03-14T23:58:22Z

Pinging @marchaefner to take a look at the approaches in this ticket and compare 'em to his, in his.

jashkenas · 2013-03-15T00:24:55Z

Or vice-versa, @nickfargo, if you have any thoughts about the simplification in #2838

epidemian · 2013-03-15T00:39:29Z

Wouldn't it be better to use an existing Markdown parser and then extract the <code> tags from the output, or something along those lines, instead of trying to correctly parse Markdown with its million corner-cases?

Here's an example of a possible simple interpreter in the command line, using marked as the Markdown parser and hxselect to extract the <code> tags from the Markdown output:

$ echo '# Awesome Literate CoffeeScript Interpreter

* Lorem
  * ipsum 
    * dolor
  * sit
    amet,
    consectetur adipisicing elit,
    sed do eiusmod tempor

incididunt ut labore

    some coffee code

et dolore

    even
    more code

magna aliqua.' | marked | hxselect 'code' -c -s '\n' | coffee -cs

Output:

// Generated by CoffeeScript 1.6.1
(function() {

  some(coffee(code));

  even;

  more(code);

}).call(this);

bjmiller · 2013-03-15T00:44:23Z

That's an awesome idea! As long as no one mixes in code blocks that aren't coffeescript.

marchaefner · 2013-03-15T00:56:46Z

Wouldn't it be better to use an existing Markdown parser and then extract the <code> tags from the output...

Unless you wanted those nice error messages and source maps to coincide with the code in your .litcoffee file.

shesek · 2013-03-15T01:04:09Z

@epidemian Its a nice approach, but it seems somewhat wasteful to run a full markdown parser just to find where the code parts are. But the biggest problem is probably generating source maps, as @marchaefner mentioned.

epidemian · 2013-03-15T01:17:49Z

Unless you wanted those nice error messages and source maps to coincide with the code in your .litcoffee file.

Ho, that's right; totally forgot about that. Then it seems the pain of reinventing a part of a Markdown parser is necessary 😓. The corner cases in Markdown are pretty hairy though, things like code inside lists (or nested lists), or code after lists. I wouldn't like to see those pesky corner cases creep into LCS.

[...] but it seems somewhat wasteful to run a full markdown parser just to find where the code is.

Well, it depends on what you're worried to "waste". One could also argue that trying to make and maintain a (subset of a) Mardown parser when some many already exist is a waste of development time.

nickfargo · 2013-03-15T05:13:04Z

I think #2838 takes it.

The global regexes of deliterate that break code into multi-line chunks do still appear to offer a negligible performance advantage, but that cannot justify the function’s relative heft, complexity, and un-readability.

Ready to close. Thanks @marchaefner!

* Expect a blank line as delimiter between text and code (jashkenas#2821). * Don't change indentation of code. It is not necessary and leads to erroneous locationData. (jashkenas#2835) * Don't modify blank lines and reverse the change in the lexer. * Don't ignore indentation with mixed whitespace.

marchaefner mentioned this issue Mar 14, 2013

Improve literal mode. #2838

Merged

nickfargo closed this as completed Mar 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing of indented markdown features by requiring line-break delimiter #2821

Fix parsing of indented markdown features by requiring line-break delimiter #2821

nickfargo commented Mar 12, 2013

jashkenas commented Mar 13, 2013

nickfargo commented Mar 14, 2013

jashkenas commented Mar 14, 2013

shesek commented Mar 14, 2013

nickfargo commented Mar 14, 2013

jashkenas commented Mar 14, 2013

jashkenas commented Mar 15, 2013

epidemian commented Mar 15, 2013

bjmiller commented Mar 15, 2013

marchaefner commented Mar 15, 2013

shesek commented Mar 15, 2013

epidemian commented Mar 15, 2013

nickfargo commented Mar 15, 2013

Fix parsing of indented markdown features by requiring line-break delimiter #2821

Fix parsing of indented markdown features by requiring line-break delimiter #2821

Comments

nickfargo commented Mar 12, 2013

jashkenas commented Mar 13, 2013

nickfargo commented Mar 14, 2013

jashkenas commented Mar 14, 2013

shesek commented Mar 14, 2013

nickfargo commented Mar 14, 2013

jashkenas commented Mar 14, 2013

jashkenas commented Mar 15, 2013

epidemian commented Mar 15, 2013

bjmiller commented Mar 15, 2013

marchaefner commented Mar 15, 2013

shesek commented Mar 15, 2013

epidemian commented Mar 15, 2013

nickfargo commented Mar 15, 2013