Backtrack a minimum of one character when footnote does not match #51

elebow · 2018-06-23T03:52:18Z

When a footnote fails to match in Footnote.match_reference, lines._index is moved backwards by a number of characters equal to the number of newlines remaining in lines. This causes buggy behavior when the input markdown does not contain any newlines, because lines._index does not change.

Instead, always backtrack at least one character, so the token start [ is covered in case there are zero newlines.

I confess I do not fully understand the Footnote parsing. It may seem that the correct behavior is actually to always backtrack one character?

def backtrack(lines, string, offset):
    lines._index -= 1

The tests pass for either solution.

When a footnote fails to match in Footnote.match_reference, `lines._index` is moved backwards by a number of characters equal to the number of newlines remaining in `lines`. This causes buggy behavior when the input markdown does not contain any newlines, because `lines._index` does not change. Instead, always backtrack at least one character, so the token start `[` is covered in case there are zero newlines.

coveralls · 2018-06-23T03:53:48Z

Coverage increased (+0.01%) to 93.125% when pulling f395eb5 on elebow:backtrack-at-least-one-char into 7e28934 on miyuchina:dev.

miyuchina · 2018-06-23T14:08:45Z

Thanks for the pull request! Yeah, figuring out how to parse footnotes took me a while. I should add some documentation at a certain point.

I think the important thing is that all BlockToken subclasses match by line rather than by character. All BlockToken.read methods assume they are taking an argument lines, which is an iterator containing the remaining lines with their respective newlines at the end.

What complicates Footnote is that whether a line is a part of a footnote reference cannot be decided from only the current line. Instead, we first join all the lines before an empty line:

line_buffer = []
next_line = lines.peek()
while next_line is not None and next_line.strip() != '':
    line_buffer.append(next(lines))
    next_line = lines.peek()
string = ''.join(line_buffer)

... and then do our matches on the concatenated string (with newlines in them). When we encounter anything that does not form a footnote reference, we need to "give back" the number of lines still unmatched in the concatenated string, because they are probably part of another block-level token. This is why every time before we return None in Footnote.match_reference, we need to make a call to Footnote.backtrack. The number of lines we need to give back is equal to the number of newlines remaining, so this is where lines._index -= string[offset+1:].count('\n') comes from.

More related to your question: are you passing a list of lines without newlines into mistletoe? The assumption that block-level tokens can match on a list of newline-terminating lines is quite entrenched at this point (which is why, for example, this line exists). My reasoning is that most people will be using a Markdown parser like this:

with open('foo.md', 'r') as fin:
    return markdown(fin)

... and fin will have all the newlines we need. Does this apply to your scenario? Would you prefer a built-in mechanism to append newlines?

You have alerted me to another edge case, however, where a document does not terminate with newlines. I'll open another issue for that!

Let me know what you think. Thanks again!

elebow · 2018-06-23T19:31:02Z

Thank you for the explanation about concatenating the remainder of the lines to find the end of the footnote. That makes sense.

The situation I ran into is actually the second one you mentioned, where the document does not terminate with a newline. I was writing integration tests for an application and using short strings (without newlines) as the test input data. Though it was not my intention, I do believe this is still a realistic scenario; the Commonmark spec section 2.1 does not seem to require a terminal newline, and Markdown can be used in web apps that store data in, say, a database instead of a file.

Even when the document is being read from a file, I don't believe it's safe to assume a newline. POSIX requires one, but Windows (and probably other systems) does not, and of course it would be nice to accommodate non-compliant files on POSIX systems if it's not difficult.

Perhaps the best solution is to simply append a newline in Document.__init__ if one is not present? Then the footnote parser can remain unchanged with the assumption that there is at least one newline

miyuchina · 2018-06-25T13:19:55Z

The best solution is to simply append a newline in Document.__init__ if one is not present.

I implemented this in 1a43a68, and will close this pull request. As for POSIX vs. Windows newlines, see this documentation. Unless I'm misreading the doc, it seems that alternative newlines when reading files are already handled by Python itself.

Thanks for the contribution!

miyuchina closed this Jun 25, 2018

miyuchina added the bug label Jun 25, 2018

miyuchina mentioned this pull request Jun 25, 2018

StopIteration Exception without terminating newlines #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backtrack a minimum of one character when footnote does not match #51

Backtrack a minimum of one character when footnote does not match #51

elebow commented Jun 23, 2018

coveralls commented Jun 23, 2018 •

edited

Loading

miyuchina commented Jun 23, 2018

elebow commented Jun 23, 2018

miyuchina commented Jun 25, 2018

Backtrack a minimum of one character when footnote does not match #51

Backtrack a minimum of one character when footnote does not match #51

Conversation

elebow commented Jun 23, 2018

coveralls commented Jun 23, 2018 • edited Loading

miyuchina commented Jun 23, 2018

elebow commented Jun 23, 2018

miyuchina commented Jun 25, 2018

coveralls commented Jun 23, 2018 •

edited

Loading