Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backtrack a minimum of one character when footnote does not match #51

Closed
wants to merge 1 commit into from

Conversation

elebow
Copy link

@elebow elebow commented Jun 23, 2018

When a footnote fails to match in Footnote.match_reference, lines._index is moved backwards by a number of characters equal to the number of newlines remaining in lines. This causes buggy behavior when the input markdown does not contain any newlines, because lines._index does not change.

Instead, always backtrack at least one character, so the token start [ is covered in case there are zero newlines.


I confess I do not fully understand the Footnote parsing. It may seem that the correct behavior is actually to always backtrack one character?

def backtrack(lines, string, offset):
    lines._index -= 1

The tests pass for either solution.

When a footnote fails to match in Footnote.match_reference,
`lines._index` is moved backwards by a number of characters equal to the
number of newlines remaining in `lines`. This causes buggy behavior when
the input markdown does not contain any newlines, because `lines._index`
does not change.

Instead, always backtrack at least one character, so the token start `[`
is covered in case there are zero newlines.
@coveralls
Copy link

coveralls commented Jun 23, 2018

Coverage Status

Coverage increased (+0.01%) to 93.125% when pulling f395eb5 on elebow:backtrack-at-least-one-char into 7e28934 on miyuchina:dev.

@miyuchina
Copy link
Owner

Thanks for the pull request! Yeah, figuring out how to parse footnotes took me a while. I should add some documentation at a certain point.

I think the important thing is that all BlockToken subclasses match by line rather than by character. All BlockToken.read methods assume they are taking an argument lines, which is an iterator containing the remaining lines with their respective newlines at the end.

What complicates Footnote is that whether a line is a part of a footnote reference cannot be decided from only the current line. Instead, we first join all the lines before an empty line:

line_buffer = []
next_line = lines.peek()
while next_line is not None and next_line.strip() != '':
    line_buffer.append(next(lines))
    next_line = lines.peek()
string = ''.join(line_buffer)

... and then do our matches on the concatenated string (with newlines in them). When we encounter anything that does not form a footnote reference, we need to "give back" the number of lines still unmatched in the concatenated string, because they are probably part of another block-level token. This is why every time before we return None in Footnote.match_reference, we need to make a call to Footnote.backtrack. The number of lines we need to give back is equal to the number of newlines remaining, so this is where lines._index -= string[offset+1:].count('\n') comes from.

More related to your question: are you passing a list of lines without newlines into mistletoe? The assumption that block-level tokens can match on a list of newline-terminating lines is quite entrenched at this point (which is why, for example, this line exists). My reasoning is that most people will be using a Markdown parser like this:

with open('foo.md', 'r') as fin:
    return markdown(fin)

... and fin will have all the newlines we need. Does this apply to your scenario? Would you prefer a built-in mechanism to append newlines?

You have alerted me to another edge case, however, where a document does not terminate with newlines. I'll open another issue for that!

Let me know what you think. Thanks again!

@elebow
Copy link
Author

elebow commented Jun 23, 2018

Thank you for the explanation about concatenating the remainder of the lines to find the end of the footnote. That makes sense.

The situation I ran into is actually the second one you mentioned, where the document does not terminate with a newline. I was writing integration tests for an application and using short strings (without newlines) as the test input data. Though it was not my intention, I do believe this is still a realistic scenario; the Commonmark spec section 2.1 does not seem to require a terminal newline, and Markdown can be used in web apps that store data in, say, a database instead of a file.

Even when the document is being read from a file, I don't believe it's safe to assume a newline. POSIX requires one, but Windows (and probably other systems) does not, and of course it would be nice to accommodate non-compliant files on POSIX systems if it's not difficult.

Perhaps the best solution is to simply append a newline in Document.__init__ if one is not present? Then the footnote parser can remain unchanged with the assumption that there is at least one newline

@miyuchina
Copy link
Owner

The best solution is to simply append a newline in Document.__init__ if one is not present.

I implemented this in 1a43a68, and will close this pull request. As for POSIX vs. Windows newlines, see this documentation. Unless I'm misreading the doc, it seems that alternative newlines when reading files are already handled by Python itself.

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants