New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve output normalization with custom parser. #31

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@cirosantilli
Contributor

cirosantilli commented Apr 22, 2014

Fixes #5.

I have come up with a normalization that seems to overcome all trivial errors.

The details are well explained in the doctests for normalize_output.

This uses Python's HTML parser, and formats the output in a way that is useful for our use case. No more BeautifulSoup dependencies.

Built on top of #30, please only consider the last commit of this PR.

The number of errors falls from to:

gfm              20  20
kramdown         23   7
multimarkdown    27   9
pandoc           41  13
redcarpet         0   0
@karlcow

This comment has been minimized.

karlcow commented on run-tests.py in d7f8b32 Apr 25, 2014

This normalization will create issues. :)
The DOM view for

<!DOCTYPE html>
<p>a < b</p>

is

    DOCTYPE: html
    HTML
        HEAD
        BODY
            P
                #text: a < b

but the DOM view for

<!DOCTYPE html>
<p>a <b</p>

is

    DOCTYPE: html
    HTML
        HEAD
        BODY
            P
                #text: a
                B< p=""
@karlcow

This comment has been minimized.

karlcow commented on d7f8b32 Apr 25, 2014

To check if the normalization would create issue, you need to compare what would be a DOM view.

You can use either

@cirosantilli

This comment has been minimized.

Contributor

cirosantilli commented Apr 25, 2014

@karlcow I'm sorry, but:

  1. I merged this in by mistake with https://github.com/karlcow/markdown-testsuite/pull/34/files ! because I was using it to make tests more meaningful for testing the branch. What shall we do, change history, or revert in a new PR?

  2. I don't quite understand your example.

    Are the inputs:

    <p>a < b</p>
    

    and:

    <p>a <b</p>
    

    valid HTML5, considering the < sign? http://validator.w3.org/check says no. If not, then we don't need to worry about them, invalid HTML does not even correspond to a DOM tree (?). We can just run things through a validator before normalizing, perhaps the Python HTML already raises an exception in that case, or has another way to validate.

    Or do you mean markown input a < b? I agree that there can be some error because this is currently normalizing references like &lt; to the corresponding UTF8, so &lt; could get confused with a tag. But with this way its much easier to read the outputs, and only a devilishly built example would break because of that, so I decided to keep it simple.

    Intuitively, the current normalization correspond to a stripped DOM tree since it is being parsed, all tags are kept, and things without order are ordered. Please correct if wrong, HTML newb here 😄

@karlcow

This comment has been minimized.

Owner

karlcow commented Apr 30, 2014

What shall we do, change history, or revert in a new PR?

As it is already merged and in the history here. Let's close it

@karlcow karlcow closed this Apr 30, 2014

@cirosantilli cirosantilli deleted the cirosantilli:normalize branch Apr 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment