Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference of block element handling in jsoup version 15.4 to previous version #1926

Closed
AnanasPizza opened this issue Mar 28, 2023 · 1 comment
Assignees
Labels
bug Confirmed bug that we should fix
Milestone

Comments

@AnanasPizza
Copy link

Hi, we noticed a change in the parser behavior regarding newlines/whitespaces between inline/block elements when using Jsoup to create a document from html:

So here is a minimal testcase:
Input html:

<!DOCTYPE html>
<html>
    <body>
        <table>
            <tr>
                <td>
                    <p style="display:inline;">A</p>
                    <p style="display:inline;">B</p>
                </td>
            </tr>
        </table>
    </body>
</html>

Testcode:


Document document = Jsoup.parse(html);
String result = document.toString();

This is the result for < 15.4:

<!doctype html>
<html>
    <head></head>
    <body>
        <table>
            <tbody>
                <tr>
                    <td><p style="display:inline;">A</p> <p style="display:inline;">B</p></td>
                </tr>
            </tbody>
        </table>
    </body>
</html>

This is the result for 15.4:

<!doctype html>
<html>
    <head></head>
    <body>
        <table>
            <tbody>
                <tr>
                    <td><p style="display:inline;">A</p><p style="display:inline;">B</p></td>
                </tr>
         </tbody>
        </table>
    </body>
</html>

So you see the difference is what's happening between the two p-tags. In previous versions, they would be collapsed and there would be a whitespace between them. Now that whitespace is removed. We noticed this change of behavior inside table cells, but there might be more places where this now happens.

Why do we think the previous behavior was better:
The additional whitespace between actual block-elements did not hurt. And as the Tag#isBlock method does not handle the display:inline styling, it can't be sure this really is a block element and it would be better to let the browser handle it.

The normal browser behavior is, if there are two block-elements after another, the browser will put these in new lines. No matter if there is no whitespace, a single whitespace or a linebreak between them in the raw html.
If there are two inline-elements, the browser will put them in the same line, no difference if there is a whitespace between them or a linebreak. But it does make a difference if there is no whitespace at all between them, because then it will not add a single whitespace, it will print the text in the spans directly next to each other.

So if one pastes the resulting html from above in to a html file and opens it in the browser, there is a difference, because the browser now sees two inline elements without a whitespace or linebreak between them and prints them just like that, while behavior with the original html would be that there is a whitespace between them.

Original in browser:
image

Parsed with version <15.4:
image

Parsed with version 15.4:
image

@AnanasPizza AnanasPizza changed the title Difference in jsoup version 15.4 to previous version Difference of block element handling in jsoup version 15.4 to previous version Mar 28, 2023
@jhy jhy closed this as completed in 8e2b868 Apr 29, 2023
@jhy jhy self-assigned this Apr 29, 2023
@jhy jhy added this to the 1.16.1 milestone Apr 29, 2023
@jhy jhy added the bug Confirmed bug that we should fix label Apr 29, 2023
@jhy
Copy link
Owner

jhy commented Apr 29, 2023

Thanks for the detailed and clear report. I have fixed this by causing nested inlineable content elements (like TDs and Ps) to wrap.

The pretty-printer code now is unfortunately pretty gnarly. I think it will be useful to complete refactor the implementation to simplify it, and make the output more customisable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix
Projects
None yet
Development

No branches or pull requests

2 participants