Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with line-breaks tags #58

Open
isaring opened this issue Dec 14, 2021 · 5 comments
Open

Issue with line-breaks tags #58

isaring opened this issue Dec 14, 2021 · 5 comments

Comments

@isaring
Copy link

isaring commented Dec 14, 2021

Hi,

I'm facing an issue with line-breaks tags when they are written like <br/> instead of <br>.

Considering this simple example:

>>> import markdownify
>>> markdownify.markdownify("<p>11111<br>22222<br>33333<br/>44444<br><55555</p>", heading_style=markdownify.ATX)

Expected:

'11111  \n22222  \n33333  \n44444  \n55555 \n\n'

Actual:

'11111  \n22222  \n33333  \n\n\n'

My workaround is to .replace('<br/>','<br>') but it's a little pity...

Could you fix this in a future release?

Regards,

@AlexVonB
Copy link
Collaborator

Hi isaring,

interestingly, Beautifulsoup, the HTML parser we use, parses your code as <p>1<br/>2<br/>3<br>4<br/>5</br></p>, enclosing 4 and 5 in the non-existant br-tag-pair. I have no idea to why this happens. This would be a bug to be reported at the BS launchpad: https://bugs.launchpad.net/beautifulsoup/ I'm afraid that we cannot really do anything for you in this case :(

Keep us updated if you learn something new!
Best

@j77h
Copy link

j77h commented Jan 2, 2022

"<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"

Did you notice? there's an extra "<" before the fives.

@isaring
Copy link
Author

isaring commented Jan 2, 2022

Oops, that's just a mistyping of my own!
Unfortunately, it has no effect on the result.

@LaundroMat
Copy link

@isaring: you can convert your markdown yourself using the html5lib parser and use markdownify.MarkdownConverter to convert your html (see https://replit.com/@mathieud/DependentThinDowngrade#main.py).

import bs4
from markdownify import markdownify, MarkdownConverter

assert bs4.__version__ == '4.9.0'  # using lowest possible version

html = "<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"

# Using html.parser
soup = bs4.BeautifulSoup(html, 'html.parser')

assert "<br>" in str(soup)
assert markdownify(html) == '11111  \n22222  \n33333  \n\n\n'

# Using html5lib parser
soup = bs4.BeautifulSoup(html, 'html5lib')

assert "<br>" not in str(soup)
assert MarkdownConverter().convert_soup(soup) == '11111  \n22222  \n33333  \n44444  \n<55555\n\n'

@AlexVonB: So it's not a bs4 issue, it's a parser problem. So should the parser at https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L96 be upgraded to html5lib for the next release of markdownify?

@chrispy-snps
Copy link
Collaborator

There indeed seems to be some kind of bug in the html.parser parser. I think there is a heuristic that tries to identify the <br>/<br/> convention of the content, because if only one style is used, then it seems to be parsed properly:

>>> print(bs4.BeautifulSoup('1<br>2<br>3', 'html.parser'))
1<br/>2<br/>3
>>> print(bs4.BeautifulSoup('1<br/>2<br/>3', 'html.parser'))
1<br/>2<br/>3

But if a mix is used, then html.parser seems to get confused:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html.parser'))
1<br/>2<br>3</br>

whereas the other parsers do not:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html5lib'))
<html><head></head><body>1<br/>2<br/>3</body></html>
                         ^^^^^^^^^^^^^

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'lxml'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

Beautiful Soup tries to choose the best available HTML parser by default:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

It might be best to use its default behavior by default, but implement a Markdownify option that allows a particular parser to be explicitly requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants