Issue with line-breaks tags #58

isaring · 2021-12-14T10:27:33Z

Hi,

I'm facing an issue with line-breaks tags when they are written like   instead of  .

Considering this simple example:

>>> import markdownify
>>> markdownify.markdownify("<p>11111<br>22222<br>33333<br/>44444<br><55555</p>", heading_style=markdownify.ATX)

Expected:

'11111  \n22222  \n33333  \n44444  \n55555 \n\n'

Actual:

'11111  \n22222  \n33333  \n\n\n'

My workaround is to .replace(' ',' ') but it's a little pity...

Could you fix this in a future release?

Regards,

The text was updated successfully, but these errors were encountered:

AlexVonB · 2021-12-14T10:52:27Z

Hi isaring,

interestingly, Beautifulsoup, the HTML parser we use, parses your code as 1 2 3 4 5, enclosing 4 and 5 in the non-existant br-tag-pair. I have no idea to why this happens. This would be a bug to be reported at the BS launchpad: https://bugs.launchpad.net/beautifulsoup/ I'm afraid that we cannot really do anything for you in this case :(

Keep us updated if you learn something new!
Best

j77h · 2022-01-02T02:45:52Z

"11111 22222 33333 44444 <55555"

Did you notice? there's an extra "<" before the fives.

isaring · 2022-01-02T08:31:29Z

Oops, that's just a mistyping of my own!
Unfortunately, it has no effect on the result.

LaundroMat · 2022-10-12T07:19:54Z

@isaring: you can convert your markdown yourself using the html5lib parser and use markdownify.MarkdownConverter to convert your html (see https://replit.com/@mathieud/DependentThinDowngrade#main.py).

import bs4
from markdownify import markdownify, MarkdownConverter

assert bs4.__version__ == '4.9.0'  # using lowest possible version

html = "<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"

# Using html.parser
soup = bs4.BeautifulSoup(html, 'html.parser')

assert "<br>" in str(soup)
assert markdownify(html) == '11111  \n22222  \n33333  \n\n\n'

# Using html5lib parser
soup = bs4.BeautifulSoup(html, 'html5lib')

assert "<br>" not in str(soup)
assert MarkdownConverter().convert_soup(soup) == '11111  \n22222  \n33333  \n44444  \n<55555\n\n'

@AlexVonB: So it's not a bs4 issue, it's a parser problem. So should the parser at https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L96 be upgraded to html5lib for the next release of markdownify?

chrispy-snps · 2024-01-14T17:18:50Z

There indeed seems to be some kind of bug in the html.parser parser. I think there is a heuristic that tries to identify the  /  convention of the content, because if only one style is used, then it seems to be parsed properly:

>>> print(bs4.BeautifulSoup('1<br>2<br>3', 'html.parser'))
1<br/>2<br/>3
>>> print(bs4.BeautifulSoup('1<br/>2<br/>3', 'html.parser'))
1<br/>2<br/>3

But if a mix is used, then html.parser seems to get confused:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html.parser'))
1<br/>2<br>3</br>

whereas the other parsers do not:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html5lib'))
<html><head></head><body>1<br/>2<br/>3</body></html>
                         ^^^^^^^^^^^^^

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'lxml'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

Beautiful Soup tries to choose the best available HTML parser by default:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

It might be best to use its default behavior by default, but implement a Markdownify option that allows a particular parser to be explicitly requested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with line-breaks tags #58

Issue with line-breaks tags #58

isaring commented Dec 14, 2021

AlexVonB commented Dec 14, 2021

j77h commented Jan 2, 2022 •

edited

isaring commented Jan 2, 2022

LaundroMat commented Oct 12, 2022

chrispy-snps commented Jan 14, 2024

Issue with line-breaks tags #58

Issue with line-breaks tags #58

Comments

isaring commented Dec 14, 2021

AlexVonB commented Dec 14, 2021

j77h commented Jan 2, 2022 • edited

isaring commented Jan 2, 2022

LaundroMat commented Oct 12, 2022

chrispy-snps commented Jan 14, 2024

j77h commented Jan 2, 2022 •

edited