Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{{ article.content }} gets redundant <html><body> tags #1

Closed
kernc opened this issue Dec 22, 2014 · 7 comments

Comments

Projects
None yet
2 participants
@kernc
Copy link

commented Dec 22, 2014

After this plugin runs, each Content.content has a <html><body> prefix and </body></html> suffix, both inconveniently added by BeautifulSoup, resulting in this issue: getpelican/pelican#984

@ingwinlu

This comment has been minimized.

Copy link
Owner

commented Dec 22, 2014

I just tried verifying your problem and I am not able to reproduce:

def bootstrapify(content):
    if isinstance(content, contents.Static):
        return

    print(content._content)
    input()
    soup = BeautifulSoup(content._content)
    replace_tables(soup)
    replace_images(soup)

    content._content = soup.decode()
    print(content._content)
    input()

content._content after the soup.decode() call does not contain html and body tags

@kernc

This comment has been minimized.

Copy link
Author

commented Dec 22, 2014

Interesting. This is what I see:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('some text')
<html><body><p>some text</p></body></html>
>>> BeautifulSoup('more text').decode()
u'<html><body><p>more text</p></body></html>'

and is surely responsible for the invalid resulting HTML. That was BeautifulSoup 4.1.0 as well as (just upgraded) 4.3.2.

@ingwinlu

This comment has been minimized.

Copy link
Owner

commented Dec 22, 2014

Python 3.4.2 (default, Oct  9 2014, 07:20:34) 
[GCC 4.8.2 20131219 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('some text')
some text
>>> BeautifulSoup('some text').decode()
'some text'
>>> BeautifulSoup('<p>some text</p>').decode()
'<p>some text</p>'
>>> BeautifulSoup('<p>some text</p>')
<p>some text</p>
>>> 
[winlu@micronuke ~]$ pip list
Beaker (1.6.4)
beautifulsoup4 (4.3.2)

@kernc

This comment has been minimized.

Copy link
Author

commented Dec 22, 2014

Confirming your observations with Python 3.
Would you consider remaining backwards-compatible and hack a fixup for Python 2?

@ingwinlu

This comment has been minimized.

Copy link
Owner

commented Dec 22, 2014

might be more an issue like https://medium.com/@as_w/beware-beautiful-soup-and-lxml-f2fa442daf99 not python2 vs python3

i pushed an update d2c48c0 wich specifies html.parser as parser for bs, can you try if this resolves the issue?

@kernc

This comment has been minimized.

Copy link
Author

commented Dec 22, 2014

It does, thanks.

@kernc kernc closed this Dec 22, 2014

@ingwinlu

This comment has been minimized.

Copy link
Owner

commented Dec 22, 2014

thanks for reporting, was not aware of bs behavior of changing defaults because of installation of packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.