Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML parsing abnormal #469

Open
Alalalalaki opened this issue Jul 3, 2021 · 2 comments
Open

HTML parsing abnormal #469

Alalalalaki opened this issue Jul 3, 2021 · 2 comments

Comments

@Alalalalaki
Copy link

Alalalalaki commented Jul 3, 2021

I recently do a "conda update --all" and then find that the HTML parsing of requests-html begins to work abnormally. In particular, the objection gotten from html.find() still contains all content of the html, e.g. if a = html.find("something", first=True), then a.text still shows all text of the page.

I then create a clean environment with only requests-html and it works well. So I guess the cause might be some recent updated version of some other package in my main environment has conflict with HTML parsing in requests-html. But I have no idea how this would happen and what would be the potential problematic package.

Any suggestion will be appreciated.

@DanielPython2021
Copy link

I had the same problem. It is strange since I saw in youtube running similar code but with expected results but, it is not my experience.
to help I copy, so you can reproduce the problem (these are cells from jupyter nb). I also print the results of BeautifulSoup

from requests_html import HTMLSession, HTML

doc = '<div class="class1">text1</div><div class="class2">text2</div><div class="class3">text3</div><div  class="class4">text4</div>'
`html = HTML(html=doc)

for cl in ['class1', 'class2', 'class3', 'class4']:
    print(html.find('div.' + cl, first=True).html)
    print(html.find('div.' + cl, first=True).text)
    print('-' * 100)

text1
text2
text3
text4
text1 text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------
text2
text3
text4
text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------
text3
text4
text3 text4 ----------------------------------------------------------------------------------------------------------------------
text4
text4 ----------------------------------------------------------------------------------------------------------------------
for x in html.lxml:
    print(x.tag, x.attrib, x.text)
    print()

div {'class': 'class1'} text1

div {'class': 'class2'} text2

div {'class': 'class3'} text3

div {'class': 'class4'} text4

from bs4 import BeautifulSoup as bs

soup = bs(doc)
for cl in ['class1', 'class2', 'class3', 'class4']:
    print(soup.find('div', {'class': cl}))
    print(soup.find('div', {'class': cl}).text)
    print('-' * 80)

text1
text1 --------------------------------------------------------------------------------
text2
text2 --------------------------------------------------------------------------------
text3
text3 --------------------------------------------------------------------------------
text4
text4 --------------------------------------------------------------------------------

@DanielPython2021
Copy link

previous results error pasting. The last results should be as follows:

<div class="class1">text1</div>
text1
--------------------------------------------------------------------------------
<div class="class2">text2</div>
text2
--------------------------------------------------------------------------------
<div class="class3">text3</div>
text3
--------------------------------------------------------------------------------
<div class="class4">text4</div>
text4
--------------------------------------------------------------------------------

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
to fix a bug of wrong parse result when use find() with lxml==4.9.0(maybe lower). see issues psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants